PARIS SCIENCES & LETTRES (PSL)
Thank your for your subscribe
Oops something went wrong. Please check your entry

A philosophy of data

Ongoing digitization has placed data at the center of economic and social life. We are producing a growing amount of data that are exchanged, secured and analyzed by increasingly sophisticated technologies. Data economics defines the value of these operations. Data policies are implemented both by governments and large corporations. An emerging business revolves around big data. But the precise nature of a datum remains unclear. A philosophical approach, as led by Luciano Floridi, can help us refine the definition.

10
April 2017
lire en français
lire en français
Executive resume

Ongoing digitization has placed data at the center of economic and social life. We are producing a growing amount of data that are exchanged, secured and analyzed by increasingly sophisticated technologies. Data economics defines the value of these operations. Data policies are implemented both by governments and large corporations. An emerging business revolves around big data. But the precise nature of a datum remains unclear. A philosophical approach, as led by Luciano Floridi, can help us refine the definition.

What is a datum? The advent of digital technology forced us to reexamine and clarify this long-neglected question that emerged at the end of the nineteenth century, when philosophy was divided between taking into account the “immediate data of consciousness” (Bergson) and an experimental method emphasizing the indirect measurement of “objective” data.

After Quine, the philosophy of science quickly dropped this pretense of objectivity. Nevertheless, to this day, any practical thinking concerned with accuracy and realism, any scientific approach, is based on data.

The known and the unknown
The very term “data” suggests something that is both given and that imposes itself. This is precisely what allows data to serve as a basis for shared reflection, to drive technical development, to establish public policy or to fuel scientific knowledge. Engineers, economists, physicists, botanists, agronomists, chemists... all use data.

Data is the known territory from which one explores the unknown.

But what is known requires an agreement, an acknowledgment. In short: some kind of self-evidence or convention that makes all participants to a reflection agree on what is “given.” Whether scientific, technical or sociological, data offers intelligible and valid facts that are publicly available – or at least, shared by a community of users who recognize their value.

From this perspective, quantitative data have an undeniable advantage over qualitative data. It is easier to agree on numbers than on qualities. Hence, modern science and engineering seek to quantify qualities, to describe them by using numbers. Grey is no longer a color between black and white. It is black at 25%, or 70%, etc. Any image can be broken down in pixels and each pixel has a digital value on a scale that ranges from infra-red to ultraviolet.

Therefore, the digitization of the world isn’t only about the “digital” as such, i.e. the translation of signals into series of 0s and 1s. It defines a more global translation of the sensitive world – including humanity – into data series. This translation began at the beginning of the modern era and we are now experiencing its dramatic acceleration.

But this acceleration also conveys a second evolution, one that transforms analysis into stochastics, from the precise and rigorous reduction of mechanisms to the ex post revelation of statistical laws through datamining. This revolution, which has only begun, could be characterized as the triumph of inductive thinking on deductive thinking. In increasingly numerous areas, knowledge is based on correlations extracted from vast amounts of data. It is less a matter of proving than observing the apparition of laws.

Statistics and algorithms have both established themselves as fundamental tools of knowledge, and also of decision-making.

This revolution, in which we are all immersed, forces us to question the status of what fuels it: vast amounts of data stored in huge data storage centers. Starting with questions as simple and radical as: is data the same as information?

genomeseq

Information atoms
The distinction is important and it is part of a chain that goes from facts to knowledge.

Sven Ove Hansson, professor of philosophy at the Royal Institute of Technology of Stockholm, summarized in a 2002 article the main differences between data, information and knowledge: “Data differs from information in the extent that it doesn’t need a form that lends itself to assimilation. If, instead of a book [of sociology that I am reading now], I read the tens of thousands of questionnaires on which this book is based, I would be looking at data instead of information. In short, data must be processable in order to be considered as information and assimilated in order to become knowledge.”

As a matter of fact, Hansson takes up a distinction already used by Roger Bohn in an article of the Sloan Review of Management (1994) to differentiate data, information and knowledge. Data come from sensors and provide a measured value for any given variable. Information consists of data organized according to a given structure. When placed in a given context, it makes sense. Knowledge goes one step further: it allows to make predictions, to establish causal relationships or to take decisions.

Value comes from knowledge. But as noted by Bohn, information is easier to store, describe and handle. Is the same true for data?

In numerical terms, yes: a datum is, in a sense, an information atom, a minimum measurement, at a given time and in a point of space. In short, something that can be reduced to a 0 or a 1.

In philosophical terms, a datum is easier to describe than information. It is a simple, less challenging concept. A datum would be the most immediate and rawest translation of a fact. But a datum is not a fact: rather, the minimum unit of observation that defines a fact.

It would be illusory to claim its objectivity, and argue that data is free of intention or project. Measurement itself is the result of a selection between all measurable aspects of a phenomenon: you chose to measure a specific variable rather than another and this screening defines the reality you need to know.

But in the case of data, typically produced by a sensor, an appearance of objectivity is found in two different ways: first, the very small amount of information contained in data; and secondly, the presence of other sensors (that help build a richer representation of the observed phenomenon, for example, the state of the right front tire of your car can be defined by its heat, vibrations, air pressure, age, service life etc. All of these parameters allow your on-board computer to provide highly reliable information.)

A semantic definition of datum
The foregoing considerations may be reversed. On the one hand, they force us to make choices, to filter data and therefore, to reconstruct a very reductive representation of reality.

On the other hand, the quest for objectivity, the multiplication of sensors and the exponential growth of the collected data foreshadow the scientific fantasy of a holistic representation of phenomena, an absolute digitization of the world. Scientific consequence: when studying a specific bone pathology, we use 100,000 sets of very complete data from 100,000 different patients. We have an unprecedented opportunity to understand a phenomenon, or at the very least, to record it entirely, without leaving anything out. Daily version: the connected human being. Your blood pressure is analyzed every second by sensors, your position in space is captured constantly, etc. You are transforming yourself into a producer of ever-increasing volumes of data. Ultimately, what is this data worth? In other words, what makes data form part of a chain that binds facts and knowledge via information?

The approach to datum understood as an atom of information finds its limits because it doesn’t explain anything about this transformation. Luciano Floridi, a professor of philosophy and research director at the Oxford Internet Institute, offers a reflection that allows to exceed this limit.

He questions the possibility of founding a data-based definition of information. In other words, of defining datum semantically, by asking what enables it to produce information.

First, he adopts a rigorous definition of the datum: “a datum is a putative fact regarding some difference or lack of uniformity within some context.”

This diaphora, this difference in the fabric of reality, opens on the possibility of information, under certain conditions. Floridi identifies three requisites: a) one or more data are required; b) these data should be well-formed i.e. assembled according to specific rules; c) they are meaningful i.e. capable of being interpreted, translated or expressed differently.

It follows therefrom that data can be defined as a relational entity – this is a crucial feature.

Floridi’s theoretical insight allows us to understand this feature by shedding light on the notions of “difference” and “lack of uniformity.” According to Floridi, these two notions refer to what the Greeks called “diaphora,” a gap.  He proposes a “diaphoric definition of the datum” that can be applied at three levels.

First, datum can be defined as “diaphora de re” i.e. lack of uniformity in reality. There is no specific name for such “data within nature.” One possibility would be to refer to data as “dedomen” (ancient Greek translation of “data”). Moreover, it should be noted that from an etymological point of view, the word “datum” appeared in Latin from the translation of a work by Euclid, Dedomena. One cannot know the datum directly, but simply infer it from experience.

Floridi explains that “dedomena” are pure data or proto-epistemic data, data prior to being interpreted epistemically. As “fractures in the fabric of reality” they are neither accessible, nor elaborated independently from a certain level of abstraction. They cannot be experienced epistemically but their presence is deduced empirically from (and required by) experience.

Apart from these proto-epistemic data, a datum is also a “diaphora de signo” i.e. lack of uniformity (or the perception of a lack of uniformity) between two physical states: the varying level of a battery, the electrical signal of a phone conversation or a dot in the Morse code.

And finally, the “diaphora de dicto” i.e. the lack of uniformity between two symbols such as A and B in the Latin alphabet.

This organizing concept of diaphora, which unites these three versions of the datum, refers to a divergence: the moment when something starts to differ, a difference that calls for a meaning. Datum is the symbolic entity that codes this difference. It is the link between this gap – which one could be tempted to call insignificant – and meaning. Ultimately, a datum is a transition point from insignificance to significance.

Bruno Teboul
Senior Vice President of Science and Innovation, Keyrus Group