Let us briefly revisit this ongoing transformation, which is revolutionizing the field of information technology.
In 2013, nearly 6 zetabytes of information (i.e. one thousand billion billion bytes) will be available on the global network. This amount doubles every two years and the gradual public availability of data from institutions and companies in the context of the Open Data amplifies this phenomenon. The emerging Big Data technologies respond in first place to the exponential growth of available data.
A Big Data system has two closely related parts: storage and analysis. Analysis is based on storage, which facilitates access. Storage is based on analysis, which reduces volume. Analytical solutions that really respond to this problem have two features: induction and speed.
It is not only a matter of computing power. The world of Big Data involves a cultural revolution for all the players involved. They must somehow accept to let go, by agreeing to work with fragments of free and quickly accessible data: in turn, data should be considered as a consumable – and even perishable – product that does not need to be preserved. Ultimately, they must be transformed for a new use.
One can speak of a real industrial revolution. This new approach challenges the established customs of computer technology and announces a new, intensive movement of creation/destruction. Clearly, the ability to process quickly large swathes of sometimes heterogeneous data to extract useful information is a potential source of added value. Today, we see the birth of a new industry. While most of the older technological players are completely destabilized, some – such as IBM – managed a successful reconversion. The others are overwhelmed, mostly because they fail to understand what exactly is at stake, or refuse to reinvent themselves.
In the field of innovation, the same sequence often repeats itself. The propagation of a new concept generally follows the same steps. The emerging phase, heavily inspired (as always) by the Gartner Group and MIT, is over. Currently we are living the rallying phase, which generates a great deal of confusion. Every technological player seeks to slip into the flow. Whether Big Data or not, they change the terms, make compromises: what they want is to be part of the movement. This will be followed by the mutation phase, during which new technological solutions will emerge from the ashes of previous ones, with a real added value integrating the “DNA” of the new concept. Some advanced players are already working on these solutions; these are slowly beginning to emerge. The last phase will be a construction phase, with the advent of pure players that will bring to the market, solutions and services that we cannot imagine today.
An industry is born. Many companies will be created to extract and process data and obviously… they will produce more data. Let us take a closer look at the technological breakthroughs that are at the heart of this industrial revolution. All of these use real-time processing: that’s why speed is a core feature of these technologies. But speed is not all. For the most part, they involve a kind of reasoning which has stayed out of the scientific and technical way of thought during a long time, despite being very old: inductive thinking. This is the heart of the Big Data disruption.
The analysis of huge amounts of data mainly focuses on finding correlations. This concept originates from biology and has been used in economics for a long time, but it is generally considered that from a scientific point of view, it only has a descriptive value: it may identify a relation between two variables, but does not explain it. It can still be used to test a hypothesis.
It is precisely here that the Big Data deviates from “traditional” scientific knowledge, as formalized by Karl Popper. The latter mainly works by inference: a consequence is logically deducted from a hypothesis. It is then checked and tested. Induction, on the other hand, is radically different.
What is it about? Induction, unlike deduction, is a mechanism used by the human brain at almost every moment. Indeed, despite the fact that deduction is considered as cleaner, more scientific, it occupies only a small portion of the processing time of our brain. It is particularly relevant when analyzing a situation out of its context.
For example, to apply deductive logics to the decision of crossing a street, you would need to measure all vehicle speeds, locate them in space and calculate, using a set of equations, which is the right time to cross. Needless to say, the slowness of this technique of analysis would be more of an obstacle, compared with the use your own senses and cognitive abilities...
In fact, our brain captures the global scene in a comprehensive situation and processes it by using induction. To do this, it generalizes the principles observed in similar situations involving us – or others – that we have observed (other people crossings streets, at any time, with or without light, wet or dry ground, etc.). Our brain is able to integrate a huge number of parameters in a flash and project the results of its inductions on the current scene.
This is exactly what Big Data processing needs: search instantly for the critical information, process them as a whole without preconditions, reproduce effective mechanisms that have been observed in the past, generate new data that can be used directly in the current situation. Chris Anderson, the former editor of Wired, was one of the first to point out the implications of this method in a famous article http://www.wired.com/science/discoveries/magazine/16-07/pb_theory published in 2008. The knowledge from Big Data will be produced by “agnostic” statistics. This lack of ideology is the very condition of their success: in their own way, numbers speak for themselves.
For anybody raised in the Western scientific tradition that derives from Descartes, this approach is absolutely stunning. It is easy to imagine that for scientific minds, trained in deductive thinking, this isn’t an easy revolution to live with. More generally speaking, apart from experts in epistemology and people familiar with constructivism http://en.wikipedia.org/wiki/Epistemology#Constructivism (an approach to knowledge based on the idea that our representation of reality, or the concepts that structure this representation, are produced by the human mind by interacting with this reality and not a reflection of reality itself), nobody knows much about the principles of induction and abduction. In most developed societies, education focuses on deductive thinking, the basis of Western logic. And yet, before the age of seven, a child is immersed in an inductive way of thought. It is only around the “age of reason” that children discover deductive logic and demonstrated reasoning. A student is very unlikely to learn about induction in his syllabus. Still, inductive logic is part of our daily lives, of our daily actions.
Induction allows us to generalize a phenomenon, even if it is observed only once. This logic, despite being so fundamentally human, is foreign to engineers and scientists trained in Cartesian epistemology. This explains a number of confusions that interfere with the understanding of Big Data. Some see induction as a form of statistics and mistake the search for singularity with a more refined segmentation of elements that are obtained statistically. Some even speak of intuition when describing induction.
In all of these cases, the confusion stems from a desire to compare different principles in identical fields. In fact there are areas where deduction is quite efficient and others where induction is required. It would be completely pointless to try and apply induction where deduction is effective and relevant... The opposite is also true. These two tools cannot be compared and in a sense, they aren’t even competitors. True wisdom is knowing about when to use the right tool in the right circumstances.
This duality is reflected in the temporal approach to analysis. Deduction, statistics or probability can be fueled once, with several years of data, to establish a “law” (i.e. a repeatable result). Induction, however, is a continuous technique that takes time. It is an ongoing process that will generate singularities, expand their base and measure the effectiveness of their application.
Inductive reasoning can’t be reduced to one single model. It depends on previous inductions and detected singularities. It is not repeatable. Again, it is far from Cartesian principles. Induction does not need complete and consistent parameters, since our brain will process them only partially, by concentrating on what it considers to be the critical information on the situation. On the other hand, this method will also produce errors.
For the last ten years, studies on inductive algorithms have proliferated in universities. The growth of social networks has increased the demand for these algorithms, which are at the heart of Big Data technology.
It should be noted that when computer science grasps a notion, it usually creates a meaning of its own which is often a simplistic version of the original. For example, in philosophy, ontology is the study of being. In computer science, ontology is a model of representation of knowledge. In philosophy again, induction is a kind of constructivist reasoning that produces probable orientations. In computer science, with a few exceptions, induction is simply about applying the principle of recurrence to graphs.
For a professional, processing data using an inductive method requires changing his approach. Many principles are completely reversed when we move from deductive to inductive reasoning.
For example, “store more data for more precision” becomes “forget more data for more possibilities.” “Eliminate singular cases to focus on the most common ones” becomes “abandon frequent cases to focus on differences.” “Model and standardize data” becomes “search for singularities and unknowns.” “Process data exhaustively” becomes “focus on the critical data.”
It would be a mistake when designing an inductive algorithm to seek to apply the principles of deductive logic. This is one of the major challenges of Big Data: we must know how to choose. Either we reason in the context of a deductive approach, or else we firmly choose an inductive approach.
This second option leads to several consequences. For example, while deductive reasoning always comes to an end, inductive reasoning generally produces no finished status. The results of inferences are likely to alter the inferences already made. It is possible to continue the reasoning indefinitely. The inductive algorithm must be paired with a convergence function (also called reward function) to assess the benefits of new inferences and limit their number.
Second consequence: there is no single inductive solution to a given problem. The multiplicity of potential solutions can however be reduced, since it is common that a small number of solutions responds to a specific aim and purpose. Consequently, inductive algorithms are rarely universal.
Last, an algorithm designed within an inductive approach will have a remarkable feature: if it isn’t intended to test an existing hypothesis, it may nevertheless be designed for a specific aim: its “products”, i.e. the graphs produced from the analyzed data, have a purpose. The knowledge of this purpose by the algorithm is valuable information to measure its own effectiveness. It will include calculations for plausibility. The best inductive algorithms can evolve: they “learn”, they refine their way of processing data according to the most appropriate use which can be made.
This evolution over time is inseparable from the second major feature of Big Data: speed.
As long as the risk of error is accepted by the user (be it an individual or an organization), it is often preferable to have correct information immediately, rather than having a complete and reliable information, later. In a world where every millisecond counts, where a competitor will sell online simply because its response time is shorter, real-time information becomes crucial.
For a human faced with a decision such as a purchase, thinking takes approximately ten minutes. This leaves time enough for decision-making systems to process data and provide guidance for a decision. But the Web has changed these landmarks. Indeed, the interaction between the user and his or her environment gradually slides from the sphere of slower cognition to the sphere of motivation or even, to that of high-speed emotion. Here, the interaction with the Internet user takes about a second, or even less.
In terms of psycho-sociology, this phenomenon is amplified by the propagation speed of information within connected communities. In a few hundred seconds, millions of people who can be informed in a way likely to dramatically change their behavior. Having the information is not enough: its social impacts need to be anticipated.
Continuing with existing technologies would inevitably lead to design massive parallel computing networks: the size and complexity of these systems would certainly limit their use.
Big Data technologies are specifically designed to meet this challenge: processing more quickly an increasing amount of information. Again, this means turning to a constructivist approach.
The example, again, is just before our eyes, in our own societies. Nobody seeks to control perfectly their environment, but everybody is aware of the contribution they make. The equation relating speed to contribution is crucial in social life. Providing little, very quickly, is not determining; providing more, but less quickly, means to take the risk of being surpassed. Every society lives in this balance between competition and collaboration. This is the spirit in which algorithms were developed to provide speed.
It is always possible to optimize the processing time of algorithms, but when the volume of information grows and, as we have seen, when the access criteria are unclear, processing speed is mathematically limited.
To design inductive speed algorithms, it is necessary to think otherwise. An algorithm must be contributory and able to anticipate. To achieve this, the computer application must be implemented in a continuous flow of data and regroup a collection of agents in charge of transforming this information as early as possible into usable knowledge for the next instants.
This is not without consequence. For example, in decision-support computing, raw data is usually kept for subsequent processes that are not yet determined. For inductive speed algorithms, it is preferable to transform information from a form “in extension” (eg conservation of all receipts of Mr. Smith) to a form “in comprehension” (eg Mr. Smith buys a loaf all Mondays and sometimes a cake) more manageable and less bulky.
Another consequence, a speed algorithm should preferably be iterative, so we can stop it without losing results. Thus, information will be available quickly and is refined in a continuous process.
Finally, a speed algorithm is part of a logic of contribution and competition with its neighbors in order to adapt the production of information to the use that is made of it. It has a mechanism that allows it to focus on the essentials.
What can we conclude from this overview? Firstly that Big Data is not just a question of scale. The technologies that make their processing possible stem from a radical change in the way we process data.
Induction allows algorithms to reproduce observed phenomena by generalizing beyond their scope, as long as they remain effective, without trying to make models out of them. Speed allows algorithms to focus on the essential in order to maintain the balance between their contribution and competition from the other. The analysis of Big Data proceeds from constructivist epistemology: each system makes its own references and takes part in the global system without ever knowing precisely the latter. Permanent learning, never completed, produces an imperfect but useful knowledge. Any resemblance with the human brain is certainly not a coincidence.