Big Data are all over the news. Some fear a new Big Brother, others celebrate new, astonishing possibilities in fields like marketing, epidemiology, city management, and Chris Anderson prophesies a science without theory. Obviously we are on the verge of a revolution. But, by the way, what are we talking about?
Paris Innovation Review - In the last two or three years, the issue of Big Data has pervaded the public debate, generating enthusiasm and doubts... And yet, we sometimes have a hard time defining what we are exactly talking about. Could you briefly explain?
Henri Verdier - This confusion isn’t surprising, because not only is it a recent theme, but more importantly, there is an ongoing political and economic confrontation around its definition. The term “Big Data” refers to at least three different phenomenon. In a narrow sense, it refers to new information technologies in the field of massive data processing. In a broader sense, it refers to the economic and social transformations induced by these technologies. And last, some analysts present it as an epistemological break: the leap from hypothetic-deductive methods – on which modern science is built – to inductive logics, very different from the former.
Moreover, the boom of Big Data covers enormous interests and this probably adds to the confusion. For example, IBM rebuilt its empire from ashes in this field. Other giants, such as Google and Facebook, are also very heavily involved. It’s an area that draws the attention of consultants and service providers, and all these people tend to exaggerate the impact of the technologies that they are trying to sell.
Does this mean that we are witnessing a mere bubble, a fad?
Definitely not. But precisely because of the significance of this evolution, we must keep a cool head and examine rationally what’s happening before our eyes. First things first: the technologies involved.
The first phenomenon is the deluge of data generated by servers that currently store volumes of information unfathomable only of a few years back (information available in digital format has increased from 193 petabytes in 1996 – the equivalent of all the books printed hitherto by Humanity – to 2.7 zetabytes in 2012, one million times more). This explosion is made possible by technical progress, but it is also fueled by new trends. You and me, everyone, every day, we increasingly produce and exchange messages: tweets, posts, comments, SMS, emails, and so and so. With the popularity of the “quantified self”, which consists in collecting personal data and sharing it, the generating raw data is even a new way of being-in-the-world. But we also produce data we don’t know about, when buying a product in a supermarket, when clicking on a newspaper article, and by allowing our smartphones to geo-localize our position. The Internet of objects will also increasingly produce, or incite us to produce, new amounts of data: noise and speed sensors will transform our bodies’ footprint into raw data – as do today our conversations on Facebook.
The second new feature is the ability to deal with this new data. In a certain way, it is not quantity that defines Big Data, but rather a certain relationship with data, a certain way of playing with it. We learn every day how to manage, measure, interpret better – and at cheaper costs. This “cheap” computer technology already opens the door to new players: no need to be IBM to handle terabytes.
Even today, in the Silicon Valley, one can observe the rise of Big Data hardware technology: some players like Facebook, SAP, IBM and Goldman Sachs have organized and funded programs to learn how to manage massive amounts of data. One of the challenges for them is to deal with Google, of course, who has emerged as the actor par excellence in Big Data processing. One of the programs in question is for example the MapReduce framework, in which Google is also committed. This model handles parallel calculations on large volumes of data. In terms of programming and system architecture, this is the break of a new philosophy: one does not necessarily seek to develop sophisticated algorithms or use very powerful machines, but simply takes advantage of the available computing power, by performing the same operation millions of times by machines set up in parallel. For example, Amazon uses hundreds of servers in its cloud. In terms of software, this isn’t necessarily very impressive, but results speak for themselves.
This technological shift is not only a question of volume. It is often said that they rely on the “three Vs”: variety, velocity and volume. Big Data information technology is renewed every day to deal with large amounts of data, often poorly structured, in lightning-short timeframes (as for example in high-frequency trading).
Performance, therefore, covers the amount of data processed, the diversity of sources and the search for real-time answers. This new available power opens the door to new strategies of data processing. You learn to handle complete distributions, to play with probabilities, to translate problems into automatic decision systems, to build new visualizations for new rules of interaction with data.
A new school of computer science is rising, a new way of programming, partly inspired by the hacker culture. The members of this community focused on the hardware in the 1970s, on software in the 1980-1990s, on contents in the 2000s. And now, they are focusing on data. “Data is the new code”: this paradigm was proclaimed to emphasize that from now on, data is no longer the random variable and that the code should be organized around it...
A new computer science, or better said, a new philosophy of computer science: does that suppose different professional skills?
Indeed, yes. Today, a new profession is emerging: datascientists, who could be defined as follows: they’re basically good mathematicians and more specifically, statisticians; next, they’re good computer engineers and, if possible, excellent hackers, capable for instance of installing three virtual machines on a single server; last, and this is a crucial point, they are able to provide strategic advice, because most organizations today are completely unprepared for Big Data. It is possible that these different functions split apart in the future. But for today, we need all three skills.
To these three basic skills, I might add datavisualisation: to be able to give a shape, a readable shape, to calculations is absolutely crucial if we want Big Data to be usable for something.
Precisely, what does it do? What are the applications of these new skills?
General speaking, producting and capturing data creates value. The question, of course, is to know how and where.
On some subjects, applications already exist: marketing comes into mind, of course, where ad targeting is made possible by the cloud processing of the data generated by each user. Another example is customization, as performed by Amazon, which is capable of suggesting books or movies amazingly close to your personal tastes. In a more distant future, we might be able to see, as in the movie Minority Report, real-time customization of advertising boards that will recognize the categories of people approaching them. After all, Minority Report did nothing else than bring on the screens innovations that were being developed by the MIT Medialab.
But these are just obvious examples. Big Data allows many other things. It can, for example, assist organizations in analyzing complex problems and taking into account the variability of these situations, instead of always thinking in terms of the “average customer”, the “average patient” or the “average voter”...
Real time management, through resynchronization and optimization systems, is another trend. Traffic is good example: the best application I know of is Waze. It’s a mobile navigation app that allows drivers to build and use maps, real-time traffic updates and step-by-step navigation to improve their daily commuting. In a completely different field, high frequency trading is also one possible application of Big Data. It’s not only about multiplying financial transactions, but also an asset for overtaking other operators, by responding quickly to their operations and by replying in more efficient communication channels.
We could also mention the emerging field of feedback economy, based on constant iterations to optimize supply, both in terms of available stocks and in terms of prices. Or personal assistants such as SIRI, that you can train yourself. Or applications such as Dr. Watson, which provide diagnostic support to high-end hospital teams.
Precisely, a site like Dr. Watson poses the problem of the reliability of interpretations made from Big Data.
True. In this case, it is just an aid for diagnostic; it doesn’t replace a doctor’s visit. But we would be wrong to stop at this conclusion. There are situations when you do not have any reliable data. The UN, for example, receives economic data dated several years back, and sometimes even distorted. Epidemiology works with data both expensive and quite long to produce. And yet, one can follow an influenza or dengue epidemic with simple queries on Google. Monitoring an epidemic in real time and with free data is priceless! Information produced by Big Data is often based on imperfect or incomplete sources and is therefore neither absolutely certain, nor guaranteed nor reliable. But oddly enough, because of the law of large numbers, it is often an effective source of information.
But how should we interpret these phenomena? Among the recurring debates around Big Data, there is the idea of a scientific revolution, including the horizon of a “science without theory” as prophesied by Chris Anderson.
Again, we must make some distinctions. There’s definitely something important at work in social sciences, especially in marketing and sociology, which have never had the pretension to unveil universal laws. In these disciplines, Big Data not only leads to a greater ability to process data, but also to a form of liberation in the way we organize this data. For instance, when mapping 30 million blogs, new sociological categories appear which sociologists would never have thought of. In short, sociological categories emerging from pure empirical observation can be far more relevant than previous conventional categories.
This is what led Chris Anderson, the editor of the Wired magazine, to formulate his idea of a “science without theory”, which would use inductive rather than deductive logic. In this model, truth almost spontaneously sprouts from data. And indeed, thanks to “machine learning”, we sometimes end up in situations where we are able to predict, with equations that we don’t really know, results that we can’t really explain! I’m thinking, for example, of a study led by IBM in a Toronto maternity hospital which, based on historical biological parameters of thousands or tens of thousands of infants, can predict, 24 hours before any pediatrician, which babies will develop neonatal infections. In this example, there is a very useful, and even vital, forecasting ability – but no underlying theory. This doesn’t prevent us from making the effort to understand: statisticians emphasize that any serious work on Big Data requires understanding the processes of data generation as well as their evolution. At the same time, data management is always based on causal inferences that should be stated and understood.
Public authorities have often access to large statistical databases: have they taken hold of the subject?
There is a clear interest and some remarkable initiatives. A city like New York, for example, has formed a small team of datascientists and they were able to extract precise information from masses of public data available from the city. For instance, they have identified the areas and streets where fires were more likely to occur. This information provided the right indications for security visits and consequently helped reduce the number of fires. They also developed an algorithm to detect tax frauds. And it works!
The United Nations, with the Global Pulse program, strives to put Big Data at the service of human development: the analysis of social networks and mobile communications can help identify, far more quickly than other conventional indicators, pressures on food prices, the break and progression of epidemics, job market fluctuations, etc.
Hence the “Big Brother” label often associated with Big Data?
The previous examples use statistical – not personal – data. But indeed, the developments of Big Data should be confronted to the contemporary obsession for transparency, often flavored with a certain naïveté. This is disturbing. Douglas Klein (Barnes & Noble) once stated that “privacy is the elephant in the room” suggesting that many U.S. private actors expect an inevitable wave of regulations resulting from citizen revolt.
Personally, I think that we have a number of issues to deal with, far more serious than privacy in the strict sense. Privacy will be protected, one way or another. Following Daniel Kaplan’s observations, I might add that beyond the concern for privacy issues, there is another equally important issue about which there haven’t been so many studies: I’m referring to automatic decision. In a close future, the operations by which an online merchant sets the price of an item or service will no longer be based on an ensemble of average buyers but on the price that you, specifically, are willing to pay. One can well imagine a website able to trace your buyer’s profile and offering you a price according to this profile. Not any price, of course. But the highest price that you are willing to pay. It is quite possible that this type of profiling will soon become the basis for relationships in many environments. And this is certainly matter of concern.
Getting back to public data, it’s not only limited to administrative use. Sometimes, the most interesting developments occur when the public sphere renounces to its monopoly on certain data and organizes the possibility for others to work on it. The GPS, originally developed by the U.S. Army, is a classic example of this strategy.
The open data movement is also a major issue. It works in both directions: the Open Government Summit, organized since 2009, has shown that you can import rules and methods into the public sphere to develop new services, by creating social and economic value; symmetrically, to accelerate the development of these services, it is in the public sector’s interest to provide the public with part of the data in its possession. The City of Paris has understood this very well, as have other cities around the world.
But we can still reach further. One of the most recent developments is known as smart disclosure: a strategy to return data to those who produce it, so that they can make use of it. The best example in my opinion is the Blue Button of American veterans. When using some specific online services, they press the button and the service is personalized and boosts its efficiency. Incidentally, in this example, it is not really “returning” data to the citizen but rather allowing him or her to retransmit it to whomever he or her wants.
There is a potential political agenda here: let’s summarize some of the issues at stake. The first one is the possibilities of using Big Data to quickly measure and improve the effectiveness of public policies. Then, open the most relevant public data or target it to rally private and social actors for the support of public policies. And last, promote smart disclosure in order to offer new services to the citizens.