PARIS SCIENCES & LETTRES (PSL)
Thank your for your subscribe
Oops something went wrong. Please check your entry

Data scientists: who are they? And what do they do?

Here is a trade everyone is talking about and no one has a clear idea of... if only because its very name is confusing. Who are the datascientists, what skills do they rely on – and what exactly are they doing?

20
November 2017
lire en français
lire en français
Executive Summary

Who are the datascientists, what skills do they rely on, and what exactly are they doing? The expression emerged about ten years ago around Big Data, when data specialists started undergoing a rapid evolution. The source of this evolution is the exit of the “data warehouse,” where clean, operational data were stored, and the concomitant entry in a world dedicated to exploiting data flows in real time. The skills needed are different: not only new tools are emerging, but specialists are required to develop extra code. Large open source software suites compete with proprietary software such as SAS, which dominated the data warehouse world. The development of these open source suites leads to an explosion of solutions and an acceleration of the cycle. Professionals who exploit and develop these new tools have often reconverted, but specialized curricula have emerged and young graduates are coming onto the market. They all have in common a specialization in one of the major software packages, an ability to code and develop, and skills in statistics or other fields of mathematics.

Paris Innovation Review – The term “data scientist” was almost unknown ten years ago. In 2017, it almost seems too broad, considering the different specialties that have emerged.

Arnaud Contival – The term keeps a certain consistency, but the field around data is clearly evolving. In addition to initial categories such as data mining, new specialties are appearing, for example marketing scientists. Overall, we are witnessing a progressive segmentation. In practice, professionals all wear two or three hats, if only because they started working with Big Data a few years ago, when the field was not as segmented. It may thus be useful to go back ten years to understand current developments.

As a market trend, Big Data really emerged around 2008-2010, when a series of acquisitions occurred in the United States, suggesting that the world’s major software companies had reached their limits. They observed the emergence of a new market and sought to take position. They took ownership of modules to sell them, with a push strategy focused on big accounts.

Within large companies, departments such as finance, marketing, HR started to take a close look at the possibilities offered by Big Data, and that is when differentiated trades begin to emerge. After the major software publishers, IT services companies entered the business, with a delayed effect due in part to the need to train professionals. The specialized Master’s programs that now train data scientists were created a little later.

Basically, the emergence of data scientists boils down to a fundamental phenomenon: an exit from the “data warehouse.” Until then, data was stored in “warehouses,” where information from operational databases was systematically stored in a decision-making perspective. Big Data departs from this well-organized world and enters into very different logics: the capture and exploitation of much more varied and incomparably more numerous data mobilizes new skills.

Let’s return briefly to the historical actors of the market. For example, how did a company such as AID become a data specialist?

Computer analysis of data has been our main specialty since the company’s foudation in 1972. At the time, and with technical means that did not have much to do with what is available today, it was already a matter of predictive analytics. In this case, a road traffic anticipation software for mass departures during holidays: Bison Futé. Later, the company specialized in marketing, by helping clients exploit their databases: the first CRM tools, loyalty cards, coupons and personalized offers. Then it took over the management of these databases, in order to optimize their use, by gradually incorporating Big Data technologies.

Today, data sources are expanding and the issue is more about using data flows than databases, even if storage remains crucial. The most interesting flows – geolocation, clickstream, or IoT – are also the most data-intensive. They are very large, but contain information that makes sense. They also convey more truth.

A significant dimension of this trade is about choosing correctly the collected data. Major platforms such as Facebook or Amazon are masters in this art. This choice is necessary. It can also be constrained: Uber, for example, was banned from taking off-track GPS data.

Data scientists are professionals or agencies that develop or use the tools needed to exploit intelligently, and increasingly in real time, these enormous data flows.

Are these tools universal, or on the contrary, does every agency or big company develop its own technology?

There are some proprietary technologies but today, 90% are open source. Everything that is distributed, for example, is open source.

One element that makes a huge contribution to the success of open source software is the fact that it is used in universities. Data scientists who enter the market today have learned these technologies as students because, like everything that has been developed in open source, they are free. Besides, one can contribute to their development, which is an excellent way to learn!

The development of software suites in open source mode has led to numerous solutions, and to an acceleration of the cycle, speeding up the birth and the death of every new tool. Shared knowledge has developed around major software suites. Nevertheless, this has not brought universal solutions – quite the contrary.

It should also be noted that there is a greater or lesser compatibility between the tools used to process the different segments of this activity. AID, for instance, chose a consistent suite: Cassandra for storage and for data science Spark, which belongs to the Hadoop ecosystem. The largest software suites are Python, R, Scala and Spark. They can all be considered as “general use” but each one of them offers solutions that are more specifically focused on certain uses.

What are the skills used by data scientists?

Data scientists master analytical tools and, more than the data miners that preceded them, they are professionally bound to a specific tool. But they do not simply use them: they also develop specific solutions for these technologies.

Today, the main training channel is that of data miners who become data scientists. But we also see a lot of reorientation, for example, people who have a Master in business intelligence software. Sometimes this reorientation is organized internally, and for companies such as AID retraining is pour of our business.

To answer your question more specifically, there are three pillars. The first is a specialization focused on one major software. The second is an ability to code and to develop. Third, you need skills in statistics or other areas of mathematics.

A whole industry which used proprietary software until now, such as SAS or SaaS tools, is transforming and integrating these new tools. Not without reason: first of all, there is a real cost difference, because proprietary software has its price (initial package, cost of training). But above all, there is the issue of power and scalability. R, for example, offers thousands of functions, as opposed to hundreds in SAS. And these new tools are always in motion: their development is constant, which allows them to be in phase with the quick evolution of needs and the emergence of new ideas. The functions are refined, because you can test many algorithms in parallel. Finally, all these tools also have a machine-learning dimension.

For all these reasons, large customers are switching to open source solutions as well.

Data scientists are young, highly sought after... how can they be retained?

By offering them to explore new areas, varying their missions, and allowing them to take breaks – by playing table tennis for example! But large industrial companies only offer the first option, and for a limited time. Quite logically, the best talents can be found within agencies or they work as freelancers. Some have an independent status, while working with an agency. Finally, following a logic of scarcity or niche market, there are often several very small companies connected to an agency and specialized in a specific module or specialty.

Will this diversity of structures long? Or are we witnessing a consolidation, with the emergence of monopolies or virtual monopolies, like what happened with ERPs?

It seems to me that there is room for everyone. In general, the market is promising: CIOs have an obligation to invest, if only to reduce costs, and they lack internal resources.

To meet this demand, on one side, there are major world publishers and on the other, more specialized actors. The former deploy their technologies worldwide and an ecosystem of agencies that provide both support and integration has emerged around them. Their weakness is that they offer ready-to-wear solutions... which are not always easy to use.

The “smaller” players have a capacity of co-construction, closer to the needs of the client. For example, they will provide a service for a business of connected objects that generates a lot of data and seeks to make better use of them. There are also intermediaries for companies that manage outsourced activities (e.g. accounting).

The interesting thing is the interpenetration between these specialized players and their clients. Data can evolve towards the equivalent of BPO (business process outsourcing). In-house data scientists from AID are hosted by the client, embedded in their activity, because this is where they can be put to fastest use. For example, in marketing, we send professionals in the services that manage advertising networks. But there are also teams who work externally. It’s a combination of outsourcing, consulting, training ... a form of integration that benefits both parties.

Arnaud Contival
Chairman and CEO, AID