The way we live today is almost inconceivable without the data processing systems implemented in the computers, smartphones, pads or GPS that we use on a daily basis and to which we now add on other connected objects. We bathe in a controlled digital reality where a multitude of information flows converge. Processing the data has become a sensitive issue inasmuch as they relate to our private sphere, to our intimacy. We do have tools that enable regulation of certain parameters – for example, whether we accept to be geo-localized or not. But this personal control function is only partly effective since almost nobody knows how to implement it seriously. This is why experts have been discussing the opportunity of Big Data governance. How can we proceed? Alongside the institutional answers based on the emergence of control authorities, one possible path forward could be via ethical data mining.
With current development of Big Data technologies, the algorithms that exploit data flow are now playing an increasingly decisive role in personal choices. Though it would be exaggerated to say that these algorithms control us, they do orientate a certain number of our decisions, from a choice of hotels to air-ticket booking, car routes, buying books... or meeting friends.
Through facilitating or shaping our choices, they contribute to the way social life is organized. What we are witnessing is the discreet emergence of an algorithmic “power” that feeds mainly on raw data inputs. The capacity of this power to intervene in the most intimate decisions we make is quite new: even totalitarian regimes, with their armies of spies and informers, would not have dreamt of it. Large scale data exploitation leads to production of personalized information, based on an anticipation of possible decisions and designed to facilitate them, which is itself somewhat ambiguous. The organizations with which we have a digital contact are keen to know our choices and seek to anticipate and orientate them. We do not have much control and not even the means to understand the criteria underpinning the algorithms used to understand us and to influence our choices.
So how can we get back in control? One of the most interesting possibilities will consist of putting together and implementing an appropriate model to analyze, understand and process the large amounts of data involved. An “ethical modeling” process, so to say.
Let us be clear about what we are discussing. It is not a question of trying to slow down Big Data technologies and processes, even less to return to earlier practices, but more to make sure that our lives are not driven by some blind rationality or shaped by the sole rules of marketing. On the contrary, the challenge we face is to build a modeling process compatible with both meaningful values and the immense potential of Big Data.
It is perfectly possible to imagine ethical modeling of complex data, for reasons that are integral to the way datamining operates. The inductive algorithms at the heart of Big Data processes are driven by a logic amazingly close to the practical wisdom which is at the core of ethics.
In our day-to-day lives, we human beings perceive incoming information, we establish links with other items we have already memorized and thus we acquire skills that we are able to exercise in a repetitive manner. In this way, we acquire a practical wisdom which can be formalized and refined into ethics, defined as the art of the right behavior.
The logic of datamining algorithms, as I said, is very close to that we humans use in our daily routines. This logic is inductive and not deductive. The algorithms of Big Data were not designed to make demonstrations that lead to undisputed, a + b type results. They operate on part or incomplete data, with low structural content – data that does not allow them to proceed to a formal demonstration. Their function is rather to identify repetitive routines, patterns, behavioral schemes: for example, they can see on Amazon that the reader of a given book probably would be interested by some other, identifiable, book. They gather the data, form information strings, interpreting the latter make connections to other already memorized data which enables the system to narrow down the offer, orientated towards a practical finality. The latter lie at the coincidence of our interests and those of the organization that possesses the data processing systems, coming in various configurations that run from quasi-neutrality to a hard-sell orientation to such item, to such choice.
The key moment – for humans and for algorithms – is when the offer is simplified, i.e., when the complex set of raw data is transformed into a practical piece of information. Simplification here, in the area of information processing technologies, tends to favor very low entropy, i.e., a quasi-nil degree of disorder. To return an instant to the example of Amazon, this leads to the decision not to offer a book on ethnology to a science-fiction fan. It is precisely at this crucial moment of the simplification process that an ethical modeling of complex data must endeavor to accompany and to lend sense to the operation.
Two principles stand out. Firstly, the “information” we are discussing must be part of a systemic framework that connects it to action via knowledge. The information contents are aggregated into knowledge, but the knowledge produced this way is practical, finalized in the action itself. It is less knowledge and more know-how-to-use.
The second principle derives directly from information theory. We can formulate it as follows: the system will display a preference for state description rather than processes. The challenge for ethics, as in the case for Big Data, is the transition of a given state of complex, disorganized, fuzzy knowledge to a set of simple, structured knowledge, with an orientation to attain a finality.
A decisive step in data simplification lies in hierarchic ranking. It is the latter that enables algorithms to be adjusted so that they effectively produce an exploitable result. But working on hierarchic ranking requires to reflect on the value of the data and this reflection, in turn, leads to a long series of questions: how do you assess the data, for what purpose and with what objectives? How do you assess the value of a datum or an information and according to what criteria? And what exactly should be assessed?
The value of a datum can be assessed in terms of its content: for example, one click can mean that you like something, or that you are moving in a certain direction, or that you are doing a U-turn or paying for a transaction. It can also be assessed in terms of content redundancy, diversity and quantity. Value can also depend on the knowledge produced with the datum: certain data convey a small amount of knowledge, others a more significant quantity. Lastly, the datum value can be assessed in terms of the level of sharing, quality and quantity of the exchanges.
The value of a datum also is a function of the service that benefits the user. Assessing a piece of information is equivalent to determining the strategy used to disseminate it: providing access to the right information, at the most opportune time, proceed with a selected transmission of information as a function of the areas of interest and the needs of users, so as to fight either disinformation or information overkill or glut.
It is therefore primordial to identify what data and what quantity of information the system designer must make available to users. What data do they in fact need to make a “good” decision or to act “well”? In order to obtain a practical balance in the information processing systems, between improvement and overload conditions for the data transmitted, two variables can optimize the hierarchy rank setting and the selection process.
The first variable consists of reassessing data attribution to the various system levels. If reassessment is too frequent, an overload associated with shifting the data back and forwards may diminish or even cancel the performance gains obtained thanks to re-mapping the data on the hard disk drives.
The second variable relates to the volume of data to be included in a minimum storage unit, then to manage and to shift through the processing system. Here again, too large a quantity of data will complicate matters and slow down any selective hierarchy-ranking process.
Data hierarchic ranking and assessment are primordial and this is where an ethical dimension can be introduced. To illustrate let us examine a particularly sensitive, practical case, that of medical data.
Medical data lies at a crossroads between two worlds: the patient’s private sphere which must be protected and statistical epidemiology which carries a positive use for the population as a whole. The question is, how should we balance these two dimensions?
An ethical approach can be based on the four principles identified and described by Tom Beauchamp and James Childress in their reference textbook on these issues (Principles of Biomedical Ethics, 2001).
The first principle is beneficience, defined by the authors as a contribution to the well-being of others. A “beneficient” action must obey two precise rules: the action must prove both beneficial and useful, i.e., there must be a positive cost-benefit ratio. The second principle is respect for autonomy: enabling persons to make reasoned informed choices and to apply their own rules of conduct. This principle aims at seeing the patient participate in the decision process. The third principle is that of non-maleficience action, viz., avoiding causation of harm to persons for whom one has a responsibility, avoiding prejudice or suffering that would be meaningless to the person. Fourth and last of the four principles, Justice, implying a distribution among patients of all available resources (time, money, power). This principle is closely tied to the notions of equality and equity as they intervene in court case proceedings. Ideally, any and all actions should tend towards perfect equity, but depending on circumstances and on the persons involved, often the equity principle is used to establish priorities and a certain hierarchic ranking of the acts to be performed.
An appropriate and well-conceived selection of medical data may comply with three out of the four ethical principles.
The principle of beneficience, when appropriate dissemination of knowledge to users (health sector practitioners and citizens) guarantees the reasoning and the legitimacy to undertake a given medical action. Communication here is more efficient.
The principle of respect of autonomy, when clear, precise, adapted and comprehensible information guarantees the reasoned informed consent of the person concerned. A patient here has the capacity to discuss, to decide and to act.
The principle of non-maleficience, lastly, occurs when a limited access to data depending on the profile and nature of the user improves security, confidentiality and protection of personal data.
However, this selective procedure for data handling has a negative repercussion on the principle of justice, inasmuch as the information transmitted is not the same, depending on the identity of the user connected to the data processing system. The latter enforces attribution rules and data access rules that differ according to the persons’ level of authorization. Discrimination of knowledge here is discriminatory and throws discredit on the transparency of the information accessed.
In this approach, the ranking and sorting of data are carried out as a function of the importance attributed to their content and to the questions raised by their utilization and dissemination.
Simplifying the data transmitted leads to more efficient access and use, with better collection and a higher level of security. In contradistinction, this process implies a lesser degree of data integrity. Consequently, hierarchic ranking of data simplifies work for the various users, but induces a greater technical complexity for the designer of the data processing system.
Selective hierarchic ranking of data plays a major role in the level of complexity assigned to the data and their accessibility for potential users. This can be assimilated to “organizational intelligence.” Algorithms designed according to the principles of ethical datamining lead to new information that we could see as “ethical information.” This information, pre-processed in compliance with an ethical grid of assessment, has greater value for future exploitation.
This data hierarchic ranking followed by selection of initial data leads to an improvement of the qualitative and entropic value of knowledge, though it incurs a quantitative loss of data and information content. Thus, an automated selective hierarchic ranking procedure allows the contents of a storage bay to automatically migrate to a more relevant class of services, depending on the needs of each user.
This approach fits perfectly with ongoing work on inductive algorithms at the core of Big Data technologies. For any given problem, there is no single universal inductive solution. Nonetheless, it is commonplace for a reduced number of process runs to attain a particular finality. As for an ethical process, the most efficient inductive algorithms are evolutionnary ones. They improve themselves by adjusting their data handling process protocols as a function of the most relevant uses that can be made. To create these algorithms, it is absolutely necessary for the data processing to anticipate and be contributive. For this purpose, exploiting Big Data must be able to convert the data as soon as possible into ethicalinformation that can be used for the following moments.
In this context, studying the solution of a selective hierarchic data ranking through an ethical prism enables you to better understand the unstable balance that exists between data availability, confidentiality and protection. The balance can swing one way or the other, depending on the nature of the context. An approach like this leads to raising a series of questions before we implement a data selection procedure: what are the objectives, the aims, the stakes attached to this stage? What data am I going to use? Are they part data or total data? How am I going to exploit them? Where? With whom as users? Framed more globally, the question is: how are we going to exploit a heterogeneous set of data accumulated and stored in an information system? What will be its relevance with respect to my personal situation? Will what I do not denature the initial information value? Will the integrity of the final message be preserved?
Technology cannot provide all the answers to questions like these. We also need to invoke deontology and human behavior to guarantee the confidentiality and protection of personal data. There maybe is a case to prepare an ethical charter, covering design, implementation and use of personal data integrated to Big Data systems. The question would arise as to the agency or institution that would be entrusted with preparing the charter and the process whereby the algorithms used are deemed “ethical.”