Digitization means that more information about the world can be reflected in a digital way, so that it can be easily shared, analysed, and used by the tools that have been developed and allows us to reach greater insights, efficiency, and potential.

The development of artificial intelligence (AI) allows us to truly take advantage of the information at our disposal. We are using AI on a daily basis, e.g. to translate languages, generate subtitles in videos or to block email spam. Beyond making our lives easier, AI is helping us to solve some of the world’s biggest challenges: from treating chronic diseases or reducing fatality rates in traffic accidents to fighting climate change or anticipating cybersecurity threats. Like the steam engine or electricity in the past, AI is transforming our world, our society and our industry.

The breakneck progresses that are being made in AI require us to realign various aspects of our life and our understanding of those. One notable aspect is the protection of personal information.

The expansion of personal data

When we look at the digitized world, we are quite far ahead in digitization. That means that a lot of what happens in the real world is reflected in a digital way: our commercial and most of our private correspondence happens directly digitally or is digitized at some point, information on our activities is not only tracked in a fine grained manner by tech companies through our online activity and our smartphones, but our social and economic impact is monitored by the states and companies for them to make sense of the world. Much of the services we receive would be impossible without making use of digital tools that need digital information to function.

Against this backdrop, the importance of personal information and the risks deriving from its abuse have become clear, which led to development of a culture of protection of this personal information. Personal information is in principle any information that relates to a specific individuum. What exactly falls under this definition, legally speaking, depends on the applicable rules.

According to GDPR, which is here used generally as a reference point, personal data “only” includes information relating to natural persons who: can be identified or who are identifiable, directly from the information in question; or who can be indirectly identified from that information in combination with other information.

Simple Identifiers

In its utmost core, personal information is information that singlehandedly allows the identification of an individuum (identifier). The most traditional one is the information about the full name of a person. This information is nowadays in many cases not unique enough to identify a specific individuum, so that the full name is often supplemented with additional information such as date of birth and/or address of residence.

With the time, additional identifiers emerged as tools for identifying an individuum. Aside from official ones, such as social security number or ID card number, biometric ones, such as fingerprint or facial features, as well as digital ones, such as IP address or location , have been added. These alone do not allow for the identification of a specific person in a traditional way but can easily be associated with the person in combination of another piece of information and used for identification purposes. Depending on the applicable law, personal data associated to this kind of identifiers is usually considered pseudonymized personal data, meaning that the identifier can be easily associated with a physical person if the linking information can be obtained.

Complex identifiers

With the development of data analytics and the accumulation of data at disposal, it has become easier to identify individuals from seemingly non-identifying information. It is in fact possible to combine different data points that taken individually do not constitute an identifier, but when combined they allow to single out individuals and the available data attached to them, so that that an identifying combination could be marked with a single indirect identifier, making it pseudonymized personal data.  According to the EU regulation if a combination of physical, physiological, genetic, mental, economic, cultural or social identity factors allows to single out an individuum, this information and all other information attached to this singled out individuum is considered personal information. This fact expands the range of information that can potentially be considered as an indirect identifier and makes it contingent to the availability of other data. The more data is available (i.e., already possessed, or accessible through third parties) and our ability to analyze information evolves, the more are the types of information that can be deemed as being potentially identifying information.

Linked data

Personal data is however not limited to identifiers, be they direct and indirect, but it also covers all information that is related to an individuum during its use. Every information that someone meaningfully attaches to an identifier (be this identifier direct or indirect) constitutes personal data, depending on how and for which purpose it is used.

This means that when working with data, personal data might emerge, and impact all correlated data. It is therefore important to have a solid understanding of the basic measures required when working with personal data, even if the intention is not directed to this.

The use of (personal) data

Data, as a digital representation of our world, is used everywhere for (almost) every activity. For someone using data an ideal world would allow open access to all available information so to further advance the undertaken activity. There is therefore a tension between the interests of the subjects from whom the data is sourced, or that have a connection with some data, and those of who needs and/or wants to use this data.

This tension is not only circumscribed to individuals but can also be felt for information relating to companies, civil society groups, segments of the population and state agents. The discussion around existing regulation of the flow of data is however mostly focused on personal data, and this discussion is particularly impactful in topic such as cybersecurity, medical research, and advertising.  

Medical research as an example

Data related to individuals or collected from these is personal at the time of collection. In the medical sector, personal data is primarily collected to track the health of an individual and administer the necessary care. Health data is however also precious for advancing the medical science and there is therefore a lot of interest in ways to make it accessible for scientific purposes. It is, among others, more and more used to train AI algorithms that are then employed for healthcare purposes.

According to GDPR, and other legislations such as the US HIPAA, health data is considered as highly sensitive and the use of health data for other purposes than the direct care of the patient has stringent requirements, such as the collection of a free and informed consent. When this is not the case, other similarly stringent conditions for their use apply. This is often unpractical because asking for the necessary consent is difficult and time consuming, because future scientific uses cannot be determined in advance (and obtaining an informed consent is therefore impossible), and because the data usually needs to be transferrable across jurisdictions, each of which has rules that might vary and be in contrast with each other.

The solution has been often to anonymize the data so that it can be freely shared, since in this way it falls outside the scope of data protection rules. This has been allowing the sharing of a great deal of information that sustains the research in different fields.

With the increase of importance of data protection and analytics potential, however, the information that can be disclosed without potentially disclose personal information (i.e., information that can not be re-linked to specific individuals) is getting smaller and smaller.  

AI adds to this problematic by making the process of re-identification easier and therefore obliging anonymization techniques to become more and more sophisticated, which leads to a further reduction of useful information that can be shared in anonymized form.

This trend is in contrast with the need by researchers to access to data. Even without considering most invasive explorative data analysis (which try to identify correlations within datasets with the most amount of information available to find solutions to yet to be determined problem), it is impossible to foresee the technological progress and all possible analysis techniques and data combinations that could be performed on a published dataset. It is therefore very difficult to determine an anonymization process that makes the risk of re-identification acceptable.

Furthermore, since the process of anonymization tries to strip or blur any potentially direct and indirect identifier from a specific dataset, the more the dataset is securely anonymized, the less information it contains, so that it becomes less and less useful for research purposes.

There is on top of that a great deal of uncertainty on the true definition of what constitute anonymized data and, consequently, if this approach is to be considering as sharing anonymized or pseudonymized personal data. This has profound consequences, because it determines if the consent of the data subject is required for sharing the information in this way.

Measures for the protection of personal data

We are at a point where the idea of publishing truly anonymized datasets originated by personal information is becoming either impossible, too difficult, or useless. The solution is therefore often to apply, alongside some form of personal data pseudonymization or anonymization, additional protectives measures such as control on who can have access to the data and requiring the accessing party to have safeguards in place for the protection of the shared data.

This approach is recognized to give more freedom in sharing information for research purposes, which however limits the opportunity to perform research to institutions able to comply with these additional requirements.

Another approach being developed is this of generating synthetic data with AI. An AI algorithm is trained on real world data and then used to produce realistic data that is however not associated with any real person. This data can be used as a proxy of real-life situation for identifying promising directions for research, without the need to apply data protection rules. This reduces the time and costs of exploring speculative hypothesis, allowing for a more uncomplicated exploration of research direction.


There are still different hurdles when trying to use personal data, but a good data governance framework making use of encryption, anonymization and pseudonymization techniques, access controls and contractual commitments allows to respect the personality of the people originating the data and extract the most value out of it at the same time.