Understanding the Data Anonymization Process in the Context of Data Privacy Laws – Data Protection


To print this article, all you need to do is be registered or log in to Mondaq.com.

“Anonymized data can never be completely anonymous”1. A big concern of data anonymization is its de-anonymization, which is the process of using information from different datasets to recreate the anonymized data. For example, a research team from the University of Texas at Austin wanted to demonstrate how de-anonymization could take place with very little information. As a result, with basic information available on IMDB, they managed to identify many Netflix user records. This was therefore used to find out their sensitive information and political preferences.2

In India, “anonymization” is dealt with under the Personal Data Protection Bill 2019. The consequences of personal data falling into the wrong hands or carelessly handled by data controllers are catastrophic; therefore, it is included as part of privacy case law.

The individual whose personal data is stolen could be at risk of misuse and fraud, including identity theft, phishing attacks, user tracking, etc. Anonymization or pseudonymization is used to protect the integrity of an individual’s personal data by preventing such malicious use by third parties. So, let us start by understanding the meaning of these terms as given below.

What is anonymization?

Personally identifiable information (PII) or sensitive data contains identifiable markers such as name, age, address, date of birth, etc. When personal data is collected, these identifiers allow the data trustee or controller to link the personal data to the individual. Many jurisdictions seek to regulate the use and flow of such personal data by data trustees and data controllers. Some examples of this include the General Data Protection Regulation of the European Union and the Indian Personal Data Protection Bill, 2019. The former introduced the concept of “pseudonymisation”, while the latter uses the term “anonymization”; both offer such procedures respectively to protect personal data against identification.

The Personal Data Protection Bill 2019 defines “anonymisation” under clause 3(2) as follows: “… such irreversible process of transforming or converting personal data into a form in which the data controller cannot be identified, which meets the standards of irreversibility specified by the Authority.”.

While the GDPR, 2016 defines “pseudonymisation” under Article 4(5) as follows: “the processing of personal data in such a way that the data can no longer be assigned to a specific data subject without the use of additional information, provided that this additional information is stored separately and subject to technical and organizational measures to guarantee the non-attribution to an identified or identifiable person.”

“Anonymization is not just about deleting individuals’ personal information”

The term “anonymization” is used to refer to the wide range of techniques and processes that can be used to prevent the identification of the individuals to whom the data relates. Anonymization is the process of turning personal data into anonymous information, so that the person to whom the data relates is no longer identifiable.

How is it made?

Some practical examples of data anonymisation techniques, provided by the Information Commissioner’s Office UK3include:

  1. Data reduction
  • Deleting variables: The simplest method of anonymization is the removal of variables that provide direct or indirect identifiers of the data file. These don’t have to be names; a variable should be deleted when it is highly identifiable in the context of the data.

  • Deleting records: Deletion of records of particular units or individuals may be adopted as an extreme data protection measure where the unit is identifiable despite the application of other protection techniques.

  • Global registration: The global recoding method consists of aggregating the observed values ​​in a variable into predefined classes. Each record in the table is recoded.

  • Local deletion: Local deletion consists of replacing the observed value of one or more variables in a certain record with a “missing” value. Ex: we can delete the ‘Age’ variable and recode it as ‘missing’.
  1. Data Disruption (change values ​​or add noise)
  • Micro-aggregation: The idea of ​​micro-aggregation is to replace a value with the average calculated over a small group of units.

  • Data exchange: Data exchange modifies records in the data by changing the values ​​of variables.

  • Post-randomization method (PRAM): “Method of microdata protection in which scores of a categorical variable are changed with certain probabilities to other scores”.4

  • Add noise: Adding noise consists in adding a random value “n” to all the values ​​of the variable to be protected.

  • Resampling: Resampling involves three steps. First, we need to identify how sensitive or key data variables vary across the population. The second step is to artificially generate a distorted sample that has the same parameter values ​​as our estimate. The sample should be the same size as the database. The third step is to replace the confidential data in the database with the distorted sample.
  1. Disturbance-free methods (do not modify the values)
  • Sampling: Sampling occurs when there is enough original data to make a sample meaningful. Instead of publishing the original data, a sample is extracted and published without identifiers. The resulting sample may contain sensitive information. However, it cannot be attributed to any particular individual.

  • Data crossing: When we have a data table with two or more variables, we can create another table by tabulating the two variables against each other.

Although theoretically anonymizing data may seem like a simple method to protect personal data, in practice it is not. Sometimes a contextual reference can lead to identification of data if the “variables” are not direct “identifiers”. The same risk is incurred in the case of pseudonymization, where other available data could be grouped together to identify the pseudonymised data. Thus, these methods are likely to be fallible and are not an ideal route for the protection of personal data.

Here are some alternative suggestions for data anonymization:

  • “Differential confidentiality: This is a technique by which information about a dataset is shared publicly by describing patterns of groups within the dataset, while obscuring personally identifiable information.

  • Federated learning: Google introduced the technique in 2017. Federated learning allows researchers to train statistical models based on decentralized servers with a local data set. This means there is no need to upload private data to the cloud or exchange it with other teams. Federated learning is better than traditional machine learning techniques because it mitigates data security and privacy risks.

  • Homomorphic encryption: In this technique, calculations are performed on encrypted data without first decrypting it. Since homomorphic encryption allows manipulation of encrypted data without revealing the actual data, it has enormous potential in healthcare and financial services where the privacy of the person is most important.5

In conclusion, anonymization and pseudonymization are commonly used as a tool for protecting personal data. The process of anonymizing data can be simple or complex, depending on how it is anonymized. While foolproof anonymization is ideal, it might not be possible anytime soon. Moreover, even when anonymization techniques are used, there may still be a risk that the data subject will be identified. This risk does not mean that the anonymization technique is ineffective, nor that the data is not effectively anonymized for the purposes of protective legislation. However, the alternatives above could help avoid the technical loophole of de-anonymization. Data privacy laws should ideally circumvent to incorporate a more foolproof method of protecting personal data, especially when better alternatives are available. It’s not too late for India to include a strong alternative to “anonymization” in its Personal Data Protection Bill 2019. Such minor alternation would awaken a new dawn of more advanced law. on data privacy.


1 https://www.theguardian.com/technology/2019/jul/23/anonymised-data-never-be-anonymous-enough-study-finds

2 https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

3 https://ico.org.uk/media/1061/anonymisation-code.pdf

4 https://stats.oecd.org/glossary/detail.asp?ID=6954

5 https://analyticsindiamag.com/data-anonymization-is-not-a-fool-proof-method-heres-why/

The content of this article is intended to provide a general guide on the subject. Specialist advice should be sought regarding your particular situation.


Data Protection Law Update

Alpha Partners

The right to privacy in India was declared a fundamental right by the Honorable Supreme Court of India on August 24, 2017, in its landmark judgment in the case of Justice KS Puttaswamy (Retired)…

Data Protection Laws in India – All You Need to Know

Vaish Associates Lawyers

Data protection refers to the set of privacy laws, policies and procedures that aim to minimize the intrusion into an individual’s privacy caused by the collection, storage and dissemination of personal data. Personal Data generally refers to information or data relating to an individual who can be identified from that information or data, whether collected by a government or private organization or agency.

The Data Protection Bill, 2021


On 16 December 2021, the Joint Parliamentary Committee published its report along with the finalized Data Protection Bill, 2021.

Brief note on SPDI

Khurana and Khurana

In India, there are no such specific laws for data protection, privacy and data protection are governed by the IT law “Information Technology Rules, 2011”.


About Author

Comments are closed.