Open In App
Related Articles

Differential Privacy and Deep Learning

Improve Article
Save Article
Like Article

Differential privacy is a new topic in the field of deep learning. It is about ensuring that when our neural networks are learning from sensitive data, they’re only learning what they’re supposed to learn from the data.

Differential privacy is a concept in privacy-preserving data analysis that aims to protect the privacy of individuals while still allowing useful insights to be gleaned from the data. It is particularly relevant in the context of deep learning, where large amounts of sensitive data may be used to train models. Here are some ways that differential privacy can be used in deep learning:

Adding noise to data: One common approach to achieving differential privacy in deep learning is to add noise to the data during training. This can help ensure that the model does not overfit to specific individuals or sensitive information in the data. Various techniques, such as the Laplace mechanism or the Gaussian mechanism, can be used to add noise to the training data.

Using private aggregation of teacher ensembles (PATE): PATE is a framework for training machine learning models with differential privacy, which is particularly useful in the context of deep learning. PATE involves training multiple “teacher” models on disjoint subsets of the data, and then using these models to generate “noisy labels” for the training data. A final “student” model is then trained on the noisy labels, which helps ensure differential privacy.

Federated learning: Federated learning is an approach to deep learning where the training data remains on the user’s device, and only the model parameters are sent to a central server for aggregation. This can help protect user privacy, as the raw data never leaves the user’s device. Differential privacy can be used to add an additional layer of protection to federated learning, by ensuring that the aggregation of model updates does not leak sensitive information about the user’s data.

Some potential advantages of using differential privacy in deep learning include improved privacy, increased trust in the model, and increased fairness. However, there are also some potential disadvantages, such as increased computational complexity and decreased accuracy. It is important to carefully consider the trade-offs and design appropriate differential privacy mechanisms for specific use cases
Robust definition of privacy proposed by Cynthia Dwork (from her book Algorithmic Foundations):

“Differential Privacy” describes a promise, made by a data holder, or curator, to a data subject, and the promise is like this: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”

The general goal of differential privacy is to ensure that different kind of statistical analysis doesn’t compromise privacy and privacy is preserved if, after the analysis, the analyzer doesn’t know anything about the features in data-set, that means Information which has been made public elsewhere isn’t harmful to an individual. 
To define privacy in the context of a simple database, we’re performing some queries against the database and if we remove a person from the database and the query doesn’t change then that person’s privacy would be fully protected.

Let us Understand With An Example 

Given a database, which contains some numbers ‘1’ and ‘0’ which is some sensitive data like if an individual has some kind of disease or not (maybe patients don’t want to reveal this data). 

  db = [1, 0, 1, 1, 0, 1, 0, 1, 0, 0]

And now, you have your databases with one of each entry removed, which are called parallel DBS. so there are ‘n’ number of parallel DBS if the length of original DB is ‘n’, in our case it’s 10. Now, we consider one of parallel DBS, let’s take the first one in which the first individual is removed and what do we get?

pdbs[0] = [0, 1, 1, 0, 1, 0, 1, 0, 0]

So you see that now this database has length ‘n-1’. So to calculate sensitivity we need a query function so, we assume the simplest ‘sum’. So we now focus on two results:  

 sum(db) = 5
 sum(pdbs[0]) = 4

And the difference between the above two is ‘1’ and we know that we need to find the maximum of all these differences, since this DB only contains ‘1’ and ‘0’ all those differences will either be ‘1’ (when similar like above, when 1 is removed) or ‘0’ (when 0 is removed). 
Therefore, we get our sensitivity for this example as ‘1’ and this is really high value and therefore differencing attacks can be easily done using this ‘sum’ query. 

The sensitivity should be below so that it gives a quantitative idea of what level of differencing attacks can reveal info/leak privacy.

Implementing the code for Differential Privacy in Python:


import torch
# the number of entries in our database
num_entries = 5000
db = torch.rand(num_entries) > 0.5
# generating parallel databases
def get_parallel_db(db, remove_index):
get_parallel_db(db, 52352)
def get_parallel_dbs(db):
    parallel_dbs = list()
    for i in range(len(db)):
        pdb = get_parallel_db(db, i)
    return parallel_dbs
pdbs = get_parallel_dbs(db)
# Creating linear and parallel databases
def create_db_and_parallels(num_entries):
    db = torch.rand(num_entries) > 0.5
    pdbs = get_parallel_dbs(db)
    return db, pdbs
db, pdbs = create_db_and_parallels(2000)
# Creating sensitivity function
def sensitivity(query, n_entries=1000):
    db, pdbs = create_db_and_parallels(n_entries)
    full_db_result = query(db)
    max_distance = 0
    for pdb in pdbs:
        pdb_result = query(pdb)
        db_distance = torch.abs(pdb_result - full_db_result)
        if(db_distance > max_distance):
            max_distance = db_distance
    return max_distance
# query our database and evaluate whether or not the result of the
# query is leaking "private" information
def query(db):
    return db.float().mean()

Input : A randomly generated database(with the help of torch library) 
Output : tensor(0.0005)

Explanation: First of all, we create a random database with the help of the torch library then we defined two functions get_parallel_db and get_parallel_dbs for linear and parallel databases. Now we defined the sensitivity function then we measured the difference between each parallel DB’s query result and the query result for the entire database and then calculated the max value (which was 1). This value is called “sensitivity”.

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out - check it out now!

Last Updated : 02 Mar, 2023
Like Article
Save Article
Similar Reads
Complete Tutorials