Privacy in Machine Learning

Changjiang Li

Over the past few years, machine learning has achieved great progress in many applications, such as face recognition, speech recognition and image classification. Many cloud platforms begin provide Machine Learning as a Service (MLaaS) of these complex tasks to user. However, it also raises the concerns about the privacy of machine learning. According to previous research, the privacy issues can be divided into two categories: model privacy and training data privacy. The model privacy mainly inlcludes the leakage of model architecture and model parameters, which are the intellectual property of holder. Training data privacy is the leakage of the sensitive information about the training data used to train the model, such as the membership inference attack. The training data privacy will bring great risks to data provider.

Generally speaking, for extracting the target model, a black-box advedrsary that can query an ML model to obtain predictions on input features, and may or may not know the model type or the distribution over the data used to train the model. The adversary’s goal is to extrac an equivalent or near-equivalent ML model. While for membership inference attack, the adversary infers whether a example was used to train the model or not by querying the model with this example, which may figure some sensitive information of the data provider.

According to the attack methods of adversaries, they have to query the target model and conduct inference based on the predictions. Therefore, an intuitive countermeasure is to perturb the prediction before returned to the adversary. The perturbed prediction will make it more diffcult for the adversary to inference the wanted information. However, such defenses may easily be broken by stronger adversary. For developing more robust defense schemes, we must figure out the root reason behind these attack surfaces.

A major reason why training data privacy leakage exists is that the trained mdoel is overfitted. As for model privacy, we still understand relatively little of the attack surface. We believe that there is a lot of work to be done for understanding the privacy of machine learning. The first and most important is to propose metrics to quantitatively measure the privacy leakage. Besides, the relationship between model privacy and training data privacy also need to be further explored.

In summary, privacy issues are widespread in the pipeline of machine learning, from training data to model. Developing high effective privacy leakage evaluation methods and defending against such privacy leakage are urgent research problems.