Operationalizing the Legal Principle of Data Minimization
for Personalization
Abstract.
Article 5(1)(c) of the European Union’s General Data Protection Regulation (GDPR) requires that "personal data shall be […] adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed (‘data minimisation’)". To date, the legal and computational definitions of ‘purpose limitation’ and ‘data minimization’ remain largely unclear. In particular, the interpretation of these principles is an open issue for information access systems that optimize for user experience through personalization and do not strictly require personal data collection for the delivery of basic service.
In this paper, we identify a lack of a homogeneous interpretation of the data minimization principle and explore two operational definitions applicable in the context of personalization. The focus of our empirical study in the domain of recommender systems is on providing foundational insights about the (i) feasibility of different data minimization definitions, (ii) robustness of different recommendation algorithms to minimization, and (iii) performance of different minimization strategies.We find that the performance decrease incurred by data minimization might not be substantial, but that it might disparately impact different users—a finding which has implications for the viability of different formal minimization definitions. Overall, our analysis uncovers the complexities of the data minimization problem in the context of personalization and maps the remaining computational and regulatory challenges.
1. Introduction
Personalized services such as recommender systems or search engines collect large amounts of user interaction logs. Such data collection practice is widely accepted to be necessary for platforms to build high-quality models (Halevy et al., 2009; Sun et al., 2017). However, some prior work shows that exact user interaction profiles are not necessary to tailor the results of search or recommendations. For instance, Singla et al. show that it is possible to personalize results while storing a reduced user interaction history (Singla et al., 2014), while Biega et al. show that it is possible to shuffle user queries and ratings while preserving the quality of personalized search and recommendations (Biega et al., 2017).
If results can be personalized without exact user profiles, it is pertinent to ask: How much information and what information does an individual need to provide to receive quality personalized results? Note the parallel between this question and the principle of data minimization defined in Article 5 of the European Union’s General Data Protection Regulation (GDPR) (Regulation, 2016) as well as data protection regimes in other jurisdictions, which requires that a system only retain user data necessary to deliver service. The core idea we explore in this work is whether the principles of purpose limitation and data minimization can be complied with in the context of personalization and what minimizing data in this context entails.
In contrast to other GDPR concepts, such as the right to be forgotten or informed consent, there is to date only marginal regulatory and judicial guidance on the interpretation of data minimization. Reasoning about data minimization has largely been confined to setups involving immutable or relatively stationary user characteristics. For instance, examples mentioned in the guidelines issued by the UK’s Information Commissioner’s Office (Office, 2018b) discuss scenarios of collecting people’s names by debt collectors, or employee blood types by employers. More recent regulatory guidelines and industrial practice, however, recognize the multitude of challenges related to minimization in data-intensive applications (Binns and Gallo, 2019; Galdon Clavell et al., 2020).
To the best of our knowledge, this work is the first to operationalize the legal concepts of purpose limitation and data minimization in a scenario where user data collection is not strictly necessary to deliver a service, but where the collection of such data might improve service quality. We tie the purpose of data collection to performance metrics, and define performance-based minimization principles.
In this study, we investigate two possible technical definitions of performance-based data minimization. The first interpretation, which we refer to as global data minimization, minimizes per-user data collection subject to meeting a target mean performance across users. This aligns well with standard empirical risk minimization approaches in machine learning (Vapnik, 1992). Alternatively, per-user data minimization minimizes per-user data collection subject to each user meeting a target performance. Equivalently, this aligns with meeting a target performance for the minimum across all users.
We use these interpretations to compare different minimization strategies for personalized recommendations. We demonstrate that quality recommendations can be provided while collecting substantially less user data. However, we also find that the appropriate minimization strategy is sensitive to the base recommendation algorithm used. While our results suggest that systems should be able to achieve global data minimization, we demonstrate that preserving the average performance conceals substantial impact for individual users. To sum up, the salient contributions of this paper are:
-
•
Identifying a lack of a homogeneous interpretation of the GDPR’s purpose limitation and data minimization principles in the context of personalization systems and proposing a computational definition of performance-based data minimization.
-
•
An analysis of the feasibility of two different data minimization definitions in the domain of recommender systems.
-
•
An analysis of the robustness of different recommendation algorithms to various minimization strategies, both on a population as well as an individual user levels.
2. Data Minimization
2.1. A legal perspective
Article 5(1)(c) GDPR requires that personal data be ‘adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.’ Data minimisation is the direct consequence of the legal principle of purpose limitation, which requires that personal data only be processed for specified, explicit and legitimate purposes and not further processed in a manner incompatible with these purposes. While these core data protection principles cannot be examined exhaustively here, it is worth noting that general statements such as ‘improving user experience’ are generally not specific enough to meet the legal threshold of purpose limitation. This raises the question of whether ‘personalization’can be a purpose under the GDPR at all.
According to data minimisation, no more personal data than necessary to achieve the purpose can be processed. The first question to ask is thus whether data such as this studied in our paper is personal data. The Article 4 GDPR embraces a very broad definition of personal data as ‘any information relating to an identified or identifiable natural person.’In the past, movie ratings such as those in the MovieLens 20M dataset (Harper and Konstan, 2016) have been shown to allow for identification through linking of private and public datasets (Narayanan and Shmatikov, 2008). It is thus safe to assume that much of the data used in recommender systems, such as movie ratings, constitutes personal data and is hence subject to the GDPR (where within its geographical scope).
Data minimisation can be broken down into three distinct requirements. First, data must be adequate in relation to the purpose that is pursued. Arguably, adequacy is the most interesting of the three criteria as it may actually (and somewhat counterintuitively) require that more data is processed. It is well established that the omission of certain data can limit the usefulness of a dataset and the accuracy of an analysis done on that dataset. As such, to achieve accurate results, more data may need to be collected. Data minimisation indeed ought to be interpreted in light of the other substantive requirements in Article 5 GDPR such as fairness, transparency and accuracy and there are scenarios, often those involving underrepresented groups, where this can only be achieved through the processing of more personal data.
Second, data ought to be relevant in light of the purpose, meaning that only data that is pertinent for the purpose can be processed. For example, if an e-commerce provider requested users’ date of birth to provide personalised recommendations regarding future purchases, this data is unlikely to be relevant (except where recommendations have an astrological flavor). Relevance thus functions as a safeguard against accumulating data simply for the sake of doing so.
Third, the GDPR requires that data be limited to what is necessary, meaning that controllers ought to identify the minimum amount of personal data required to fulfil the stated purpose. Thus, where similarly robust results can be achieved through the processing of less personal data, the processing of personal data can likely not be accepted as being necessary. Where possible, only anonymised data should be used. However, given the practical limitations of achieving anonymisation, the latter cannot be assumed as a viable alternative to minimisation in many contexts (Finck and Pallas, 2019).
2.2. Performance-Based Data Minimization
Our focus in this paper is on operationalizing the third requirement of data minimization, namely that of limitation. According to the legal considerations detailed in the previous subsection, generic statements such as ‘improving user experience’ are not specific enough to be used as a purpose of data collection. Thus, we propose to reason about data minimization by tying the purpose to performance metrics. While there are many ways in which this proposition might be operationalized, in this paper, we begin investigating this space with an empirical study of two definitions.
Let be a set of users for whom the system needs to minimize the data and let be the set of items that a system can recommend. Each user has rated some subset of items. Let be the vector of ratings for these items. Of the rated items in , in a minimization setting, the system only sees a subset , referred to as the observational pool for user . Let be the ratings for these observations. Given , a system generates , its predicted ratings for . The quality metric for is defined as .
Definition 0 (Global data minimization).
A system satisfies global data minimization if it minimizes the amount of per-user data while achieving the quality of a system with access to the full data on average,
and |
where is the prediction using the ratings in and is a threshold difference in the expected per-user performance.
Definition 0 (Per-user data minimization).
A system satisfies per-user data minimization if it minimizes the amount of per-user data while achieving the quality of a system with access to the full data for each user,
and |
where is the prediction using the ratings in and is a threshold difference in the per-user performance.
3. Experimental Setup
3.1. Datasets
We run our analyses using (1) the MovieLens 20M dataset (Harper and Konstan, 2016) and (2) the Google Location dataset (He et al., 2017). Because of the space constraints, we report the results using dataset (1), and use dataset (2) for validation, reporting differences in observations where applicable. To properly reason about data minimization, we only select users who have at least 45 ratings in their profile. For efficiency reasons, we further subsample the users, creating (1) a MovieLens dataset containing around 2.5k users, 170k ratings, and 20k unique movies; the mean and median number of ratings in a user profile are 69.5 and 59, respectively, and (2) a Google Location dataset containing around 2.2k users, 185k ratings, and 150k unique items; the mean and median number of ratings in a user profile are 85.2 and 64, respectively.
3.2. Recommendation algorithms
We analyze data minimization properties for two fundamental classes of recommendation algorithms - neighborhood-based (k-nearest-neighbors) and matrix-factorization-based (SVD) (Funk, 2006), both as implemented in the Surprise library (Hug, 2017).
3.2.1. Notation.
For a user and item , we use to denote the true rating given by the user for the item and as the predicted rating by the user for the item from a predictive model,
3.2.2. Neighborhood-based.
For the neighborhood-based recommendations, we use the user-user k-nearest-neighbors algorithm setting , as per prior studies investigating the recommendation performance in the MovieLens dataset (Ekstrand and Riedl, 2012). Rating prediction for user and item is computed as a weighted sum of the ratings of made by ’s top- nearest neighbors among users who rated item :
(1) |
where is the set of users who have rated item and who are most similar to user by the value of similarity measure .
User similarity is computed as the inverse of the mean squared difference of ratings (with add-1 smoothing) over set .
3.2.3. Matrix-factorization-based.
For the matrix-factorization-based recommendations, we use an implementation of the FunkSVD algorithm (Funk, 2006) with 30 latent factors. Rating prediction for user and item is computed as:
(2) |
where is a 30-dimensional latent vector representing item , is a 30-dimensional latent vector representing user , is a global mean, and and are item and user biases, respectively.
3.3. Error measures
We measure the quality of recommendations using: RMSE (comparing the differences between the predicted and true ratings for all items in the test set and thus assuming a user consumes the whole recommendation set) and NDCG (measuring the quality of the top results with a logarithmic discounting factor for errors in lower ranking positions (Järvelin and Kekäläinen, 2002)). In our experiments, we set .
3.4. Protocol
We explore data minimization in the context of a system that begins with extensive data collection for a starting set of users. This may be gathered in-house or from a representative market not subject to data minimization constraints. While there will be situations where seed data is unavailable, we leave that for future work.
To simulate this situation, we randomly split the full dataset into two parts: the system data ( of all users), and the minimization data ( of all users). Users are randomly assigned to one of these groups. For minimizing users in , we further randomly split their ratings into candidate ( of all ratings) and test data ( of all ratings). Different minimization strategies will select different subsets of each user’s candidate data for use by the system. Recommendations generated based on the selected data from the candidate user data are evaluated using the remaining test data.
Data is selected from the candidate user data using a chosen minimization strategy and a minimization parameter (the number of items to select). We run experiments for .
3.5. Data minimization strategies
When minimizing data, we select a subset of user candidate items to present to the recommendation algorithm. While approaches with similar problem structure have used greedy algorithms modeling the information-theoretic utility of data (Krause and Horvitz, 2010), greedy algorithms are less practical in a data minimization scenario. Since utility of data is tied to a specific recommendation performance metric rather than modeled as information gain, the submodularity and monotonicity properties upon which guarantees on greedy algorithms are based do not necessarily hold. Moreover greedy selection is costly in terms of runtime, since the recommendation algorithm needs to be run for every possible selection. This section presents the selection strategies we study in this paper.
3.5.1. Full.
We compare other minimization strategies against a baseline generating predictions based on full observational pools of users from . Formally, .
3.5.2. Empirical bounds.
We compare the minimization results against brute-force baselines that select item from a user’s profile that lead to (i) the highest prediction RMSE (One item worst), (ii) the lowest prediction RMSE (One item best). We also compute (iii) the average RMSE error over all possible 1-item selections (One item avg); this value can be thought of as an empirical expected value of RMSE over 1-item random selections.
3.5.3. Random.
This strategy selects ratings uniformly at random from the observational pools of the minimizing users. The key observation to make here is that this method will not create random user profiles as a result, but minimized average profiles of each user. That is, if ratings of certain types (e.g., of a certain genre) are common in the full observational profile, they are likely to be preserved through the random sampling.
3.5.4. Most recent.
This strategy selects most recent ratings from the observational pools of the minimizing users. Note that one can expect this method to behave similarly to the random method in case the data is collected over a period of time short enough for the user tastes to stay intact. In case the observational data of each user spans a very long time, we could expect the predictions to be better than random in case the test data is also sampled from the most recent ratings, and worse than random otherwise.
3.5.5. Most/least favorite.
This strategies select the ratings that have the highest/lowest value for a given user, respectively.
3.5.6. Most Rated.
This method uses the system data to determine the selection method. For a given user, we select the items that have been rated the most often (by the number of times an item has been rated by all users in the system data).
3.5.7. Most characteristic.
This method uses the system data to determine the selection method for a given user. We create binary vector representations of items by allocating each system data user to a dimension of and setting the value to 1 if the user has rated item , and 0 otherwise. We then take the average of all the item representations . Finally, for a given user we select the items with the closest Euclidean distance to the average item representation. Whereas the most rated strategy treats all users the same when creating its counts, this strategy rewards items for being rated by users who have rated many items and penalizes items that have been rated by user who have rated few items. Formally, where is the Euclidean distance between two vectors, is the binary representation of item , and is the average item vector; all vectors are computed using the system data.
3.5.8. Highest variance.
This method is based on one of the standard approaches for feature selection in machine learning (Guyon and Elisseeff, 2003). It uses the system data to determine the selection method for each user by looking at which items have the highest variance in their ratings. Formally, where is standard deviation, and is the set of all ratings for item in the system data.
4. Global data minimization
To guide the interpretation of the results, we want to make the following remarks. Reasoning about feasibility of data minimization, it is important to understand what quality loss we would incur if we based personalized recommendations on minimized user profiles. The main purpose of our experimental study is thus to measure and compare the quality of recommendations under different minimization conditions.
To reason about the efficacy of a minimization condition (maximum size of user profile and a minimization strategy) for a given recommendation algorithm, we measure the difference in the quality of recommendations obtained under the minimization condition, and the quality of recommendations obtained if the recommendation algorithm sees all available user data (the Full strategy). We conclude that minimization is feasible if this difference is not statistically significant, or if the difference is minimal (low RMSE increase, and low NDCG decrease).
n=1 | n=3 | n=7 | n=15 | n=100 | |
RMSE | |||||
random | |||||
most recent | |||||
most favorite | |||||
least favorite | |||||
most watched | |||||
most characteristic | |||||
highest variance | |||||
NDCG@10 | |||||
random | |||||
most recent | |||||
most favorite | |||||
least favorite | |||||
most watched | |||||
most characteristic | |||||
highest variance |
n=1 | n=3 | n=7 | n=15 | n=100 | |
RMSE | |||||
random | |||||
most recent | |||||
most favorite | |||||
least favorite | |||||
most watched | |||||
most characteristic | |||||
highest variance | |||||
NDCG@10 | |||||
random | |||||
most recent | |||||
most favorite | |||||
least favorite | |||||
most watched | |||||
most characteristic | |||||
highest variance |




4.1. Feasibility of global data minimization
Tables 1 and 2, as well as Figure 1 present the performance of the k-NN and SVD recommendation algorithms for various minimization strategies and intensity (parameter denotes the number of items from observational pools that were shown to the recommendation algorithm). The numbers show the RMSE and NDCG values of the minimized recommendations, averaged over all minimizing users.
For both recommendation algorithms, we observe that the recommendation quality decreases as we present the algorithm with less data to base personalized recommendations on. We attribute the few exceptions (e.g., the increase of RMSE between n=3 and n=7) to the inherent noisiness of data and effects of sampling strategies.
We would like to highlight two observations. First, the overall loss incurred by minimization is relatively low when compared to the variation of error across users — see Figure 1 for a visualization of sorted error values for all users in the minimization dataset for random minimization strategies. It is important not to overinterpret these results based on measures like RMSE, though. Ratings in recommendation datasets are often relatively homogenous in terms of absolute values: In the MovieLens dataset, for instance, they vary between and in increments. Moreover, most users abstain from using extreme values in their ratings: In our system data, out of different values in the rating scale, the three most used rating values of and , make of all ratings.
Second, the distribution of error in the population remains the same even when the recommendations are based on minimized data. We observe that the shapes of the error value curves are similar for different minimization methods beyond random (effects similar to those in Figure 1). We exclude additional plots for lack of space.
4.1.1. Withheld data
While our experiments explicitly controlled the size of user interaction logs available to a recommendation algorithm, the data withheld from the algorithm can be substantial. On average, minimization with leads to of data withheld from the recommendation algorithm, respectively. Note that this is not a comment about the total amount of data available to the system: In the setup we consider in this paper, the recommendation algorithm is trained on full data of 70% of users, which means that the effective percentage of the withheld data is lower.
4.2. Algorithm robustness to data minimization
We find that SVD is more robust to data minimization according to both quality measures. In case of RMSE, metric differences between the Full strategy and any other strategy and minimization parameter are lower for SVD than for kNN. This observation also holds for NDCG; moreover, the differences in NDCG between the performance of SVD on full data and minimized data are not significant (under a two-tailed t-test and with the Bonferroni correction). Note that the SVD robustness result is partly explained by our experimental protocol—the minimized observed data of each test user is ’folded in’ into the matrix one user at a time. While this approach is more computationally expensive than folding in all test users at once, the resulting decomposition is computed for a matrix where only one row is different from the full data condition. On top of that, the NDCG measure is not sensitive to differences in predicted rating values as long as the predicted ranking of items remains the same (which is likely to happen when the decomposed matrix is similar to the full data matrix). The lower minimization robustness of kNN can furthermore be explained by the fact that user similarities are computed over rating sets joint with other system users (, see 3.2), and minimization thus leads to computing predictions over noisier neighbour sets.
4.2.1. Comparison to prior work.
Note that these findings are consistent with prior work. First, Chow et al. (Chow et al., 2013) demonstrate that, for similarity-based recommendations, performance often does not differ after removing random data points. Further, different data removal strategies can improve or degrade predictive performance relative to random removal; in some cases, strategies can improve over the non-minimized predictions (Chow et al., 2013, Fig. 1).
Second, Wen et al. (Wen et al., 2018) analyzed performance decreases in a recommendation privacy scenario where users provide an algorithm with their recommendation data from the most recent days. This filtering strategy is similar to the Most Recent minimization strategy we introduce in Sec. 3.5. Wen et al. showed that predictions of matrix-factorization-based methods are robust, with performance not degrading even when data is limited to ratings from the previous one to seven days and especially when the percentage of minimizing users is low (Wen et al., 2018, Fig. 2).111The experimental protocol used in our paper maps to a setting in Wen et al. (Wen et al., 2018) where the percentage of minimizing users is much lower than .
4.2.2. Factors influencing rating changes when minimizing for k-NN
Recall Eq. 1. What will influence the difference between an item prediction under the minimization condition and the prediction based on the full observational pool? Since the system data remains intact under our experimental setup, the values of will remain intact as well. The value of will be changed, though, when ’s relative similarity to other users changes. This might happen when:
-
•
The set of nearest neighbors changes and user is placed in a different neighborhood for item . The nearest neighbor summation of ratings happens over a different set of users (even if the relative similarities to those users stay the same).
-
•
The set of nearest neighbors changes and user is placed in a neighborhood where the relative similarities to other users are different (even if the neighbor rating values are the same).
-
•
The set of nearest neighbors stays the same but the similarity of to other users within the neighborhood changes. Note that this is very likely to happen since the similarities will be computed over ’s minimized data.
While it is possible to enumerate these error contributing factors, analysis of how exactly they impact overall minimization error is challenging because the different dimensions (user similarity, neighborhoods, item popularity, etc.) all influence each other.
4.2.3. Factors influencing rating changes when minimizing for SVD
When will an item prediction under the minimization condition and the prediction based on the full observational pool? Note that latent item representations and biases will largely stay intact – during training, most updates to ’s and ’s will come from the data of the system users. The rating change will primarily be influenced by a change in the latent user representation and bias – during training, updates to these components will come from the latent factors of minimized observational items. Thus, we can expect biggest rating differences if the items in the minimized user profile don’t reflect the full user profile. To examine the relative importance of and , we run minimization for recommendations generated using an unbiased SVD (removing , , and from Eq. 2). We find that errors incurred by minimization for this setup increase, suggesting that recommendation performance might be preserved by the bias terms when data is minimized.
4.3. Best and worst minimization strategies
4.3.1. Random minimization strategy
Figure 1 presents sorted RMSE (a, c) and NDCG (b, d) error values per user in the MovieLens dataset population, respectively, when minimizing data using random selection strategies. Unsurprisingly, on average, recommendation error increases as we observe fewer items. The error increase is, however, not substantial. There a number of factors that contribute to this effect. First, note that the random minimization strategy does not create random user profiles, but average user profiles, and the rating distributions over salient categories are likely to remain the same. Second, user profiles are of varying sizes and for some methods minimizing methods already access full observational pools. We tried to alleviate this effect by inclusion of users whose observational pools have at least 45 ratings. To understand these limitations better, we also plot the empirical lower bound on the error for predictions based on empty observational pools (non-personalized predictions based on the system data only). While the random minimization strategy performs reasonably well, there exist better and worse minimization strategies for both recommendation algorithms.
4.3.2. Strategies performing better than random minimization.
For kNN recommendations, Most Favorite and Most Watched strategies perform better than Random. Movies users like most likely lead to highest contributions to user-user similarity, and thus the Most Favorite strategy tends to quickly place users in the right neighborhoods. Most Watched, by asking about the most rated movies, will quickly place users belonging to large clusters of popular movie watchers in the right neighborhood. Since there are many users with a taste for most popular movies, this strategy overall leads to a good global minimization performance.
4.3.3. Strategies performing worse than random minimization.
For kNN recommendations, the Highest Variance selection strategy performs worse than the random selection strategy for the lowest values (). One hypothesis is that the items selected by this strategy often have very high or very low ratings for a given user, causing this strategy to effectively interpolate between the performance of the Most Favorite and Least Favorite strategies. Whereas Most Favorite usually performs slightly better than random, Least Favorite often performs far worse, and when observed together explains why the Highest Variance strategy often performs worse than the Random selection strategy. We believe this is because the most characteristic score is inversely correlated with the most watched count.
For SVD recommendations, Most Favorite and Least Favorite strategies perform significantly worse than Random. We hypothesize that asking a user for ratings from just one side of their taste spectrum fails to populate all latent dimensions with relevant information. Moreover, since the most and least favorite items of a given user are likely correlated, asking for more items corroborates this effect by constructing an increasingly skewed user taste representations. This skew potentially leads to a reversal effect we have observed—Most Favorite and Least Favorite strategies initially decrease in performance as we increase n.
4.3.4. Other strategies.
For kNN recommendations, Most Recent strategy performs on average worse than Random, likely due to the fact that the MovieLens-20M data was collected over a long period of time, yet our testing sample was random. Relatively bad performance of the Least Favorite strategy is related to insensitivity of standard recommendation algorithms to negative rating feedback; systems generally need to be tuned to be able to learn from negative ratings (Frolov and Oseledets, 2016).
4.4. Differences between datasets
As described in Sec. 3.1, we run the the same experiments with two different datsets, using the Google Location dataset for validation. We observe the same trends in terms of the performance of different minimization strategies. One major difference in the results is that we observe similar robustness to minimization for KNN and SVD recommendations. We attribute this fact to the key difference between the two datasets—in Google Location dataset item ratings are sparser (20k vs. 150k unique items for a similar total number of users) thus minimization is less likely lead to overall change in similarities to other users.
5. Per-user data minimization
5.1. Feasibility of per-user data minimization
Figure 2 shows the error variation when the data is sorted only by the error value of the Full method - other error values correspond to users at the ranking positions determined by the sorting for Full. Note that the data plotted here is exactly the same as the data in Figure 1 — only the sorting differs. These results suggest that, while the distribution of error in the population across users remains largely similar irrespective of recommendation algorithm or minimization strategy (see Figure 1), errors incurred to individuals can be substantial. We observe this behavior for all tested minimization methods and recommendation algorithms, although the per-user variations are lower when minimizing for SVD recommendations.
This finding suggests that, for a fixed quality threshold, data can be less effectively minimized if the loss requirement applies to every individual as opposed to the population on average.
Since the error is not uniformly distributed, we dive deeper to try to understand which users are most impacted. The following sections analyze a number of user characteristics and their correlations with error deltas.


5.2. User characteristics vs. minimization error
We investigate whether the minimization errors (the difference in the quality metric when comparing the recommendations over minimized profiles and the recommendations over full profiles) are correlated with different user characteristics. For each user, we consider the following characteristics (measured over the user’s full profile, before minimization): (1) Number of ratings, (2) Average value of the ratings in a user’s profile, (3) Average popularity of items in a user’s profile (measured as the number of users in the system data who have rated a given item), (4) Profile diversity measured by the number of genres the movies in a user’s profile belong to, (5) Average similarity to all users in the system data, (6) Average similarity to the 30 most similar users in the system data.
5.2.1. Regression analysis
For each pair of recommendation algorithm and minimization strategy, we run an Ordinary Least Squares regression with the error delta as the dependent variable, and the above user characteristics as independent variables. Error delta is computed in two versions as: (i) , and (ii) . We compute the coefficient of determination () to measure what proportion of variance in the dependent variable can be explained by the independent variables.
We find that the variance in neither nor is well explained by the selected user variables, across recommendation and minimization strategies. For kNN and , we get the highest at for the Most Recent strategy, followed by for the Most Characteristic System strategy, and for the Least Favorite strategy. For kNN and , values are even lower. For SVD and , we get the highest values for the Most and Least Favorite strategies, at and , respectively. For SVD and , values follow similar trends.
5.2.2. A closer look
For a closer look into the complex dependencies between user characteristics, minimization strategies, recommendation algorithms, and minimization errors, we plot the most interesting cases in Figure 3.
Figure 3(a) shows the dependency between the number of ratings in a user’s full profile and the error delta (kNN+Random). The plot suggests that the smaller a user’s observational pool, the higher variation in the incurred minimization error. We conjecture that the reason for this effect is that sparse profiles with little data are likely to misrepresent true user tastes.
Figure 3(b) shows the dependency between a user’s average similarity to all users in the system data and the error delta (kNN+Random). We observe a similar trend – lower global similarity means higher variance in minimization error. However, the reason for this effect is likely different. Users who are similar to many system users are likely to end up in a neighborhood with accurate recommendations irrespective of which items are minimized out of their profiles.
Figure 3(c) shows the dependency between a user’s RMSE error for recommendations over the full observational pool the error delta (kNN+Random). We observe that lower RMSE values over the full data tend to imply higher error deltas, suggesting that users who are underserved by a system will be harmed the most when minimizing data.
Figures 3(d) and 3(e) reveal a curious observation about the dependency between the average value of ratings in a user profile and the error delta incurred by the Most and Least Favorite strategies for SVD. Users who tend to give lower movie ratings on average will receive worse results when minimizing using the Most Favorite strategy – likely because the movies they like the most will look like neutral movies when compared to the absolute values of ratings of other users. For a similar reason, though inverted, users who tend to give higher ratings on average will receive worse results when minimizing using the Least Favorite strategy. Figure 3(f) shows that for the Random strategy the effect is symmetric and less pronounced.






6. Data Minimization vs. Privacy
The operational definitions of data minimization proposed in this paper, as shown in the experiments, will often lead to a decrease of data collection. However, it is a feasible scenario that that each data point positively contributes to a system’s performance and data collection will not be decreased as a result. Rather than thinking of minimization as another computational definition of privacy, we look at data protection more broadly. For instance, the UK Information Commissioner’s Office defines data protection as ‘the fair and proper use of information about people’ (Office, 2018a). Nevertheless, because of the potential decrease in the collected data, the proposed definitions of data minimization are related to different computational concepts of privacy. We briefly discuss some of these relationships.
Identifiablity. Presence of a unique combination of items in an anonymous user profile poses a deanonymizaton risk: if an attacker has the background knowledge that a user has rated these items, they can uniquely identify their profile and thus gain access to the rest of the user’s data. Analogous scenarios motivated the work on k-anonymity and related concepts (Sweeney, 2002). One way of quantifying identifiability without access to external datasets is through a lower bound on the number of items an attacker would need to know to identify a user in a given dataset. More specifically, we compute, for each user , the minimum size of a subset of her ratings that does not exist in a profile of any other user: The higher the value of the above measure, the bigger the number of items an attacker would need to know to uniquely identify a user profile, and thus the lower the identifiability risk.
Table 3 presents the identifiability statistics for user profiles minimized using different strategies, averaged over all users. The results suggest that minimization strategies selecting items based on the characteristics of system data (Most Watched, Highest Variance, Most Characteristic) lead to lower profile identifiability than minimization methods based on an individual’s preferences (Most and Least Favorite). The most Recent strategy leads to the lowest identifiability across different values of the minimization parameter . We conjecture this is because at a given time, many users rate the same new releases.
Profiling. Another computational privacy concept is that of profiling—collecting detailed topical profiles of users (Biega et al., 2017; Chen et al., 2011; Xu et al., 2007). Should data minimization lead to decrease of collected data, it is likely that profiling risks also decrease. For instance, in our experiments, decreasing the number of movie ratings in all user’s profile to a maximum of 100 already reduces the average number of different genres in a user profile from down to according to the best strategy.
Other. While decreasing the size of data might also help with other privacy dimensions, such as protection from inference (du Pin Calmon and Fawaz, 2012) or differential privacy (Dwork, 2008) in case aggregate data is released, analysis of these dimensions is more complex and might lead to removal of different data points.
n=3 | n=7 | n=15 | n=100 | |
---|---|---|---|---|
random | ||||
most recent | ||||
most favorite | ||||
least favorite | ||||
most watched | ||||
most characteristic | ||||
highest variance |
7. Related Work
Interpreting GDPR principles in practice. The core contribution of this paper is in pointing out the gap between the current understanding of GDPR’s data minimization principle and the reality of personalization systems and proposing possible adequate re-interpretations. In this context, our work is related to other efforts to translate GDPR’s principles into data science practice. Prior work in this space has explored practical challenges behind revoking consent to data processing(Politou et al., 2018; Utz et al., 2019), and explored what the right to be forgotten (Koops, 2011) means in practice. Recent work proposes practical solutions for removing data points from trained machine learning models in case an individual included in the training data requests deletion (Ginart et al., 2019). The right to explanation(Kaminski, 2019), requiring service providers to be able to explain algorithmic decisions and results to their users, motivated the active are of explainability and transparency. Another line of work analyzes changes to the online ecosystem incurred by GDPR, including the presence of consent notices (Urban et al., 2019), or tracking scripts(Solomos et al., 2019; Sørensen and Kosta, 2019).
Privacy. As discussed in Sec. 6, data minimization is related to some of the computational concepts of privacy. In the context of personalized search, many works proposed mechanisms for perturbing user search logs while preserving the search quality, including mixing and merging queries into synthetic profiles(Biega et al., 2017; Eslami et al., 2017), grouping user profiles (Mac Aonghusa and Leith, 2018), or splitting them (Chen et al., 2011; Xu et al., 2007; Zhu et al., 2010). Privacy has also been interpreted as a probabilistic guarantee on data retention (Singla et al., 2014). To preserve the privacy of recommender system users, it has been proposed to prevent the collection of ratings locally if they are predicted to lead to privacy loss (Guerraoui et al., 2017), or to store the ratings of different users intermingled (Biega et al., 2017). Research in privacy-preserving information retrieval (Yang et al., 2016) moreover investigates problems related to search log anonymization (Zhang et al., 2016), or the relation between user behavior and privacy attitudes (Zimmerman et al., 2019).
Performance of recommender systems under varying conditions. Analyses we perform in this paper are related to a line of work analyzing the success and failure of recommender systems under changing conditions. Ekstrand et al. (Ekstrand and Riedl, 2012) analyze data factors that cause different recommendation algorithms to fail. Chow et al. (Chow et al., 2013) propose techniques to estimate the contributions of different data points to the overall recommendation quality. Vincent et al. (Vincent et al., 2019) propose ’data strikes’ as a form of collective action where users protest by withholding their data from recommendation provider. Note that, while the goal of data strikes is to limit availability of data to reduce recommendation performance, the goal of performance-based data minimization it to limit availability of data while preserving recommendation performance. Wen et al. (Wen et al., 2018) analyzed performance decrease in a recommendation privacy scenario where users provide an algorithm with their recommendation data from the most recent N days.
Relation to other disciplines We believe that further work on data minimization would lead to synergies not only with the legal community, but also with other computer science subdisciplines. While the focus of data minimization is on minimizing features rather than data points, the problem is related to works studying the relationships between training examples and algorithmic performance. This abstract description includes, for instance, the problems of data influence (Koh and Liang, 2017), data valuation (Ghorbani and Zou, 2019), active learning (Bachman et al., 2017), or budgeted learning (Lizotte et al., 2002).
8. Discussion and Conclusions
8.1. Summary of the findings
In this paper, we have identified a lack of a homogeneous interpretation of the GDPR’s purpose limitation and data minimization principles in the domain of personalization systems. We argue that these systems do not necessarily need to collect user data, but that they do so in order to improve the quality of the results. Thus, we propose two performance-based interpretations of the data minimization principle that tie the limitations of data collection to quality metrics. The first interpretation focuses on the global average algorithm performance, while the second focuses on the local per-user minimum performance.
We found SVD (FunkSVD with user and item biases) to be more robust to minimization than kNN user-user collaborative filtering across different minimization strategies. Among the minimization strategies, we found the random strategy to perform well, likely due to the fact that it preserves average user characteristics. However, for each recommendation algorithm, it is possible to find strategies that perform better or worse than random.
While the results suggest global data minimization can be quite successful (in some cases we can withhold as much as 90% of the user data incurring RMSE loss as low as 0.025), we show that quality difference can be substantial for individual users. Furthermore, our analysis with Ordinal Least Squares regression shows that the variation in individual-level error is not well explained by standard user features. The complex interaction between the individual-level error and recommendation algorithms, minimization strategies, system data, and individual data, require further study, also from a legal perspective. Indeed, further research should evaluate the desirability of both approaches, considering that, on the one hand, the GDPR requires that each data processing operation be examined on its own merits, yet on the other purpose limitation or data protection by design and by default ought to be evaluated from a global perspective.
8.2. Potential negative impacts
Based on our observations about varying user-level errors, it is plausible that data minimization hurts marginalized groups, in particular if those groups form a minority of the data—the members of majority population will be well served with just a few features (because there is sufficient statistical support), while minority populations will need to provide more features to get service of comparable quality. A scenario like this would further harm marginalized populations through decreased data protection.
Furthermore, our analysis assumes service providers have a collection of background data to base personalization on (purchased or collected from markets that are not legally obliged to data minimization). Companies might also need personal data to develop new services. In this work, we did not consider such provider costs.
8.3. Challenges for data minimization
While this paper enhances our understanding of what performance-based data minimization means in practice, a number of challenges emerge. Practical minimization mechanisms would not be able to measure quality loss directly, nor easily adapt selection mechanisms to each user if necessary without access to candidate user data. To support minimization, we need to design new protocols for user-system interaction, and new learning mechanisms that select data while respecting specific minimization requirements. Last but not least, further interdisciplinary work with the legal community is necessary to develop data minimization interpretations that are verifiable and viable, both legally and computationally.
Acknowledgements.
We wish to thank Solon Barocas for the discussion that helped shape the interdisciplinary direction of this work.References
- (1)
- Bachman et al. (2017) Philip Bachman, Alessandro Sordoni, and Adam Trischler. 2017. Learning algorithms for active learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 301–310.
- Biega et al. (2017) Asia J Biega, Rishiraj Saha Roy, and Gerhard Weikum. 2017. Privacy through solidarity: A user-utility-preserving framework to counter profiling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 675–684.
- Binns and Gallo (2019) Reuben Binns and Valeria Gallo. 2019. Data minimisation and privacy-preserving techniques in AI systems. Retrieved May 26, 2020 from https://ico.org.uk/about-the-ico/news-and-events/ai-blog-data-minimisation-and-privacy-preserving-techniques-in-ai-systems/
- Chen et al. (2011) Gang Chen, He Bai, Lidan Shou, Ke Chen, and Yunjun Gao. 2011. UPS: efficient privacy protection in personalized web search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 615–624.
- Chow et al. (2013) Richard Chow, Hongxia Jin, Bart Knijnenburg, and Gokay Saldamli. 2013. Differential data analysis for recommender systems. In Proceedings of the 7th ACM conference on Recommender systems. 323–326.
- du Pin Calmon and Fawaz (2012) Flávio du Pin Calmon and Nadia Fawaz. 2012. Privacy against statistical inference. In 2012 50th annual Allerton conference on communication, control, and computing (Allerton). IEEE, 1401–1408.
- Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1–19.
- Ekstrand and Riedl (2012) Michael Ekstrand and John Riedl. 2012. When recommenders fail: predicting recommender failure for algorithm selection and combination. In Proceedings of the sixth ACM conference on Recommender systems. ACM, 233–236.
- Eslami et al. (2017) Sedigheh Eslami, Asia J Biega, Rishiraj Saha Roy, and Gerhard Weikum. 2017. Privacy of hidden profiles: Utility-preserving profile removal in online forums. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2063–2066.
- Finck and Pallas (2019) Michèle Finck and Frank Pallas. 2019. They Who Must Not Be Identified-Distinguishing Personal from Non-Personal Data Under the GDPR. Max Planck Institute for Innovation & Competition Research Paper 19-14 (2019).
- Frolov and Oseledets (2016) Evgeny Frolov and Ivan Oseledets. 2016. Fifty Shades of Ratings: How to Benefit from a Negative Feedback in Top-N Recommendations Tasks. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 91–98.
- Funk (2006) Simon Funk. 2006. Netflix update: Try this at home (December 2006). URL http://sifter. org/~ simon/journal/20061211. html (2006).
- Galdon Clavell et al. (2020) Gemma Galdon Clavell, Mariano Martín Zamorano, Carlos Castillo, Oliver Smith, and Aleksandar Matic. 2020. Auditing Algorithms: On Lessons Learned and the Risks of Data Minimization. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 265–271.
- Ghorbani and Zou (2019) Amirata Ghorbani and James Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In International Conference on Machine Learning. 2242–2251.
- Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Zou. 2019. Making AI Forget You: Data Deletion in Machine Learning. NIPS (2019).
- Guerraoui et al. (2017) Rachid Guerraoui, Anne-Marie Kermarrec, and Mahsa Taziki. 2017. The Utility and Privacy Effects of a Click. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 665–674.
- Guyon and Elisseeff (2003) Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157–1182.
- Halevy et al. (2009) A. Halevy, P. Norvig, and F. Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 2 (March 2009), 8–12.
- Harper and Konstan (2016) F Maxwell Harper and Joseph A Konstan. 2016. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2016), 19.
- He et al. (2017) Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017. Translation-based recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. 161–169.
- Hug (2017) Nicolas Hug. 2017. Surprise, a Python library for recommender systems. http://surpriselib.com.
- Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. TOIS 20, 4 (2002), 422–446.
- Kaminski (2019) Margot E Kaminski. 2019. The right to explanation, explained. Berkeley Tech. LJ 34 (2019), 189.
- Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1885–1894.
- Koops (2011) Bert-Jaap Koops. 2011. Forgetting footprints, shunning shadows: A critical analysis of the right to be forgotten in big data practice. SCRIPTed 8 (2011), 229.
- Krause and Horvitz (2010) Andreas Krause and Eric Horvitz. 2010. A utility-theoretic approach to privacy in online services. Journal of Artificial Intelligence Research 39 (2010), 633–662.
- Lizotte et al. (2002) Daniel J Lizotte, Omid Madani, and Russell Greiner. 2002. Budgeted learning of nailve-bayes classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence. 378–385.
- Mac Aonghusa and Leith (2018) Pol Mac Aonghusa and Douglas Leith. 2018. 3PS-Online Privacy through Group Identities. arXiv preprint arXiv:1811.11039 (2018).
- Narayanan and Shmatikov (2008) Arvind Narayanan and Vitaly Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy, SP. 111–125.
- Office (2018a) UK Information Commisioner’s Office. 2018a. Guide to Data Protection. Some basic concepts. Retrieved Jan 22, 2020 from https://ico.org.uk/for-organisations/guide-to-data-protection/introduction-to-data-protection/some-basic-concepts/
- Office (2018b) UK Information Commisioner’s Office. 2018b. Guide to the General Data Protection Regulation (GDPR). Principle (c): Data minimisation. Retrieved Jan 22, 2020 from https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/principles/data-minimisation/
- Politou et al. (2018) Eugenia Politou, Efthimios Alepis, and Constantinos Patsakis. 2018. Forgetting personal data and revoking consent under the GDPR: Challenges and proposed solutions. Journal of Cybersecurity 4, 1 (2018), tyy001.
- Regulation (2016) Protection Regulation. 2016. REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. Official Journal of the European Union (2016).
- Singla et al. (2014) Adish Singla, Eric Horvitz, Ece Kamar, and Ryen White. 2014. Stochastic privacy. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
- Solomos et al. (2019) Konstantinos Solomos, Panagiotis Ilia, Sotiris Ioannidis, and Nicolas Kourtellis. 2019. Clash of the Trackers: Measuring the Evolution of the Online Tracking Ecosystem. arXiv preprint arXiv:1907.12860 (2019).
- Sørensen and Kosta (2019) Jannick Sørensen and Sokol Kosta. 2019. Before and After GDPR: The Changes in Third Party Presence at Public and Private European Websites. In The World Wide Web Conference. ACM, 1590–1600.
- Sun et al. (2017) C. Sun, A. Shrivastava, S. Singh, and A. Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In 2017 IEEE International Conference on Computer Vision (ICCV). 843–852.
- Sweeney (2002) Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557–570.
- Urban et al. (2019) Tobias Urban, Dennis Tatang, Martin Degeling, Thorsten Holz, and Norbert Pohlmann. 2019. A Study on Subject Data Access in Online Advertising After the GDPR. In Data Privacy Management, Cryptocurrencies and Blockchain Technology. Springer, 61–79.
- Utz et al. (2019) Christine Utz, Martin Degeling, Sascha Fahl, Florian Schaub, and Thorsten Holz. 2019. (Un) informed Consent: Studying GDPR Consent Notices in the Field. In ACM SIGSAC Conference on Computer and Communications Security (CCS ’19).
- Vapnik (1992) Vladimir Vapnik. 1992. Principles of risk minimization for learning theory. In Advances in neural information processing systems. 831–838.
- Vincent et al. (2019) Nicholas Vincent, Brent Hecht, and Shilad Sen. 2019. “Data Strikes”: Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies. In The World Wide Web Conference. ACM, 1931–1943.
- Wen et al. (2018) Hongyi Wen, Longqi Yang, Michael Sobolev, and Deborah Estrin. 2018. Exploring recommendations under user-controlled data filtering. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 72–76.
- Xu et al. (2007) Yabo Xu, Ke Wang, Benyu Zhang, and Zheng Chen. 2007. Privacy-enhancing personalized web search. In Proceedings of the 16th international conference on World Wide Web. ACM, 591–600.
- Yang et al. (2016) Hui Yang, Ian Soboroff, Li Xiong, Charles LA Clarke, and Simson L Garfinkel. 2016. Privacy-preserving ir 2016: Differential privacy, search, and social media. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 1247–1248.
- Zhang et al. (2016) Sicong Zhang, Hui Yang, and Lisa Singh. 2016. Anonymizing query logs by differential privacy. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 753–756.
- Zhu et al. (2010) Yun Zhu, Li Xiong, and Christopher Verdery. 2010. Anonymizing user profiles for personalized web search. In Proceedings of the 19th international conference on World wide web. ACM, 1225–1226.
- Zimmerman et al. (2019) Steven Zimmerman, Alistair Thorpe, Chris Fox, and Udo Kruschwitz. 2019. Investigating the Interplay Between Searchers’ Privacy Concerns and Their Search Behavior. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 953–956.