A Comprehensive Study on GDPR-Oriented Analysis of Privacy Policies: Taxonomy, Corpus and GDPR Concept Classifiers

Peng Tang, Xin Li, Yuxin Chen, Weidong Qiu, Haochen Mei, Allison Holmes, Fenghua Li, and Shujun Li P. Tang, X. Li, Y. Chen, W. Qiu and H. Mei are with Shanghai Jiao Tong University, Shanghai, China, 200240. A. Holmes and S. Li are with University of Kent, Canterbury, UK. F. Li is with Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, 100085.

Abstract

Machine learning based classifiers that take a privacy policy as the input and predict relevant concepts are useful in different applications such as (semi-)automated compliance analysis against requirements of the EU GDPR. In all past studies, such classifiers produce a concept label per segment (e.g., sentence or paragraph) and their performances were evaluated by using a dataset of labeled segments without considering the privacy policy they belong to. However, such an approach could overestimate the performance in real-world settings, where all segments in a new privacy policy are supposed to be unseen. Additionally, we also observed other research gaps, including the lack of a more complete GDPR taxonomy and the less consideration of hierarchical information in privacy policies. To fill such research gaps, we developed a more complete GDPR taxonomy, created the first corpus of labeled privacy policies with hierarchical information, and conducted the most comprehensive performance evaluation of GDPR concept classifiers for privacy policies. Our work leads to multiple novel findings, including the confirmed inappropriateness of splitting training and test sets at the segment level, the benefits of considering hierarchical information, and the limitations of the “one size fits all” approach, and the significance of testing cross-corpus generalizability.

Index Terms:

GDPR taxonomy, privacy policy corpus, legal compliance, concept classifier

I Introduction

To provide users with more personalized online services and for other legitimate lawful bases, online services typically collect personal data from their users. For online services provided via the Internet, privacy policies on their websites remain the main type of legal documents for online users to understand how service providers collect their personal data. A typical privacy policy describes different aspects about collection and processing of personal data by an online service, e.g., what personal data are collected, how they are collected, why they are collected, how such data are protected, how such data are stored, and what data are shared with third parties. Although accepting the content of a privacy policy is often made mandatory for starting using an online service, most users tend to omit reading privacy policies because they are often too long to read quickly and too difficult to understand due to the legal and formal wording used [1]. Despite the existence of data protection laws and regulations in many countries, it has been found that service providers’ privacy policies often do not fully comply with such laws and regulations, leading to concerns from online users, researchers and privacy advocates [2]. For example, in 2019, the CNIL, the French data protection authority, fined Google 50 million Euros for failing to provide transparent and understandable information in its data consent policy [3].

Among all data protection laws and regulations around the world, the EU (European Union)’s GDPR (General Data Protection Regulation) is one of the most demanding and comprehensive privacy regulations ever enacted. It was passed in 2016, and went into effect in the whole EU and EEA (European Economic Area) on May 25, 2018. After Brexit, the UK decided to keep the GDPR in its local law, known as the UK GDPR, which follows largely the same principles but with some differences on UK-specific matters. This leads to two versions of the GDPR: the EU GDPR and the UK GDPR. In this paper, we will use the loose term “the GDPR” to refer to both versions and consider the EU/EEA/UK as the region the GDPR is effective. The GDPR harmonizes data privacy laws across the EU/EEA/UK and is widely regarded as a benchmark for data protection legislation around the world. The GDPR defines a number of principles and requirements for data controllers and data processors to consider in order to be legally compliant. For example, under the GDPR’s transparency principle, data subjects have the right to be informed about the collection and processing of their personal data. A common approach to meet this requirement is to provide data subjects with a privacy policy document, which inform them about all important information they have a right to know according to the GDPR. The GDPR requires each country to have a national authority to take care of the enforcement of the GDPR, and such bodies often release guidelines to data controllers and data processors on how to be GDPR compliant. For instance, the Information Commissioner’s Office (ICO) [4], the UK’s national data protection authority in charge of the enforcement of the GDPR¹¹1The EU GDPR in the UK before Brexit and the UK GDPR afterwards., has provided a general guide on the GDPR [5] and also a template for data controllers to use as a reference of GDPR-compliance privacy policies [6]. In order to meet their legal obligation in terms of the GDPR, many organizations updated their privacy policy to be GDPR compliant, e.g., the New York Times updated its privacy policy on May 24, 2018, the day before the GDPR went into effect, to include provisions on international data transfers.

The importance of privacy policies for both data subjects and data controllers means that GDPR-oriented analyses of such legal documents can be very useful. It can help enhance data subjects’ awareness on important data protection issues, and also help data controllers refine their privacy policy to be more legally compliant to the GDPR. Such an analysis can obviously be done qualitatively by experienced GDPR experts, but more automated analysis is preferred in many applications to help reduce the time and costs spent. As reviewed in Section II of this paper, many researchers have investigated the automation of analysis of privacy policies, but they all evaluated their concept classifiers by splitting the training and testing sets at the segment level without considering which segment belongs to which privacy policy. Such an approach means that segments used in both the training and testing sets can cover all privacy policies, while in real-world applications the classifiers should be tested on segments from unseen privacy policies. Therefore, a classifier tested using the segment-level treatment could perform worse in real-world settings since its training and testing sets may have not covered all segments of an unseen privacy policy. In addition, we also observed other research gaps in the literature, including the lack of a more complete GDPR taxonomy and the less consideration of hierarchical information in privacy policies for developing GDPR-oriented concept classifiers.

In this paper, we fill all the above research gaps by presenting the most comprehensive comparative study on the performances of GDPR concept classifiers. Our study involved the development of a more complete GDPR taxonomy, the first privacy policy corpus covering hierarchical information, and a comprehensive performance analysis of many different GDPR concept classifiers based on a new document-level training-testing set splitting method, two architectures of hierarchical classifiers, and many different sets of features including some covering contextual information. More details about our contributions are summarized as follows (more significant contributions appearing first).

1) We propose a document-level performance evaluation framework for GDPR concept classifiers, and obtained the actual performance of the classifier in real-world scenarios. Different from previous segment-level evaluation, we divide the training and testing sets by documents so that no segments in the testing set are included in the training process. By comparing the performance results at the document and segment levels, we showed that the performances of GDPR concept classifiers reported in past studies are indeed significantly overestimated, so cannot reflect their real-world performances. This indicates the need for all future studies to follow the document-level performance framework.

2) We conduct a comparative study on GDPR concept classifiers (the most comprehensive one to the best of our knowledge). Our study goes beyond past research in multiple ways: i) considering contextual features based on the hierarchical nature of a privacy policy (derived from the parent node and the immediate sibling nodes); ii) comparing two architectures of hierarchical classifiers – local classifier per node (LCN) and local classifier per parent node (LCPN), not just LCPN in past work; iii) obtaining a range of new insights based on extensive experiments, such as evidence against the “one type of classifier fits all” idea followed in past work and the error propagation issue of LCPN classifiers.

3) We constructed a new and the first fully hierarchically encoded GDPR-oriented privacy policy corpus GoPPC-150 and the new framework for extending this new corpus and constructing other new corpora. GoPPC-150 includes 150 privacy policies collected from Alexa.com top websites. It includes expert-annotated GDPR concept labels in a hierarchical manner following a newly extended GDPR privacy policy taxonomy (see below), and is structured at the document-level. The inclusion of explicitly encoded hierarchical information in GoPPC-150 allows more context-aware development of GDPR concept classifiers and other relevant tools. The framework has a high level of automation to minimize human intervention.

4) We propose an extended taxonomy for facilitating GDPR-oriented analysis of privacy policies. This taxonomy is based on a smaller GDPR taxonomy proposed by Torre et al. [7], our own expertise, two important documents from the ICO and the IAPP (International Association of Privacy Professionals), and the work of the W3C Data Privacy Vocabularies and Controls CG (DPVCG). The end result is the most comprehensive GDPR privacy policy taxonomy reported in the literature so far.

To help other researchers reproduce our work reported in this paper, all our source code, data used and newly produced, and all results of our experiments have been uploaded to a GitHub repository available at https://github.com/tp-sh/GDPR_privacy_policies. We will refer to this GitHub repository frequently in the rest of the paper, with a direct URL pointing to the corresponding part of the repository when necessary.

The rest of this paper is structured as follows. Section II discusses related work. Section III shows the initial experimental evidence regarding the significantly reduced performances of selected concept classifiers previously trained at the segment level, when they are applied to unseen privacy policies. Section IV explains the more complete GDPR taxonomy and the new hierarchical corpus we developed, including the methods we used for their development. Section V introduces the comprehensive study on the GDPR concept classifiers and the evaluation results using the document-level evaluation framework. Section VI discusses the further research directions of our work and the field, and the last section concludes the paper.

II Related Work

In this section, we introduce related work in five closely related areas: (1) hierarchical multi-label text classification, (2) privacy policy corpora, (3) privacy policy concept classifiers, and (4) analyses of privacy policies.

Hierarchical multi-label text classification (HMTC). Privacy policy is a hierarchical text with multi-label, so the compliance checking of privacy policy can be regarded as an HMTC task. There are many studies on HMTC, in 2011, Silla et al. [8] summarized the three methods of HMTC, which are flat, local and global approaches. To achieve better results, the latter two are more concerned. Koller et al. [9] introduced the first type of local classifier which is proposed to explore the hierarchy by using local information and build multiple local classifiers around it. Following which, a series of local approaches including LCN (a local classifier per node), LCPN (a local classifier per parent node) as well as LCL (a local classifier per level) based methods are proposed [10, 11, 12]. However, local approaches on their own possess high risk of error-propagation. On the other hand, global approaches are designed to build a single classifier to explore hierarchical information globally, thus reducing the overall model size. Most global approaches are generally modified from the flat classification algorithms such as hAnt-Miner [13], Vens et al. [14]) (decision tree based) and GMNBwU [15](Naive Bayes classifier based). Recently, more neural network based HMTC algorithms which combine both local and global approaches are proposed [16, 17]. Huang [18] proposed HARNN, extracting hierarchical information at different label levels and generating document embeddings for local and global classifications. Zhang [19] introduced LA-HCN which employs label-based attention based on the individual label, and the respective embeddings are obtained for global and local classification for better error propagation control. However, when it comes to the compliance checking of privacy policy, few researches have applied global approaches for classification. Most of the classifiers are local [20, 21, 7, 22, 23]. Their specific approaches are shown in following Table I. The tuples in column Taxonomy explain the numbers of levels and nodes of the taxonomies as (3, 96) represents there are 96 nodes of 3-level in the taxonomy. The tuples in column Corpus/Corpora explain the number of samples, annotation levels and be in hierarchical structure or not. For example, (150, 3, Yes) represents that there are 150 privacy policies with annotations of 3-level in the corpus GoPPC-150 and has a hierarchical structure.

TABLE I: Our work compared with related work (* = new public corpus; ^† = GDPR-specific)

Related Work	Taxonomy	Features	Corpus/Corpora	Classifier(s)	Analysis
OPP-115 [24]	(2, 46)	Paragraph2Vec	OPP-115* (115, 2, No)	Multiple; multi-class; LCPN	OPP-115
Polisis [20]	No	fastText	OPP-115	CNN; multi-class; LCPN	No
Mustapha et al.’s [21]	No	XLNet	OPP-115	XLNet; multi-class; LCPN	No
PrivBERT [23]	No	RoBERTa	OPP-115	CNN; multi-class; LCPN	No
Torre et al.’s [25]	(3, 69)^†	GloVe	Private^†	SVM; binary^†; LCPN	Private corpus
Rahat et al.’s [22]	(1, 18)^†	fastText	Private	CNN; multi-class^†; LCPN	Private corpus
Our Work	(3, 96)^†	Multiple Context	GoPPC-150*^† (150, 3, Yes) OPP-115	Multiple; binary + multi-class; LCPN + LCN^†	GoPPC-150 Template-based

Privacy policy corpora. In 2016, Wilson et al. [24] created a public privacy policy corpus called “Online Privacy Policy” (OPP-115) by hiring three legal experts as annotators and it has been used by many other researchers. However, Sarne et al. [26] used unsupervised ML to model topics in privacy policies and found a mismatch between the topics in privacy policies they analyzed and topics covered in OPP-115. Sathyendra et al. [27] constructed a more fine-grained corpus based on OPP-115 with semi-automated annotation, focusing on “opt-out” in privacy policies. Leblanc and Liu [28] built a corpus of fuzzy words and sentences in privacy policies. Zimmeck et al. [29] created the App-350 Privacy Policy Corpus with the goal of checking compliance of behaviors and privacy policies of mobile apps. They selected 350 privacy policies of the most popular apps on Google Play and hired legal experts as annotators. Robaldo et al. [30] used I/O logic formula to code and model various legal documents, expressed legal statements in a logical language, explained various clauses, and demonstrated their work with the GDPR. Muller et al. [31] introduced a privacy policy dataset containing over 18,300 sentences, tagged according to five core privacy policy requirements of the GDPR. Srinath et al. [23] created PrivaSeer, an automatically constructed corpus containing privacy policies of over 1 million English websites, and studied the composition of the corpus, showing readability tests, document similarity, keyword extraction results, and exploring the corpus through topic modeling. Kuznetsov et al. [32] used a technique for identifying URLs of privacy policies of IoT devices to construct a new corpus with 592 privacy policies in 2022.

Privacy policy concept classifiers. Harkous et al. [20] proposed Polisis, a comprehensive framework for enabling multi-dimensional privacy policy analysis. Polisis provides an online service for privacy policies analysis, using a combination of NLP and deep learning (DL) techniques to extract fragments from privacy policies, each containing a set of labels describing data processing behaviors. Tesfay et al. [33] proposed an ML-based approach to classify multiple categories of privacy policy content using pre-defined keywords. Lippi et al. [34] provided 33 metadata types for privacy policies in terms of the GDPR compliance, and provided automatic support for ambiguity detection of privacy policies based on manual rules and machine learning (ML) based methods. A more comprehensive piece of work was done by Torre et al. [7], who proposed 55 metadata types, covering all types identified by Lippi et al. [34], and studied automated detection of privacy policy integrity using an advanced combination of natural language processing (NLP) and ML based on the 55 metadata types. Their more recent work [25] further improved their privacy policy completeness checking framework with the same GDPR taxonomy and classifiers. Srinath et al. [23] constructed a new privacy policy corpus called PrivSeer, built a pre-trained language model called PrivBERT based on PrivSeer, and then developed a number of concept classifiers for privacy policy analysis based on PriVERT. As a whole, the best-performing classifiers are PrivBERT-based ones and those reported in [25], achieving an F1 score above 0.8 for most concept classifiers, and even above 0.9 for some.

Analyses of privacy policies. There are many past studies focusing on the analyses of privacy policies, some of which are GDPR-oriented. Many of such studies are based on machine learning based concept classifiers, as part of a larger automated system. When it comes to the GDPR-oriented analyses, Tom et al. [35] presented a preliminary GDPR model that aims to provide a simple and visual overview for human operators to achieve a better understanding of the relationships between different concepts in the GDPR. It also describes a method of using their proposed model as a tool to develop privacy policies and illustrates how to extract compliance rules. Palmirani and Governatori [36] proposed a proof of concept applicable to the GDPR domain with the aim of detecting or preventing violations of privacy enforcement norms. Bhatia et al. [37] identified incompleteness of privacy policies by representing data practice descriptions as semantic frames. The approach was a grounded analysis to discover which semantic roles corresponding to a data action are needed to construct complete data practice descriptions. Mousavi et al. [38] proposed a tool called KnIGHT, whose innovation lies in the use of semantic similarity between words to associate sentences in a privacy policy with relevant paragraphs in the legal text of the GDPR. Torre et al. [39] proposed a model-based solution for GDPR-compliance analysis, using Unified Modeling Language (UML) and Object Constraint Language (OCL) to build a UML representation of the GDPR. Hamdani et al. [40] conceptualized a framework to implement document-central compliance checking methods in the data supply chain and developed concrete methods to automatically check GDPR-compliance of privacy policies. The other work focus on the commonalities of privacy policies. In order to increase transparency of privacy policies, Zimmeck et al. [41] proposed Privee, a software architecture for analyzing essential policy terms based on crowd sourcing and automatic classification techniques. They implemented Privee as a proof-of-concept web browser extension that retrieves policy analysis results from an online privacy policy repository or, if no such results are available, performs automatic classifications. Targeting privacy policies of websites and mobile apps, Caramujo et al. [42] proposed a domain-specific language and a model transformation approach for specifying privacy policy models. Pullonen et al. [43] proposed a multi-level model as an extension of the business process model and representation to support visualization, analysis, and communication of the privacy policy characteristics of business processes. Ayala-Rivera and Pasquale [44] proposed a model-based approach to help online services understand their data protection obligations under the GDPR. Zimmeck et al. [45] designed and implemented PrivacyFlash Pro, an automated privacy policy generator for iOS apps that leverages static analysis, which identifies code signatures composed of Plist permission strings, framework imports, class instantiations, authorization methods and other evidence that are mapped to privacy practices expressed in privacy policies. Wang et al. [46] proposed PRIVGUARD, a novel system design that reduces human participation required and improves the productivity of the compliance process, which is mainly comprised of two components: (1) PRIVANALYZER, a static analyzer based on abstract interpretation for partly enforcing privacy regulations, and (2) a set of components providing strong security protection on the data throughout its life cycle. Cui et al. [47] proposed POLIGRAPH, automated analysis of the information disclosed in a policy into a knowledge graph based on NLP technology. Although they also combine the context of the current text for classification, they only focus on the data collection part of the privacy policy, and do not consider the structural information of the privacy policy.

III Potential Performance Overestimation of Existing Concept Classifiers

As explained in the Introduction section, all past studies on concept classifiers of privacy policies all followed the segment-level performance evaluation approach. In order to show that such an approach could lead to an overestimation of the performance, we conducted a small experiment to test the performances of some PrivBERT-based concept classifiers with a good performance as reported in [23]²²2For this experiment, we did not select Torre et al.’s classifiers [25] because of the following reasons: 1) their classifiers are not open sourced; 2) they used various artificial rules so it is harder to generalize their classifiers; and 3) we reproduced their classifiers based on their descriptions in [25] and tested their classifiers’ performances against the new corpus we developed, but all the classifiers performed turned out to be very poor (see the results at https://github.com/tp-sh/GDPR_privacy_policies/blob/main/data/Torre_classifiers_results.csv), indicating that their classifiers may have overfitting problems., by applying them to three privacy policies the first author of the paper manually labeled. These classifiers, which were previously trained and tested at the segment level using the OPP-115 corpus, now needed to make predictions on unseen privacy policies, which are not included in the OPP-115 corpus. The results showed that all classifiers performed much worse than what was reported in [23]. Table II shows the performance comparison of 4 concept classifiers with more than 10 positive samples, and the full results of all the 10 concept classifiers can be found at https://github.com/tp-sh/GDPR_privacy_policies/blob/main/data/opp_results.csv.

TABLE II: Performance of classifier under different evaluation method

OPP-115 Concept	Old Evaluation (segment-level)			Our Evaluation (document-level)
	P	R	F1	P	R	F1
FIRST PARTY COLLECTION AND USE	0.913	0.945	0.929	51/70 (-18.4%)	51/58 (-6.5%)	0.797 (-13.2%)
THIRD PARTY SHARING AND COLLECTION	0.915	0.945	0.930	24/58 (-49.7%)	24/25 (+1.5%)	0.578 (-35.2%)
USER CHOICE/CONTROL	0.906	0.828	0.865	11/21 (-38.2%)	11/27 (-42.1%)	0.458 (-40.7%)
INTERNATIONAL AND SPECIFIC AUDIENCES	0.960	0.947	0.954	25/28 (-6.7%)	25/35 (-23.3%)	0.794 (-16.0%)

The results indicate that the PrivBERT classifiers trained and tested at the segment level could not perform as well as shown by the performance metrics reported in [23]. Considering that the way how privacy policies are written has not substantially changed over the years, we believe the significantly reduced performances are caused by the inappropriate way of splitting the training and test sets at the segment level. This calls the need to consider the document-level splitting approach for evaluating performances of such classifiers. This is one of the main foci of our work reported later.

IV Extended GDPR Taxonomy and New Corpus

In this section, we give details of our work on the extended GDPR taxonomy, the framework GoHPPC and the new corpus GoPPC-150.

IV-A Extended GDPR Taxonomy

In 2020, Torre et al. [7] proposed a GDPR-oriented, three-level conceptual model of privacy policy metadata by cooperating with legal experts. Although being professionally constructed, we noticed that Torre et al.’s model still has a number of issues, including some missed concepts and structural issues of some nodes. So we decided to refine it further by making a range of changes, leading to an extended taxonomy covering a more comprehensive set of GDPR-related concepts relevant for privacy policies. The extension was based on the GDPR-oriented privacy policy template provided by the ICO, the privacy policy mapping chart provided by the IAPP [48], the work of W3C Data Privacy Vocabularies and Controls CG (DPVCG) [49], the annotation scheme of the OPP-115 corpus [24], and our own expertise³³3Two co-authors of the paper have been actively teaching and researching the GDPR. One of them is a data protection and privacy law expert, and the other is a computer scientist with substantial research experience on interdisciplinary topics including GDPR-related matters.. Our extended taxonomy includes 96 nodes, 39.1% more than Torre et al.’s (69 nodes), including some important nodes missing from the latter, e.g., ‘DATA SHARING’ and the sub-nodes of ‘PD STORAGE DETAILS’.

The main changes we made to Torre et al.’s original model and the reasons of such changes are summarized below.

Change 1: Swapped ‘AUTO DECISION MAKING’ and ‘DATA SUBJECT RIGHT.COMPLAINT’.

Reason: The ICO guide has “Rights related to auto decision making including profiling” under the “Individual rights” category, and “complaints” is covered in a dedicated section in the ICO’s GDPR-oriented privacy policy template.

Change 2: A new ‘CONDITION’ node was added, which was combined with the original ‘RECIPIENTS’ under the new first-level node ‘DATA SHARING’.

Reason: When describing data sharing, privacy policies often include conditions for sharing.

Change 3: Changed ‘PD TIME STORED’ node to ‘TIME’, and merged the newly added ‘LOCATION’ node and ‘DISPOSAL METHOD’ node into the new first-level node ‘PD STORAGE DETAILS’.

Reason: These changes reflect the recommended content under “data storage” in the ICO template better.

Change 4: Split the ‘DIRECT’ sub-node of ‘PD ORIGIN’ into ‘DIRECT ACTIVE’ and ‘DIRECT PASSIVE’, and added third-level node ‘COOKIE’ under ‘INDIRECT’ to the scope of ‘DIRECT PASSIVE’.

Reason: When describing data collection, privacy policies often mention data provided by users (‘DIRECTIVE ACTIVE’), data automatically collected by service providers (‘DIRECTIVE PASSIVE’), and data from third-party sources (‘INDIRECT’). Torre et al.’s [7] taxonomy included just ‘DIRECT’ and ‘INDIRECT’ nodes, which cannot accurately cover data automatically collected by service providers (‘DIRECTIVE PASSIVE’). Cookies as a type of data automatically collected would now better be put under ‘DIRECTIVE PASSIVE’.

Change 5: Added the ‘INFORMATION’ node to the ‘DATA SUBJECT RIGHT’.

Reason: Articles 13 and 14 of the GDPR declare that a range of specific information “shall” be provided by data controllers.

Change 6: Some changes were made after comparing our taxonomy with the IAPP [48]. For example, processing of personal data by online services must satisfy some principles. Some of them are internal and unnecessary to be declared to users, but the others are external and need to be declared in privacy policies. We added a new first-level node called ‘DP PRINCIPLE’, containing ‘PURPOSE LIMITATION’ and ‘DATA MINIMIZATION’.

Change 7: For concepts that go beyond the GDPR, we added two new first-level nodes to the taxonomy: ‘NON-GDPR’ and ‘OTHERS’. ‘NON-GDPR’ refers to concepts related to data protection laws in other countries or regions, such as the California Consumer Privacy Act (CCPA) in the US [50], Brazil’s General Personal Data Protection Law (Lei Geral de Proteção de Dados Pessoais, LGPD) [51], and the Personal Information Protection Law of China [52]. ‘OTHERS’ covers all situations that cannot be labeled.

In addition to the above changes, we also compared our taxonomy with the work of W3C Data Privacy Vocabularies and Controls CG (DPVCG) [49], and confirmed that all key concepts in the latter are covered in our extended taxonomy.

Refer to caption — Figure 1: The proposed GDPR-oriented privacy policy taxonomy (red boxes = newly added concepts, percentages = coverage rates of different GDPR concept in the our new GoPPC-150 corpus).

According to the changes described above, we obtained a more comprehensive GDPR taxonomy, which lays the foundation of other work reported in this paper. As shown in Figure 1, our GDPR taxonomy is a three-level tree covering more important core GDPR concepts relevant for privacy policies. In order to enhance the readability of the visual presentation, different levels are colored differently and nodes we added or modified are highlighted inside red boxes. And the GDPR concepts with underlines are those covered in the ICO template. The percentage under the nodes are the coverage rates of different GDPR concept in the our new GoPPC-150 corpus. The results are as follows.

Out of the 19 first-level concepts, ‘PD ORIGIN’ and ‘DATA SUBJECT RIGHT’ are covered by all privacy policies, followed by ‘PD CATEGORY’ and ‘PROCESSING PURPOSES’ (all but one), indicating that these four aspects of the GDPR have been taken seriously by most websites covered in our corpus. On the other hand, many other first-level GDPR concepts have a much lower coverage rate, e.g., ‘PD PROVISION OBLIGED’ covered by only six (4.0%) privacy policies, despite that this concept is required to appear in privacy policies according to Article 13.2(e) of the GDPR (where the mandatory wording “shall” is used).

For the second-level concepts, ‘DATA SUBJECT RIGHT.INFORMATION’, ‘PD ORIGIN.DIRECT PASSIVE’ and ‘PD ORIGIN.DIRECT ACTIVE’ have the highest coverage, mentioned in 146 (97.9%), 143 (95.9%) and 142 (95.3%) privacy policies, respectively. On the low coverage end, ‘PD CATEGORY.SPECIAL’, ‘COMPLAINT.SA’ and ‘PD PRINCIPLE.DATA MINIMIZATION’ have the lowest coverage rate of 2.0%, covered in only 3 privacy policies.

Among the third-level concepts, only ‘DATA SUBJECT RIGHT.INFORMATION.POLICY CHANGE’ has a relatively high coverage, mentioned in 119 (79.8%) privacy policies, other third-level concepts are rarely mentioned. A possible explanation is that online services may consider such low-level details in privacy policies unnecessary.

IV-B The New Corpus GoPPC-150

Privacy policies of different websites and online services can vary significantly in many ways, e.g., in terms of their length, complexity, presentation format, and legal compliance requirements. Generally speaking, privacy policies of large companies are written and maintained by a dedicated legal team, which are likely to be more comprehensive and professionally written. For the new corpus we constructed, our aim is to cover typical privacy policies used by large companies, so we decided to use top sites returned by Alexa.com, a widely used website ranking online service for research purposes⁴⁴4This service was discontinued by in May 2022., for collecting privacy policy samples. The full list of the 150 websites is provided in the supplementary data file available at https://github.com/tp-sh/GDPR_privacy_policies/blob/main/data/Web_list.csv. Different from former corpora, our new corpus maintains the hierarchical structure of original privacy policies with annotations of three-level, and this will bring context features to the classifiers of compliance checking. The corpus contains manual annotations of 8K fine-grained paragraphs and 6k titles on 150 privacy policies in all.

We developed the framework GoHPPC (as shown in Figure 2) for constructing our corpus and facilitating its further extension in the future. The purpose of the GoHPPC is to support more automated identification of privacy policy web pages from a pre-defined list of URLs, extraction of the privacy policy’s relevant content, and converting the HTML elements in the privacy policy web page into a hierarchical structure following a well-defined XML schema we call PP-XML (privacy policy XML). The GoHPPC helps reduce human efforts greatly, and can be adapted to enhance automation of privacy policy analysis.

IV-B1 Identifying Privacy Policy Web Page

We observed that some websites change the content or even URL of their privacy policy according to the country or region a visitor comes from. In order to ensure that the collected privacy policies for our corpus are more relevant for the GDPR, we used a proxy server called Amazon EC2 located in London to simulate visits of a user from the EU/EEA/UK to the selected candidate websites. We developed a semi-automated process for identifying and downloading privacy policies from a list of pre-defined websites. The tool was implemented based on the web browser automation engine Selenium [53] and it works following the steps described below.

Step 1: The tool visits each website and searches (case-insensitively) for a <a> element whose content includes one of the following pre-defined keywords representing a privacy policy: ‘privacy policy’, ‘privacy notice’, and ‘privacy terms’. If only one link is found, the tool clicks the link to visit the privacy policy web page and goes to Step 4; otherwise, it goes to Step 2.

Step 2: The tool tries to identify the privacy policy web page via the user registration page, which normally includes a link to the privacy policy. To this end, it searches for the following keywords case-insensitively: ‘create account’, ‘register’, ‘sign up’, ‘sign-up’. If any of the keywords is found, it clicks the link and goes to the user registration page. Then, the tool applies Step 1 to the current page to find and visit the privacy policy page, after which it goes to Step 4. If no privacy policy link is found, the tool goes to Step 3.

Step 3: The tool seeks human intervention to identify the link of the privacy policy.

Step 4: The tool saves the identified privacy policy page as a local HTML file with all elements included, by calling a third-party web browser extension SingleFile [54].

IV-B2 Pre-Processing of Privacy Policy Web Page

After identifying a privacy policy web page, we need to pre-process the web page to remove three types of DOM elements: 1) multimedia-related elements that are not useful for analyzing textual content of the privacy policy contained in the web page, e.g., <picture>, <img>, <video>, and <audio>; 2) elements that embed an non-textual object or an external web page, e.g., <applet>, <embed>, <object>, and <iframe>; 3) elements whose are not semantically related to or can appear as part of the main body of a web page, e.g., <footer> and <nav>. Note that some of such elements are still useful for understanding the design and layout of a privacy policy web page, which is out of the scope of this study and will be our future work. The full list of HTML elements removed is as follows:

•

Type 1: Elements that are not useful for analyzing textual content of the privacy policy contained in the web page. These include <img>, <picture>, <video>, <audio>, <canvas>, <map>, <area> <figure>, <figcaption>, <source>, <track> and <svg> elements.
•

Type 2: Elements that embed an non-textual object or an external web page. These include <applet>, <embed>, <object>, <param>, <script>, <noscript> and <iframe> elements.
•

Type 3: Elements whose are not semantically related to or can appear as part of the main body of a web page. These include <footer>, <nav>, <form> and all input control elements.

IV-B3 Extracting Privacy Policy Content

After pre-processing a privacy policy web page, our next step is to extract the relevant content of the privacy policy. Here, the term “relevant conent” refers to one or more HTML elements that contain the actual content of a privacy policy. Inspecting all the 150 privacy policy web pages we collected, we observed that the relevant content is always under a single HTML element with multiple child elements each corresponding to a different part of the privacy policy (e.g., a <div> element including $n$ child <div> elements). In this way, we need to identify the single HTML element that contains the content of the privacy policy, which we call the PP element.

We observed that the immediate child elements under the PP element are more similar to each other as a whole in terms of the text length within each element, compared with non-PP elements at the same level. Here, the term “text length” is defined as the number of characters for human readers (not the characters in the HTML code), reflecting the amount of text a human reader is expected to read. Based on the above observations, we developed an algorithm shown in Algorithm 1. It measures the likelihood of a candidate element being the PP element using a simple metric we call children similarity score (CSS) – the standard deviation of all its children’s text lengths. Since the CSS is a relative concept, Algorithm 1 compares the CSS value of the current candidate element with the average CSS value of all non-PP elements observed so far, and if the ratio is below a threshold $r_{h}$ , the candidate element is considered the PP element.

Algorithm 1 Given an HTML DOM node (normally <body>), return an HTML element that is more likely the lowest node containing the privacy policy’s relevant content fully

1:function Extraction(node)

\mathbb{R}\leftarrow\varnothing

r\leftarrow\infty

4: PP_node

\leftarrow

node

5: while True do

6: if PP_node does not have any child node then

7: return PP_node

8: end if

s\leftarrow

the CSS value of PP_node

10: if

\#(\mathbb{R})>0

then

11:

r\leftarrow s/\text{average}(\mathbb{R})

12: end if

13: if

r<r_{h}

then

14: return PP_node

15: else

16:

\mathbb{R}\leftarrow\mathbb{R}\bigcup\{s\}

17: PP_node

\leftarrow

the child node in

\mathbb{C}

with the longest human-readable text

18: end if

19: end while

20:end function

We carried out experiments in 150 samples collected in Section IV-B1 to test the performance of Algorithm 1. In order to determine the optimal value of the parameter $r_{h}$ , we used 50 randomly selected samples as the training set to find its range that allowed us to achieve a zero identification error for the 50 samples. The range obtained is [0.5,0.6], and we assigned $r_{h}$ to be the midpoint 0.55. Applying Algorithm 1 with $r_{h}=0.55$ to the other 100 remaining policies as the testing set and achieved a perfect accuracy of 100%.

Despite the high accuracy of Algorithm 1, it may still fail when processing an unseen privacy policy (e.g., a privacy policy web page does not have a single HTML element that contains the relevant content), so results of this step should be checked and fixed manually when necessary.

IV-B4 Constructing Hierarchical Structure

After extracting the relevant content of a privacy policy web page, we convert the HTML-based DOM tree of the page content into a new hierarchical structure representing the semantic content of the privacy policy, which is stored as an XML file following a specific XML schema we call PP-XML. The whole privacy policy content has a single root node <policy>, which includes a number of semantic segments of the privacy policy, each encoded as a <segment> element. The <segment> elements can be nested to allow a hierarchical structure for privacy policy’s content. For each <segment> element, there is always one <title> element, which represents the semantic heading of the corresponding segment’s content. Each <segment> element should have one or more <paragraph> elements, representing the content of the corresponding segment. In addition to <segment> elements, the <policy> element can also include a number of standalone <paragraph> elements, e.g., for a number of leading paragraphs before the first <segment> element. A <paragraph> element can also include one or more <list> elements as its children, and a <list> element contains at least one <item> element as its children. An <item> element may include a <list> element, allowing nested lists in a <paragraph> element.

To convert HTML elements in the privacy policy to the new hierarchical structure based on PP-XML, we followed a five-stepped process described below. Note that there has been some related work on automatic segmentation of HTML documents, e.g., the ASDUS proposed in [55]. However, most such work focused on identifying top-level titles only, which is insufficient for our work.

Step 1: Pre-processing. Although many HTML elements have been removed as part of the first pre-processing step described in Section IV-B2, some ad hoc inline elements, such as those appearing in the middle of a <p> element, play no roles in PP-XML. We remove such inline elements. At the end, we put all remaining elements into a <policy> element.

Step 2: Initial conversion of some HTML elements to PP-XML elements. As shown in Table III, there are a precise mapping from some HTML elements to <list> and <item> elements in PP-XML, so such HTML elements can be directly converted.

TABLE III: Different types of PP-XML elements and HTML elements that may be converted to each PP-XML element type

PP-XML element	Description	Possible HTML elements
`<policy>`	The privacy policy as a whole (the root element)	NA (determined based on the method introduced in Section IV-B3)
`<segment>`	A semantic segment of the privacy policy (child of the `<policy>` element or another `<segment>` element)	NA (derived from `<title>` elements, see Step 4 discussed in Section IV-B4)
`<title>`	The title of a semantic segment (child of a `<segment>` element)	`<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>`, `<p>`, `<div>`, and highlighted inline elements
`<paragraph>`	A semantic paragraph in the privacy policy, normally but not always in a semantic segment (child of a `<segment>` element or the `<policy>` element)	Block-level elements such as `<p>` and `<div>`
`<list>`	A semantic list in the privacy policy (child of a `<paragraph>` element or an `<item>` element)	`<ul>`, `<ol>`, `<dd>`
`<item>`	An item in a semantic list (child of `<list>` element)	`<li>`, `<dt>`

Step 3: Classifying the remaining HTML elements into title elements at different levels and paragraph elements in PP-XML. In privacy policies, titles tend to have different attributes from paragraphs, e.g., titles are often in bold face and have a shorter text length. Based on manual inspection of title and paragraph elements of some randomly selected privacy policies, we identified the following attributes that may be useful for differentiating title elements from paragraph elements: i) the text length, ii) the font size, iii) the font weight, iv) if the text is italic, v) if the text is underlined, vi) the node depth in the HTML DOM tree, vii) the HTML tag, and viii) leading ordinal labels (LOLs) for titles (e.g., ‘1’, ‘2.3’, ‘a.’, and ‘ii)’) (see Table IV for more examples on LOLs). As part of the input features of the title and paragraph classifier, a 12-D descriptor is used to describe the leading ordinal label of a candidate title or paragraph element. The 12-D vector is composed of four 3-D sub-vectors, each representing a sub-label of the leading ordinal label at one of the four possible title levels. Each sub-vector follows the format (sub-label’s format, sub-label’s value, sub-separator’s format). The sub-label’s format and the sub-separator’s format are determined according to Table IV. The sub-label’s value is the 1-based natural ordinal value of the label, e.g., the number itself for Arabic numbers, 1 for ‘a’, ‘A’ and ‘i’. For instance, the sub-label ‘3.’ will be mapped to (1, 3, 1), ‘b)’ to (3, 2, 3), and a full label ‘3.a.i’ will be mapped to the 12-D descriptor [1 3 1 2 1 1 4 1 0 0 0 0].

In total, for each candidate title/paragraph element, we have these input features and an ML-based classifier can be constructed to predict the candidate element into one of the following five classes: title at Level $i$ ( $i=1,2,3,4$ ), and paragraphs. We tested several mainstream ML models that do not require a large training set for the multi-class classification task, including random forest (RF) [56], SVM (with linear kernel and RBF kernel) [57], Extra Trees (ET) [58], and XGBoost [59]. We used all the 150 privacy policies for training and testing purposes. We used 5-fold cross validation and 20% as the testing set. Table V shows the results of all classifiers, indicating that the ET classifier achieved the best performance, with the F1-score reaching 0.882. As can be seen, the performance is generally good enough to support the semi-automated corpus construction process.

TABLE IV: Mappings between sub-label’s format and sub-separator’s format and values in the leading ordinal label descriptor

Descriptor Value	Label’s Format	Separator’s Format
0	None	None
1	Arabic Number	Full Stop
2	Lowercase Letter	Colon
3	Uppercase Letter	Parenthesis
4	Roman Number	Others
5	Others	Others

TABLE V: Results of title and paragraph classifiers

Classifier	RF	SVM (Linear)	SVM (RBF)	XGBoost	ET
Precision	0.874	0.879	0.803	0.711	0.885
Recall	0.875	0.873	0.822	0.778	0.879
F1-score	0.875	0.876	0.812	0.742	0.882

Step 4: Constructing segment elements based on title elements and their levels. After the <title> elements and their levels are identified, we can use them to construct properly nested <segment> elements reflecting the hierarchical structure of the privacy policy. For each <title> element, we create a <segment> element to include the <title> element and all non-<title> elements after it until the next <title> element or the end of the PP-XML document. The level of the <segment> element is set to be the same as the <title> element it contains.

Step 5: Manual verification and correction. The last step of the process is to have one or more human experts to manually check the produced PP-XML document and the input HTML document to confirm the automated results and fix any errors.

IV-B5 GDPR Concept Annotation

Using the process described in Section IV-B4, we processed all 150 privacy policies in our corpus to get 150 PP-XML documents. Then, we annotated the 150 PP-XML documents with relevant GDPR concepts based on our extended GDPR taxonomy introduced in Section IV-A. The annotation was done for each title and paragraph element in each PP-XML document, with one or more tags each representing a unique GDPR node in our extended GDPR taxonomy.

For the annotation work, we decided to pay a commercial annotation company to ensure the quality of the work. We followed a multi-stepped quality assurance process. In Step 1, one of the authors of this paper, who has good knowledge of GDPR concepts, first trained four annotators of the company, and then the annotators attempted to annotate 20 PP-XML documents as a pilot. In Step 2, the results of the pilot annotation experiments were checked by the author and feedback was given to the four human annotators so they knew how to refine and align their annotations. In Step 3, the four annotators went ahead to annotate all the remaining PP-XML documents so that each document was annotated independently by two different annotators. In Step 4, a random sample of the annotation results of each annotator was checked by another annotator. Finally, the full annotation results were checked by three authors of this paper, who then discussed the results between them to agree on any different annotations. To quantitatively measure the quality of the four annotators’ work, we used the inter-rater agreement of Cohen’s Kappa ( $\kappa$ ) [60], and observed an average $\kappa$ score of 0.75 across all the annotated documents, indicating that the two independent human annotators had a substantial level of agreement for most documents. For disagreements, we observed that most are about the GDPR concepts ‘PD CATEGORY’, ‘PD ORIGIN’, and ‘PROCESSING PURPOSES’. The reason is that these concepts tend to be declared together in many privacy policies, making it difficult to separate them. At the end of the annotation work, we finally obtained a fully annotated privacy policy corpus GoPPC-150 following PP-XML and our extended GDPR taxonomy.

V GDPR Concept Classifiers and Evaluation Methods

As mentioned in Section II, past studies on automated GDPR compliance of privacy policies focuses mostly on long texts or paragraphs in the privacy policy without considering the hierarchical structure, e.g., no previous work has considered automatic analysis of titles in a privacy policy. Missing such context information in a privacy policy can lead to an incomplete and less accurate analysis. What’s more, previous work tends to view compliance checking task as multi-label classification, using one model classify all concepts of the same level. As a result, the input features of the model are all the same. This seems work when concept types are few. But for the 96 concepts of different levels in our GDPR taxonomy, the same features may well are not suitable. To figure out the effect of different input features and structures of models, we conducted a comprehensive study. As shown in Figure 3, we tried different combinations of structures of classifiers, input features, corpora and evaluation methods, then compared each performance on each concept.

Noticeably, the binary RF model with 100-D TF-IDF vectorization input features is enough for titles’ classification, and the length of titles’ text limit the experiment much. So we only conducted the comprehensive study on paragraphs and applied the simple configuration on title classification task.

V-A Structures of Classifiers

As mentioned in Section II, compliance checking of privacy policy can be regard as an HMTC task and performed in three ways: flat, local and global approaches. Within local approaches [8], there are LCN, LCPN, LCL and so on [10, 11, 12]. Until now, related work has focused on the LCPN approach, using one classifier classifying all concepts of the same level. But we also tried LCN approach as it is flexible to set different configurations for different concepts. When there are many concepts to be classified, LCN classifiers allow us to train the best models for specific concepts. When the samples of specific concepts are not enough, upsampling can be applied without affecting other concepts’ samples in LCN classifiers. So we tried LCPN and LCN approaches to construct the classifiers. All systematic experiments are both performed on two classifier types respectively.

V-B Input Features

When classifying a paragraph or title in a privacy policy, we considered the following three sub-groups of features as the input of the classifier: (1) the vector embeddings of the paragraph or title, (2) the vector embeddings of the already processed parent and sibling node of the paragraph or title, and (3) a binary vector representing if some pre-defined keywords associated with each GDPR concept appears or not in the current paragraph or title. The first-subgroup of features cover all vector embeddings used in past studies, and the third one was proposed by Amaral et al. [25]. The second subgroup is new features proposed by us for this work to cover capture the hierarchical information between the current node and its parent and the contextual information between the current node and its siblings under the same parent.

For the first sub-group of features on vector embeddings, we first tested three widely used traditional vectorization methods with three different dimensionalities (100-D, 200-D and 300-D): TF-IDF [61], GloVe [62], fastText [63]. We also tested combinations of three different pairs of the three traditional vectorization methods, with a 100-D subvector for each method and a 200-D vector for the combination. The experimental results showed that 300-D TF-IDF performed the best for binary classifiers and we implemented the methods as the default vectorization for binary classifiers. We then compared TF-IDF and the more advanced vectorization method based on PrivBERT [23] as two alternative choices for our more comprehensive experiments.

For the second sub-group of features on capturing hierarchical and contextual information, we use the vector embeddings of already processed parent and sibling nodes of the current node in the same privacy policy. Considering parent nodes tend to be titles while sibling nodes tend to be paragraphs when the current node is a paragraph, we considered that 100-D is enough for the former and 300-D is suitable for the latter. The vector embedding methods considered were also TF-IDF and PrivBERT. These features help us capture the hierarchical information between the current node and its parent and sibling nodes. Our experiments prove the features improved the performance of classifiers as shown in Tables VII and VIII.

For the third sub-groups of features about occurrences of pre-defined keywords, we used a 96-D binary vector (the number of all nodes in our GDPR taxonomy), where each feature indicates if at least one pre-defined keywords associated with a specific GDPR concept appears in the candidate element at least once. For example, a sentence like “When Google shares your information” will be labeled as ‘DATA SHARING.CONDITION’ because it contains the keywords ‘share’ and ‘when’. Because Amaral et al. [25] did not release their pre-defined keywords and our GDPR taxonomy includes more concepts, we decided to define our own keywords list for each of the 96 concepts. The full list of such keywords can be found in the supplementary data file available at https://github.com/tp-sh/GDPR_privacy_policies/blob/main/data/keyword_list.csv). Our experimental results in Tables VII and VIII show that these keyword-based features could help improve the performance of LCN classifiers.

For the LCPN classifiers, we set two sub-groups of input features, which are PrivBERT embeddings [23] of current node’s text and PrivBERT embeddings of the parent’s and sibling’s text’s embeddings. The experiments showed the parent’s and sibling’s embeddings had few help in classifications, so we used the PrivBERT embeddings of current node’s text as the input features of the fine-tuned PrivBERT.

V-C Corpus

As the OPP-115 corpus is labeled at the segment-level, it cannot accomplish document-level evaluation. Therefore, we use our corpus GoPPC-150 as the corpus of classifiers, which contains manual annotations of 14K (8k paragraphs and 6k titles) fine-grained data practices on 150 privacy policies. The annotations contain 96 nodes of three levels.

V-D ML Models

For LCN classification, Torre et al. [7] reported that the SVM model achieved the best performance. Therefore, we adopted the SVM model for our testing and conducted comparison experiments using other mainstream ML models, including random forest (RF), XGBoost, ET and neural network (MLP). We used 300-D TF-IDF features of current node as input to test the performance of these models, and the results showed that RF achieved the best performance for the most concept. The full results of all classifiers can be found at https://github.com/tp-sh/GDPR_privacy_policies/blob/main/data/comparison_experiments.csv. Therefore, we took RF classifiers as the representative of LCN classifiers. For LCPN classification, a multi-class classifier is required to classify all the child nodes under the same parent node. In the current literature, most studies [20, 21, 22, 23] use multi-class neural network models as the classifier. Therefore, we opted for multi-class neural network as the representative of LCPN classifiers.

V-E Evaluation Frameworks

Different from evaluating classifier performance at the segment level in the literature, we propose a document-level evaluation framework that is more suitable for real-world scenarios. A detailed comparison of segment-level and document-level evaluation frameworks is shown in Figure 4. Document-level evaluation is achieved by selecting some documents in the corpus to form the training set to train the classifier, and using the remaining documents as the test set. In the testing phase, the trained classifiers are used to make predictions for different concepts for each paragraph and title in each document of the test set, and the performance metrics of each classifier are determined according to the predicted results. The advantage of this evaluation framework is that it is more in line with real-world application scenarios, that is, the trained classifiers are applied to process new (i.e., unseen) privacy policies. For our task, we chose 120 privacy policies in our corpus GoPPC-150 as the training set, and the rest 30 privacy policies as the test set.

By trying different combinations of input features, classifier structures and models, we got 12 types of classifiers (see Table VI for the detailed configurations of the 12 classifiers) and the results of each classifier. We used three common metrics, precision (P), recall (R), and F1-scores, to evaluate the performance of each classifier. The difference of performance between each two types can show the effect of the different configurations. The overall performances of selected classifiers are shown in Table VII and detailed performances of some examples are shown in Table VIII. Due to the space limit, we only show the F1-scores of representative 6 classifiers in the two tables. The full results of all classifiers can be found at https://github.com/tp-sh/GDPR_privacy_policies/blob/main/data/complete_classifier_results.csv.

TABLE VI: The 12 different types of classifiers

Type	Input Features	Machine Learning Model
1	300-D TF-IDF features of the current node	Binary RF
2	300-D TF-IDF features of the current, parent and sibling nodes	Binary RF
3	300-D TF-IDF features of the current node, keyword occurrences	Binary RF
4	300-D TF-IDF features of the current, parent and sibling nodes, keyword occurrences	Binary RF
5	PrivBERT embeddings of the current node	Binary RF
6	PrivBERT embeddings of the current, parent and sibling nodes	Binary RF
7	300-D TF-IDF features of the current node	Multi-class neural network
8	300-D TF-IDF features of the current, parent and sibling nodes	Multi-class neural network
9	300-D TF-IDF features of the current node, keyword occurrences	Multi-class neural network
10	300-D TF-IDF features of the current, parent and sibling nodes, keyword occurrences	Multi-class neural network
11	PrivBERT embeddings of the current node	Multi-class neural network
12	PrivBERT embeddings of the current, parent and sibling nodes	Multi-class neural network

TABLE VII: Overall performance results in both segment-level and document-level evaluation frameworks (F1-scores)

Evaluation

Segment-level

Document-level

Type 1

Type 2

Type 3

Type 4

Type 11

Type 12

Type 1

Type 2

Type 3

Type 4

Type 11

Type 12

Macro averages of level-1 nodes

0.748

0.797

0.753

0.804

0.773

0.805

0.512

0.558

0.509

0.554

0.658

0.669

Macro averages of all nodes

0.702

0.717

0.702

0.716

0.666

0.682

0.453

0.454

0.426

0.448

0.527

0.529

Ratio of maximum F1-score in level-1 nodes

0/14

1/14

0/14

1/14

0/14

1/14

0/14

1/14

4/14

3/14

Ratio of maximum F1-score in all nodes

4/35

3/35

1/35

0/35

1/35

3/35

2/35

1/35

2/35

8/35

5/35

TABLE VIII: Detail performance results (F1-scores). For each cell, the performance figure for the segment-level evaluation is shown first, followed by that of the document-level evaluation in parentheses. For LCPN classifiers (Types 11 and 12) on Level 2, their performances are based on the those of two cascaded classifiers on Levels 1 and 2. In the Name column, DSR represents “DATA SUBJECT RIGHT”, DS represents “DATA SHARING”, LB represents “LAWFUL BASIS”.

Concept		Type 1	Type 2	Type 3	Type 4	Type 11	Type 12
Level	Name	Type 1	Type 2	Type 3	Type 4	Type 11	Type 12
1	CONTROLLER	0.749 (0.519)	0.726 (0.554)	0.788 (0.598)	0.738 (0.455)	0.741 (0.593)	0.735 (0.575)
	DATA SUBJECT RIGHT	0.783 (0.493)	0.833 (0.477)	0.799 (0.350)	0.821 (0.497)	0.816 (0.658)	0.828 (0.694)
	PD ORIGIN	0.775 (0.538)	0.852 (0.655)	0.779 (0.444)	0.845 (0.699)	0.825 (0.668)	0.842 (0.770)
	PD SECURITY	0.752 (0.792)	0.857 (0.698)	0.777 (0.737)	0.844 (0.846)	0.800 (0.767)	0.840 (0.744)
	DATA SHARING	0.764 (0.640)	0.833 (0.697)	0.761 (0.716)	0.837 (0.764)	0.806 (0.757)	0.849 (0.772)
2	DSR.RESTRICTION	0.778 (0.457)	0.694 (0.400)	0.781 (0.359)	0.658 (0.451)	0.733 (0.557)	0.731 (0.556)
	DSR.OBJECT	0.684 (0.440)	0.747 (0.190)	0.675 (0.237)	0.628 (0.172)	0.586 (0.382)	0.512 (0.425)
	LB.LEGITIMATE INTEREST	0.729 (0.419)	0.564 (0.280)	0.667 (0.421)	0.709 (0.235)	0.536 (0.426)	0.540 (0.440)
	LB.CONSENT	0.424 (0.244)	0.313 (0.098)	0.486 (0.195)	0.389 (0.108)	0.333 (0.244)	0.308 (0.280)
	DS.RECIPIENT	0.637 (0.562)	0.713 (0.570)	0.618 (0.537)	0.715 (0.626)	0.639 (0.516)	0.681 (0.554)

Table VII shows that, for some GDPR concepts, non-PrivBERT classifiers could outperform PrivBERT-based ones with the segment-level evaluation method. Our results give a very different picture from what was reported in [23], which showed PrivBERT-based classifiers as the best. Despite differences under the document-level evaluation framework, PrivBERT-based classifiers performed better in comparison with those non-PrivBERT ones, but all classifiers showed significant declines in F1 score, which were not sufficient to support automation analysis of privacy policies. An important insight learned here is that more comprehensive comparative studies across multiple settings like what we did are needed to evaluate performances of GDPR concept classifiers. What’s more, under the document-level evaluation framework, PrivBERT-based classifiers have the best macro average F1-scores for all nodes, it shows the best performance in most nodes. However, the results in the segment-level evaluation case are less stable, as the best performance appears under different types for different concepts. This indicates that the document-level evaluation framework may be able to produce more consistent results for deciding the best feature set.

The detail results in Table VIII show that the effects of different types of features are different for different concept classifiers, for both segment-level and document-level evaluation frameworks. The results were expected by us because unique characteristics of different concepts may be better captured by different combinations of features. For instance, for the contextual features we introduced in this paper, this concept-dependency may be explained as follows: some concepts are ambiguous and contextual features can help distinguish them from others, but for some other concepts contextual features may just add more redundant information and confuse the learning process. As a result, the optimal input features of different GDPR concept classifiers may need to be different. This strategy can be more easily applied to LCN classifiers than to LCPN classifiers since the latter combine multiple concepts. The structure of LCPN classifiers also means inevitable cascade errors. As shown in Table VIII, when it comes to level-2 nodes, the LCPN classifiers’ cascade performances go down due to the limit of classifiers at level 1. What’s more, all concepts at the same level need to be trained in LCPN classifications even if there are not enough data for one or more concepts involved, while for LCN classifiers we can choose not to train a specific classifier if there are not enough training samples.

VI Further discussion

The significant difference between the performance results obtained by these two evaluation methods further demonstrates the value of document-level evaluation methods. The results obtained by the previous segment-level evaluation methods are somewhat inflated, and the excellent performance in the test does not mean that the correct prediction can be made in the real environment. Only by using the document-level evaluation method can we understand the performance of the classifiers in the real environment. However, previous work did not pay attention to this point, which may be because the current open source corpus is also segment-level, such as our document-divided corpus is not common, so relevant work can only use segment-level assessment methods. Therefore, we propose that the future corpus need to be divided into documents, so that the document-level evaluation method can be adopted.

The results of our evaluation method show that the performance of current concept classifiers is overestimated, and in fact, the performance of classifiers can not meet the requirements of large-scale automated detection. This results in an error bar that is too large to be trusted. Therefore, it is not possible to do large-scale automated detection of privacy policies at present, and the focus of current work should also be to improve the performance of classifiers. In order to achieve this goal, the corpus needs to be greatly expanded, which cannot be done by our efforts alone, and requires the cooperation of the entire community. We plan to set up an open website and open source our corpus and code to facilitate such a community-wide effort.

At the same time, we found that a single set of classifiers can not always obtain the best performance in all concepts, and for some concepts, the simple configuration of classifiers can get better performance, which indicates that the previous ”one size fits all” point is not correct, especially for the privacy policy analysis scenario that requires dozens of classifiers. It is obviously far-fetched to always adopt the same configuration. For obtaining the best classifier for each concept, it is necessary to conduct comprehensive experiments and analysis to get the best classifier configuration.

Furthermore, LLM may also represent a potential development direction of this part of work. The large training samples of LLM can mitigate the limitations of corpora and ensure the robustness of the classification.

VII Conclusion

This paper proposes a new evaluation method for concept classifiers which can present the performance of classifiers in real environment. Different from previous method, the new approach divides samples in document-level while training and evaluating. The gap between our evaluation results and previous work shows that the former classifiers are highly overestimated. The real performance of the classifiers are not enough for the large scale automation compliance analysis. We also noticed some other gaps in previous work and improved them. This includes the most complete GDPR-oriented privacy policy taxonomy, a new GDPR-oriented privacy policy corpus GoPPC-150 and an automation-enhancing frameworks for supporting future research on GDPR-oriented analysis of privacy policies. This paper also provides the most comprehensive study on the GDPR-oriented analysis of privacy policies. The study led to a range of new findings and insights, including the usefulness of context-based features for improving performance of machine learning based GDPR concept classifiers, the importance to consider the LCN architecture of hierarchical classifiers and testing cross-corpus generalizability, and the finding that a “one size fits all” idea does not work for GDPR concept classifiers so a more locally optimized approach is needed. Our study also led to a range of new GDPR concept classifiers that can be used by researchers and practitioners. Although our work focused on GDPR-oriented analysis, most results and outcomes can either be easily extended to study other and multiple data protection laws, or the general insights can be re-validated in the wider data protection landscape.

References

[1] A. M. McDonald and L. F. Cranor, “The cost of reading privacy policies,” I/S: A Journal of Law and Policy for the Information Society, vol. 4, pp. 543–568, 2008. [Online]. Available: http://www.aleecia.com/authors-drafts/readingPolicyCost-AV.pdf
[2] H. Li, H. Zhu, S. Du, X. Liang, and X. Shen, “Privacy leakage of location sharing in mobile social networks: Attacks and defense,” IEEE Transactions on Dependable and Secure Computing, vol. 15, no. 4, pp. 646–660, 2016.
[3] Digitalguardian. (2020) Google fined $57m by data protection watchdog over GDPR violations. [Online]. Available: https://digitalguardian.com/blog/google-fined-57m-data-protection-watchdog-over-gdpr-violations
[4] ICO, UK. Home — ICO. [Online]. Available: https://ico.org.uk/
[5] ——. (2022) Guide to the UK General Data Protection Regulation (UK GDPR). [Online]. Available: https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/
[6] ——. Privacy policy template. [Online]. Available: https://ico.org.uk/media/for-organisations/documents/4019666/privacy-template.docx
[7] D. Torre, S. Abualhaija, M. Sabetzadeh, L. Briand, K. Baetens, P. Goes, and S. Forastier, “An AI-assisted approach for checking the completeness of privacy policies against GDPR,” in Proceedings of the 2020 IEEE 28th International Requirements Engineering Conference. IEEE, 2020, pp. 136–146.
[8] C. N. J. Silla and A. A. Freitas, “A survey of hierarchical classification across different application domains,” Data Mining and Knowledge Discovery, vol. 22, pp. 31–72, 2011.
[9] D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” Tech. Rep. 1997-75, 1997. [Online]. Available: http://ilpubs.stanford.edu:8090/291/
[10] E. P. Costa, A. C. Lorena, A. C. Carvalho, A. A. Freitas, and N. Holden, “Comparing several approaches for hierarchical classification of proteins with decision trees,” in Advances in Bioinformatics and Computational Biology: Second Brazilian Symposium on Bioinformatics, BSB 2007, Angra dos Reis, Brazil, August 29-31, 2007, Proceedings. Springer, 2007, pp. 126–137.
[11] T. Fagni and F. Sebastiani, “On the selection of negative examples for hierarchical text categorization,” in Proceedings the 3rd Lang & Technology Conference, 2007, pp. 24–28. [Online]. Available: https://openportal.isti.cnr.it/data/2007/160839/2007_160839.pdf
[12] A. Secker, M. N. Davies, A. A. Freitas, E. Clark, J. Timmis, and D. R. Flower, “Hierarchical classification of g-protein-coupled receptors with data-driven selection of attributes and classifiers,” International Journal of Data Mining and Bioinformatics, vol. 4, no. 2, pp. 191–210, 2010.
[13] F. E. Otero, A. A. Freitas, and C. G. Johnson, “A hierarchical classification ant colony algorithm for predicting gene ontology terms,” in Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, 2009, pp. 68–79.
[14] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, “Decision trees for hierarchical multi-label classification,” Machine Learning, vol. 73, no. 2, pp. 185–214, 2008.
[15] C. N. Silla Jr. and A. A. Freitas, “A global-model naive bayes approach to the hierarchical prediction of protein functions,” in Proceedings of the 2009 IEEE International Conference on Data Mining. IEEE, 2009, pp. 992–997.
[16] Y. Mao, J. Tian, J. Han, and X. Ren, “Hierarchical text classification with reinforced label assignment,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. ACL, 2019, pp. 445–455.
[17] J. Wehrmann, R. Cerri, and R. Barros, “Hierarchical multi-label classification networks,” in Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018, pp. 5075–5084. [Online]. Available: https://proceedings.mlr.press/v80/wehrmann18a.html
[18] W. Huang, E. Chen, Q. Liu, Y. Chen, Z. Huang, Y. Liu, Z. Zhao, D. Zhang, and S. Wang, “Hierarchical multi-label text classification: An attention-based recurrent network approach,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 2019, pp. 1051–1060.
[19] X. Zhang, J. Xu, C. Soh, and L. Chen, “LA-HCN: Label-based attention for hierarchical multi-label text classification neural network,” Expert Systems with Applications, vol. 187, pp. 115 922:1–115 922:9, 2022.
[20] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin, and K. Aberer, “Polisis: Automated analysis and presentation of privacy policies using deep learning,” in Proceedings of the 27th USENIX Security Symposium. USENIX Association, 2018, pp. 531–548. [Online]. Available: https://www.usenix.org/conference/usenixsecurity18/presentation/harkous
[21] M. Mustapha, K. Krasnashchok, A. Al Bassit, and S. Skhiri, “Privacy policy classification with XLNet (short paper),” in Data Privacy Management, Cryptocurrencies and Blockchain Technology: ESORICS 2020 International Workshops, DPM 2020 and CBT 2020, Guildford, UK, September 17–18, 2020, Revised Selected Papers, ser. Lecture Notes in Computer Science. Springer, 2020, vol. 12484, pp. 250–257.
[22] T. A. Rahat, T. Le, and Y. Tian, “Automated detection of GDPR disclosure requirements in privacy policies using deep active learning,” arXiv:2111.04224 [cs.CR], 2021.
[23] M. Srinath, S. Wilson, and C. L. Giles, “Privacy at scale: Introducing the PrivaSeer corpus of web privacy policies,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, 2021, pp. 6829–6839.
[24] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, T. B. Norton, E. Hovy, J. Reidenberg, and N. Sadeh, “The creation and analysis of a website privacy policy corpus,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1. ACL, 2016, pp. 1330–1340.
[25] O. Amaral, S. Abualhaija, D. Torre, M. Sabetzadeh, and L. Briand, “AI-enabled automation for completeness checking of privacy policies,” IEEE Transactions on Software Engineering, vol. 48, no. 11, pp. 4647–4674, 2022.
[26] D. Sarne, J. Schler, A. Singer, A. Sela, and I. Bar Siman Tov, “Unsupervised topic extraction from privacy policies,” in Companion Proceedings of the 2019 World Wide Web Conference. ACM, 2019, pp. 563–568.
[27] K. M. Sathyendra, S. Wilson, F. Schaub, S. Zimmeck, and N. Sadeh, “Identifying the provision of choices in privacy policy text,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, 2017, pp. 2774–2779.
[28] L. Lebanoff and F. Liu, “Automatic detection of vague words and sentences in privacy policies,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, 2018, pp. 3508–3517.
[29] S. Zimmeck, P. Story, D. Smullen, A. Ravichander, Z. Wang, J. R. Reidenberg, N. C. Russell, and N. Sadeh, “MAPS: Scaling privacy compliance analysis to a million apps,” Proceedings on Privacy Enhancing Technologies, vol. 2019, pp. 66–86, 2019.
[30] L. Robaldo, C. Bartolini, and G. Lenzini, “The DAPRECO knowledge base: Representing the GDPR in LegalRuleML,” in Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 2020, pp. 5688–5697. [Online]. Available: https://aclanthology.org/2020.lrec-1.698
[31] N. M. Müller, D. Kowatsch, P. Debus, D. Mirdita, and K. Böttinger, “On GDPR compliance of companies’ privacy policies,” in Text, Speech, and Dialogue: 22nd International Conference, TSD 2019, Ljubljana, Slovenia, September 11–13, 2019, Proceedings, ser. Lecture Notes in Computer Science, vol. 11697. Springer, 2019, pp. 151–159.
[32] M. Kuznetsov, E. Novikova, I. Kotenko, and E. Doynikova, “Privacy policies of IoT devices: Collection and analysis,” Sensors, vol. 22, no. 5, pp. 1838:1–1838:23, 2022.
[33] W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, and J. Serna, “PrivacyGuide: Towards an implementation of the EU GDPR on internet privacy policy evaluation,” in Proceedings of the 4th ACM International Workshop on Security and Privacy Analytics. ACM, 2018, pp. 15–21.
[34] M. Lippi, P. Pałka, G. Contissa, F. Lagioia, H.-W. Micklitz, G. Sartor, and P. Torroni, “CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service,” Artificial Intelligence and Law, vol. 27, no. 2, pp. 117–139, 2019.
[35] J. Tom, E. Sing, and R. Matulevičius, “Conceptual representation of the GDPR: Model and application directions,” in Perspectives in Business Informatics Research: 17th International Conference, BIR 2018, Stockholm, Sweden, September 24-26, 2018, Proceedings, ser. Lecture Notes in Business Information Processing, vol. 330. Springer, 2018, pp. 18–28.
[36] M. Palmirani and G. Governatori, “Modelling legal knowledge for GDPR compliance checking,” in Legal Knowledge and Information Systems, ser. Frontiers in Artificial Intelligence and Applications. IoS Press, 2018, vol. 313, pp. 101–110.
[37] J. Bhatia, M. C. Evans, and T. D. Breaux, “Identifying incompleteness in privacy policy goals using semantic frames,” Requirements Engineering, vol. 24, no. 3, pp. 291–313, 2019.
[38] N. M. Nejad, S. Scerri, and J. Lehmann, “KnIGHT: Mapping privacy policies to GDPR,” in Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, 2018, Proceedings. Springer, 2018, pp. 258–272.
[39] D. Torre, G. Soltana, M. Sabetzadeh, L. C. Briand, Y. Auffinger, and P. Goes, “Using models to enable compliance checking against the GDPR: An experience report,” in Proceedings of the 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems. IEEE, 2019.
[40] R. E. Hamdani, M. Mustapha, D. R. Amariles, A. Troussel, S. Meeùs, and K. Krasnashchok, “A combined rule-based and machine learning approach for automated GDPR compliance checking,” in Proceedings of the 18th International Conference on Artificial Intelligence and Law. ACM, 2021, pp. 40–49.
[41] S. Zimmeck and S. M. Bellovin, “Privee: An architecture for automatically analyzing web privacy policies,” in Proceedings of the 23rd USENIX Security Symposium. USENIX Association, 2014, pp. 1–16. [Online]. Available: https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/zimmeck
[42] J. Caramujo, A. R. da Silva, S. Monfared, A. Ribeiro, P. Calado, and T. Breaux, “RSL-IL4Privacy: a domain-specific language for the rigorous specification of privacy policies,” Requirements Engineering, vol. 24, no. 1, pp. 1–26, 2019.
[43] P. Pullonen, J. Tom, R. Matulevičius, and A. Toots, “Privacy-enhanced BPMN: enabling data privacy analysis in business processes models,” Software and Systems Modeling, vol. 18, no. 6, pp. 3235–3264, 2019.
[44] V. Ayala-Rivera and L. Pasquale, “The grace period has ended: An approach to operationalize GDPR requirements,” in Proceedings of the 2018 IEEE 26th International Requirements Engineering Conference. IEEE, 2018, pp. 136–146.
[45] S. Zimmeck, R. Goldstein, and D. Baraka, “PrivacyFlash Pro: Automating privacy policy generation for mobile apps,” in Proceedings of the 28th Network and Distributed System Security Symposium. Internet Society, 2021.
[46] L. Wang, U. Khan, J. Near, Q. Pang, J. Subramanian, N. Somani, P. Gao, A. Low, and D. Song, “PrivGuard: Privacy regulation compliance made easier,” in Proceedings of the 31st USENIX Security Symposium. USENIX Association, 2022, pp. 3753–3770. [Online]. Available: https://www.usenix.org/conference/usenixsecurity22/presentation/wang-lun
[47] H. Cui, R. Trimananda, A. Markopoulou, and S. Jordan, “PoliGraph: Automated privacy policy analysis using knowledge graphs,” in Proceedings of the 32nd USENIX Security Symposium. USENIX Association, 2023. [Online]. Available: https://www.usenix.org/conference/usenixsecurity23/presentation/cui
[48] International Association of Privacy Professionals (IAPP). (2021) Global comprehensive privacy law mapping chart. [Online]. Available: https://iapp.org/resources/article/global-comprehensive-privacy-law-mapping-chart/
[49] W3C Community and Business Groups. (2017) W3C Data Privacy Vocabularies and Controls CG. [Online]. Available: https://www.w3.org/community/dpvcg/
[50] California State Legislature. (2018) California Consumer Privacy Act of 2018. [Online]. Available: https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5
[51] Brazil Government. (2020) Lei Geral de Proteção de Dados Pessoais. [Online]. Available: http://www.planalto.gov.br/ccivil_03/_at02015-2018/2018/lei/113709.htm
[52] J. Bryant. (2021) China’s PIPL takes effect, compliance ‘a challenge’. [Online]. Available: https://iapp.org/news/a/chinas-pipl-takes-effect-compliance-a-challenge/
[53] Software Freedom Conservancy. Selenium. [Online]. Available: https://www.selenium.dev/
[54] Chrome Web Store. SingleFile. [Online]. Available: https://chrome.google.com/webstore/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle
[55] A. A. M. Gopinath, S. Wilson, and N. Sadeh, “Supervised and unsupervised methods for robust separation of section titles and prose text in web documents,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, 2018, pp. 850–855.
[56] B. Williams, C. Halloin, W. Löbel, F. Finklea, E. Lipke, R. Zweigerdt, and S. Cremaschi, “Data-driven model development for cardiomyocyte production experimental failure prediction,” in 30th European Symposium on Computer Aided Process Engineering, ser. Computer Aided Chemical Engineering. Elsevier, 2020, vol. 48, pp. 1639–1644.
[57] J. M. Moguerza and A. Muñoz, “Support vector machines with applications,” Statistical Science, vol. 21, no. 3, pp. 322–336, 2006.
[58] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, pp. 3–42, 2006.
[59] XGBoost community. XGBoost. [Online]. Available: https://xgboost.ai/
[60] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.
[61] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988.
[62] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL, 2014, pp. 1532–1543.
[63] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.