Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification

Wenjun Qiu University of Toronto
[email protected] David Lie University of Toronto
[email protected]

Abstract

Privacy policies are statements that notify users of the services’ data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled by skilled human annotators, requiring significant amount of labor hours and effort. In this paper, we leverage active learning and crowdsourcing techniques to develop an automated classification tool named Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost. Specifically, active learning allows classifiers to proactively select the most informative segments to be labeled. On average, our model is able to achieve the same F1 score using only 62% of the original labeling effort. Calpric’s use of active learning also addresses naturally occurring class imbalance in unlabeled privacy policy datasets as there are many more statements stating the collection of private information than stating the absence of collection. By selecting samples from the minority class for labeling, Calpric automatically creates a more balanced training set.

I Introduction

Privacy policies are legal documents that disclose how a party collects, uses, and shares private information data. Privacy legislation, such as the California Online Privacy Protection Act (CalOPPA) and the General Data Protection Regulation (GDPR), require online services to use privacy policies to obtain consent for collection and use of private information. However, studies have shown that users are unlikely to read privacy policies, as it would take hundreds of hours to read all the privacy policies a typical person encounters over a year [23]. Given this challenge, there have been many proposals to use machine learning-based text processing tools to distill critical information from privacy policies and provide it to both users and regulators [10, 48]. While there has been some success, an ongoing challenge with this approach is the difficulty of obtaining large, high-quality labeled training sets that are required by machine learning to be effective. For instance, Liu et al. [18] show that privacy policy paragraphs can be classified with average micro-F1 scores of $0.78$ , with their model being trained on the OPP-115 corpus [40], which consists of $23K$ data practices, $128K$ practice attributes, and $103K$ labeled text spans extracted from 115 privacy policies.

The difficulty of obtaining labeled data stems from the conventional wisdom that privacy policies are hard to read and understand, and thus must be labeled by skilled annotators. As an example, the most widely-used OPP-115 dataset is prepared by $10$ skilled annotators (i.e law students or those with legal training), spending an average of $72$ minutes on each privacy policy. While labeling privacy policies is labor intensive and expensive, unlabeled policies are easily accessible. In this work, we propose the use of active learning, in which a short series of sentences is extracted from the original privacy policy, where all sentences are related to the same data practice. An ideal segment contains all necessary information for labelers to understand the described data practices and excludes redundant details that may require extra time and effort to read.

To evaluate this idea, we design and implement Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which classifies a privacy policy as either collecting or not collecting three of the most commonly collected classes of private data: user contacts, user location and the device identifier. Calpric uses active learning to select the policy segments that will most improve model accuracy. These segments are sent to Amazon Mechanical Turk (mTurk) and labeled by crowdsourcing workers (Turkers). Calpric is based on the Privacy Policy Word Embedding bidirectional Long Short Term Memory (PPWE-biLSTM) and focuses on labeling first party data practices involving collection/use of contact, location, and device information. We apply Calpric to a corpus of 52K privacy policies obtained by scraping the Google Play store, a large online market for Android applications. Using active learning and policy segmentation, Calpric is able to achieve a classification accuracy of $97.6\%$ , which exceeds that of models trained on data sets labeled by skilled labelers (e.g., law students).

We find that one of the reasons Calpric is able to achieve significantly better accuracy is that its use of active learning allows it to mitigate class imbalance, which is known to lead to lower model accuracy. Privacy policy labeling suffers from heavy class imbalance as there are significantly more positive segments that assert data collection (e.g., “The Application also accesses the names of individuals in the user’s contacts.”), than negative segments that assert the absence of data collection (e.g., “We do not collect your name, email addresses, postal addresses, and/or telephone numbers.”). In fact, the most widely used labeled privacy policy datasets, OPP-115 and APP-350, contain only $2.0\%$ and $18.6\%$ negative segments, respectively. By selecting the policy segments that are most likely to improve model accuracy, we find that active learning allows Calpric to identify and use the negative samples in our large dataset and thus achieve a balanced training set despite the naturally occurring bias towards positive samples.

In summary, this paper makes the following contributions:

1.

We present Calpric, the first system we are aware of that applies active learning to the problem of privacy policy classification.
2.

We find that by automatically decomposing privacy policies into segments, Calpric is able to use active learning to identify the individual segments across privacy policies that are most valuable, allowing it to increase its accuracy with far fewer labeled segments.
3.

We study different design options in deploying Calpric and show that Calpric is able to achieve 97.6% labeling accuracy with Amazon Turker workers, which exceeds that of previous studies that used skilled workers to label policies.

II Related Works

We study prior works on automated privacy policy analysis in Section II-A and investigate active learning techniques and their application on top of deep learning models in Section II-B.

II-A Privacy Policy Analysis

Given the number and length of privacy policies, much of the related work focuses on extracting important information from them and identify the related data practices. Watanabe et al. [39] use keyword extraction on privacy policies to identify non-compliance between mobile apps and their privacy policies. Wilson et al. [40] built and trained text classifiers using the OPP-115 Corpus, which consists of 115 website privacy policies labeled by legal experts. Using the same data corpus, Frederick et al. [18] presented a performance comparison among Logistic Regression (LR), Support Vector Machine (SVM), and Convolutional Neural Network (CNN) models, and has used this to also detect non-compliance in Android applications [48]. Harkous et al. [10] developed a multi-label classifier using CNN to label policies using major categories and smaller attributes. Elisa et al. [5] presented a solution to automatically assess the completeness of a policy using more complex algorithms such as Linear Support Vector Machines (LSVM). Zimmeck et al. [47] also created a labeled mobile app-specific privacy policy corpus, APP-350. To compensate the rarity of negative annotation labels, they introduced synthetic data by manually changing positive policy texts into negative samples. Finally, Elisa et al. [5] extracted texts from 64 privacy policies, and labeled them by a single annotator. Because Calpric leverages crowdsourcing to efficiently label samples, it is able to build a larger labeled dataset and achieve significantly greater accuracy using machine learning than previous work. In addition, while Zimmeck et al. [47] used synthetic training points to address class imbalance, Calpric overcomes the imbalanced training set by mining a large unlabeled dataset for potential negative samples and gets them labeled with active learning.

Although Zimmeck et al. [46] and Wilson et al. [41] also explore the crowdsourcing option, in both studies, labelers are required to read through the entire privacy policy and answer questions accordingly. In contrast, we pre-preprocess policies into segments that address the same data practice to reduce labeling effort, as suggested by Schaub et al. [28]. Whereas Wilson et al. [41] focus on methodology to increase the crowdsourcing productivity, such as highlighting the most relevant paragraphs in a privacy policy, we shift the focus and extend our work to classification of policy segments. Our study also differs from all prior published work on privacy policy classification because we integrate active learning algorithms on top of regular classifiers. Combined with the use of policy segments, Calpric is able to address the class imbalance issue by automatically querying more negative samples in the unlabeled training pool.

II-B Active Learning

Active learning is an iterative training strategy where an initial model is trained in the normal fashion, and it is then allowed to select unlabeled training instances using a query strategy. In each active learning iteration, a number of selected labels are sent to a query oracle for labeling and the model is then updated with the newly labeled training points—we use Amazon mTurk as our query oracle. The model can be trained for an arbitrary number of such iterations until a pre-defined stopping point is reached.

In general, there are three active learning scenarios: membership querying synthesis, stream-based, and pool-based selective sampling. Pool-based selective sampling has been the most well-studied scenario, especially for text classification [38] and information extraction [32]. The major advantage of this method is that it evaluates and ranks the entire set of unlabeled training points before selecting the next one to label [30]. We select pool-based sampling because it is applicable to Calpric’s scenario as we have the entire unlabeled training set up front and it is the most effective and widely used sampling mode for text classification.

To our knowledge, this is the first paper that performs deep active learning classification on privacy policies. However, there are related studies using similar learning approaches in areas such as image analysis and classification [42, 9], and natural language processing (NLP). Zhang et al. [43] implemented active learning strategies on top of CNNs for sentiment analysis, whereas Shen et al. [34] investigated uncertainty-based active learning heuristic for sequence tagging on a newly proposed CNN-CNN-LSTM architecture. We build upon prior work and experiment with our PPWE-biLSTM and the fine-tuned BERT [7] classifiers, which are proven to be the state-of-the-art NLP models for text classification tasks [45, 12].

III Design Overview

Figure 1 shows a high-level overview of our proposed system, comprising four major components: data preparation in Section IV, and crowdsourcing in Section V, querying and active learning covered in Section VI.

Refer to caption — Figure 1: A simplified overview of the active learning system

The objective of Calpric is to extract whether a privacy policy declares collection for a particular private data category. Currently Calpric supports three private data categories: user contacts, location, and device ID. To do so, we design Calpric to be a combination of multiple binary classifiers, each being responsible for classification of one private data category. The binary classifiers are trained on policy segments of the according categories.

In the preprocessing stage, we download $375K$ Android app privacy policies from the Google Play Store, filter out duplicates, process and sanitize them, resulting in a total of $52K$ privacy policies in raw text format. To reduce crowdsourcing effort and increase label accuracy, instead of requiring labelers to read through the entire privacy policy, we extract smaller policy segments from the original policies, and pre-classify them into groups using basic keyword extraction heuristics. In each active learning iteration, the model selects the most representative segments $\mathbb{S}$ from the unlabeled training pool, denoted by $\mathbb{X}^{U}$ . These segments are sent to mTurk and labeled by multiple Turkers, in order to reduce the possibility of erroneous labels. The assigned labels are post-processed into a single labeled policy segment and added to the accumulated labeled training set $\mathbb{X}^{L}$ . The process of requesting and retrieving labeled data is handled by the query oracle. Using the training set with new labels added, the classification model updates its weights $\omega$ and repeats the process until a pre-defined stopping criterion is satisfied. We describe Calpric in more detail below.

IV Data Preparation

Calpric requires a small set of labeled privacy policy segments to boot-strap the active learning process and a large set of unlabeled privacy policy segments. The data preparation process begins with privacy policy acquisition, which is covered in Section IV-A, followed by policy segmentation in Section IV-B.

IV-A Policy Downloading

We scrape Android application metadata, including the link to the app’s privacy policy, from the Android Google Playstore. We then filter out broken or invalid privacy policy links. We only consider policies that are in HTML format, though we could have easily extended our preprocessing to handle less common formats such as PDF or raw text. Our web scraper is based on the dragnet’s pre-trained model [25]. The downloaded HTML is then passed through a sanitization pipeline, in which the irrelevant elements, including HTML tags, advertisement, and UI features (eg., navigation bar), are removed. We verify that the downloaded pages are actually privacy policies by checking for the presence of keywords such as “privacy policy” and “legal” as well as by excluding any documents with fewer than 50 words. Finally, we verify text language using the Python langdetect library [24] as the Calpric models currently only handle English text. Finally, we identify and remove duplicates, leaving 51,781 usable privacy policies.

IV-B Policy Segmentation

Next, we extract policy segments, which will form the training points for Calpric. Policy segmentation includes the following tasks: keyword filtering, segmentation and pre-classification.

We first tokenize each privacy policy into a list of sentences using the Python NLTK library [1]. Calpric then filters the sentences using keywords corresponding to three private data categories: contacts, location and device ID. For example, we use keywords such as “email” and “phone number” to extract sentences that discuss the handling of private data in the contacts category and keywords such as “IP address” for the device ID category.

We then segment the privacy policies. The objective of segmentation is to include only the necessary sentences required to understand what the privacy policy is declaring with respect to the private data category. A typical example of a policy segment is as follows:

Personal information is data that can be used to uniquely identify or contact a single person. When you visit, download or upgrade our app or our products, we do not use this information explicitly. However, we may collect personal information to improve our services and deliver a better experience.

All sentences in the segment contribute to the meaning of the segment. For instance, without the first sentence, it is not clear that personal information refers to contact information. However, if the segment only contains the first two sentences, readers may be misled to provide a negative label to the segment. Finally, the sentence immediately following the snippet was “Please be aware that when you register and set up an account, you will at minimum have to download the Application onto your mobile device.” This sentence is not required to understand whether the application was collecting private contact information or not, and thus should not be included in the segment.

Segments are constructed in two steps. First, Calpric checks for keywords such as “include” or “for example”, as well as punctuation marks such as semicolons and question marks, which indicate that the previous sentence may contain information related to the current one.

Second, Calpric uses NLP algorithms to measure the similarity of consecutive sentences. As a part of the algorithm, similar to Harkous et al. [10], we create a domain-specific word embedding trained by the 52K downloaded policies on top of Fasttext [2] and use it to generate vector representations of sentences. While there are various general-purpose pre-trained word embeddings available, customized embeddings tend to achieve better performance for classification tasks [37]. As a Fasttext-based embedding that allows vector training on sub-words, the policy embedding is able to interpret words with spelling mistakes and, more importantly, captures the actual meaning of proper nouns, which usually consist of several words.

Calpric measures sentence similarities using the Word Mover’s Distance (WMD) [16], which evaluates how relevant the previous sentence is to the current one. Dias et al. [8] propose a threshold function that consists of the average and the standard deviation of the downhills depths, where downhills represent topic shifts in a document. To determine if there is a topic boundary between two sentences, our tool compares the WMD similarity of these sentences against a Segment Threshold (ST), which is calculated for every individual privacy policy; if the similarity is larger than the threshold, the two sentences are considered to be related. Therefore there should be no boundary between them.

ST is calculated as follows:

ST=\mu+Topic\_Boundary\_Constant\times\sigma,

(1)

where $\mu$ and $\sigma$ are the WMD mean and standard deviation of sentences in a privacy policy, respectively. The topic boundary constant value in Calpric is set to $2.5$ .

After segmentation, Calpric now has a set of unlabeled segments labeled by private data category $\mathbb{X}^{U}_{contact}$ , $\mathbb{X}^{U}_{device}$ , and $\mathbb{X}^{U}_{location}$ . In total our 52K privacy policies produced 153K, 63K, and 38K segments for contacts, location, and device ID, respectively.

V Crowdsourcing

Individual segments are labeled by crowdworkers hired to perform Human Intelligence Tasks (HITs) on the Amazon mTurk service. Each segment is labeled by five workers and we consolidate the the results into a single label, which we add to the labeled training set $\mathbb{X}^{L}$ for the respective data category. In this section, we cover the design of the HIT survey questions Section V-A, and methods to ensure data reliability in Section V-B. The use of human workers was approved by our institutional review board (IRB).

V-A Survey Questions

Amazon mTurk provides a platform for requesters to publish HITs, in the form of surveys to a market, which are accepted and performed by Turkers. Our Amazon mTurk HITs consist of two questions (see Appendix A). The first question confirms that the policy segment is a First Party Collection/Use, while the second question asks whether or not the policy segment claims to collect/use private data of that type. Labels for segments where Turkers answer “No” for the first question are discarded, as these segments are likely not relevant to data collection. For the remaining labels, the answer to the second question is converted to a positive label if Turkers answer “Yes” and a negative label if Turkers answer “No”.

Because some Turkers may incorrectly label other policy segments, each survey is sent to five Turkers and we measure the level of agreement between the Turkers for each segment, and only accept labels where there is sufficient agreement. In addition, some privacy policy segments are inherently ambiguous and open to interpretation. Because they are a vehicle for obtaining consent, privacy policies should be understandable by people with no legal training. As a result, if regular people cannot agree on the meaning of a policy segment, then it has no definite meaning, and thus is not a useful training point for Calpric.

Our final survey questions were selected from a set of four survey variants we had constructed. To determine the best one, we computed the Alignment Rate (AR) for each survey variant. To calculate AR, we first need to introduce several definitions. Agreement Percentage (AP) refers to the percentage of Turkers labeled a segment the same way. For instance, if five workers select Yes for a segment, two select No, and yet another three others select Other, the AP will be $5/(3+2+5)=50\%$ . To filter out low-confidence labels, we pre-define a threshold above which the AP must fall for the label to be considered reliable (i.e., the segment is not ambiguous and the Turkers labeled the segment correctly). We call this threshold the Acceptance Threshold (AT) and default it to 80% in this study. We refer labels that pass the AT as Aligned Labels. AR is thus the number of aligned labels extracted from segments over the total number of segments published for labeling. From this we can see that when used on the same set of segments, and run over a sufficiently large number of Turkers, a survey variant that achieves a higher AR is likely more consistently interpreted by the Turkers and will thus yield fewer erroneous labels. We note that this simple AR metric achieves a similar result as more complex measures of correlation of internal consistency as our labels are binary (collect or no-collect).

We measure AR for our variants by preparing three batches of questions, with $10$ different policy segments per batch. We randomly select the segments from the APP-350 dataset, which were labeled by skilled labelers and we consider them as $100\%$ reliable. Using four different versions of survey questions, we calculate the AR on $30$ policy segments drawn from the APP-350 dataset, with each segment labeled by five Turkers. The AR varies from $45\%$ to $73\%$ , with the average labeling accuracy between $70\%$ to $75\%$ for all versions. We select the version with the highest AR for all future crowdsourcing tasks.

V-B Label Reliability

Because we create training labels by crowdsourcing instead of legal experts, we need to ensure label reliability and reduce data noise as much as possible. Previous studies show that there should be an emphasis on quality control when dealing with crowdsourcing tasks because they may have varying degrees of skill or may not pay full attention when performing HITs [5]. We introduced a set of quality controls to validate crowdsourcing workers and filter out low-quality data.

V-B1 Worker Requirements

The published HITs only allow qualified workers to answer survey questions. We set the following requirements on Amazon mTurk:

•

Approval Rate $>85$
•

Number of HITs Approved $>50$
•

English Speaker
•

Android Mobile User

V-B2 Consolidating by the Rule of Majority

As mentioned, the policy texts were processed and segmented into small segments, and labeled by crowdsourcing workers on Amazon mTurk. After gathering segment labels from the workers, we consolidate the results using the rule of majority: most of the annotators must agree on the same label. If the AP of a certain label is smaller than the pre-defined AT, we consider it as a low confidence label and do not include it in our training set, as it may contaminate the classifiers.

V-B3 Repeat Workers

Due to the nature of majority rule, we need to ensure that no worker answers the same question repeatedly. We implement a qualification test embedded in every published batch to check the worker IDs that have previously accepted the same batch of questions. Permission to access our survey is only given to Turkers who had not participated in the same question.

V-B4 Knowledge Test and Honesty Test

There are cases where crowdsourcers provide bogus answers without actually doing work [41] or where adversarial bots that enter spam answers on crowdsourcing platforms to earn money [22]. To prevent noise from such interference, we aim to design a qualification test in the form of simple questions to verify if the workers are qualified to answer our questions.

We first consider a knowledge test, which uses segments from the APP-350 as a test, combined with other unlabeled segments to see if the Turker is able to label the test segment correctly. If the Turker fails the test, we do not use any of that Turker’s labels. To test the effectiveness of the knowledge test, we randomly select $60$ questions from APP-350 and experiment on two surveys, randomly designating a question in one survey as the knowledge test and having no knowledge test question in the other survey. Each survey is given to five Turkers to perform. Our results show that many Turkers who answered the test question incorrectly still provide useful labels, and the difference in accuracy between the survey with the knowledge test and the survey without is $100\%$ and $97.6\%$ , respectively. For this small difference in accuracy, we would have had to discard a large number of surveys—using the knowledge test we would need to run twice as many surveys as without to achieve the same number of labeled segments. We thus conclude that such knowledge was not effective in improving label quality.

Instead, we implement a simple honesty checking question, where we ask the Turkers whether they paid close attention to the questions and provide answers accordingly, and highlight the fact that they will still receive full payments even if they did not. Such questions were proved to be effective in improving labeling accuracy [26]. We performed a similar test survey with the honesty checking question and discarded all labels of Turkers who answered “No” to the honesty checking question. The accuracy of this survey was 97.7% and when deployed, we found that the honesty question was answered “Yes” 99.63% of the time. We include the testing results in Table I.

Aligned

Labels

Correct

Labels

Accuracy

With Know_Test

33.3%

100%

Without Any Test

70.0%

97.6%

With Honesty_Test

71.7%

97.7%

TABLE I: Impact of knowledge test and honesty checking question on data collecting performance using 60 test segments

V-B5 Turker Wages and Performance

We experimented on worker wages and decided on a payment of $0.60 per batch per Turker. The average time to complete a 40-question survey is 7 minutes. As Lovett et al. [20] suggest, a standard of $0.15 per minute should be adequate for most Turkers. We tested on three different payments ($0.10, $0.60, and $1.50 per batch) using five different batches, the resulting average AR are $31.9\%$ , $77.0\%$ , and $76.0\%$ , respectively, as shown in Table II. Note that the $0.10 payment results in a significant drop in the AR. One batch was not even completed as no more workers were willing to take the task for such a low payment.

	Batch 1	Batch 2	Batch 3	Batch 4	Batch 5	Overall
$0.1	22.5%	12.5%	20.0%	72.5%	-	31.9% ¹¹1Calculated based on n=4 without considering the uncompleted batch
$0.6	80.0%	57.5%	82.5%	85.0%	80.0%	77.0%
$1.5	75.0%	55.0%	87.5%	85.0%	77.5%	76.0%
Avg	59.2%	41.7%	63.3%	80.8%	78.8%	63.8%

TABLE II: Performance on alignment rate for different payments

We observe that experienced workers are not attracted by the extremely low payment, while workers who accepted these tasks did not spend much effort doing questions, as reflected by the result of $0.10. On the other hand, surprisingly, the higher pay rate of $1.50 did not necessarily produce better results. Despite that these Turkers spending more time on the labeling tasks, the overall AR did not vary much from that of the $0.60 pay rate. We observe a performance ceiling due to the fact that some policy segments are vague by nature. It is difficult to label them, even for legal experts. We therefore concluded that $0.60 per batch per worker is an appropriate payment, and use this setting in all other experiments and evaluations.

VI Automated Privacy Policy Classification

In this section, we present our automated privacy policy analysis tool. We discuss the basic machine learning (ML) model selection in Section VI-A, and active learning strategies to improve label efficiency in Section VI-B.

VI-A Classification Model Description

Because crowdsourcing holds the promise of providing more labeled data than previous studies, we are able to use more complex, higher capacity models than previous works that which used models such as LR and SVM. Traditional CNNs experience the issue of long-distance dependency in text classification, while recent NLP studies solve this problem by introducing RNNs, LSTM, and BERT models. These models have been proven to achieve a better overall performance in text classification [19, 45, 7].

A commonly used, state-of-the-art language model is BERT, which can be fine-tuned for various tasks including text classification. However, because active learning is an iterative training algorithm, an important factor is the retraining time of the model that is selected. While accurate, BERT suffers from slow training due to its complexity and the number of parameters involved, despite it being an application of transfer learning. Joselson and Hallen [12] evaluate a fine-tuned BERT model and a customized biLSTM classifier for sentiment analysis, and show that the former model requires hours more training while achieving similar performance ( $69\%$ vs. $71\%$ ). As a result, developed our own model based on biLSTM, which we call PPWE-biLSTM.

The model is implement with bidirectional LSTM layers and a customized privacy policy word embedding trained on top of the general-purpose Fasttext embedding. Our LSTM nodes are bidirectional, so two recurrent layers in opposite directions serve as a platform for training in the past and future of a specific time frame. The PPWE-biLSTM used in our experiments consists of a hidden layer with $100$ densely connected memory units, as the structure is proved to achieve a lower perplexity than regular stacked models [13, 11]. We use leaky ReLUs [21] and dropouts [35] of $0.1$ to prevent over-fitting while sustaining the weight updates to avoid vanishing gradient problems in the propagation process [21]. We apply a sigmoid activation function to the model output for binary prediction. We use binary cross entropy loss and ADAM optimizer [14] to find model weights. We used BATCH_SIZE of $20$ and EPOCHS = 4. We apply the same vectorization method with our customized word embedding for the ML component. An evaluation of PPWE-biLSTM vs. BERT is included in Section VII-C. We use the above model setup as default for all experiments, unless specified otherwise.

VI-B Applying Active Learning

Recall that we decide to investigate pool-based sampling because we have the entire unlabeled training set available and it is the most effective and widely used sampling mode for text classification. The detailed setup is described in the following sections, including querying strategies (Section VI-B1), alignment threshold (Section VI-B2), and re-labeling strategies (Section VI-B3).

VI-B1 Querying Strategies

As mentioned, the active learner proactively selects a subset of available examples in each learning iteration. The key question is, how does the learner identify which labels are considered to be the most informative? While Settles [30] describes a wide range of existing querying strategies, we first investigate uncertainty-based sampling [17] as it is the most commonly used query framework. In this framework, unlabeled samples are ranked according to how much confidence in predictions made by the current model. Within uncertainty-based sampling, there are a number of algorithms:

•

Least Confidence (LC): selects the least certain instance from the unlabeled set and request for it to be labeled.
•

Margin Sampling (MS): selects instances where the difference between the first most likely and second most likely classes are the smallest.
•

Entropy-based Sampling (ES): selects samples with the largest entropy in class probabilities.

For binary classification, MS and ES reduce to LC. In such cases, the three methods will query instances with a class posterior closest to 0.5, that is, the most ambiguous segments [30]. As a result, we will refer to these algorithms as LC.

We further explore other querying strategies based on the reduction [27], the density-weighted [31], and the batch mode [3] framework:

•

Expected Error Reduction (EER): selects instances that make the most impact on the current model, specifically, how much the generalization error is likely to be reduced as the new samples are introduced to the model. We built upon Roy and McCallum’s work [27], calculate the expected future error of a classifier using the Monte Carlo Estimation.
•

Information Density (ID): considers not only uncertain instances, but also those which are representative of the dense region of the input space.
•

Batch Mode Uncertainty (BMU): first prioritizes diversity on the initial iterations, providing a global view of the input distribution; then, as the number of labeled instances increases, the system shifts the priority to instances about which the classifier is uncertain. Our design makes use of the modAL active learning framework, which is built upon the study of Cardoso et al. [3].

We describe the detailed implementations of each of these algorithms in Appendix C.

VI-B2 Acceptance Threshold & Alignment Rate

As described in the previous section, the acceptance threshold (AT) defines the minimum agreement percentage (AP) among labelers that must be achieved for a label to be accepted. There are several prior studies on privacy policy classification (without AL) using crowdsourcing methods, in which $80\%$ and $100\%$ are the most common thresholds [36, 41].

Setting the AT has a clear trade-off: increasing the AT will result in a decrease in Alignment Rate (AR), which eventually leads to the query oracle needing to create more mTurk HITs to achieve the same number of labeled samples. This directly increases the cost of labeling for Calpric, but may lead to higher quality training data. However, we also need to take into account that some segments may be inherently ambiguous and thus may never achieve an AP greater than the AT, regardless of how many HITs are published.

As a result, while Calpric uses a default AT of 80%, it’s AT is configurable and we evaluate the effect of AT on Calpric’s accuracy in Section VII-E.

VI-B3 Re-labeling Strategies

Recall that labels for segments where the query oracle doesn’t achieve an AP above the AT are not added to $\mathbb{X}^{L}$ . Calpric implements two options for what will be done with such segments:

Label and Discard: This option compensates for the possibility that some number of labels will be discarded as they do not pass the AT. As a result, in order to achieve the $30$ labels Calpric aims to add to $\mathbb{X}^{L}$ in each iteration, we use the average AR to estimate the total number of segments we should submit for labeling. In our experiments, we estimate AR for our survey to be 73%, which requires Calpric to request $42$ segments to be labeled on each iteration. Segments that do not meet the AT are discarded and can never be selected by the query strategy again based on the assumption that they are inherently ambiguous.

Incremental Re-labeling: The above strategy may be overly conservative in discarding segments that fail the AT test after only one iteration, as segments may also fail to pass the test due to poor crowdsourced labelers. Discarding such segments reduces the overall unlabeled pool the querying strategy can select from. This motivates an incremental re-labeling strategy, which we implement based on that of Zhao et al. [44]. In this approach, we publish the exact number of segments we aim to be labeled (30). Instead of discarding segments that fail the AT test, Calpric republishes those segments in following iterations. We use $N$ to represents the pre-set number of labeling iterations (i.e., the maximum tries of labeling request on one policy segment). In our case, we set it to 3. The following example shows the workflow: in the first labeling iteration ( $N=1$ ), we publish 30 unlabeled segments to Amazon mTurk, each labeled by 5 crowdsourcers. For any labels that have their $AP<AT$ , they are considered as unaligned and the query oracle will request an additional 5 labels for the same segment in the second labeling iteration ( $N=2$ ), resulting in a total number of 10 labels. We repeat the same consolidation process to check for each segment whether these labels reach an agreement. If a clear majority fails to emerge after a total of $N=3$ tries, resulting in a total number of 15 labels on each segment, we mark the sample as ambiguous and remove it from the training pool.

VII Evaluation and Measurements

For evaluation, we present measurements on the data distribution in Section VII-A, data similarity in Section VII-B, and a comparison of PPWE-biLSTM with BERT in Section VII-C. We then evaluate the effectiveness of Calpric as compared against the the non-AL baseline model (Section VII-D), as well as it sensitivity to various configuration parameters by comparing the performance of various querying strategies. We present an evaluation on different AT values in Section VII-E, and explore different options for dealing with non-alignment in Section VII-F. Lastly, we show how the active learning algorithm solves the issue of class imbalance (Section VII-G).

VII-A Data Distribution

As mentioned, we focus on policy segments for First Party Collection/Use, with data types being geographical location, device information, and contact information. We extracted these labels from APP-350 and OPP-115, and grouped them into the same categories. We analyzed the two corpora and summarized the results in Tables III and IV. The numbers $115$ and $350$ in OPP-115 and APP-350 indicate the respective numbers of privacy policies covered by each dataset. The number of policies covered in the corpus is different from the number of segments with useful labels.

	Location	Device	Contact	Total
First Party Actions	403	608	1019	2030
- Positive Samples	396	604	990	1990
- Negative Samples	7	4	29	40

TABLE III: Data distribution of OPP-115 corpus (number of segments per category)

	Location	Device	Contact	Total
First Party Actions	852	1633	1608	4093
- Positive Samples	669	1394	1270	3333
- Negative Samples	183	239	338	760

TABLE IV: Data distribution of APP-350 corpus (number of segments per category)

We observe that the number of segments of interest for APP-350 is much larger than that of OPP-115. Note that the web-based OPP-115 dataset is more fine-grained and covers a larger set of categories, whereas the mobile application policies in APP-350 tend to be simpler. Policy segments in OPP-115 are mostly short paragraphs while APP-350 has a smaller average length of words per segment, usually one to two sentences. Similar to APP-350, our data are collected from Android mobile applications, and our segmentation algorithm generates policy segments with their length similar to that of APP-350.

One of the existing issues in the classification of privacy policies is the lack of negative samples. As we can see from the data distribution of OPP-115, the positive/negative ratio is highly imbalanced. In fact, the average Negative Sample Ratio (NSR) is only $2.0\%$ . Research shows that such training data in many machine learning models can result in poor performance [15]. While APP-350 addresses the class imbalance issue by introducing synthetic negative data with manual modifications, which may not be the ideal solution, its NSR is only $18.6\%$ . In our proposed model, the active learner is able to select the most representative segments and balance the training set automatically.

VII-B Data Similarity

OPP-115 &

APP-350

OPP-115 &

CPPS

APP-350 &

CPPS

VII-C Classification Model Evaluation

In this section, we demonstrate that the proposed PPWE-biLSTM gives an accuracy similar to the state-of-the-art BERT model, while significantly outperforming the baseline LR classifier, which was widely used in previous privacy policy studies.

We randomly selected $450$ segments from each category of the unlabeled training pool $\mathbb{X}^{U}$ . The segments were labeled by the crowdsourcing workers. We applied our post-processing algorithms to discard low confidence labels, and consolidated them into three sets of training data for contact, location, and device, each containing $300$ labeled segments. We prepared three test sets, each containing $300$ labeled segments randomly selected from the APP-350 Corpus.

For each classification algorithm, we trained three classifiers and evaluate them using the prepared test sets. The experimental results are shown in Table VI. We use biLSTM as an abbreviation for PPWE-biLSTM. We use two metrics to measure accuracy: F1 score and Matthew Correlation Coefficient (MCC). MCC, which is equivalent to the mean square contingency coefficient, represents the correlation between target and predictions. We include the formula in Appendix B. In binary classification, MCC is more informative since it takes into account the balance ratios of the four confusion matrix categories, especially when class imbalance issues [4] exist. The value varies between $-1$ and $+1$ , where $1$ is a perfect agreement between the actual and predicted labels, $-1$ is a perfect disagreement, and $0$ occurs when the predictions are random with respect to the actuals. We include this additional metric since accuracy and F1 scores are asymmetric and sensitive to data imbalance. We perform an additional experiment using the OPP-115 and APP-350 Corpora. Similar to the previous experiment, the classifiers are trained on $300$ labels selected from each of the three categories in the datasets. As an example, Table VII shows the results for the location classifiers. Results for other categories are included in Appendix F.

From the two tables we observe:

1.

Both PPWE-biLSTM and BERT outperform the LR baseline model significantly, with an average F1 score of $89.7\%$ and $90.1\%$ versus $67.4\%$ , and an average MCC value of $61.3\%$ and $63.8\%$ versus $10.7\%$ .
2.

Based on the F1 score and MCC values, our PPWE-biLSTM model achieves an accuracy similar to BERT, whereas its training time is significantly shorter than that of BERT. We therefore proceed to investigate active learning approaches with the PPWE-biLSTM model.
3.

The average training accuracy, F1 and MCC of PPWE-biLSTM are $98.2\pm 1.2\%$ , $98.4\pm 0.9\%$ , and $97.1\pm 1.6\%$ respectively. Compared with the testing results, there is a reasonable difference of $<10\%$ . As an additional example, Figure 2 shows the trend of F1 and MCC values for contact classifiers trained on $2000$ labeled segments, further confirms that these setups do not suffer from over-fitting.
4.

Although the classifiers trained on OPP-115 have very high accuracy, precision and recall values, the MCC accuracy is inferior due to the unbalanced dataset. The lack of negative samples results in nearly no training on DoesNot data actions. That is, the classifier predicts every label as positive.

Category	Model	Acc.	Prec.	Recall	F1	MCC
	LR	52.5%	58.7%	49.1%	53.5%	5.89%
	biLSTM	80.5%	80.5%	96.6%	88.6%	58.3%
Contact	BERT	82.7%	89.1%	85.6%	91.2%	62.5%
	LR	76.8%	86.9%	85.9%	86.4%	7.10%
	biLSTM	81.6%	97.4%	82.9%	88.7%	46.7%
Device	BERT	84.3%	83.4%	97.9%	89.8%	53.4%
	LR	58.6%	72.3%	54.8%	62.4%	19.1%
	biLSTM	92.7%	92.7%	90.6%	91.7%	58.6%
Location	BERT	90.1%	98.3%	85.6%	91.2%	62.0%

TABLE VI: Performance comparison on LR, PPWE-biLSTM, and BERT

		Acc.	Prec.	Recall	F1	MCC
OPP	LR	94.9%	94.9%	99.9%	97.4%	0.00%
OPP	biLSTM	97.9%	97.9%	100%	98.9%	0.00%
APP	LR	63.6%	75.7%	50.9%	60.9%	31.3%
APP	biLSTM	90.7%	87.9%	95.8%	90.8%	82.3%
CPPS	LR	58.6%	72.3%	54.8%	62.4%	19.1%
CPPS	biLSTM	92.7%	92.7%	90.6%	91.7%	58.6%

TABLE VII: Performance of different location classifiers

VII-D Evaluation of Querying Strategies

Moving to the performance evaluation of active learning, the following experimental setups are used in all tests related to this topic. We specify BATCH_SIZE= $20$ for the initial boot-strap phase which is non-AL, and BATCH_SIZE= $8$ for the stage two active learning phase even though the number of new instances labeled by crowdsourcers may not be a constant value. A large training pool leads to a significant amount of memory consumption during training. For testing efficiency, we limit the unlabeled training pool $\mathbb{X}^{U}$ to contain $12K$ segments, approximately $4K$ for each category, randomly selected from the 52K policies crawled from Google Play. Note that the actual labeled training set contains fewer segments, as some instances queried by the learner do not pass the AT and fail to be aligned. Unless specified otherwise, we repeat each experiment three times and calculate the average accuracy.

We conduct two sets of experiments using different validation data. Both test sets are generated from the APP-350 corpus, and consist of $300$ labels for each category. The difference is that labels in Test_set_1 are randomly selected while Test_set_2 is selected to be a balanced set. The former is used to evaluate the model accuracy against the APP-350 dataset, whereas the latter represents a generalized true classification accuracy.

For the following experiments, we boot-strap our classifiers with $100$ labeled segments. The initial data are prepared by random selection from the unlabeled pool $\mathbb{X}^{U}$ , and are labeled and consolidated using the rule of majority. The labeled training data that are ready to be used are denoted as $\mathbb{X}^{L}$ . In each active learning iteration, the active learner queries new instances from $\mathbb{X}^{U}$ . The selected instances are labeled by Turkers and the aligned labels will be added to $\mathbb{X}^{L}$ . Note that the percentage of negative samples in the initial $\mathbb{X}^{L}$ differ from one category to another: contact has the highest negative sample rate, 31%, whereas the percentage is 23% for location, and only 8% for device.

In addition to the four querying strategies mentioned in Section VI-B1, we also include a baseline non-AL model BASE for comparison, which uses random sampling to obtain new instances. In both sets of experiments, we measure the performance of querying algorithms using F1 score and MCC, and conduct the same experiments for all three categories. We note that EER does suffer from a much slower training speed compared to other methods, as it requires a complete walk-through of all instances to calculate the expected error for each active learning iteration. However, the classifier training time is still overshadowed by the wait time for crowdsourcing tasks, which could take days to complete. However, we point out that if we continued our experiments and labeled more segments, the training time would continue to grow while the labeling time will stay constant.

Figure 3 and Figure 4 compare the F1 and the MCC performance for different querying strategies in location classifiers evaluated on Test_set_1. In both figures, each querying algorithm is associated with individual scores presented by dots, and an estimated trendline based on its average performance scores. We observe that all active learning models outperform the baseline. As a reference, while the active models converge faster, BASE reaches an F1 of $98.3\%$ with more than $2200$ labels, denoted as the converged F1 value. Figure 3 shows that, as the most effective active learning model, BMU reaches an F1 of $93.4\%$ ( $95\%$ of the converged F1 value) with $191$ labels while BASE needs $375$ for the same score. That is, the best active learning model uses $50.93\%$ of the training labels used by BASE to achieve the same result. Similarly, BMU is able to achieve $F1=97.3\%$ ( $99\%$ of the converged F1 value) using $749$ labels, which is only $48.97\%$ of the number of training samples used in BASE.

To evaluate how much saving in training labels there is when using active learning, we introduce two Percentile Scores (PS): PS_high and PS_low, representing the 90th percentile and 85th percentile for MCC, and the 99th percentile and 95th percentile for F1 score, since F1 values converge faster. We also define Training Effort Percentage (TEP) for each Percentile Score (PS) as $TEP=\frac{n_{AL}}{n_{BASE}}$ , where $n_{BASE}$ is the number of training samples used in BASE to achieve a specific performance goal, and $n_{AL}$ is that number of the best case active learning model. A lower score value indicates a more effective active learning algorithm. That is, a querying strategy uses fewer training samples to achieve the same performance.

Going back to the example of Figure 3, for location classifiers, the $48.97\%$ we calculated is the TEP of $PS_{low}$ for F1 score, whereas its $PS_{low}$ is $50.93\%$ , which are also shown in Table VIII. In general, while all active learning querying models outperform BASE, BMU has the best performance among all. We also observe that, compared to the baseline, the largest amount of training effort saved when using active learning models occurs in the device classifiers, where the initial training set is the most imbalanced. The improvements in contact classifiers are relatively small compared to the other two categories, as the contact dataset contains more negative samples. Further evaluations on active learning models and class imbalance will be presented in Section VII-G.

Category	Metric	Converged Value	PS_low	PS_high
Contact	MCC	0.78	70.87%	71.43%
Contact	F1	0.93	85.37%	85.63%
Device	MCC	0.90	27.11%	26.53%
Device	F1	0.99	35.00%	38.89%
Location	MCC	0.91	42.85%	31.88%
Location	F1	0.98	48.97%	50.93%

TABLE VIII: Performance improvement: AL vs. non-AL (Test_set_1)

We also make an observation from Table VIII that the most effective active learning models are the ones trained on the device category, whereas the contact active learning classifiers do not save as much labeling effort. This is reasonable since privacy policy segments on device information tend to be simpler and more straightforward than contact and location. The average length of device segments is also slightly shorter than the other two. As an example, indications for device labels are mainly “device information” and “IP address”, whereas the contact category contains a lot more relevant keywords, such as email, phone number, contact book, different social network accounts such as Facebook, Twitter, or Google. Moreover, the keyword “contact” may also refer to other meanings. For instance, most privacy policies include a section for users to contact the App developers. Such sentences are also associated with email address or phone number, making it difficult for the classifiers to differentiate such segments from the actual data practice of collection/usage of contact information. As a result, it requires more labeling effort to achieve a nearly converged accuracy.

In the second set of experiments, we evaluate the models using the balanced validation set Test_set_2. As the initial training set has significant class imbalance issues, performance scores on Test_set_2 converge slower than that of Test_set_1. We therefore focus on the early stage of training and investigate the amount of labeling effort saved by active learning models to achieve an early performance goal. We set the stopping criteria for all classification models to be: $MCC>0.2$ and $F1>70\%$ . We run experiments on the three categories, and observe similar results as those for Test_set_1: All active learning models outperform BASE, with BMU being the faster model to reach the stopping goal. Figure 5 summarizes the difference in labeling effort when using active learning and non-AL BASE models. We conclude that our active learning model is able to achieve the same performance with fewer training labels, whether the test sets are perfectly balanced or reflect the actual data distribution of the input space.

Similar to the previous experiments, the $TE$ percentage is calculated using the ratio of ${n_{AL}}$ and ${n_{BASE}}$ . However, we observe a difference in these TE values: the device active learning classifier has a higher TE than the other two categories, while the TE for contact classifier drops to the lowest. This is also reasonable, as the second experiment is done in the early stage of a training process. Looking at the actual number of training samples shown in Figure 5, we observe that classifiers for all data categories require approximately $200$ labels to be able to surpass a $70\%$ F1 score, and $280$ labels to achieve an $MCC$ higher than $0.2$ .

VII-E Impacts of Acceptance Threshold

To investigate the relationship between AT values and the learning speed, we experiment on three different AT values: $100\%$ , $80\%$ , and $60\%$ using the same active learning model. The testing model is implemented with BMU, which was shown to be the best querying strategy in the previous section. We aim to introduce $30$ new labels in each active learning iteration, therefore we set the learner to query $42$ unlabeled segments per iteration, according to the average alignment rate of $73\%$ . After the annotation, Calpric downloads the resulting labels and process them with the rule of majority. It consolidates these labels only when the number of majority votes reaches the testing threshold. For instance, if the current testing $AT=1.0$ , we train our classifiers using only the perfectly aligned labels (all workers agree on the same label), and discard the remaining ones where AR $<$ AT.

We define each crowdsourcing label obtained from Amazon mTurk as one unit of labeling effort (LE). In other words, LE takes into account the segments sent for labeling but do not reach the minimum AT. In this experiment, we allocate $8000$ units of LE to each AT value during the active querying phase. That is, the active learning model is able to query up to $1600$ different unlabeled segments, each labeled by five Turkers, even if some of them will not align. We set the boot-strap LE at 100 for each classifier, which means they will probably end up with different numbers of initial training labels because of the different AT values. We train classifiers separately using the post-processed aligned labels according to each pre-set AT, and evaluate their performance using the balanced validation set: Test_set_2.

Figure 6 shows the trend of learning speed vs. MCC performance. Both $AT=0.8$ and $AT=1.0$ outperform $AT=0.6$ . While they share a similar performance trend, $AT=1.0$ is less stable, as shown by the large fluctuations, as its training set is significantly smaller than that of $AT=0.8$ and $AT=0.6$ . We include a figure showing the relationship between the number of aligned labels and the amount of LE allocated in Appendix E. On the other hand, despite having a large training set, $AT=0.6$ results in the worst performance. This is because the low AT value allows labels with low confidence to be added to the dataset, resulting in unwanted label noise. We include the graph for F1 improvement trend in Appendix D as it is not as obvious as that of MCC. In general, the use of active learning results in a larger improvement in MCC performance compared to the F1 score, because active learning in our study naturally targeted the class imbalance problem, and MCC is a direct measurement of how the classifiers are making balanced predictions. In practice, to improve the MCC performance, one should train the model using a more balanced dataset. On the other hand, the most effective way to increase the F1 score is to train models using a larger training set. Our AT experiments are done on relatively small training sets with an average size of $847$ aligned labels. As a result, the improvement in F1 will be rather insignificant.

The full performance comparison is summarized in Table IX. Although in most cases $AT=1.0$ and $AT=0.8$ outperform $AT=0.6$ , we do observe a difference in the results for the location classifiers only: When AT is set to $1.0$ , the both F1 and MCC performance drop to the worst among all three AT values.

Category	AT	Aligned Labels	F1	MCC
Contact	0.6	1266	0.6318	0.1273
	0.8	673	0.6535	0.1870
	1.0	239	0.6503	0.1889
Device	0.6	1308	0.6499	0.0909
	0.8	885	0.6583	0.1059
	1.0	547	0.6547	0.1147
Location	0.6	1333	0.6795	0.1167
	0.8	829	0.6999	0.1596
	1.0	542	0.6551	0.0679

TABLE IX: AT performance comparison using

1600

labels

While other classifiers are trained on samples with similar NSR despite being set to different AT, the location classifiers are not. We calculated the standard deviation for NSR in the three categories: $0.0425$ for contact, $0.0301$ for device, and $0.1239$ for location. Specifically, labels consolidated using $AT=0.6$ and $AT=0.8$ contain an average of $52.0\%$ and $52.1\%$ negative samples, respectively, whereas $AT=1.0$ only has an NSR of $25.5\%$ . Furthermore, as the AT value increases, the total number of aligned labels decreases, together resulting in only $138$ negative labels for the $1.0$ classifier. It is possible that the some crowdsourcers did not agree on labels of potential negative samples. Because we set AT to be $100\%$ , any misalignment will result in the failure to produce to pass the AT. These unaligned labels are not allowed to be added into the training set. Since we test on a relatively small $\mathbb{X}^{U}$ (4K unlabeled segments), we may have run out of negative samples quickly, and the active learner would not able to query more informative segments to be labeled. Being trained on an imbalanced dataset, the $Location-AT=1.0$ model suffers from under-fitting, and its F1 accuracy and MCC reflect this. We explore re-labeling methods to handle discarded labels in Section VII-F, and further investigate the impact of negative samples in Section VII-G.

Also note that the AR for $AT=0.8$ obtained in this experiment seems to be lower than that in previous experiments ( $73\%$ ) in which instances are randomly selected. We suspect this is due to the active selection bias, where the queried labels contain more uncertainty than others.

From the above experiments, we conclude that $AT=0.8$ is the appropriate pre-set threshold for active learning policy classification systems of a size similar to that of our prototype tool. That being said, given a very large unlabeled training pool and an unlimited amount of LE, $AT=1.0$ does achieve the performance ceiling, as it has the lowest possibility of introducing additional label noise to the training set.

VII-F Re-labeling strategies

In theory, we should not discard unaligned labels in active learning classification, as they do contain a significant amount of information, which could potentially improve the model performance. As described in Section VI-B3, we implement two types of methods to handle unaligned labels: Label and Discard (L&D) and Incremental Re-Labeling (IRL). In this section, we compare these strategies using the following setup: active learning models with BMU querying strategy and AT value set to $80\%$ . We use the $12K$ training pool and evaluate the classifiers using Test_set_2.

We run the IRL model on classifiers based on the three different categories, and allow each classifier to re-label up to $100$ segments. The number is denoted as the allocated re-labeling resource. As mentioned, we set the maximum number of labeling iterations to be $N=3$ . Under such circumstances, if we re-label one segment ( $N=2$ ) and a total of $10$ crowdsourcers still could not agree on a single label, we can once again publish this segment to be labeled ( $N=3$ ), resulting in a total of $15$ labels. That being said, if we publish the same segment twice, we count this as two units used in the re-labeling resource.

We introduce a measurement of how effective the re-labeling process is. The re-labeling Success Rate (RSR) is defined as a ratio of the number of successfully aligned labels after re-labeling and the total allocated re-labeling resource. We calculate RSR for each classifier. Unfortunately, the results for relabeling are not promising. Out of 100 relabeled segments, only 2 passed the AT after $N=2$ and only 1 passed the AT after $N=3$ , resulting in an overall RSR of $3\%$ .

One possible reason for such a low RSR is that, because our question surveys are carefully designed and reviewed, with the additional qualification tests, crowdsourcers are able to provide correct labels as long as they pay attention to the questions. Furthermore, if the segment is longer and more complex than usual, the possibility of misinterpretation increases, affecting the successful rate of label alignment. This is proven by our experiment, as most re-labeled and aligned labels have longer lengths than the average.

Another consideration is that, some labels are ambiguous by nature. For instance, examples of the relabeled and failed include: “your GPS geo-location is not accessed without your consent” and “you will be asked for your permission each time a location-based service is requested to provide you with local search results.” It is obvious that the software is trying to access user information in order to perform a certain service. However, users also have the choice to disable such functionalities, or opt out of specific features. Such segments, even when crowdsourcers agree on the label itself, may not reflect the actual legal content. What is worse, these labels may introduce additional data noise and contaminate the training set.

We conclude that developing a more complex re-labeling design may not be an ideal choice to avoid under-fitting. Many of the segments that do not reach AT the first time are likely ambiguous and are unlikely to pass AT if more labels are obtained from Turkers, and thus it is generally a more efficient use of resources to discard such segments. Instead, we should construct a larger unlabeled training pool $\mathbb{X}^{U}$ for the active learning system to query as many informative labels as possible.

VII-G Percentage of Negative Samples

In previous experiments, we observe that the active learning models not only improve the training speed, but also solve the existing problem of class imbalance in privacy policy classification.

Figure 7 shows the trends of NSR for the labeled training set $\mathbb{X}^{L}$ using different querying strategies. We observe steep rises in LC, BMU, and EER, and a slightly higher NSR in ID than that of BASE. This is reasonable, as the initial boot-strap training set contains fewer negative samples. The classifiers are more uncertain on such instances, and query more of them during each active learning iteration. We note that all NSR converge back to the initial value as the size of $\mathbb{X}^{L}$ increases. This is due to our limited $\mathbb{X}^{U}$ , which contains only $4K$ unlabeled segments that can be potentially queried. As mentioned, the number of negative samples in privacy policies tend to be much fewer than that of the positive ones. Thus, it is likely that all negative samples are used up as the training process continues.

Figure 8 confirms this hypothesis. All active learning models query more negative instances than the random baseline model. As a result, while there are still remaining negative samples in $\mathbb{X}^{U}$ for BASE, other models run out of negative samples in the first few hundreds of learning iterations. For future work, it will be crucial to obtain a reasonably large $\mathbb{X}^{U}$ before training the active learning models to ensure the set contains an adequate number of negative samples. We can approximate the amount of negative training labels needed to achieve certain goals, and back calculate the required size of unlabeled training pool $\mathbb{X}^{U}$ using the estimate percentage of NSR in the pool.

VIII Limitations and Discussion

As the first automated privacy policy analyzer that uses active learning, we designed and built Calpric as a prototype. Since we trained for contact, location, and device segments, our classifier is able to identify only those data practices associated with this information. We look forward to expanding our binary classifiers to more categories, as well as to a second-level in which data practices are classified into First Party Collection/Use and Third Party Sharing. We aim to develop an automated report tool that can take any raw privacy policy texts and output a human-readable report to summarize all user data mentioned in the policies.

IX Conclusion

We propose Calpric, the first crowdsourcing active learning framework that performs automated classification of privacy policies with high accuracy, using only a small amount of training labels. It proactively selects the most informative batches of policy segments from the unlabeled training pool, increasing the benefit of each label. Calpric opens opportunities for privacy policy classifiers to achieve high accuracy with a limited labeling budget. It helps app users and regulators to analyze data practices in privacy policies without reading through copious amounts of text.

References

[1] S. Bird, E. Klein, and E. Loper, Natural language processing with Python, 1st ed. Beijing ; Cambridge [Mass.]: O’Reilly, 2009, oCLC: ocn301885973.
[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, Dec. 2017. [Online]. Available: https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00051
[3] T. N. Cardoso, R. M. Silva, S. Canuto, M. M. Moro, and M. A. Gonçalves, “Ranked batch-mode active learning,” Information Sciences, vol. 379, pp. 313–337, Feb. 2017. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0020025516313949
[4] D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, p. 6, Dec. 2020. [Online]. Available: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7
[5] E. Costante, Y. Sun, M. Petković, and J. den Hartog, “A machine learning solution to assess privacy policy completeness: (short paper),” in Proceedings of the 2012 ACM workshop on Privacy in the electronic society - WPES ’12. Raleigh, North Carolina, USA: ACM Press, 2012, p. 91. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2381966.2381979
[6] A. Culotta and A. Mccallum, “Reducing labeling effort for structured prediction tasks.” in Proceedings of the National Conference on Artificial Intelligence, vol. 2, 01 2005, pp. 746–751.
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
[8] G. Dias, E. Alves, and J. G. P. Lopes, “Topic segmentation algorithms for text summarization and passage retrieval: An exhaustive evaluation,” in AAAI, 2007.
[9] Y. Gal, R. Islam, and Z. Ghahramani, “Deep Bayesian active learning with image data,” arXiv:1703.02910 [cs, stat], Mar. 2017, arXiv: 1703.02910. [Online]. Available: http://arxiv.org/abs/1703.02910
[10] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin, and K. Aberer, “Polisis: Automated analysis and presentation of privacy policies using deep learning,” arXiv:1802.02561 [cs], Jun. 2018, arXiv: 1802.02561. [Online]. Available: http://arxiv.org/abs/1802.02561
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 770–778. [Online]. Available: http://ieeexplore.ieee.org/document/7780459/
[12] N. Joselson and R. Hallen, “Emotion classification with natural language processing (comparing bert and bi-directional lstm models for use with twitter conversations),” Ph.D. dissertation, University of Cape Town, 2019. [Online]. Available: http://rgdoi.net/10.13140/RG.2.2.16482.27844
[13] J. Kim, M. El-Khamy, and J. Lee, “Residual LSTM: Design of a deep recurrent architecture for distant speech recognition,” in Interspeech 2017. ISCA, Aug. 2017, pp. 1591–1595. [Online]. Available: http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0477.html
[14] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 [cs], Jan. 2017, arXiv: 1412.6980. [Online]. Available: http://arxiv.org/abs/1412.6980
[15] B. Krawczyk, “Learning from imbalanced data: Open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, Nov. 2016. [Online]. Available: http://link.springer.com/10.1007/s13748-016-0094-0
[16] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” Proceedings on International Conference on Machine Learning, p. 10, 2015.
[17] D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in Machine Learning Proceedings 1994. Elsevier, 1994, pp. 148–156. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/B978155860335650026X
[18] F. Liu, S. Wilson, P. Story, S. Zimmeck, and N. Sadeh, “Towards automatic classiﬁcation of privacy policy text,” p. 5, 2018.
[19] P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text classification with multi-task learning,” arXiv:1605.05101 [cs], May 2016, arXiv: 1605.05101. [Online]. Available: http://arxiv.org/abs/1605.05101
[20] M. Lovett, S. Bajaba, M. Lovett, and M. J. Simmering, “Data quality from crowdsourced surveys: A mixed method inquiry into perceptions of Amazon’s Mechanical Turk Masters,” Applied Psychology, vol. 67, no. 2, pp. 339–366, Apr. 2018. [Online]. Available: http://doi.wiley.com/10.1111/apps.12124
[21] A. L. Maas, “Rectifier nonlinearities improve neural network acoustic models,” in International Conference on Machine Learning (ICML), 2013.
[22] R. Mccreadie, C. Macdonald, and I. Ounis, “Crowdsourcing a news query classification dataset,” in Proceedings of CSE, 2010.
[23] A. M. McDonald and L. F. Cranor, “The cost of reading privacy policies,” Isjlp, vol. 4, p. 543, 2008.
[24] M. Nakatani Shuyo, “langdetect: Language detection library ported from Google’s language-detection.” [Online]. Available: https://github.com/Mimino666/langdetect
[25] M. E. Peters and D. Lecocq, “Content extraction using diverse feature sets,” in Proceedings of the 22nd International Conference on World Wide Web - WWW ’13 Companion. Rio de Janeiro, Brazil: ACM Press, 2013, pp. 89–90. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2487788.2487828
[26] S. V. Rouse, “A reliability analysis of Mechanical Turk data,” Computers in Human Behavior, vol. 43, pp. 304–307, Feb. 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0747563214005809
[27] N. Roy and A. McCallum, “Toward optimal active learning through sampling estimation of error reduction,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, p. 441–448.
[28] F. Schaub, T. D. Breaux, and N. Sadeh, “Crowdsourcing the extraction of data practices from privacy policies,” Association for the Advancement of Artificial Intelligence (AAAI), p. 2, 2014.
[29] T. Scheffer, C. Decomain, and S. Wrobel, “Active hidden Markov models for information extraction,” in Advances in Intelligent Data Analysis, G. Goos, J. Hartmanis, J. van Leeuwen, F. Hoffmann, D. J. Hand, N. Adams, D. Fisher, and G. Guimaraes, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, vol. 2189, pp. 309–318. [Online]. Available: http://link.springer.com/10.1007/3-540-44816-0_31
[30] B. Settles, “Active learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114, Jun. 2012. [Online]. Available: http://www.morganclaypool.com/doi/abs/10.2200/S00429ED1V01Y201207AIM018
[31] B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii: Association for Computational Linguistics, Oct. 2008, pp. 1070–1079. [Online]. Available: https://www.aclweb.org/anthology/D08-1112
[32] B. Settles, M. Craven, and L. Friedland, “Active learning with real annotation costs,” Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 01 2008.
[33] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, Jan. 2001. [Online]. Available: https://dl.acm.org/doi/10.1145/584091.584093
[34] Y. Shen, H. Yun, Z. Lipton, Y. Kronrod, and A. Anandkumar, “Deep active learning for named entity recognition,” in Proceedings of the 2nd Workshop on Representation Learning for NLP. Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 252–256. [Online]. Available: http://aclweb.org/anthology/W17-2630
[35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 06 2014.
[36] Y. P. Sun, “Investigating the effectiveness of android privacy policies,” Master’s thesis, University of Toronto, 2018.
[37] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning sentiment-specific word embedding for Twitter sentiment classification,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Baltimore, Maryland: Association for Computational Linguistics, 2014, pp. 1555–1565. [Online]. Available: http://aclweb.org/anthology/P14-1146
[38] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” J. Mach. Learn. Res., vol. 2, p. 45–66, Mar. 2002. [Online]. Available: https://doi.org/10.1162/153244302760185243
[39] T. Watanabe, M. Akiyama, T. Sakai, H. Washizaki, and T. Mori, “Understanding the inconsistency between behaviors and descriptions of mobile apps,” IEICE Transactions on Information and Systems, vol. E101.D, no. 11, pp. 2584–2599, Nov. 2018. [Online]. Available: https://www.jstage.jst.go.jp/article/transinf/E101.D/11/E101.D_2017ICP0006/_article
[40] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. Giovanni Leon, M. Schaarup Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, T. B. Norton, E. Hovy, J. Reidenberg, and N. Sadeh, “The creation and analysis of a website privacy policy corpus,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, 2016, pp. 1330–1340. [Online]. Available: http://aclweb.org/anthology/P16-1126
[41] S. Wilson, F. Schaub, R. Ramanath, N. Sadeh, F. Liu, N. A. Smith, and F. Liu, “Crowdsourcing annotations for websites’ privacy policies: Can it really work?” in Proceedings of the 25th International Conference on World Wide Web - WWW ’16. Montreal, Quebec, Canada: ACM Press, 2016, pp. 133–143. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2872427.2883035
[42] L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen, “Suggestive annotation: A deep active learning framework for biomedical image segmentation,” 2017.
[43] Y. Zhang, M. Lease, and B. C. Wallace, “Active discriminative text representation learning,” arXiv:1606.04212 [cs], Dec. 2016, arXiv: 1606.04212. [Online]. Available: http://arxiv.org/abs/1606.04212
[44] L. Zhao, G. Sukthankar, and R. Sukthankar, “Incremental relabeling for active learning with noisy crowdsourced annotations,” in 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing, 2011, pp. 728–733.
[45] P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu, “Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 3485–3495. [Online]. Available: https://www.aclweb.org/anthology/C16-1329
[46] S. Zimmeck and S. M. Bellovin, “Privee: An architecture for automatically analyzing web privacy policies,” in 23rd USENIX Security Symposium (USENIX Security 14). San Diego, CA: USENIX Association, Aug. 2014, pp. 1–16. [Online]. Available: https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/zimmeck
[47] S. Zimmeck, P. Story, D. Smullen, A. Ravichander, Z. Wang, J. Reidenberg, N. Cameron Russell, and N. Sadeh, “MAPS: Scaling privacy compliance analysis to a million apps,” Proceedings on Privacy Enhancing Technologies, vol. 2019, no. 3, pp. 66–86, Jul. 2019. [Online]. Available: https://content.sciendo.com/doi/10.2478/popets-2019-0037
[48] S. Zimmeck, Z. Wang, L. Zou, R. Iyengar, B. Liu, F. Schaub, S. Wilson, N. Sadeh, S. M. Bellovin, and J. Reidenberg, “Automated Analysis of Privacy Requirements for Mobile Apps,” in Proceedings 2017 Network and Distributed System Security Symposium. San Diego, CA: Internet Society, 2017. [Online]. Available: https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/automated-analysis-privacy-requirements-mobile-apps/

Appendix A Survey Questions

Below are the questions used in our Amazon mTurk HITs. Note that the questions are adjusted depending on the type of private information we are asking the Turkers to label (i.e., “contact” would be adjusted to “location” and “device ID” accordingly).

1.
Does the segment talk about FIRST PARTY data practice (collect/use information from users)?
1. $\bigcirc$
  
  Yes
2. $\bigcirc$
  
  No
2.
Does the segment claim to collect/use CONTACT information?
1. $\bigcirc$
  
  Yes
2. $\bigcirc$
  
  No

Appendix B Matthew Correlation Coefficient

Below shows the formula for Matthew Correlation Coefficient (MCC):

\scriptstyle MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}},

(2)

where TP, TN, FP, FN are true positives, true negatives, false positives and false negatives, respectively.

Appendix C Additional Information on Active Learning Querying Strategies

Basic Uncertainty: The basic uncertainty is also called the Least Confidence (LC) strategy, proposed by Culotta and McCallum [6]. The active learner selects the least certain instance from the unlabeled set and request for it to be labeled. The confidence probability is calculated using:

\underset{y\in Y}{max}[1-\mathbb{P}(y\in Y|\mathbf{x})]

(3)

Margin & Entropy Sampling: Other popular uncertainty querying strategies include Margin Sampling and Entropy Sampling. The former selects instances where the difference between the first most likely and second most likely classes are the smallest [29], whereas the latter chooses samples with the largest entropy in class probabilities [33]. In other words, instead of minimizing the classification error, entropy method tries to minimize the log loss.

For binary classification, margin and entropy-based sampling are reduced to the LC strategy. In such cases, the three methods will query instances with a class posterior closest to 0.5, that is, the most ambiguous segments [30]. We confirmed this conclusion using our CPPS dataset. In future evaluation, we use Uncertainty Sampling to represent the three querying strategies.

Expected Error Reduction: Instead of considering individual instances along, there are methods take the entire input space into account, which have the potential to prevent sub-optimal queries. The Expected Error Reduction (EER) sampling method selects instances that make the most impact on the current model. Specifically, it measures how much the generalization error is likely to be reduced as the new samples are introduced to the model. We built upon Roy and McCallum’s work [27], calculate the expected future error of a classifier using the Monte Carlo Estimation. Below shows the formula to estimate the binary loss, which is used in our evaluation:

{\mathbb{E}}_{P^{*}_{\mathbb{X}^{L}}}(binary)=\frac{1}{\left|\mathbb{X}^{U}\right|}\underset{x\in\mathbb{X}^{U}}{\sum}(1-\underset{y\in Y}{max}{P^{*}_{\mathbb{X}^{L}}}(y|x)),

(4)

where $\mathbb{X}^{U}$ is the unlabeled training pool and $\overset{*}{\mathbb{X}^{L}}$ is the current training set with the chosen query $(x^{*},y^{*})$ added to the original labeled training set ${\mathbb{X}^{L}}$ . We calculate expected error for each possible label, $y\in\{y_{0},y_{1}\}$ using the learner’s prediction distribution ${P^{*}_{\mathbb{X}^{L}}}$ . As the true label for $x^{*}$ is unknown before the query, we use the current model to estimate the true label probabilities. We design our model with a p_subsample of $0.5$ to improve runtime for large sample pools.

Density-Weighted Methods: Settles and Craven [31] propose the Information Density (ID) framework. When querying new samples, ID not only considers uncertain instances, but also those which are representative of the dense region of the input space. The ID value of an instance $x$ can be calculated as follow:

{\mathbf{ID}}(x)=\frac{1}{\left|\mathbb{X}^{U}\right|}\sum_{x=1}^{\mathbb{X}^{U}}sim(x,\bar{x}),

(5)

where $sim(x,\bar{x})$ is a similarity function such as cosine similarity or Euclidean similarity. The value is calculated using $x$ and the average similarity (denoted by $\bar{x}$ ) of all other instances in the input distribution. A higher ID value represents a closer relationship between the given instance and the rest of samples in $\mathbb{X}^{U}$ .

Batch Sampling Querying: Traditional active learning query strategies suffer from sub-optimal record selection when passing $n\_instances>1$ , that is, querying multiple instances in each iteration. The Batch Mode Uncertainty (BMU) sampling method addresses this issue by enforcing a importance ranking system for records among the batch. Our design makes use of the modAL active learning framework, which is built upon the study of Cardoso et al. [3]. They implement a BMU framework to prioritize diversity on the initial iterations, providing a global view of the input distribution. As the number of labeled instances increases, the system then shifts the priority to instances in which the classifier is uncertain about. The ranking score of each instance $x$ is calculated as:

BMU(x)=\alpha(1-sim(x,\mathbb{X}^{L})+(1-\alpha)LC(x),

(6)

where $\alpha=\frac{\left|\mathbb{X}^{U}\right|}{\left|\mathbb{X}^{U}+\mathbb{X}^{L}\right|}$ , $LC(x)$ is the uncertainty of predictions for $x$ calculated by the LC algorithm. The similarity function $sim(x,\mathbb{X}^{L})$ measures to what extend the feature space is explored near $x$ . The BM score is calculated for all $x$ in $\mathbb{X}^{U}$ , and ranked in ascending order. In each active learning iteration, a preset number of instances are selected from the pool in this order, and the BM scores will be re-calculated.

Appendix D Evaluation on F1 performance for different AT Using contact classifiers

Figure 9 shows the trend of learning speed vs. F1 Score performance. In general, the scores in all three cases are relatively closed to each other. However, $AT=1.0$ is less stable compared to others, as shown by the large fluctuations, because its training set is significantly smaller than that of $AT=0.8$ and $AT=0.6$ .

Appendix E Number of Aligned Labels vs. Labeling Effort (LE) for Different AT Using Contact Classifiers

Figure 10 shows how many labeled instances are generated from the same LE when the AT values vary. We observe a pattern in which the lower the AT is, the more aligned labels we receive from the same amount of allocated LE.

Appendix F Evaluation Using Single-sourced Dataset

In addition to Table VII, evaluation results for the other two categories are included in Tables X and Table XI. We use biLSTM as an abbreviation for PPWE-biLSTM.

		Acc.	Prec.	Recall	F1	MCC
Opp	LR	88.9%	91.2%	96.5%	93.8%	43.3%
Opp	biLSTM	97.9%	97.9%	100%	98.9%	0.00%
App	LR	64.6%	68.5%	67.3%	67.9%	28.6%
App	biLSTM	87.7%	96.9%	82.0%	88.5%	76.9%
CPPS	LR	52.5%	58.7%	49.1%	53.5%	5.89%
CPPS	biLSTM	88.6%	83.5%	95.3%	88.7%	75.8%

TABLE X: Performance of contact classifiers trained on LR and PPWE-biLSTM models

		Acc.	Prec.	Recall	F1	MCC
Opp	LR	96.9%	96.9%	99.9%	98.5%	0.00%
Opp	biLSTM	100%	100%	100%	100%	0.00%
App	LR	59.6%	64.2%	61.8%	62.9%	18.6%
App	biLSTM	88.6%	86.2%	90.6%	88.2%	74.6%
CPPS	LR	76.8%	86.9%	85.9%	86.4%	7.10%
CPPS	biLSTM	81.6%	97.4%	82.9%	88.7%	46.7%

TABLE XI: Performance of device classifier trained on LR and PPWE-biLSTM models