[1]Muhammad Haroon

Avengers Ensemble! Improving Transferability of Authorship Obfuscation

Fareed Zaffar Lahore University of Management Sciences, E-mail: [email protected] Padmini Srinivasan The University of Iowa, E-mail: [email protected] Zubair Shafiq University of California, Davis, E-mail: [email protected]

Abstract

Stylometric approaches have been shown to be quite effective for real-world authorship attribution. To mitigate the privacy threat posed by authorship attribution, researchers have proposed automated authorship obfuscation approaches that aim to conceal the stylometric artefacts that give away the identity of an anonymous document’s author. Recent work has focused on authorship obfuscation approaches that rely on black-box access to an attribution classifier to evade attribution while preserving semantics. However, to be useful under a realistic threat model, it is important that these obfuscation approaches work well even when the adversary’s attribution classifier is different from the one used internally by the obfuscator. Unfortunately, existing authorship obfuscation approaches do not transfer well to unseen attribution classifiers. In this paper, we propose an ensemble-based approach for transferable authorship obfuscation. Our experiments show that if an obfuscator can evade an ensemble attribution classifier, which is based on multiple base attribution classifiers, it is more likely to transfer to different attribution classifiers. Our analysis shows that ensemble-based authorship obfuscation achieves better transferability because it combines the knowledge from each of the base attribution classifiers by essentially averaging their decision boundaries.

1 Introduction

Authorship obfuscation is the process of concealing stylometric pointers in a text document that may reveal the identity of its author. The problem has become increasingly relevant today considering the erosion of privacy due to recent advances in the performance of state-of-the-art authorship attribution approaches. Sophisticated machine learning models can determine the author of a given text document [27, 44] using hand-crafted stylometric features [2, 8, 35, 12, 3] or automated features such as word embeddings [42, 23]. State-of-the-art authorship attribution approaches have achieved impressive results in a multitude of settings ranging from social media posts [4, 37, 41] to large-scale settings with up to a 100,000 possible authors [36].

The desire to maintain anonymity in this increasingly hostile environment motivates the need for effective authorship obfuscation methods. Obfuscation approaches can be broadly divided into two groups: those that do not rely on feedback from an authorship attribution classifier and those that do require such feedback.

In the first group, there are a number of efforts especially from the PAN digital text forensics initiative [1]. These authorship obfuscation approaches mostly use rule based transformations (e.g., splitting or joining sentences) guided by some general criteria, such as moving the text towards some average point or moving it away from the author’s writing patterns, text simplification, machine translation, etc. [29, 9, 28, 40]. These approaches generally struggle with achieving the appropriate trade off between evasion effectiveness and preserving text semantics.

In the second group, obfuscators that rely on access to an authorship attribution classifier are more relevant to our research. In a seminal work, MacDonald et al. [35] proposed Anonymouth – an obfuscator that relies on access to the attributor JStylo to guide manual text obfuscation. A⁴NT [43] proposed a generative adversarial network (GAN) based automated approach to obfuscation and also requires access to the attribution classifier. More recently, Mutant-X [33] used a genetic algorithm and ParChoice [22] used combinatorial paraphrasing for automated obfuscation, and both require access to an attribution classifier. These methods have shown promise in effectively evading attribution classifiers while reasonably preserving text semantics.

While prior authorship obfuscation methods can suitably trade off between evading attribution and preserving semantics, they do not work well when the adversary uses a different attribution classifier than the one used internally by the obfuscator [33, 22, 43]. However, it is important that authorship obfuscators can protect the author’s identity even when the adversary uses a different attribution classifier. In other words, obfuscation should transfer to previously unseen attribution classifiers.

The lack of transferability is essentially because of the mismatch between the obfuscator’s internal classifier and the adversary’s attribution classifier. To address the transferability issue, our key insight is that if an obfuscator can evade a meta-classifier, which is based on multiple base classifiers that target different feature subspaces, it is more likely to evade an unseen attribution classifier. Building on this insight, we propose an ensemble-based approach for transferable authorship obfuscation. We explore the design space of the ensemble using different feature subspaces, base classifiers, and aggregation techniques. The experimental evaluation shows our ensemble based authorship obfuscation approach yields state-of-the-art transferability results. We find that obfuscation by our ensemble approach achieves an 1.7 $\times$ and 2.1 $\times$ better transferability in terms of the attack success rate (ASR) than the baseline RFC and SVM attributors. The ensemble achieves an average METEOR score of 0.36, which is comparable with the RFC at 0.42 and the SVM at 0.40.

We summarize our key contributions and findings as follows:

1.

We explore the problem of transferability of authorship obfuscation against unseen attribution classifiers.
2.

We propose an ensemble approach that consists of multiple base classifiers, each capturing different feature subspaces, to guide automated text obfuscation.
3.

We evaluate the evasion effectiveness, semantic preservation, and transferability of the ensemble obfuscator and show that it achieves much better transferability against unseen attribution classifiers than prior approaches.

2 Preliminaries & Methods

Refer to caption — Fig. 1: Overview of the threat model involving an obfuscator and multiple adversaries. The obfuscator relies on access to an internal attribution classifier. The obfuscator consists of two components: a generator and an obfuscator. The generator generates an obfuscation and queries the internal classifier for feedback on its probability of detection. They repeat this for $M$ iterations and then, if evaded, the final document is verified against the adversaries.

2.1 Authorship Attribution vs. Obfuscation

Stylometry is the analysis of an author’s writing style that helps distinguish them from other authors. For example, writeprints [2] is a well-known stylometric feature set that has been used to analyze the writing style for the sake for authorship attribution. The primary goal of authorship obfuscation is to evade attribution by concealing such stylometric features in the document while retaining its original meaning. Early approaches such as Anonymouth [35] highlighted the distinctive stylometric properties of the text that could then be modified by the user to evade attribution. Follow up work at PAN-CLEF aimed to automatically obfuscate documents using simple predefined rules. For example, Mansoorizadeh et al. [34] used WordNet to identify synonyms for the most commonly used words by the author and replaced them with similar words. Castro et al. [9] used sentence simplification techniques, such as replacing contractions with their expansions, to obfuscate a document. Keswani et al. [29] used round-trip translation ( $English\rightarrow German\rightarrow French\rightarrow English$ ) to obfuscate a document. While these automated approaches managed to evade attribution, they severely compromised the obfuscated text’s semantics. These approaches, rather unsuccessfully, navigate the trade-off between evading attribution and preserving semantics [33].

Recent work such as A⁴NT [43], Mutant-X[33], and ParChoice [22] employ more sophisticated adversarial obfuscation to evade authorship attribution classifiers. Their threat model assumes that the obfuscator can query the adversary’s attribution classifier to guide obfuscation. For example, A⁴NT uses a generative adversarial network (GAN) for obfuscation that requires white-box access to the adversary’s attribution classifier. Mutant-X uses a genetic algorithm for obfuscation that requires black-box access to the adversary’s attribution classifier. ParChoice uses combinatorial paraphrasing for obfuscation that requires black-box access to adversary’s attribution classifier. While these obfuscation approaches achieve a better tradeoff between attribution evasion and preserving semantics, they all assume white/black box access to the the adversary’s attribution classifier. This key assumption limits their effectiveness in the real world because the adversary’s attribution classifier might be different or unknown. For example, the evasion effectiveness of Mutant-X drops drastically when the adversary uses a different attribution classifier than assumed by Mutant-X [33]. Similarly, an adversarially retrained attribution classifier is resistant to obfuscation by ParChoice using the original classifier [22]. This lack of transferability to unseen attribution classifiers has major ramifications in the real world as the obfuscator’s effectiveness is questionable when the adversary happens to use a different attribution classifier. Figure 1 provides an overview of this threat model involving the obfuscator and multiple unseen adversaries.

2.2 Problem Statement

The obfuscator seeks to obfuscate the stylometric properties of an input document $D$ of author $A$ by modifying its text to produce an obfuscated document $D^{\prime}$ such that the attributor incorrectly classifies the obfuscated document to another author $A^{\prime}\neq A$ . The state-of-the-art authorship obfuscators mainly consist of two components: a generator and an internal authorship attribution classifier $C$ . The generator modifies the input document based on some rules and queries the internal classifier to predict whether these modifications would degrade the likelihood of successful authorship attribution. The two components work in tandem for $M$ iterations to progressively obfuscate the input document by generating new obfuscation samples and measuring the degradation in authorship attribution by $C$ . It is noteworthy that the adversary might use a different authorship attribution classifier $C^{\prime}\neq C$ . There could, in fact, be multiple adversaries in this setting, with each using a different attribution classifier $C^{\prime}$ than the obfuscator’s internal classifier $C$ . The primary goal of the obfuscator is to obfuscate an input document $D\rightarrow D^{\prime}$ using its internal classifier $C$ such that it evades attribution by the adversary classifier $C^{\prime}$ . This problem is also referred to as transferability in the field of adversarial machine learning.

2.3 Approach

Intuition. The obfuscator relies on feedback from its internal classifier as a proxy to identify suitable transformations that can help evade attribution by the adversary’s classifier. These transformations essentially aim to move the document to the wrong side of the decision boundary, which partition different author classes, of the obfuscator’s internal classifier. Since these transformations are specific to the decision boundary of the obfuscator’s internal classifier, they may not achieve the same result on the adversary’s attribution classifier. When they do not, the obfuscated document would evade the attribution classifier of the obfuscator but not that of the adversary. These differences in the decision boundaries of the two classifiers could be because of the differences in their emphasis on different features; thus, transformations that targeted a certain feature emphasized by one classifier might be rendered useless for the other classifier. To address this issue, our key insight is that if an obfuscator can evade a meta-classifier, whose decision boundary is based on the decision boundaries of multiple base classifiers, it is more likely to evade those base classifiers. We hypothesize that a meta-classifier consisting of multiple base classifiers, each emphasizing different features, will better capture the relative importance of various features. Intuitively, as the internal attribution classifier, this ensemble of base classifiers can provide a more nuanced view of the entire feature space and can classify the document in a manner that essentially averages the decision boundaries of the base classifiers.

Ensemble Approach. An ensemble is a learning algorithm that takes a set of classifiers and uses their individual outputs to make the final classification for a given input. The classifiers in this set are referred to as the base classifiers for the ensemble. The number of base classifiers affects how the model fits to the training set: too few and the model is likely to underfit and too many will likely result in overfitting. An efficient number can be determined using cross-validation though it is a time-consuming exercise to train multiple ensembles and validate their results. [24] The base classifiers can be different classifiers trained on the same training set, or they could be the same and trained on different subsets of the training set (a technique known as bagging) or even on subspaces of the feature set [48].

The outputs of the base classifiers are then polled by the ensemble through either a majority vote or by training another classifier (a technique called stacking) [16]. A majority vote gives uniform weight to the output of each base classifier whereas stacking causes the weights to vary, as it can learn to downplay classifiers that are inaccurate more often. While these base classifiers might not be very accurate on their own, the ensemble together can capitalize on their knowledge and make more accurate predictions.

We construct our ensemble using the feature subspace method and describe it as follows. A subspace is a subset of the entire universal set of features that are available to the classifier. We train the base classifiers of the ensemble on different subspaces of the feature set. The goal of using a subspace of features is to train a base learner that is specialized in that distinct and local set of features. This is motivated by our findings of feature importance and decision boundaries which we discuss later in Section 4.2. The subspaces can be selected randomly [47], through sampling [50], or through feature selection techniques [17].

Figure 2 illustrates the architecture of our proposed ensemble. The original feature space is divided into multiple subspaces which are then used to train the base classifiers. The outputs from these base classifiers are then aggregated to produce the final classification of the ensemble for the given input.

3 Experimental Setup

In this section, we state the assumptions and describe the setup for our experiments. Specifically, we describe the dataset we use, the attribution classifiers used by the obfuscator and the adversaries, the layout of the experiments, and finally the evaluation metrics used to assess the results.

3.1 Data

The Extended Brennan Greenstadt Corpus [7] comprises of writing samples submitted by various authors through Amazon’s Mechanical Turk (AMT) platform. The corpus is unique because it was collected expressly for the purpose of adversarial stylometry in text and was vetted against a strict set of guidelines imposed by AMT and the authors themselves to ensure quality. The guidelines required that the submissions be professionally written, be free of anything other than the writing itself (i.e., citations, URLs, headings, etc.), and contained at least 6500 words. The imposition of these strict guidelines ensured that the submissions were of high quality, reflected the author’s particular writing style, and that there was sufficient data to train an attribution classifier. Out of the 100 submissions, the authors selected the 45 that most closely followed the guidelines and then split them into nearly 500 word passages, averaging 15 documents per author, to create the final corpus. In our experiments, we divide the corpus into groups of 5 authors based on document length and report results on these groupings by further splitting them into a 80% training and 20% testing set.

3.2 Obfuscator’s Attribution Classifiers

Baseline obfuscator: We use Mutant-X as the baseline obfuscator for our experiments as its generator only requires black-box access to the attribution classifier. [33] This loose coupling between the two components allows the attribution classifier to be easily swapped for another. Additionally, we use the Writeprints [2] feature set throughout the experiments to train the internal attribution classifiers for Mutant-X. This feature set incorporates lexical and syntactic features to capture the stylometric properties of the author’s writing style. The lexical features include character-level and word-level features such as total words, average length of words, proportions of different character classes, among others. The syntactic features include POS tags, use of function words, and various punctuations.

The fitness function for Mutant-X takes into account the detection probability of a given attribution classifier $C$ and the semantic similarity between the original and the obfuscated document. We train the two classifiers that were originally used for Mutant-X, a random forest classifier and a support vector machine, on the Writeprints feature set to serve as baselines for comparing the performance of the ensemble.

Writeprints-Static + Ensemble: We use the same feature set as our baselines to train the ensemble and construct it by training base classifiers on subspaces of the entire feature set. We use a linear SVM as the base classifier owing to its stability and demonstrated use as a base classifier for an ensemble in prior work [48]. We use the random subspace method [47] to construct the feature subspaces by randomly choosing distinct features to train each base classifier. The ensemble then reduces the results from these internal classifiers by polling their individual predictions through a majority vote because it gives a uniform weight to all the base classifiers. We configure the remaining parameters for the ensemble architecture by conducting small-scale experiments in a variety of settings. Just as before, we use the training portion of the EBG 5 dataset to select appropriate values for the following hyper-parameters:

–

number of internal classifiers: $I_{c}\in\{5,10,15\}$
–

length of subspaces: $L_{s}\in\{30,50,80\}$

The results from these experiments show that while a lower value of $L_{s}$ yields inaccurate individual classifiers, the accuracy of the overall ensemble is much higher. This ascertains the notion that a robust and highly accurate model can be created from a grouping of weak learners. In light of this, we conservatively set the value of $L_{s}=30$ and $I_{c}=10$ as we noticed that higher values yield similar results. These settings are retained throughout the entirety of our experiments.

3.3 Adversary’s Attribution Classifiers

To assess the transferability of the obfuscated samples to other classifiers, we train a series of classifiers and measure the performance of the baselines and the ensemble. Specifically, we train multiple classifiers that use different types of techniques to measure the cross-technique transferability [38] of the method used to generate the samples. These classifiers are also trained on the Writeprints feature set and are as follows: k-nearest neighbors (KNN), naive-bayes (NB), multilayer-perceptron (MLP), logistic regression (LR) in addition to the already trained random forest classifier (RFC), support vector machine (SVM), and the ensemble (Ens) itself. Additionally, we incorporate counter-measures for the findings by Gröndahl et al. [22] that using an internal classifier results in highly specific transformations which are not only non-transferable to other different classifiers but also to the same classifier if it is retrained. We accomplish this by training multiple versions of the classifiers that exhibit randomness (RFC, Ensemble, and MLP) during training and then report the average transferability result in their respective columns.

In addition to these Writeprints based classifiers, we also measure the transferability of the samples to JGAAP [26], a well-known system for authorship attribution that provides a wide-array of features and classifiers, and another MLP model trained on the Basic-9 feature set[35]. This setting allows us to explore the performance of the ensemble against an adversary that uses a different feature set and classifier implementation. We borrow the configuration recommended by Juola et al [27] for JGAAP. The final configuration for JGAAP is listed in Table 1.

Pre-processing	Features	Classifiers
\@BTrule[]Lowercase all Strip punctuation Normalize whitespaces	Word unigrams Word bigrams Character bigrams Character trigrams Function words Sentence length	Weka SMO (SVM) Weka Naive Bayes (NB) Linear Discriminant Analysis (LDA)

Table 1: Configuration for JGAAP.

3.4 Design of Transferability Experiments

We conduct several experiments with each corresponding to a particular internal classifier that is used by Mutant-X. In each of these experiments, all Mutant-X parameters are kept consistent and only its internal classifier is replaced. We configure Mutant-X parameters based on the findings from the original pilot experiments performed on the EBG 5 dataset. The values for the different Mutant-X parameters are specified in Table 2. For each experiment, we report the average METEOR score of the obfuscated documents, the transferability rate across our fixed set of adversaries, and the overall attack success rate of that technique.

Setting	Value
\@BTrule[]Number of word replacements	5
Number of iterations	25
Number of runs per document	10
Weight assigned to attribution confidence in fitness function	0.75
Number of top individuals retained in each iteration	5
Number of document mutants	5

Table 2: Configuration of hyper-parameters for Mutant-X.

In its default setting, Mutant-X stops obfuscation once the mutated document has been misclassified by its attribution classifier. We alter this behavior to allow Mutant-X to continue obfuscation until all $M$ iterations have been performed. This alteration is partly inspired by the idea that stopping early because of one successful misclassification is detrimental to the overall goal of transferability to a wider set of adversaries.

3.5 Evaluation Metrics

Originally, Mutant-X was evaluated through two metrics that measured the safety and soundness of the obfuscation. While effective for measuring obfuscation across one adversary, they fail to quantify the transferability of the obfuscated samples to multiple adversaries. To alleviate this, we utilize a third metric called the Attack Success Rate (ASR) to capture this information and slightly modify the safety metric to accommodate it. The final metrics for evaluation are as follows.

1.

Evasion Effectiveness: An obfuscated document generated from an internal attribution classifier effectively evades an adversary if it is misclassified by that particular adversary. We refer to this property of the internal classifier as its transferability to an adversary and report it as the percentage of obfuscated documents produced by that classifier that were misclassified by the adversary. For an adversary $i$ that misclassifies $a_{i}$ out of $n$ obfuscated documents generated by the internal classifier, we measure transferability as:

$T_{i}=\frac{a_{i}}{n}\times 100$ (1)
2.

Attack Success Rate: The attack success rate [11] measures the overall transferability of the obfuscated documents across the entire set of adversaries. It is an average of all the transferability scores reported for that specific internal classifier. Given a fixed set of adversaries $m$ , the attack success rate of classifier $A$ can be reported as:

$ASR_{A}=\frac{\sum_{i=1}^{m}T_{i}}{m}$ (2)
3.

Semantic Similarity: An obfuscated document has to maintain semantic similarity to the original document. As originally used to evaluate Mutant-X, we use the METEOR score [14] to assess this similarity. The score lies in the range [0, 1] with 1 indicating perfect similarity and 0 indicating the opposite. The final score reported is the average METEOR score of the obfuscated documents, where a higher score implies that the final documents were similar to the original ones.

4 Results

4.1 Evaluation

Mutant-X Writeprints-Static JGAAP Basic-9 ASR METEOR Classifier RFC* SVM KNN NB MLP* LR Ens* SVM LDA NB MLP* SVM 1.6 93.7 18.5 20.6 10.1 1.6 7.4 5.0 15.0 10.5 16.9 18.3 0.40 RFC 28.2 26.2 19.4 18.4 14.6 5.8 29.1 24.0 28.0 20.0 25.2 21.7 0.42 Ensemble 18.4 61.0 41.6 52.9 21.9 15.8 71.9 32.0 39.0 32.0 31.0 38.0 0.36

Table 3: Transferability results for multiple Mutant-X settings. Rows are the writeprints-static classifier used for obfuscation. Columns are the adversary’s feature set and classifier. The asterisk * indicates that the classifier exhibits randomness during training. Cells underneath a random classifier report the average against multiple instances of that particular classifier. The attack success rate is summarized in the ASR column and the average METEOR scores are reported in the last column.

The main results of the experiments are presented in Table 3. The rows correspond to the internal classifier used by Mutant-X and the subsequent columns correspond to the classifier used by the adversary and the feature set they are trained on. The cell values are the percentage of documents generated using that method that were misclassified by the adversary’s classifier, i.e., the transferability of that method. The final two columns contain the attack success rate (mean of transferability to adversaries) and the mean METEOR score of the technique. Additionally, we reiterate the counter-measure discussed in Section 3.3 and train multiple versions of classifiers that exhibit randomness during training. As such, the columns for the Ensemble, MLP, and RFC report the average transferability to different versions of the classifiers.

Impact of attribution classifier: The transferability achieved by SVM ranges from 1.6% ( $SVM\rightarrow RFC$ ) to 93.7% ( $SVM\rightarrow SVM$ ), whereas for RFC it ranges from 5.8% ( $RFC\rightarrow LR$ ) to 29.1% ( $RFC\rightarrow Ensemble$ ). In comparison, the ensemble achieves transferability ranging from 15.8% ( $Ensemble\rightarrow LR$ ) to 71.9% ( $Ensemble\rightarrow Ensemble$ ).

We notice that the cases where the baselines perform better are where the internal classifier and adversary’s classifier are the same ( $SVM\rightarrow SVM,RFC\rightarrow RFC$ ); cases which are fairly trivial and advantageous to Mutant-X. In fact, the 93.7% in the case of $SVM\rightarrow SVM$ is unrealistic because the adversary’s classifier is exactly the same as Mutant-X, whereas $RFC\rightarrow RFC$ and $Ensemble\rightarrow Ensemble$ scenarios have been normalized by training multiple instances of the adversary and reporting their average (see Section 3.3). Regardless, the ensemble still manages to outperform the other baseline in these trivial cases.

The effects of re-training the ensemble and RFC are also evident in these cases. The average transferability of $RFC\rightarrow RFC$ is quite low at 28.2%, corroborating the findings of Gröndahl et al. [22]. However, this doesn’t appear to be true for the ensemble as it still reports a fairly high transferability at 71.9% indicating its robustness.

In the non-trivial cases where the adversary’s classifier is different from the internal classifier, the ensemble fares far better than the baselines. In the case of KNN, the ensemble achieves a transferability of 41.6% compared to the RFC at 19.4%. In the case of NB, it achieves a transferability of 52.9%, which is 32.3% higher than the SVM at 20.6%. On average, the ensemble achieves 21% higher transferability than the SVM and 13.8% higher than the RFC across the set of adversaries where the internal classifier is different.

A comparison of the overall performance (trivial and non-trivial) between the ensemble and the baselines shows that the ensemble outperforms both across the wide range of adversarial settings. The overall attack success rate of the ensemble at 38.0% is the best compared to the baselines, which is 1.7 $\times$ higher than the transferability of the RFC at 21.7% and 2.1 $\times$ higher than the transferability of the SVM at 18.3%.

The ensemble does not perform as well as the other methods when we compare the METEOR scores. The samples generated by RFC and SVM retain better semantic similarity to the original documents with an average METEOR score of 0.42 and 0.40, respectively. In contrast, the ensemble reports an average score of 0.36 indicating that the generated samples differed more from their original selves. We attribute this lower score to the problem of balance between protecting the author’s identity and being true to the original content. We believe that the effort to ensure transferability requires more substantial changes to be made to the document which leads to lower similarity between the source and the obfuscated text, and consequently a lower average METEOR score.

The superior effectiveness of the ensemble as an internal classifier is undeniable when compared to the baselines. The high attack success rate and a comparable METEOR score make it a reliable alternative to other conventional classifiers for use alongside an obfuscator like Mutant-X.

Impact of feature set: Mutant-X may have an inherent advantage when the obfuscator and adversary’s classifiers are trained on the same feature set as this likely provides the obfuscator unfair insight into how the adversary operates. We test this by observing results from experiments where the adversary is trained on a different feature set and classification technique than the internal classifier.

Within the JGAAP setting, the SVM does not perform as well as the other two settings. It surprisingly performs the worst in the $SVM\rightarrow SVM$ case, only achieving a transferability of 5%. We attribute this to the difference between the kernel functions of the two SVMs; as opposed to a linear kernel used in the internal classifier, JGAAP’s default setting uses a RBF kernel. In comparison, the ensemble and the RFC achieve higher degrees of transferability yielding an attack success rate of 34.3% and 24% respectively with the ensemble outperforming the RFC by 10.3%.

We see similar results in the Basic-9 setting, where the ensemble achieves a transferability that is 6% higher than RFC and almost twice as high the SVM. This affirms the idea that the ensemble performs just as well against adversaries trained on a different feature set and outperforms other conventional classifiers.

4.2 Discussion

We now study the various decisions we’ve made concerning the ensemble and try to understand their impact on its inner workings.

Impact of ensemble diversity: It is widely understood that combining a diverse set of individual classifiers leads to a more robust ensemble [15]. While the ensemble is diverse in the sense that the base classifiers are trained on distinct subspaces, we instead focus on the predictions of the base classifiers to measure diversity. Intuitively, the diversity of an ensemble is the difference in the predictions of the individual members that constitute it [30]. To try and understand how the diversity of the ensemble impacts obfuscation, we train multiple ensembles with varying degrees of diversity and compare their performance with each other.

To measure the diversity of an ensemble, we use the non-pairwise metric of entropy $E$ [30]. The entropy value of an ensemble lies in the range $[0,1]$ , a value closer to 0 means that the individual classifiers mostly agree, and a value closer to 1 means that they mostly disagree with one another. This measure assumes that a diverse set of classifiers will disagree with one another as opposed to correlated classifiers which will agree more often.

We train multiple ensembles and control for their diversity by selecting appropriate base classifiers. Specifically, we create 4 bins of entropy values $E\in\{0,0.25,0.5,0.75\}$ and train 10 ensembles for each bin, each ensemble having approximately the same entropy as the bin it is assigned to. We then conduct 40 experimental runs of Mutant-X, and in each run use one of these ensembles as the internal classifier to obfuscate documents and measure the attack success rate of these documents against our set of adversaries.

Figure 3 shows the results from these experiments. The y-axis represents the attack success rates while x-axis ticks represent the entropy value of the bin. Contrary to our intuition, we notice that ensembles with higher values of entropy (more diverse) had lower transferability while ensembles with lower entropy (less diverse) performed comparatively better. We explain our interpretation of these results as follows.

Recalling that Mutant-X uses the confidence score of the internal classifier to make decisions regarding obfuscation, we investigate the impact of diversity on the accuracy of the ensemble and the confidence of the classifier in its classifications. To increase the diversity of an ensemble, we need to promote disagreement between the individual classifiers that comprise it. Consequently, this makes the ensemble overall less confident in its final classification even if it is correct. Since Mutant-X uses this confidence score as an indicator of attribution, a lower score leads to poorer decision making on Mutant-X’s part and reduces the quality of obfuscation. We note that this problem is likely unique to how Mutant-X operates and not necessarily an artefact of using ensembles. In future work, it will be interesting to explore how diversity impacts ensemble based transferability in different settings that are not bound by this restriction.

Impact of feature subspaces: Our main set of experiments consider an ensemble of base classifiers trained on randomly selected subspaces of the feature set. We now consider more systematic approaches for constructing the subspaces and see how they affect transferability.

At a higher level, the Writeprints feature set incorporates lexical and syntactic features that are qualitatively distinct. This distinction indicates the presence of a contextual subspace within the feature set. More specifically, there are 9 distinct subspaces and are as follows: frequency of special characters, letters, digits, parts-of-speech tags and punctuation, most common letter bigrams and trigrams, percentage of certain function words, and the ratio of unique words (hapax ratio) [2]. We train the base classifiers on this division of subspaces and measure the attack success rate of the resultant ensemble and note that, in contrast to the random subspace method, this yields base classifiers that have different values of $L_{s}$ .

Additionally, we explore feature selection techniques to construct the subspaces. Using a one-way ANOVA test [18], we measure the dependency between the features and the author label, and use the highest-ranking features to train the first base classifier. We repeat this for the next base classifier by considering the remaining set of features and so on for the rest of the classifiers. This yields base classifiers of varying performance as the initial ones are highly accurate but the accuracy gradually drops as the remaining features are insufficient predictors . For this particular experiment, we set $I_{c}=8$ and $L_{s}=20$ for a consistent distribution of features.

The results of the experiments are as follows: the contextual subspace ensemble yields an ASR of 37.1% whereas the feature selection subspace ensemble yields an ASR of 34.7%. We see that the results for the subspace ensemble are comparable with those of the random subspace ensemble which had an ASR of 38.0%.

Considering the security aspect of the techniques, the contextual subspaces approach is relatively more risky. Since the features composing these subspaces are easy to identify, an adversary can undo the effects of obfuscation through adversarial training: building a classifier to recognize obfuscation by training it on the obfuscated documents generated by the internal classifier [21]. In contrast, an ensemble built using random subspaces offers a good balance: achieving a commensurable degree of transferability and providing a good defence against adversarial training as its random nature is unpredictable for the adversary.

Feature importance and decision boundaries: While the goal of all classifiers is to map the input document to an author, there are fundamental differences in the way they operate and actually classify the data. These differences highlight the notion of feature importance: some features are more important for a particular type of classifier than to another. We now interpret these models to identify the features they consider important and see how it affects transferability.

RFC is a collection of decision trees that also counts the votes of the individual outputs of the trees to make the final classification. The decision trees consist of several nodes that split the training set into subsets based on the values of certain features. Arguably, features that are used more often for splitting and split a sizable portion of the training set compared to others bear greater significance for the model. This is known as the Gini Importance of that feature [6] and it is the number of times a particular feature was used for a split weighted by the number of samples it splits.

SVM classifies the data by learning a hyperplane between the data points that separates the different class boundaries. In a linear SVM, this hyperplane represents the points at which the distance between the class boundaries is maximum. Since the coefficients of this hyperplane are associated with the features, their absolute values represent the significance of the corresponding feature relative to the other features. In a multi-class setting, the SVM has multiple hyperplanes separating each of the classes and each hyperplane has its own set of coefficients.

RFC SVM \@BTrule[]posTagFrequency - Space functionWordsPercentage - by frequencyOfLetters - a functionWordsPercentage - and functionWordsPercentage - is functionWordsPercentage - was frequencyOfDigits - 1 functionWordsPercentage - in functionWordsPercentage - had functionWordsPercentage - that

Table 4: Top 5 important features for baseline classifiers.

We assess the differences between what features are important for the RFC and SVM. Table 4 lists the top 5 features for the baseline RFC and one of the SVM hyperplanes. We note that these are different for the two classifiers; moreover, this trend holds even beyond the top 5 features. In a high-dimensional feature space such as Writeprints, this difference in feature emphasis by the classifier amounts to some features losing their relative importance and thus the obfuscator does not consider their relevance. This highlights a fundamental flaw in the obfuscator: the obfuscation will always be tuned to the features preferred by its internal classifier and fail to transfer to a different classifier emphasizing different feature.

Our approach of using feature subspaces in an ensemble alleviates this flaw to an extent; base classifiers trained on smaller random sets of features emphasize the importance of those features. The base classifier then specializes in its localized subspace of features and, while it may not be accurate, it is representative of a certain aspect of the feature space that might be of significance to an adversary’s classifier.

Taking a look at the decision boundaries of the base classifiers in Figure 4 helps explain how they improve transferability. We use Principal Component Analysis (PCA) [49] to reduce the higher-dimensional feature space to two-dimensions for the plotting the boundaries. The data points are the documents from the test set projected into the PCA dimensions. The colored regions in the background represent the decision regions of the classifier for that particular label, i.e., points that fall in those regions are classified according to that label. We stress that this two-dimensional projection is merely an approximation of the actual high-dimensional feature space so some misalignments are expected. Looking at the decision boundaries of the base classifiers, we see that they vary significantly and that some of the classifiers perform better at classifying a particular author than others. The decision boundaries also highlight the limited access the classifiers have to the entire feature space, observable by the disjoint between the same decision region. While the projection takes into account the entire feature space, the decision region is only based on the subspace the particular classifier was concerned with.

Figure 5 shows the decision boundary of the ensemble formed from these base classifiers. We see that the decision region of the ensemble more closely encapsulates the data points than the base classifiers. Since the ensemble classifies according to the majority vote of the base classifiers, its decision boundary is approximately the average of all their decision boundaries. The voting mechanism also ensures that the base classifiers are weighted equally so as to not downplay the role of a certain subspace. Therefore, the ensemble capitalizes on the individual knowledge of the base classifiers and effectively serves as a middle-ground for the obfuscator to compare against.

5 Related Work

We survey related research on the transferability of adversarial attacks designed to evade machine learning classifiers.

There is a rich body of literature in the image classification context on transferability of adversarial attacks in both white-box and black-box settings. Biggio et al. [5] and Szegedy et al. [46] first showed that an adversary can launch attacks by creating minor perturbations in the input that cause machine learning models to misclassify it. Follow up work has studied the practically of these adversarial attacks in the real world by studying whether they can transfer even when the adversary might not have complete access to the machine learning classifier (e.g., [32, 39, 45, 13]). For example, Papernot et al. [39] proposed a black-box attack against a variety of machine learning approaches including deep neural networks, logistic regression, SVM, decision tree, and nearest neighbors, outperforming existing attacks in terms of transferability. Suciu et al. [45] and Demontis et al. [13] studied if and why adversarial attacks (do not) transfer in real-world settings. They showed that the target model’s complexity and its alignment with the adversary’s source model significantly impact the transferability of adversarial attacks.

Adversarial attacks in the continuous vision/image domain are different than adversarial attacks in the discrete text domain. Much of prior work on adversarial attacks is focused on the vision domain and cannot be easily adapted to the text domain [51]. Adversarial attacks on text classification models mostly work by simply misspelling certain words [19, 31]. While these attacks are effective, they are easy to counter by standard pre-processing steps such as fixing misspelled, out-of-vocabulary words. Jin et al. (TextFooler) [25] and Garg et al. (BAE) [20] proposed black-box adversarial attacks on text classification models by replacing certain words using word embeddings and language models, respectively. The evaluation showed that these black-box adversarial attacks at best only moderately transfer to unseen models.

Recent adversarial attacks on machine learning based authorship attribution models employing feedback from authorship classifiers are also quite similar. Mahmood et al. (Mutant-X) proposed a black-box adversarial attack that replaced selected words using word embeddings based on a genetic algorithm [33]. Grondahl et al. [22] also proposed a similar black-box adversarial attack (ParChoice) that used paraphrasing to replace selected texts. While these adversarial attack approaches are effective at authorship obfuscation, they do not transfer well against unseen authorship classifiers. Transferable authorship obfuscation in such settings remains an open challenge that we address in our work.

Another relevant line of research has investigated using ensembles to improve transferability of adversarial attacks. For example, Liu et al. [32] showed that if an adversarial attack succeeds in evading an ensemble of models it will have better transferability because the source ensemble model and the target models are more likely to share decision boundaries. Most recently, Che et al. [10] studied the effectiveness of different ensemble strategies in improving transferability of adversarial attacks. They also conclude that an attack model that evades an ensemble of multiple source models is more likely to transfer to different target models.

6 Conclusion

In the arms race between authorship attribution and obfuscation, it is crucial that obfuscation can transfer when an adversary deploys a different attributor than the one assumed by the obfuscator. In this paper we showed that an ensemble that uses multiple base attribution classifiers, each exploiting random portions of the feature space, is able to achieve better transferability by a factor of 1.7 $\times$ and 2.1 $\times$ . Moreover, we showed that this success holds even when the adversary’s attributor operates off a different feature space. We also found that ensemble diversity in terms of disagreement is not crucial for transferability as it only hinders the obfuscator due to a decrease in the ensemble’s probability of detection.

References

pan [2018] Author Obfuscation. https://pan.webis.de/clef18/pan18-web/author-obfuscation.html, 2018.
Abbasi and Chen [2008] Ahmed Abbasi and Hsinchun Chen. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2):7, 2008.
Afroz et al. [2014] Sadia Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. Doppelgänger finder: Taking stylometry to the underground. In 2014 IEEE Symposium on Security and Privacy, pages 212–226. IEEE, 2014.
Almishari et al. [2014] Mishari Almishari, Dali Kaafar, Ekin Oguz, and Gene Tsudik. Stylometric linkability of tweets. In Proceedings of the 13th Workshop on Privacy in the Electronic Society, pages 205–208, 2014.
Biggio et al. [2013] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion Attacks against Machine Learning at Test Time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2013.
Breiman [2017] L. Breiman. Classification and Regression Trees. CRC Press, 2017. ISBN 9781351460484. URL https://books.google.com.pk/books?id=gLs6DwAAQBAJ.
Brennan et al. [2012a] Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15, 11 2012a. 10.1145/2382448.2382450.
Brennan et al. [2012b] Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3):1–22, 2012b.
Castro-Castro et al. [2017] Daniel Castro-Castro, R. O. Bueno, and R. Muñoz. Author masking by sentence transformation. In CLEF, 2017.
Che et al. [2020] Zhaohui Che, Ali Borji, Guangtao Zhai, Suiyi Ling, Jing Li, and Patrick Le Callet. A New Ensemble Adversarial Attack Powered by Long-Term Gradient Memories. In AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
Chen and Vorobeychik [2018] Yifan Chen and Yevgeniy Vorobeychik. Regularized ensembles and transferability in adversarial learning. CoRR, abs/1812.01821, 2018. URL http://arxiv.org/abs/1812.01821.
Clark and Hannon [2007] Jonathan H Clark and Charles J Hannon. An algorithm for identifying authors using synonyms. In Eighth Mexican International Conference on Current Trends in Computer Science (ENC 2007), pages 99–104. IEEE, 2007.
Demontis et al. [2019] Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks. In USENIX Security Symposium, 2019.
Denkowski and Lavie [2014] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. 10.3115/v1/W14-3348. URL https://www.aclweb.org/anthology/W14-3348.
Dietterich [2000a] Thomas G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, pages 1–15, Berlin, Heidelberg, 2000a. Springer Berlin Heidelberg.
Dietterich [2000b] Thomas G. Dietterich. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, MCS ’00, page 1–15, Berlin, Heidelberg, 2000b. Springer-Verlag. ISBN 3540677046.
Dyer et al. [2013] Eva L. Dyer, Aswin C. Sankaranarayanan, and Richard G. Baraniuk. Greedy feature selection for subspace clustering. J. Mach. Learn. Res., 14(1):2487–2517, January 2013. ISSN 1532-4435.
Elssied et al. [2014] Nadir Elssied, Assoc Prof. Dr. Othman Ibrahim, and Ahmed Hamza Osman. A novel feature selection based on one-way anova f-test for e-mail spam classification. Research Journal of Applied Sciences, Engineering and Technology, 7:625–638, 01 2014. 10.19026/rjaset.7.299.
Gao et al. [2018] Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. In IEEE Security and Privacy Workshop on Deep Learning and Security, 2018.
Garg and Ramakrishnan [2020] Siddhant Garg and Goutham Ramakrishnan. BAE: BERT-based Adversarial Examples for Text Classification. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Goodfellow et al. [2014] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. arXiv e-prints, art. arXiv:1412.6572, Dec 2014.
Gröndahl and Asokan [01 Oct. 2020] Tommi Gröndahl and N. Asokan. Effective writing style transfer via combinatorial paraphrasing. Proceedings on Privacy Enhancing Technologies, 2020(4):175 – 195, 01 Oct. 2020. https://doi.org/10.2478/popets-2020-0068. URL https://content.sciendo.com/view/journals/popets/2020/4/article-p175.xml.
Howard and Ruder [2018] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
Htike [2016] Kyaw Htike. Efficient determination of the number of weak learners in adaboost. Journal of Experimental & Theoretical Artificial Intelligence, 29:1–16, 12 2016. 10.1080/0952813X.2016.1266038.
Jin et al. [2020] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In AAAI 2020, 2020.
Juola [2009] P. Juola. Jgaap: A system for comparative evaluation of authorship attribution. 2009.
Juola and Vescovi [2010] Patrick Juola and Darren Vescovi. Empirical evaluation of authorship obfuscation using jgaap. In Proceedings of the 3rd ACM workshop on Artificial Intelligence and Security, pages 14–18, 2010.
Karadzhov et al. [2017] Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev, Ivan Koychev, and Preslav Nakov. The case for being average: A mediocrity approach to style masking and author obfuscation. pages 173–185, 07 2017. ISBN 978-3-319-65812-4. 10.1007/978-3-319-65813-1_18.
Keswani et al. [2016] Yashwant Keswani, H. Trivedi, Parth Mehta, and P. Majumder. Author masking through translation. In CLEF, 2016.
Kuncheva and Whitaker [2004] L. Kuncheva and C. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181–207, 2004.
Li et al. [2019] Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. TextBugger: Generating Adversarial Text Against Real-world Applications. In Network and Distributed Systems Security (NDSS) Symposium, 2019.
Liu et al. [2017] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. International Conference on Learning Representations, 2017.
Mahmood et al. [2019] Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar. A girl has no name: Automated authorship obfuscation using mutant-x. PoPETs, 2019(4):54–71, 2019. 10.2478/popets-2019-0058. URL https://doi.org/10.2478/popets-2019-0058.
Mansoorizadeh et al. [2016] Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminian, and M. Eskandari. Author obfuscation using wordnet and language models. In CLEF, 2016.
McDonald et al. [2012] Andrew WE McDonald, Sadia Afroz, Aylin Caliskan, Ariel Stolerman, and Rachel Greenstadt. Use fewer instances of the letter “i”: Toward writing style anonymization. In International Symposium on Privacy Enhancing Technologies Symposium, pages 299–318. Springer, 2012.
Narayanan et al. [2012] Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. On the feasibility of internet-scale author identification. In 2012 IEEE Symposium on Security and Privacy, pages 300–314. IEEE, 2012.
Overdorf and Greenstadt [2016] Rebekah Overdorf and Rachel Greenstadt. Blogs, twitter feeds, and reddit comments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016(3):155–171, 2016.
Papernot et al. [2016] Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. ArXiv, abs/1605.07277, 2016.
Papernot et al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical Black-Box Attacks against Machine Learning. In ACM Asia Conference on Computer and Communications Security (AsiaCCS), 2017.
Potthast et al. [2016] Martin Potthast, Matthias Hagen, and Benno Stein. Author obfuscation: Attacking the state of the art in authorship verification. In CLEF, 2016.
Rajapaksha et al. [2017] Praboda Rajapaksha, Reza Farahbakhsh, and Noël Crespi. Identifying content originator in social networks. In GLOBECOM 2017-2017 IEEE Global Communications Conference, pages 1–6. IEEE, 2017.
Ruder et al. [2016] Sebastian Ruder, Parsa Ghaffari, and John G Breslin. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686, 2016.
Shetty et al. [2017] Rakshith Shetty, Bernt Schiele, and Mario Fritz. A ${}^{\mbox{4}}$ nt: Author attribute anonymity by adversarial training of neural machine translation. CoRR, abs/1711.01921, 2017. URL http://arxiv.org/abs/1711.01921.
Stolerman et al. [2014] Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. Breaking the closed-world assumption in stylometric authorship attribution. In IFIP International Conference on Digital Forensics, pages 185–205. Springer, 2014.
Suciu et al. [2018] Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumitras. When Does Machine Learning FAIL? Generalized Transferability for Evasion and Poisoning Attacks. In USENIX Security Symposium, 2018.
Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.
Tin Kam Ho [1998] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998. 10.1109/34.709601.
Ting et al. [2011] Kai Ming Ting, Jonathan R Wells, Swee Chuan Tan, Shyh Wei Teng, and Geoffrey I Webb. Feature-subspace aggregating: ensembles for stable and unstable learners. Machine Learning, 82(3):375–397, 2011.
Wold et al. [1987] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1):37 – 52, 1987. ISSN 0169-7439. https://doi.org/10.1016/0169-7439(87)80084-9. URL http://www.sciencedirect.com/science/article/pii/0169743987800849. Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists.
Ye et al. [2013] Yunming Ye, Qingyao Wu, Joshua Zhexue Huang, Michael K. Ng, and Xutao Li. Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognition, 46(3):769 – 787, 2013. ISSN 0031-3203. https://doi.org/10.1016/j.patcog.2012.09.005. URL http://www.sciencedirect.com/science/article/pii/S0031320312003974.
Zhang et al. [2020] Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey. In ACM Transactions on Intelligent Systems and Technology, 2020.