Elephant in the Room: An Evaluation Framework for
Assessing Adversarial Examples in NLP

Ying Xu
IBM Research
Australia

&Xu Zhong
IBM Research
Australia

&Antonio Jose Jimeno Yepes
IBM Research
Australia

&Jey Han Lau
University of
Melbourne

Abstract

An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. While there are a number of methods proposed to generate adversarial examples for text data, it is not trivial to assess the quality of these adversarial examples, as minor perturbations (such as changing a word in a sentence) can lead to a significant shift in their meaning, readability and classification label. In this paper, we propose an evaluation framework consisting of a set of automatic evaluation metrics and human evaluation guidelines, to rigorously assess the quality of adversarial examples based on the aforementioned properties. We experiment with six benchmark attacking methods and found that some methods generate adversarial examples with poor readability and content preservation. We also learned that multiple factors could influence the attacking performance, such as the length of the text inputs and architecture of the classifiers.

1 Introduction

Adversarial examples, a term introduced in Szegedy et al. (2013), are inputs transformed by small perturbations that machine learning models consistently misclassify. The experiments are conducted in the context of computer vision (CV), and the core idea is encapsulated by an illustrative example: after imperceptible noises are added to a panda image, an image classifier predicts, with high confidence, that it is a gibbon. Interestingly, these adversarial examples can also be used to improve the classifier — either as additional training data Szegedy et al. (2013) or as a regularisation objective Goodfellow et al. (2014) — thus providing motivation for generating effective adversarial examples.

The germ of this paper comes from our investigation of adversarial attack methods for natural language processing (NLP) tasks, e.g. sentiment classification, which drives us to quantify what is an “effective” or “good” adversarial example. In the context of images, a good adversarial example is typically defined according to two criteria:

(a)

it has successfully fooled the target classifier;
(b)

it is visually similar to the original example.

In NLP, defining a good adversarial example is a little more involving. While criterion (b) can be measured with a comparable text similarity metric (e.g. BLEU or edit distance) and semantic similarity metrics (e.g. cosine distance between sentence embeddings), an adversarial example should also:

(c)

be fluent or natural;
(d)

preserve its original label.¹¹1In the CV example, if the perturbed panda image looks like a panda, it fulfils criterion (b) and (d). In an NLP task such as sentiment classification, even though a perturbed sentence may look similar to the original and so satisfies criterion (b), the perturbed sentence might have the opposite sentiment because of a word change (e.g. from good to tolerable).

These two additional criteria are generally irrelevant for images, as adding minor perturbations to an image is unlikely to: (1) create an uninterpretable image (while changing one word in a sentence can render a sentence incoherent), and (2) change how we perceive the image, say from seeing a panda to a gibbon (but a sentence’s sentiment can be reversed by simply adding a negative adverb such as not). Without considering criterion (d), generating adversarial examples in NLP would be trivial, as the model can learn to simply replace a positive adjective (amazing) with a negative one (awful) to attack a sentiment classifier, or substitute a numeric token with another number to attack a machine comprehension system that is queried for the year of an event. In other words, while criterion (d) is directly implied by criterion (b) in CV (a visually similar perturbed image generally preserves its original label), this is not the case for NLP. To the best of our knowledge, most studies on adversarial example generation in NLP have largely ignored these additional criteria Wang et al. (2019); Ebrahimi et al. (2017); Tsai et al. (2019); Gong et al. (2018).

The core contribution of our paper is to introduce a systematic evaluation framework that combines automatic metrics and human judgements to assess the quality of adversarial examples for NLP. We focus on sentiment classification as the target task, as it is a popular application that highlights the importance of the criteria discussed above. It is worth noting, however, that our framework is generic and applies to any NLP task.

We test our evaluation framework on a number of attacking methods, ranging from white-box to black-box attacks for generating adversarial examples.²²2White-box attack assumes full access to the target classifier’s architecture and parameters; black-box attack, on the other hand, does not. For the human judgements, we crowdsource the annotations to assess criteria (b), (c) and (d). Our results reveal that examples generated from most attacking methods are successful in fooling the target classifiers, but their language is often unnatural and the original label is not properly preserved. We also found that a number of external factors have a substantial impact on the attacking performance, such as the length of text inputs and the classifier architectures. Lastly, we evaluate the transferability of the adversarial examples and the computational time of different attacking methods. Transferability measures how effective the adversarial examples (generated for one classifier) are in attacking other classifiers.

2 Related Work

Most adversarial attack methods for text inputs are derived from methods originally designed for image inputs. These methods can be categorised into three types: gradient-based attacks, optimisation-based attacks and model-based attacks.

Gradient-based attacks are white-box attacks that rely on the gradients of the target classifier with respect to the input representation. This class of attacking methods Kurakin et al. (2016); Dong et al. (2018); Kurakin et al. (2016) are by and large inspired by the fast gradient sign method (FGSM) Goodfellow et al. (2014), and it has been shown to be effective in attacking CV classifiers. However, these gradient-based methods could not be applied to text directly because perturbed word embeddings do not necessarily map to valid words. Other methods such as DeepFool Moosavi-Dezfooli et al. (2016) that rely on perturbing the word embedding space face similar roadblocks. Gong et al. (2018) propose to use nearest neighbour search to find the closest word to the perturbed embedding.

Both optimisation-based and model-based attacks treat adversarial attack as an optimisation problem where the constraints are to maximise the loss of target classifiers and to minimise the difference between original and adversarial examples. Between these two, the former uses optimisation algorithms directly; while the latter trains a seperate model to generate the adversarial examples and therefore involves a training process. Some of the most effective attacks for images are achieved by optimisation-based methods, such as Goodfellow et al. (2014) and Carlini and Wagner (2017) for white-box attacks and Chen et al. (2017) for black-box attacks. For texts, we also have white-box attacks Ebrahimi et al. (2017) and black-box attacks Gao et al. (2018); Li et al. (2018) proposed in this category.

Model-based attacks are generally seen as grey-box attacks as it requires access to target classifier during training phase, but once it’s trained it can generate adversarial examples independently. Xiao et al. (2018) introduce a generative adversarial network to generate the image perturbation from a noise map. Generally in model-based attacks the attacking method and target classifier form a large network and the attacking method is trained using the loss from the target classifier. Note, however, that it is not very straightforward to use these model-based techniques for text directly because words in the adversarial examples are discrete and the network is not fully differentiable.

3 Methodology

3.1 Sentiment Classifiers

There are a number of off-the-shelf neural models for sentiment classification Kim (2014); Wang et al. (2016), most of which are based on long-short term memory networks (LSTM; Hochreiter and Schmidhuber (1997)) or convolutional neural networks (CNN; Kim (2014)). In this paper, we pre-train three sentiment classifiers: BiLSTM, BiLSTM $+$ A, and CNN. These classifiers are targeted by different attacking methods to generate adversarial examples (detailed in Section 3.2). BiLSTM is composed of an embedding layer that maps individual words to pre-trained word embeddings; a number of bi-directional LSTMs that capture sequential contexts; and an output layer that maps the averaged LSTM hidden states to a binary output. BiLSTM $+$ A is similar to BiLSTM except it has an extra self-attention layer which learns to attend to salient words for sentiment classification, and we compute the weighted mean of the LSTM hidden states prior to the output layer. Manual inspection of the attention weights show that polarity words such as awesome and disappointed are assigned with higher weights. Finally, CNN has a number of convolutional filters of varying sizes, and their outputs are concatenated, pooled and fed to a fully-connected layer followed by a binary output layer.

Recent development in transformer-based pre-trained models have produced state-of-the-art performance on a range of NLP tasks Devlin et al. (2018); Yang et al. (2019). To validate the transferability of the attacking methods, we also test it on a fine-tuned BERT classifier. That is, we use the adversarial examples generated for attacking the three previous classifiers (BiLSTM, BiLSTM $+$ A or CNN) as test data for BERT and measure its classification performance to understand whether these adversarial examples can fool BERT.

3.2 Benchmark Attacking Methods

We experiment with six benchmark attacking methods for texts, ranging from white-box attacks: FGM, FGVM, DeepFool Gong et al. (2018), HotFlip Ebrahimi et al. (2017)), and TYC Tsai et al. (2019) to black-box attacks: TextFooler Jin et al. (2019).

To perturb the discrete inputs, both FGM and FGVM introduce noises in the word embedding space via the fast gradient method Goodfellow et al. (2014) and reconstruct the input by mapping perturbed word embeddings to valid words via nearest neighbour search. Between FGM and FGVM, the former introduce noises that is proportional to the sign of the gradients while the latter introduce perturbations proportional to the gradients directly. The proportion is known as the overshoot value and denoted by $\epsilon$ . DeepFool uses the same trick to deal with discrete inputs except that, instead of using the fast gradient method, it uses the method introduced in Moosavi-Dezfooli et al. (2016) for image to search for an optimal direction to perturb the word embeddings.

Unlike the previous three methods, HotFlip and TYC rely on performing one or more atomic flip operations to replace words while monitoring the label change given by the target classifier. In HotFlip, the directional derivatives w.r.t. flip operations are calculated and the flip operation that results in the largest increase in loss is selected.³³3While the original paper explores both character flips and word flips, we test only word flips here. The rationale is that introducing character flips to word-based target classifiers is essentially changing word tokens to [unk], which creates a confound for our experiments. TYC is similar to FGM, FGVM and DeepFool in that it also uses nearest neighbour search to map the perturbed embeddings to valid words, but instead of using the perturbed tokens directly, it uses greedy search or beam search to flip original tokens to perturbed ones one-at-a-time in order of their vulnerability.

TextFooler, the only black-box attack method tested, is a query-based method. Since it assumes no access to the full architecture of the target classifier, it learns the order of vulnerability of tokens in an input sentence according to the change of prediction scores produced by the target classifier when a specific token is discarded. Once the order of vulnerability of words is identified, similar to HotFlip and TYC, it greedily replaces tokens in the order of vulnerability one-at-a-time until the prediction of the target classifier changes. To ensure similarity between the adversarial examples and the original ones, the substituted tokens are selected to satisfy semantic, part-of-speech and sentence embedding similarity constraints. Note that TextFooler uses the classifier’s prediction scores to learn token vulnerability; in a more realistic black-box scenario the classifier may only reveal the predicted labels (without showing the underlying scores associated with each label).

4 Experiments

4.1 Datasets

We construct three datasets based on the Yelp reviews⁴⁴4https://www.yelp.com/dataset and the sentence-level Rotten Tomato (RT) movie reviews⁵⁵5http://www.cs.cornell.edu/people/pabo/movie-review-data/. For Yelp, we binarise the ratings⁶⁶6Ratings $\geq$ 4 is set as positive and ratings $\leq$ 2 as negative., and create 2 datasets, where we keep only reviews with $\leq$ 50 tokens (yelp50) and $\leq$ 200 tokens (yelp200). For RT (rt), we directly use the binarised dataset which contains 5331 positive and 5331 negative tokenized sentences. We randomly partition both datasets into train/dev/test sets (90/5/5 for yelp50; 99/0.5/0.5 for yelp200; and 80/10/10 for rt). For yelp50 and yelp200, we use spaCy⁷⁷7https://spacy.io for tokenisation. We train and tune target classifiers (see Section 3.1) using the training and development sets, and evaluate their performance on the original examples in the test sets as well as the adversarial examples generated by attacking methods for the test sets. These datasets present a variation in the text length (e.g. the average number of words for rt, yelp50 and yelp200 is 22, 34 and 82 words respectively) and training data size (e.g. rt: 8K examples, yelp50: 407K examples, and yelp200: 2M examples).

4.2 Implementation Details

We use the pre-trained glove.840B.300d embeddings Pennington et al. (2014) for the first 5 attacking methods and the counter-fitted word embedding Mrkšić et al. (2016) for TextFooler. For FGM, FGVM and DeepFool, we tune $\epsilon$ , the overshoot hyper-parameter (Section 3.2) and keep the iterative step $n$ static (5).⁸⁸8We search for the best $\epsilon$ within a large range (orders of magnitude in difference) to achieve a particular attacking performance. For TYC, besides $\epsilon$ we also tune the upper limit of flipped words, ranging from 10%–100% of the maximum length. For HotFlip and TextFooler, we tune only the upper limit of flipped words, in the range of $[1,7]$ .

For target classifiers, we tune batch size, learning rate, number of layers, number of units, attention size (for BiLSTM $+$ A), filter sizes and dropout probability (for CNN). For BERT, we use the default fine-tuning hyper-parameter values except for batch size, where we adjust based on memory consumption. Note that after the target classifiers are trained their weights are not updated when running the attacking methods.

5 Evaluation

We propose both automatic metrics and human evaluation strategies to assess the quality of adversarial examples, based on four criteria defined in Section 1: (a) attacking performance (i.e. how well they fool the classifier); (b) textual and semantic similarity between the original input and the adversarial input; (c) fluency of the adversarial example; and (d) label preservation. Note that the automatic metrics only address the first 3 criteria (a, b and c); we contend that criterion (d) requires manual evaluation, as the judgement of whether the original label is preserved is inherently a human decision.

5.1 Automatic Evaluation: Metrics

As sentiment classification is our target task, we use the standard classification accuracy (ACC, the lower the better) to evaluate the attacking performance of adversarial examples (criterion (a)).

To assess the similarity between the original and (transformed) adversarial examples (criterion (b)), we compute BLEU scores Papineni et al. (2002) to measure word overlap; and SEM scores, cosine similarity between the representations of original examples and adversarial examples generated by the universal sentence encoder Cer et al. (2018), to measure semantic similarity. For both metrics, higher scores represent better performance.

To measure fluency, we first explore a supervised BERT model fine-tuned to predict linguistic acceptability Devlin et al. (2018). However, in preliminary experiments we found that BERT performs very poorly at predicting the acceptability of adversarial examples (e.g. it predicts word-salad-like sentences generated by FGVM as very acceptable), revealing the brittleness of these supervised models. We next explore unsupervised approaches Lau et al. (2017, 2020), using normalised sentence probabilities estimated by pre-trained language models for measuring acceptability. Following Lau et al. (2020), we use XLNet Yang et al. (2019) as the language model. The acceptability score of a sentence is calculated based on the normalised sentence probability: ${\log P(s)}/({((5+|s|)/(5+1))^{\alpha}})$ , where $s$ is the sentence, and $\alpha$ is a hyper-parameter (set to 0.8) to dampen the impact of large values Vaswani et al. (2017). To measure how fluency differs between an adversarial example and the original sentence, we compute the difference in their acceptability scores, giving us the acceptability metric ACPT.

Note that we only compute BLEU and ACPT for adversarial examples that have successfully fooled the classifier. Our rationale is that unsuccessful examples can artificially boost these scores by not making any modifications, and so the better approach is to only consider successful examples.

5.2 Automatic Evaluation: Results

We present the performance of the attacking methods against 3 target classifiers (Table 1A; top) and on 3 datasets (Table 1B; bottom). We choose 3 ACC thresholds for the attacking performance: T0, T1 and T2, which correspond approximately to accuracy scores of 90%, 80% and 70% for the Yelp datasets (yelp50, yelp200)⁹⁹9Most methods are capable of achieving attacking performance of 30% or lower (ACC), although that comes with severe degradation to the quality of the adversarial examples.; and 60%, 50%, 30% for RT dataset (rt) ¹⁰¹⁰10We choose 30% instead of 40% because, for HotFlip, flipping one words achieved an accuracy drop to 30.1%. Our rationale is that each method should be compared on the same basis if our focus is to to provide a fair assessment on the quality of the adversarial examples, and the attacking performance constitutes a reasonable basis.

Looking at BLEU, SEM and ACPT, TextFooler is the most consistent method over multiple datasets and classifiers. HotFlip is also fairly competitive, occasionally producing better BLEU scores (CNN at T1; at T0, T1, T2 on yelp200 and T2 on rt). Gradient-based methods FGM and FGVM perform very poorly. In general, they tend to produce word salad adversarial examples, as indicated by their poor ACPT scores. DeepFool similarly generates incoherent sentences with low BLEU scores, but occasionally produces good SEM (at T0) and ACPT (BiLSTM at T1 and T2), suggesting potential brittleness of the automatic evaluation approach for evaluating semantic similarity and acceptability.

(A) Dataset: yelp50 Models: acc BiLSTM $+$ A: 96.8 CNN: 94.3 BiLSTM: 96.6 Attack ACC BLEU SEM ACPT ACC BLEU SEM ACPT ACC BLEU SEM ACPT T0 FGM 93.0 34.6 17.4 -25.9 92.3 65.9 47.1 -16.6 90.5 30.0 2.3 -25.9 FGVM 93.4 29.1 10.1 -17.7 93.4 89.0 80.0 -5.8 92.9 19.6 16.2 -21.4 DeepFool 94.7 20.1 61.7 -20.6 92.7 68.6 72.8 -15.0 93.9 16.0 68.8 -17.8 TYC 91.7 64.7 38.9 -14.4 90.4 59.0 41.5 -16.9 90.8 65.4 34.9 -13.6 HotFlip 92.5 92.6 66.5 -3.7 – – – – 93.2 92.7 67.3 -3.5 TextFooler – – – – – – – – – – – – T1 FGM 88.7 15.4 -13.1 -28.9 82.2 16.3 -7.1 -35.2 81.0 11.6 -15.6 -27.7 FGVM 83.9 7.6 -24.2 -12.1 82.6 20.6 2.2 -34.0 85.4 11.4 -13.9 -15.8 DeepFool 86.8 13.4 27.6 -10.1 84.6 17.5 33.4 -34.4 80.8 5.6 14.1 -0.7 TYC 83.8 48.3 11.6 -18.9 87.6 41.2 29.4 -21.8 81.8 47.4 8.9 -19.0 HotFlip 80.3 85.6 47.9 -7.0 81.5 92.5 77.1 -3.8 82.8 85.1 47.6 -7.0 TextFooler 86.5 92.6 88.7 -1.8 87.7 91.9 94.2 -2.1 85.8 92.8 82.1 -1.6 T2 FGM 72.7 2.7 -33.6 -34.5 71.9 2.4 -30.8 -38.5 71.4 3.3 -26.3 -28.8 FGVM 77.8 4.6 -20.9 -9.3 71.7 7.0 -18.7 -38.3 70.9 0.3 -37.0 -4.6 DeepFool 72.1 3.1 -28.5 -12.6 70.9 5.4 -20.8 -38.3 72.0 2.9 6.0 0.5 TYC 75.3 41.2 -7.6 -20.7 73.4 38.9 -15.3 -21.4 77.5 43.1 0.6 -19.9 HotFlip 75.3 80.0 38.1 -7.8 70.8 84.7 63.4 -7.1 70.6 78.7 36.7 -9.8 TextFooler 73.6 88.5 84.1 -2.9 – – – – 70.8 88.4 86.9 -2.77 (B) Target classifier: BiLSTM $+$ A Datasets: acc yelp50: 96.8 yelp200: 97.9 rt: 78.8 Attack ACC BLEU SEM ACPT ACC BLEU SEM ACPT ACC BLEU SEM ACPT T0 FGM 93.0 34.6 17.4 -25.9 92.1 13.8 3.2 -37.8 66.3 4.9 -15.9 -29.1 FGVM 93.4 29.1 10.1 -17.7 94.2 55.4 63.7 -18.3 66.3 25.3 13.3 -24.4 DeepFool 94.7 20.1 61.7 -20.6 – – – – 62.2 2.1 -31.9 -5.9 TYC 91.7 64.7 38.9 -14.4 90.3 51.2 44.5 -20.3 65.4 67.2 33.2 -9.3 HotFlip 92.5 92.6 66.5 -3.7 90.8 96.4 75.6 -3.3 – – – – TextFooler – – – – 93.8 96.3 94.4 -1.6 – – – – T1 FGM 88.7 15.4 -13.4 -28.9 81.6 23.0 6e-3 -37.0 – – – – FGVM 83.9 7.6 -24.2 -12.1 80.8 17.2 10.6 -35.7 48.8 7.1 -11.6 -23.7 DeepFool 86.8 13.4 27.6 -10.1 82.3 19.1 -0.4 -9.0 49.8 0.4 -41.7 -18.6 TYC 83.8 48.3 11.6 -18.9 86.9 42.6 35.7 -23.7 50.6 46.0 18.7 -16.1 HotFlip 80.3 85.6 47.9 -7.0 83.2 94.8 67.2 -4.1 – – – – TextFooler 86.5 92.6 88.7 -1.8 84.2 92.6 90.3 -3.3 49.9 87.7 73.0 -3.5 T2 FGM 72.7 2.7 -33.6 -34.5 72.8 6.9 14.8 -28.5 – – – – FGVM 77.8 4.6 -20.9 -9.3 70.6 1.5 -26.5 -5.4 – – – – DeepFool 72.1 3.1 -28.5 -12.6 72.2 7.7 25.2 -25.4 – – – – TYC 75.3 41.2 -7.6 -20.7 76.9 36.3 8.7 -23.5 – – – – HotFlip 75.3 80.0 38.1 -7.8 77.0 93.8 62.9 -4.7 30.1 89.0 64.5 -5.0 TextFooler 73.6 88.5 84.1 -2.9 75.7 90.2 88.2 -4.5 33.7 82.0 69.1 -5.5

Table 1: Results based on automatic metrics. Top half (A) presents 3 different target classifiers evaluated on one dataset (yelp50); bottom half (B) tests 3 datasets using one classifier (BiLSTM

+

A). For ACPT, less negative values are better. Boldface indicates optimal performance for an attacking performance and target classifier. Missing numbers (dashed lines) indicate the method is unable to produce the desired accuracy, e.g. HotFlip with only 1 word flip produces 81.5% accuracy (T1) when attacking CNN on yelp50, and so T0 accuracy is unachievable.

Comparing the performance across different ACC thresholds, we observe a consistent pattern of decreasing performance over all metrics as the attacking performance increases from T0 to T2, especially drastic for FGM, FGVM and DeepFool when the attacking rate changes from T0 to T1. These observations suggest that all methods are trading off fluency and content preservation as they attempt to generate stronger adversarial examples.

We now focus on Table 1A to understand the impact of model architecture for the target classifier. With 1 word flip as the upper limit for HotFlip, the accuracy of BiLSTM $+$ A and BiLSTM drops to T0 (approximately 4% accuracy decrease) while the accuracy of CNN drops to T1 (approximately 13% accuracy decrease). Within the same attacking performance thresholds, FGM, FGVM and DeepFool achieve higher BLEU, SEM and ACPT scores when targeting CNN compared to those scores when targeting the other two models. These observations suggest that convolutional networks are more vulnerable to attacks (noting that it is the predominant architecture for CV). Interestingly, the CNN model seems to be very robust against TextFooler on yelp50, where the accuracy stays at around 87.0% regardless of how we tune TextFooler.

Looking at Table 1B, we also find that the attacking performance is influenced by the input text length and the number of training examples for target classifiers. For HotFlip and TextFooler, we see improvements over BLEU, SEM and ACPT as text length increases from yelp50 to yelp200, indicating the performance of these two methods is more affected by input lengths. We think this is because with more words it is more likely for them to find a vulnerable spot to target. While for TYC, we see improvements over BLEU and ACPT as the number of training examples for target classifier decreases from yelp200 (2M) to yelp50 (407K), indicating TYC are less effective for attacking target classifiers that are trained with more data. This suggests that increasing the training data for a classifier could potentially improve its robustness against certain attacks. The effect of the number of training examples for target classifier is further validated by the attacking performance of TextFooler and HotFlip on rt. Despite that rt has the shortest input (with average 22 tokens), changing 1 words for TextFooler and HotFlip successfully rendered accuracy drops from 78.8% to 49.9% and 30.1%, respectively. Comparing the same number of word change, the drop of accuracy introduced by TextFooler and HotFlip are 10% and 4% on yelp50; and 4% and 7% on yelp200.

Another factor that possibly makes it easier to attack rt is that movie reviews being more descriptive and therefore creating potential ambiguity in their expression of sentiment. In comparison, the restaurant reviews are more straightforward, using polarising words such as awesome or awful. The net effect is that sentiment classification is “easier” for the restaurant reviews (as the classifier can make the decision based on vocabulary choices), which in turn make adversarial attacks “harder”.

To check how well these adversarial examples generalise to fooling other classifiers, also known as transferability, we feed the adversarial examples from the 3 best methods, i.e. TYC, HotFlip and TextFooler, to a pre-trained BERT trained for sentiment classification and measure its accuracy (Figure 1). Unsurprisingly, we observe that the attacking performance (i.e. the drop in ACC) is not as good as those reported in Table 1 . Interestingly, we find that TextFooler, the best performing method, produces the least effective adversarial examples for BERT, indicating poor transferability, while examples generated from TYC perform the best in fooling BERT. Manually inspecting the examples generated by TextFooler, we notice it tries to replace as few words as possible to move an examples just across the decision boundary of the target classifier (indicated by very close scores between different labels). This change appears be very targeted for a specific classifier, and as such is unlikely to fool other classifiers.

Refer to caption — Figure 1: Accuracy of BERT on adversarial examples generated from TYC, HotFlip and TextFooler for different target classifiers (top row) and for different datasets (bottom row). BERT’s accuracy on original examples are denoted as red line in each plot.

As an additional insight, we also evaluated the computational time of different attacking methods. We calculate the the generation time of the three best performing methods TYC, HotFlip and TextFooler when they attack the test dataset of rt, yelp50, and yelp200 at T2 attacking threshold. Table 2 shows the corresponding computational time (seconds per example). We found that HotFlip achieves comparable performance with TextFooler but only consumes 1/3 of its computational time. TYC is even more time-consuming than TextFooler. Also, the speed of HotFlip seems not affected by the increase of the sentence lengths, but TextFooler and TYC take much longer time to attack longer input examples.

To summarise, our results demonstrate that the best attacking methods (e.g. TextFooler and HotFlip) may not produce adversarial examples that generalise to fooling other classifiers. We also saw that convolutional networks are generally more vulnerable than recurrent networks, and that dataset features such as text length and training data size can influence the performance of adversarial attacks.

5.3 Human Evaluation: Design

Automatic metrics provide a proxy to quantify the quality of the adversarial examples. To validate that these metrics work, we conduct a crowdsourcing experiment on Figure-Eight.¹¹¹¹11https://www.figure-eight.com/ Recall that the automatic metrics do not assess sentiment preservation (criterion (d)); we evaluate that aspect here.

Methods	rt	yelp50	yelp200
TYC	–	1.2	13.8
HotFlip	0.06	0.3	0.6
TextFooler	0.5	1.0	8.0

Table 2: Computational time (seconds per example) of TYC, HotFlip and TextFooler when attacking different datasets.

We experiment with the 3 best methods (TYC HotFlip and TextFooler) on 2 accuracy thresholds (T0 and T2), using BiLSTM $+$ A as the classifier. For each method and threshold, we randomly sample 25 positive-to-negative and 25 negative-to-positive examples. To control for quality, we reserve and annotate 10% of the samples ourselves as control questions. Workers are first presented with 10 control questions as a quiz, and only those who pass the quiz with at least 80% accuracy can continue to work on the task. We display 10 questions per page, where one control question is embedded to continue monitor worker performance.¹²¹²12Workers are required to maintain their performance (80% accuracy) throughout the annotation process. The task is designed such that each control question can only be seen once per worker. We restrict our jobs to workers in United States, United Kingdoms, Australia, and Canada.

To evaluate the criteria (b) textual similarity, (c) fluency, and (d) sentiment preservation, we ask the annotators three questions:

1.

Is snippet B a good paraphrase of snippet A?
$\ocircle$ Yes $\ocircle$ Somewhat yes $\ocircle$ No
2.

How natural does the text read?
$\ocircle$ Very unnatural $\ocircle$ Somewhat natural $\ocircle$ Natural
3.

What is the sentiment of the text?
$\ocircle$ Positive $\ocircle$ Negative $\ocircle$ Cannot tell

For question 1, we display both the adversarial input and the original input, while for question 2 and 3 we present only the adversarial example. As a baseline, we also run a survey on question 2 and 3 for 50 random original (unperturbed) samples.

5.4 Human Evaluation: Results

We present the percentage of answers to each question in Figure 2. The green bars illustrate how well the adversarial examples paraphrase the original ones; blue how natural the adversarial examples read; and red whether the sentiment of the adversarial examples is consistent compared to the original.

Looking at the performance of the original sentences (“(a) Original samples”), we see that their language is largely fluent and their sentiment is generally consistent to the original examples’.

On content preservation (criterion (b); green bars), all methods produce poor paraphrases on yelp50 except for TextFooler.

Next we look at fluency (criterion (c); blue bars). We see similar trend: with the increased attacking performance, the readability of adversarial examples generated by different attacking methods is getting poorer. HotFlip is fairly competitive, producing adversarial examples that are only marginally less fluent compared to the original at T0. At T2, however, it begin to trade off fluency. TextFooler, on the other hand, achieved even slightly better fluency than the original examples at T2, indicating the substituted tokens fit the context very well.

Lastly, we consider sentiment preservation (criterion (d); red bars). All methods trade off sentiment preservation to achieve better attacking performance. Again both HotFlip and TextFooler are the better methods here (interestingly, we observe an increase in agreement as their attacking performance increases from T0 and T2).

Comparing automatic evaluation and human evaluation results, we found that the SEM scores are generally consistent with the human evaluation results on semantic preservation, while the BLEU score is less effective as evidenced by the high BLEU score of HotFlip on yelp50 and poor paraphrasing performance in the same settings. The ACPT metric appears to be a solid metric in evaluating fluency, as we see a good agreement with human evaluation across the three attacking methods.

Summarising our findings, TextFooler is generally the best method across all criteria, noting that its adversarial examples, however, have the poorest transferability. HotFlip produces comparable results with TextFooler for meeting the four criteria and similarly suffer from poor transferability. TYC generates adversarial examples with better transferability but do not do well in terms of content preservation and fluency. In terms of computational time, HotFlip is the most efficient, consuming less than 1/3 of the time consumed by the other two methods. The difference is more profound for longer input sentences. FGM, FGVM and DeepFool perform very poor as they largely sacrifice example qualities to achieve attacking performance, indicating directly mapping from perturbed word embedding is not applicable in NLP. All said, we found that all methods tend to trade-off sentiment preservation for attacking performance, revealing that these methods in a way “cheat” by simply flipping the sentiments of the original sentences to fool the classifier, and therefore the adversarial examples might be ineffective for adversarial training, as they are not examples that reveal potential vulnerabilities in the classifier.

6 Conclusion

We propose an evaluation framework for assessing the quality of adversarial examples in NLP, based on four criteria: (a) attacking performance, (b) textual similarity; (c) fluency; (d) label preservation. Our framework involves both automatic and human evaluation, and we test 6 benchmark methods involving both white-box and black-box attacking methods. We found that the architecture of the target classifier is an important factor when it comes to attacking performance, e.g. CNNs are more vulnerable than LSTMs. Data features such as length of text and input domains are also influencing factors that affect how difficulty it is to perform adversarial attack. Lastly, we observe in our human evaluation that on short texts that express clear positive or negative sentiments (such as yelp50), these methods produce adversarial examples that tend not to preserve their semantic content and have low readability. More importantly, these methods “cheat” by simply flipping the sentiment in the adversarial examples, and this behaviour is evident especially on the yelp50 dataset, suggesting they could be ineffective for adversarial training.

References

Carlini and Wagner (2017) Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE.
Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder.
Chen et al. (2017) Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dong et al. (2018) Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193.
Ebrahimi et al. (2017) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
Gao et al. (2018) Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE.
Gong et al. (2018) Zhitao Gong, Wenlu Wang, Bo Li, Dawn Song, and Wei-Shinn Ku. 2018. Adversarial texts with gradient methods. arXiv preprint arXiv:1801.07175.
Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Jin et al. (2019) Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is bert really robust? a strong baseline for natural language attack on text classification and entailment.
Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533.
Lau et al. (2020) Jey Han Lau, Carlos S. Armendariz, Shalom Lappin, Matthew Purver, and Chang Shu. 2020. How furiously can colourless green ideas sleep? sentence acceptability in context. arXiv e-prints, page arXiv:2004.00881.
Lau et al. (2017) Jey Han Lau, Alexander Clark, and Shalom Lappin. 2017. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge. Cognitive Science, 41:1202–1241.
Li et al. (2018) Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271.
Moosavi-Dezfooli et al. (2016) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE CVPR, pages 2574–2582.
Mrkšić et al. (2016) Nikola Mrkšić, Diarmuid O Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, pages 311–318, Philadelphia, Pennsylvania, USA.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP (2014), pages 1532–1543.
Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Tsai et al. (2019) Yi-Ting Tsai, Min-Chu Yang, and Han-Yu Chen. 2019. Adversarial attack on sentiment classification. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 233–240.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008.
Wang et al. (2019) Wenqi Wang, Lina Wang, Run Wang, Zhibo Wang, and Aoshuang Ye. 2019. Towards a robust deep neural network in texts: A survey.
Wang et al. (2016) Yequan Wang, Minlie Huang, Li Zhao, et al. 2016. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 606–615.
Xiao et al. (2018) Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. 2018. Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610.
Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237.

Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP