Word Shape Matters:
Robust Machine Translation with Visual Embedding

Haohan Wang Peiyan Zhang Eric P. Xing
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA, USA
[email protected]

Abstract

Neural machine translation has achieved remarkable empirical performance over standard benchmark datasets, yet recent evidence suggests that the models can still fail easily dealing with substandard inputs such as misspelled words, To overcome this issue, we introduce a new encoding heuristic of the input symbols for character-level NLP models: it encodes the shape of each character through the images depicting the letters when printed. We name this new strategy visual embedding and it is expected to improve the robustness of NLP models because human also process the corpus visually through printed letters, instead of machinery one-hot vectors. Empirically, our method improves models’ robustness against substandard inputs, even in the test scenario where the models are tested with the noises that are beyond what is available during the training phase.

1 Introduction

Despite the remarkable empirical successes Neural Machine Translation (NMT) has achieved over various benchmark datasets, models tested in the real-world scenario soon alarmed the community about the lack of robustness behind the appealing high numbers. For example, evidence shows that the impressive performances can hardly be met when the models are tested with out-of-domain or noisy data Luong and Manning (2015); Belinkov and Bisk (2017); Anastasopoulos et al. (2019), while these datasets can barely raise any difficulties for human translators.

This robustness disparity between human and NMT has motivated multiple recent works aiming to improve the resilience of NMT system against either natural or synthetic noises Belinkov and Bisk (2017); Zhou et al. (2019); Vaibhav et al. (2019); Sano et al. (2019); Levy et al. (2019). For example, most preceding works emphasized the importance of training with noisy inputs (i.e., adversarial training) Belinkov and Bisk (2017); Vaibhav et al. (2019); Levy et al. (2019), which is further extended by Sano et al. (2019) to perturb the intermediate representations other than input data.

Different from the previous efforts along this line, we notice that previous NMT mostly process the corpus through artificial one-hot vectors of words or characters, while human read through eyes, analyzing the shape of the printed letters. We conjecture this disparity is the root of the notorious vulnerability of many NMT towards substandard words, such as misspellings and elongated words, which human can read through effortlessly.

Therefore, in this paper, building around the central argument

Human read through eyes; so shall the models¹¹1assuming the models are expected to show human-level resilience towards substandard inputs.

we introduce a simple alternative of the one-hot character encoding mechanism for character-level model: we encode the shape of the symbols. This encoding of the letter shape, which we refer to as visual embedding (VE), is obtained as a dimension-reduced representation of the images depicting the characters. The embedding exhibits remarkable robustness behavior against input noises in comparison to the previous standard. Our embedding also comes with a minor advantage as the embedding is even smaller than the typical one-hot embedding of characters, thus help to reduce the input size of the models.

The remainder of this paper is organized as follows. We first discuss the related works in Section 2, and then start to introduce our visual embedding (VE) and corresponding algorithms in Section 3. We demonstrate the empirical strength of methods in Section 4, where we first verify the concept with synthetic data, and then compare against previous popular methods . We finally offer some discussion in Section 5 before we conclude our paper in Section 6.

2 Related Work

The vulnerability of modern neural networks towards human imperceptible input variations has been studied for a while since Szegedy et al. (2013), primarily in the computer vision community (e.g., Goodfellow et al. (2015)), later extended to the NLP community (e.g.,Ebrahimi et al. (2017); Liang et al. (2017); Yin et al. (2020); Jones et al. (2020); Jia et al. (2019); Huang et al. (2019); Liu et al. (2019); Pruthi et al. (2019)). Recent studies suggest that the fragility of neural networks roots in that the data has multiple signals that can reduce the empirical risk, and when a model is forced to reduce the training error, it picks up whatever information that diminish the empirical loss, ignoring whether the learnt knowledge aligns with human perception or not Wang et al. (2019b), connecting the adversarial robustness problems and the bias in data problems that has been studied for a while (e.g.,Wang et al. (2016); Goyal et al. (2017); Kaushik and Lipton (2018); Wang et al. (2019a)).

NMT, despite the high empirical numbers on various leaderboards, complicated architecture design, and even occasional human-parity claims Wu et al. (2016); Hassan et al. (2018), is fundamentally a statistical commission where a model is asked to search the patterns that reduce the empirical loss in training data and evaluated numerically in testing data, thus one may expect the models to “cheat” through distribution-specific signals for high scores and behave differently from human when tested more thoroughly Läubli et al. (2018).

As expected, Belinkov et al. showed that some noises, either synthetic or natural, can easily break NMT models Belinkov and Bisk (2017), while human can process the noised texts effortlessly. Fortunately, after revealing this pitfall, Belinkov et al. also proposed a simple heuristic as a remedy Belinkov and Bisk (2017): augmenting the training samples with noises can significantly improve the robustness of these models against noises at test time. They also showed that the noises injected at training time and testing time have to be aligned to maximize the performance of this augmentation method.

This augmentation method, also known as a member of the adversarial training family, has become the main force of the battle against the vulnerability issue of the NMT models Levy et al. (2019); Sano et al. (2019); Cheng et al. (2019). For example, Levy et al. improved the robustness against character-level variations, such as typo, of the source languages Levy et al. (2019). They experimented with a transformer-based MT model with CNN character encoders, set up the experiment in a way that natural noises are only injected at testing time (so the evaluation result is a more natural reflection of what is in the real world), and observed that injecting synthetic noises during training help to improve the robustness against natural noises at testing.

There are also other emerging forces against the fragility of NMT models. For example, inspired by a popular psychological observation suggesting that human are invariant to jumbled letters²²2 Most readers should be able to process the following text despite nearly all the words are jumbled: “Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.” For more details, we refer readers to the online article linked at the next footnote. , Belinkov et al. attempted the usage of average embedding of the characters compositing the word Belinkov and Bisk (2017) and Sakaguch et al. introduced an architecture that predicts the original word back from the jumbled words Sakaguchi et al. (2017). Recently, along another direction, Zhou et al. demonstrated the resilence of a multitask learning mechanism against input variations Zhou et al. (2019).

Key Differences: Our work is also inspired by the psychological observation discussed above. However, instead of explicitly regularizing the models to fix what has been revealed, we ask the question that what can be a fundamental disparity between a machine and a human. Our answer, and the main argument of this paper, is that human read through eyes, thus the shape of the word matters more than the exact permutation of the compositing letters. Interestingly, we notice that our argument is partially supported by an online post written by a psycholinguist commenting on the aforementioned observation³³3 www.mrc-cbu.cam.ac.uk/people/matt-davis/cmabridge/. Therefore, we design a new input regime that can encode the shape of the letters, and expect the shape of the words will be more accurately described by our technique.

3 Method

We first introduce the procedure that can generate the visual embedding (VE) that describes the shape of the letters. The VE is expected to work with any existing character-level NLP models with no extra efforts, so we will only mention the corresponding neural architecture briefly.

3.1 Visual Embedding

With a predetermined set of characters $\mathcal{C}$ , we first choose a font (e.g., Times New Roman) and decide the dimension of the image depicting a letter (e.g. $15\times 15$ for each letter), then for every character in $\mathcal{C}$ , we print it onto an image, which naturally offers us an embedding of the character (e.g., a vector of $225$ , following our previous configuration). Further, to have a more memory-efficiency representation, we collect the embedding of all the characters in $\mathcal{C}$ and get a matrix of representations (e.g., a matrix of $|\mathcal{C}|\times 225$ , following the previous configuration, where we use $|\cdot|$ to denote the cardinality of the set) and use PCA to map the matrix to a lower dimension $d$ , which can be straightforwardly determined by inspecting the variance explained (details to follow). Therefore, we obtain representation as a $d$ -dimension vector, which we name as the VE of the character. Algorithm 1 offers more formal details of this procedure.

Determining the output dimension $d$ by examining ratio of explained variances.

As the effective dimension of VE is much smaller than the dimension of the image depicting the letters, we use PCA to reduce the dimension to a smaller dimension $d$ .

We use a simple heuristic to decide an appropriate $d$ : we plot the ratio of the explained variance of the first $d$ components calculated by PCA for every $d$ , then we manually inspect the plot to select a cut-off where the first $d$ components can sufficiently explain a large ratio (e.g., $95\%$ ) of the variance.

The ratio of explained variance can be conveniently calculated by the ratio of the sum of the first $d$ eigenvalues over the sum of all the eigenvalues, where eigenvalues are calculated with the corresponding covariance matrix, thus guaranteed to be non-negative.

There are also other algorithms that can choose an appropriate $d$ automatically. However, since PCA of this scale of the data can be calculated in negligible time in a modern computer, and it only needs to be calculated once, we consider the plot-and-check procedure as the main method to select an appropriate $d$ . Also, the manually check procedure may be more reliable than automatic algorithms.

Input: set of characters

\mathcal{C}

, font

T

, image dimension

m\times n

, output dimension

d

;

Output: embedding of the shape

|\mathcal{C}|\times d

(in other words, a length-

d

vector for each character in the set);

Initialize a matrix

\mathbf{R}

with shape

|\mathcal{C}|\times mn

;

for every element $c$ in $\mathcal{C}$ do

Initialize a blank image

\mathbf{I}_{c}

of the size

m\times n

;

c

with font

T

\mathbf{I}_{c}

;

Reshape the image into a vector

\mathbf{v}_{c}

of dimension

mn

;

Collect

\mathbf{v}_{c}

into the corresponding row of

\mathbf{R}

;

end for

Perform PCA to project

\mathbf{R}

into a lower dimension matrix of the shape

|\mathcal{C}|\times d

Algorithm 1 Algorithm of generating visual embedding

3.2 Training with Visual Embedding

In simple words, to use VE, all one need is to replace the traditional one-hot character input with VE.

Formal Discussion

The data is a pair of collections of $n$ sentences, denoted as $\{\mathbf{X},\mathbf{Y}\}$ , where $\mathbf{X}$ denotes the source sentences and $\mathbf{Y}$ denotes the target sentences. The $i$ ^th sample of $\mathbf{X}$ (or $\mathbf{Y}$ ) is denoted as $\mathbf{x}_{i}$ (or $\mathbf{y}_{i}$ ) which consists a sequence of characters, denoted as $\{x_{i}^{1},x_{i}^{2},\dots,x_{i}^{l(i)}\}$ (or $\{y_{i}^{1},y_{i}^{2},\dots,y_{i}^{k(i)}\}$ ). $l(i)$ (or $k(i)$ ) denotes the number of characters of the $i$ ^th source (or target) sentence. Sentences do not necessarily have the same length.

Through the VE technique, we will map the source sentence $\mathbf{X}$ (or $\mathbf{x}$ , $x$ ) into the representation $\mathbf{Z}$ (or $\mathbf{z}$ , $z$ , respectively). We use $e(\cdot;\theta)$ to denote the encoder and $\theta$ denotes its parameters; we use $d(\cdot;\phi)$ to denote the decoder and $\phi$ denotes its paramters. The training of our NMT model is to optimize

\displaystyle\theta,\phi=\operatorname*{arg\,min}_{\theta,\phi}\mathbb{E}_{\mathbf{x},\mathbf{y}}l(d(e(v(\mathbf{z}_{i}),\theta);\phi);\mathbf{y}_{i})

where $l(\cdot,\cdot)$ is a generic loss function and $v(\cdot)$ stands for the function mapping characters into VE.

Model Specification:

As our VE mainly concerns with the character-level input, we discuss one corresponding network architecture and associated hyperparamters. Despite that the vectors can be integrated into almost any character-level models, the model used in this paper essentially builds upon transformer-based machine translation model Vaswani et al. (2017) and a CNN-based character encoder Kim et al. (2016) serving as the encoder, based on Fairseq implementation Ott et al. (2019).

4 Experiments

We first use synthetic experiment to validate the effectiveness of VE for simple robust text classification, which is also a territory where we can discuss related questions such as the choice of a font and the choices of the effective dimension $d$ of VE. Then we demonstrate the empirical strength of our method with standard MT benchmarks when the test sentences are perturbed with noises.

4.1 Synthetic Experiment

We first prove the concept of VE with a simple binary text classification when the test sequences has some noises that are not seen by models during training.

Experiment Setup

We generate the data by first sampling some positive “words” and negative “words”, where each word is a sequence of three random letters. Each sample is a “sentence” of three “words”, and the label (negative or positive) is determined by the majority of the polarity of the “words”. Following this rule, we sampled 15 “words” for each category and generated roughly 180k samples for training and 60k samples for validation. Further, in addition to the samples from the same distribution, we also generate two distributions of noised test samples: one is to mix random characters into the samples generated with the above rule, the other is to replace a “word” with a random sequence of three letters when the label of the “sentence” can still determined by the remaining two “words”. Altogether, there are 60k samples for testing.

The model we considered is inspired from the CNN-LSTM architecture Vosoughi et al. (2016), and we also implement the highway network Srivastava et al. (2015). With this model, we compare the performance when the input is one-hot vector and is VE.

Results

The results are shown in Table 1, where we report the performance of one-hot embedding and the results of VE over four different fonts over nine different choices of the effective dimension $d$ . First of all, we can observe the non-robustness issues of conventional one-hot embedding as the test accuracy is significantly lower than train accuracy. However, these issues are not observed for VE. Further, we can see that the choices of fonts barely matter in the end performance, we believe this is mainly because all the fonts, despite the visual differences, are still essentially encoding the shape of the characters. Similarly, we can observe that the choices of the effective dimension $d$ barely matters for the end performance. With these evidence, the following experiments in this paper will use Time New Roman as the font and set $d=128$ .

Fonts

Additionally, to have a more thorough understanding of the choice of the fonts, we further plot the layout of the VE with t-SNE in Figure 1. Although the layouts appear distinct given different fonts, the proximity particulars are roughly agreed. For example, the cluster of ‘C’, ‘O’, ‘Q’, and ‘G’, as well as the cluster of ‘e’, ‘o’, and ‘c’ are agreed reasonably well across multiple fonts. Interestingly, we can see that the embedding introduced by the font “Comic Sans MS” deviates the most from the layouts by other fonts, this deviation should be expected as letters printed with “Comic Sans MS” also look the most differently from other fonts.

We also investigate whether our simple heuristic in determining the dimension $d$ will be affected significantly by the fonts. Our analysis suggest that there are barely any differences in terms of the function between $d$ and ratio of explained variance.

$d$	One-hot		visual embedding
	One-hot		Arial		Comic Sans MS		Times New Roman		Verdana
	train	test	train	test	train	test	train	test	train	test
5	0.9784	0.8345	0.9453	0.9993	0.9224	0.9978	0.9433	0.9986	0.9269	0.9987
10			0.9588	0.9996	0.9460	0.9990	0.9583	0.9996	0.9506	0.9994
20			0.9389	0.9998	0.9427	0.9994	0.9549	0.9999	0.9606	0.9996
30			0.9699	0.9994	0.9591	0.9995	0.9460	0.9999	0.9565	0.9990
40			0.8842	0.9977	0.9707	0.9999	0.9450	0.9999	0.9548	0.9989
50			0.968	0.9997	0.9478	0.9999	0.9552	0.9998	0.9473	0.9999
60			0.9575	1.0000	0.9668	0.9999	0.9596	0.9999	0.9489	0.9999
70			0.9453	1.0000	0.9454	0.9999	0.9639	0.9998	0.9562	0.9997
80			0.9588	0.9998	0.9597	0.9998	0.9637	0.9999	0.9731	0.9999

Table 1: Synthetic experiments with simple robust text classification when VE is tested with different font choices and dimensions. Classification accuracy is reported.

4.2 Robust Machine Translation

We proceed to examine the robustness of a NMT model trained over benchmark datasets against variation of inputs at test time. Following previous works Belinkov and Bisk (2017); Levy et al. (2019), we mainly used the IWSLT 2016 machine translation benchmark Junczys-Dowmunt and Birch (2016), when the test data is polluted with various noises. We consider three language pairs: German-English (de-en), French-English (fr-en), and Czech-English (cs-en).

4.2.1 Competing Methods

As the VE is a generic method of encoding the characters of corpus, our method is expected to work with most of the other, if not all, character-level models. Therefore, we mainly test the baseline and the adversarial training methods, and compare them to the same methods when the input is VE. To be specific, we study the following methods:

•

Base: The character-level transformer described in Section 3.2.
•

Base-V: Base method when the input is VE.
•
ADT: The input is the conventional one-hot embedding, but augmented with multiple synthetic noises, as done by Belinkov and Bisk (2017); Levy et al. (2019), this process is often discussed as adversarial training. In particular, we have the following five methods:
- –
  
  ADT_D: when synthetic noises is to delete of a random character.
- –
  
  ADT_I: when synthetic noises is to insert a random character to a random position.
- –
  
  ADT_R: when synthetic noises is to replace a random character with another one.
- –
  
  ADT_S: when synthetic noises is to swap two adjacent random characters.
- –
  
  ADT_A: when synthetic noises is the combination of all above synthetic noises.
•

ADT-V: the corresponding methods when input is VE, correspondingly, we have ADT_D-V, ADT_I-V, ADT_R-V, ADT_S-V, and ADT_A-V.

Therefore, we tested 12 methods all together. Following Levy et al. (2019), the methods concerning synthetic noises are only added to training data, the validation data remain untouched. Also, we follow the hyperparamter setup in Levy et al. (2019) to choose the probability $p=10\%$ in injecting synthetic noises.

4.2.2 Test Data with Noise

Following precedents Belinkov and Bisk (2017); Levy et al. (2019). We consider two different situations for injecting noises into the test data: the natural noise case and the synthetic noise case.

Natural Noise

We evaluate the models where natural noises are injected into the test data. We follow Belinkov and Bisk (2017) to add natural noises by leveraging the collection of frequently misspelled words and replacing the words in test sentences with the misspelled ones. For each source language, we have:

•

French: Wikipedia Correction and Paraphrase Corpus Max and Wisniewski (2010).
•

German: a combination of RWSE Wikipedia Revision Dataset Zesch (2012) and The MERLIN corpus of language learners Wisniewski et al. (2013)
•

Czech: manually annotated essays written by non-native speakers Šebesta et al. (2017).

Synthetic Noise

We also evaluate the models where different degrees of the synthetic noises are injected into the test data. However, as the ADT family models are trained with synthetic noised injected to the text with probability $p=10\%$ , we can intuitively expect these models will excel at test time when the test sentences are perturbed with a similar degree of noises, even when these models are not in fact resilient towards variations in a broader scope. To avoid such inaccurate evaluations, we conduct a more comprehensive testing regime: for each model, we test it with all the different kinds of synthetic noises, and report the average score of these test performances: therefore, even a model has seen certain noise pattern during the training phase, it may be fragile towards other noise patterns. Further, we also test beyond the setup where models are trained with the synthetic noises injected with probability $p$ : we test the scenarios when the test sentences are perturbed with probability $2p$ and $3p$ in addition. For convenience of further discussion, we refer to this probability of which we inject synthetic noises as “noise-level” (NL).

4.2.3 Results

Natural Text (Original Text and Natural Noise)

We first evaluate the performances of these models in natural texts that can appear in the real world, including the original test texts and the misspelled texts generated by replacing the words with its frequently misspelled counterparts. Table 2 shows the results, where natural texts are reported at the first two columns.

Overall, we can see the best performances in these two categories are all obtained by methods with VE, although the methods with the best performances in each category are different. In particular, we notice that ADT_S-V behaves reasonably well, it outperforms all the methods without VE over the original text and the natural noises.

If we compare the performances of the method and its counterpart with VE, the improvement of VE can be best represented in the Base method: Base-V almost maintains the performance of Base in original text and shows an improvement over the text with natural noises. Interestingly, we did not observe a clear change-of-performance pattern when ADT family models adopt VE. In particular, some synthetic noises (deletion) hurt the convergence of Base, and VE can recover (and improve) the performance, while some other synthetic noises (insertion and substitution) improves the performance over Base, but VE degrades the performance. We run the experiments repeatedly, but find this performance pattern appears to be stable.

Base	57.35	28.86	51.49	46.39	41.60	10.82
	original text	natural noise	synthetic noise			STD
	original text	natural noise	NL = $p$	NL = $2p$	NL = $3p$	STD
Base-V	57.07	33.09	53.68	48.53	44.02	9.36
ADT_D	48.6	38.32	43.31	40.64	38.05	4.36
ADT_D-V	63.55	51.01	49.47	45.88	42.90	7.92
ADT_I	60.84	50.13	51.70	39.73	37.97	9.38
ADT_I-V	49.19	39.61	41.48	49.62	47.85	4.66
ADT_R	58.94	47.77	52.95	40.25	38.65	8.53
ADT_R-V	50.41	40.98	42.11	50.92	48.89	4.75
ADT_S	60.66	50.15	51.78	47.92	45.08	5.90
ADT_S-V	63.26	52.78	51.06	49.04	46.55	6.43
ADT_A	57.36	48.72	48.73	41.51	40.54	6.79
ADT_A-V	49.65	42.72	42.33	47.44	46.10	3.12

Table 2: Test performances of French-English translation, where two models are reported together: the model uses one-hot embedding and VE; performances are reported with text that can appear naturally (original text and natural noise) and text are perturbed with synthetic noises (with three different noise level (NL)); standard deviations (STD) of each row are also reported.

Synthetic Noises

Further, we inspect the models’ robustness when the test sentences are perturbed by synthetic noises. Results are also showed in Table 2.

First of all, when we test the models with a more stringent setup, we get a message that counters the previous beliefs: our results indicate that training with synthetic noises alone does not necessarily increase the models’ robustness against synthetic noises at test time. In particular, although models trained with synthetic noises can outperform Base when the noise-level is $p$ , the advantage of these models can be barely observed if we simply increase the noise-level.

If we increase the noise-level, we can see a clear performance drop of all the models with one-hot embeddings, whether trained with synthetic noises or not. This behavior aligns with our conjecture: synthetic noises can help improve the robustness of NMT models mostly because it offers an opportunity for the models to see the noise pattern during training, thus, when we shift the noise pattern (by increasing the noise-level), these models become fragile again.

In contrast, models with VE demonstrate an impressive level of resilience towards input variations as we increase the noise-level. Surprisingly, some models even see an increment of test score when the noise-level increases. For other models, as we increase the noise-level, the test score also goes down, but in a much slower pace than the models not boosted with VE.

The standard deviation also endorses the strength of VE. With only one exception (ADT_D-V), all models with VE report a smaller standard deviation, sometimes even smaller than half of the standard deviation of the counterparts.

More Results from Appendix

Due to the space limitation, we only report the detailed dissection of results on French-English translation. Results of other translations (German-English, Czech-English) are reported in the Appendix A. Briefly, all these results lead to the same conclusion: VE helps improve the robustness of NMT models towards input variations, especially when the noises during test phase are different from the ones used during train phase.

5 Discussion

Broader Usage of Visual Embedding:

We believe our method can also be used beyond the MT task discussed, especially in the applications that constantly work with substandard inputs or compound words that are not even recorded in standard dictionaries.

For example, we experiment with a collection of bioinformatics articles, over a task of sentiment analysis for drug reviews Gräßer et al. (2018). In biology research, there are usually new compound words created, therefore it is likely that the training corpus is inadequate to represent all the words a model will see during the test phase. Our results suggest that VE leads to significantly better results in comparison to the previous deep learning driven methods. For example, a standard one-hot embedding model (CNN-LSTM architecture with highway network) can only achieve a classification accuracy around 0.35, while VE can boost the performance up to 0.72 with the same set of hyperparamters. As a reference, this recent manuscript reports an accuracy of 0.443 with a word-level neural network Škrlj et al. (2020).

Broader Definition of Visual Embedding:

Our paper describes the VE in the context of encoding the shape of characters. However, one should be able to extend the concept to word level or even the sentence level (e.g., Sun et al. (2018, 2019)).

We have also attempted to extend the concept directly to the sentence or paragraph level: we directly print every sample into an image, and turn the NLP problem into a computer vision problem. However, we notice that this method is limited by the number of characters of a sample. For example, for a paragraph with 5000 characters, we roughly need an image with size $3000\times 200$ to have all the characters printed clearly. The image will have 600k pixels and can easily surpass the capacity of most computer vision techniques. Converting the image into a lower-resolution one also fails as the blurry images will not be able to carry all the information the text has.

Broader Scope of Related Work:

We are not aware of related works that encoding the shape of input text in Germanic or Latin languages, especially in the context of learning robust NLP models. However, there is a large collection of related works that use the shape information of Sino-Tibetan languages, especially Chinese over various applications. We offer a brief overview of these works as they are connected to our paper in the sense of encoding the visual information of characters.

For example, glyph-aware embedding has been explored to incorporate vision techniques to boost the language modelling and word segmentation task of Chinese Dai and Cai (2017). The visual encoding of Chinese also offers a convenience opportunity to investigate the radicals of the language, which as been taken advantage of for text classification and word segmentation Shi et al. (2015). Ideas in learning glyph embedding from vision representation has also been attempted for word analogy and word similarity in Chinese Su and Lee (2017); Cao et al. (2018). Later, recent advances of this direction has enabled the usage of visual representation of characters, especially Chinese, in more high-level semantic applications Liu et al. (2017); Meng et al. (2019); Zhang and LeCun (2017).

Potential Limitations:

Since our contribution, despite its simplicity, is a fundamental innovation that can be applied to nearly all the character-level models for nearly all the NLP tasks, we hope to discuss the potential limitations beyond specific application and model within the scope of this paper.

First, although VE is usually smaller than the one-hot embedding, and the model integrating the technique can usually finish one epoch faster, we notice that models using VE needs more epoches to converge. Second, the usage of VE may be limited in the applications when there is a undetermined effects of the shape of the words. For example, sometimes changing a few letters to upper case may even hinder the human (and VE’s) recognition of it (e.g., banana vs. baNANA), but sometimes it may only hinder VE but not human (e.g., banana vs. Banana). This misalignment may limit the usage of VE for some dedicated applications. Fortunately, there should exist multiple heuristics that can account these issues to be explored, such as always mapping the first letter of a word to lower case.

6 Conclusion

Motivated by the discrepancy between how humans and machine learning models process text data, we aim to improve the robustness of neural machine translation (NMT) models towards substandard inputs by aligning the data processing procedure between humans and the models. In particular, building around the belief that one of the reasons of the vulnerability of the models against substandard inputs is the discrepancy in processing data, we argue “human read through eyes; so shall the models.”

Following this argument, we introduced the visual embedding (VE), which encodes the shape of the characters as a replacement of the conventional one-hot embedding of characters. The VE can be constructed very efficiently with only a few lines of codes, and it can be integrated into almost any existing character-level models that use one-hot embedding as input.

Further, in the context of machine translation over noised texts, we tested the performance of models with one-hot embedding and with VE. We mainly compared to the methods that augment the training samples with noises (e.g., adversarial training), which is usually considered as one of the most effective methods for robust machine learning. With a more comprehensive evaluation, we demonstrated an impressive superiority of the VE: models with VE (especially trained with synthetic noises) are resilient towards noised input even when the noises at the test time are introduced with a greater probability than that of the training phase, which is a scenario usually fails conventional methods. Overall, the empirical performance strongly endorsed the efficacy of VE, especially in the context of robustness towards substandard inputs not seen during training.

References

Anastasopoulos et al. (2019) Antonios Anastasopoulos, Alison Lui, Toan Q Nguyen, and David Chiang. 2019. Neural machine translation of text from non-native speakers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3070–3080.
Belinkov and Bisk (2017) Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173.
Cao et al. (2018) Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning chinese word embeddings with stroke n-gram information. In Thirty-second AAAI conference on artificial intelligence.
Cheng et al. (2019) Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. arXiv preprint arXiv:1906.02443.
Dai and Cai (2017) Falcon Z. Dai and Zheng Cai. 2017. Glyph-aware embedding of chinese characters. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, Copenhagen, Denmark, September 7, 2017, pages 64–69. Association for Computational Linguistics.
Ebrahimi et al. (2017) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples (2014). In International Conference on Learning Representations.
Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
Gräßer et al. (2018) Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-based sentiment analysis of drug reviews applying cross-domain and cross-data learning. In Proceedings of the 2018 International Conference on Digital Health, pages 121–125.
Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
Huang et al. (2019) Po-Sen Huang, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krishnamurthy Dvijotham, and Pushmeet Kohli. 2019. Achieving verified robustness to symbol substitutions via interval bound propagation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4081–4091. Association for Computational Linguistics.
Jia et al. (2019) Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. 2019. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4127–4140. Association for Computational Linguistics.
Jones et al. (2020) Erik Jones, Robin Jia, Aditi Raghunathan, and Percy Liang. 2020. Robust encodings: A framework for combating adversarial typos.
Junczys-Dowmunt and Birch (2016) Marcin Junczys-Dowmunt and Alexandra Birch. 2016. The university of edinburgh’s systems submission to the mt task at iwslt. In Proceedings of the First Conference on Machine Translation, Seattle, USA.
Kaushik and Lipton (2018) Divyansh Kaushik and Zachary C Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. arXiv preprint arXiv:1808.04926.
Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In Thirtieth AAAI Conference on Artificial Intelligence.
Läubli et al. (2018) Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. arXiv preprint arXiv:1808.07048.
Levy et al. (2019) Omer Levy, Jacob Eisenstein, Marjan Ghazvininejad, et al. 2019. Training on synthetic noise improves robustness to natural noise in machine translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47.
Liang et al. (2017) Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. 2017. Deep text classification can be fooled. arXiv preprint arXiv:1704.08006.
Liu et al. (2017) Frederick Liu, Han Lu, Chieh Lo, and Graham Neubig. 2017. Learning character-level compositionality with visual features. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 2059–2068. Association for Computational Linguistics.
Liu et al. (2019) Hairong Liu, Mingbo Ma, Liang Huang, Hao Xiong, and Zhongjun He. 2019. Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3044–3049. Association for Computational Linguistics.
Luong and Manning (2015) Minh-Thang Luong and Christopher D Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation, pages 76–79.
Max and Wisniewski (2010) Aurélien Max and Guillaume Wisniewski. 2010. Mining naturally-occurring corrections and paraphrases from wikipedia’s revision history. In LREC.
Meng et al. (2019) Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, and Jiwei Li. 2019. Glyce: Glyph-vectors for chinese character representations. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 2742–2753.
Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
Pruthi et al. (2019) Danish Pruthi, Bhuwan Dhingra, and Zachary C. Lipton. 2019. Combating adversarial misspellings with robust word recognition. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 5582–5591. Association for Computational Linguistics.
Sakaguchi et al. (2017) Keisuke Sakaguchi, Kevin Duh, Matt Post, and Benjamin Van Durme. 2017. Robsut wrod reocginiton via semi-character recurrent neural network. In Thirty-First AAAI Conference on Artificial Intelligence.
Sano et al. (2019) Motoki Sano, Jun Suzuki, and Shun Kiyono. 2019. Effective adversarial regularization for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 204–210.
Šebesta et al. (2017) Karel Šebesta, Zuzanna Bedrichová, Katerina Šormová, Barbora Štindlová, Milan Hrdlicka, Tereza Hrdlicková, Jirı Hana, Vladimır Petkevic, Tomáš Jelınek, Svatava Škodová, et al. 2017. Czesl grammatical error correction dataset (czesl-gec). LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
Shi et al. (2015) Xinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie, and Chao Liu. 2015. Radical embedding: Delving deeper to chinese radicals. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers, pages 594–598. The Association for Computer Linguistics.
Škrlj et al. (2020) Blaž Škrlj, Matej Martinc, Jan Kralj, Nada Lavrač, and Senja Pollak. 2020. tax2vec: Constructing interpretable features from taxonomies for short text classification. Computer Speech & Language, page 101104.
Srivastava et al. (2015) Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385.
Su and Lee (2017) Tzu-Ray Su and Hung-Yi Lee. 2017. Learning Chinese word representations from glyphs of characters. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 264–273, Copenhagen, Denmark. Association for Computational Linguistics.
Sun et al. (2019) Baohua Sun, Lin Yang, Catherine Chi, Wenhan Zhang, and Michael Lin. 2019. Squared english word: A method of generating glyph to use super characters for sentiment analysis. arXiv preprint arXiv:1902.02160.
Sun et al. (2018) Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Super characters: A conversion from sentiment classification to image classification. arXiv preprint arXiv:1810.07653.
Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Vaibhav et al. (2019) Vaibhav Vaibhav, Sumeet Singh, Craig Stewart, and Graham Neubig. 2019. Improving robustness of machine translation with synthetic noise. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1916–1920.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Vosoughi et al. (2016) Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb Roy. 2016. Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 1041–1044.
Wang et al. (2016) Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. 2016. Select-additive learning: Improving generalization in multimodal sentiment analysis. ICME.
Wang et al. (2019a) Haohan Wang, Da Sun, and Eric P Xing. 2019a. What if we simply swap the two text fragments? a straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7136–7143.
Wang et al. (2019b) Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. 2019b. High frequency component helps explain the generalization of convolutional neural networks. arXiv preprint arXiv:1905.13545.
Wisniewski et al. (2013) Katrin Wisniewski, Karin Schöne, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. 2013. Merlin: An online trilingual learner corpus empirically grounding the european reference levels in authentic learner data. In ICT for Language Learning 2013, Conference Proceedings, Florence, Italy. Libreriauniversitaria. it Edizioni.
Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Yin et al. (2020) Fan Yin, Quanyu Long, Tao Meng, and Kai-Wei Chang. 2020. On the robustness of language encoders against grammatical errors.
Zesch (2012) Torsten Zesch. 2012. Measuring contextual fitness using error contexts extracted from the wikipedia revision history. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 529–538. Association for Computational Linguistics.
Zhang and LeCun (2017) Xiang Zhang and Yann LeCun. 2017. Which encoding is the best for text classification in chinese, english, japanese and korean? CoRR, abs/1708.02657.
Zhou et al. (2019) Shuyan Zhou, Xiangkai Zeng, Yingqi Zhou, Antonios Anastasopoulos, and Graham Neubig. 2019. Improving robustness of neural machine translation with multi-task learning. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 565–571.

Appendices

Appendix A Additional Results

Base	55.45	26.76	31.05	29.78	28.46	11.93
	original text	nature noise	synthetic noise			STD
	original text	nature noise	NL = $p$	NL = $2p$	NL = $3p$	STD
Base-V	46.47	29.57	34.72	33.15	31.38	6.66
ADT_D	50.09	40.96	35.81	33.97	32.42	7.16
ADT_D-V	47.93	47.04	40.48	38.97	37.49	4.79
ADT_I	45.29	37.28	39.99	38.78	37.40	3.29
ADT_I-V	45.86	38.28	41.25	40.09	39.30	2.95
ADT_R	52.73	37.44	39.98	38.60	37.50	6.50
ADT_R-V	45.2	43.03	45.50	43.89	42.38	1.35
ADT_S	49.58	38.4	40.91	39.20	37.83	4.83
ADT_S-V	46.69	40.21	42.06	40.34	38.80	3.06
ADT_A	50.78	38.22	36.60	35.99	35.58	6.42
ADT_A-V	43.29	44.17	41.85	40.94	40.19	1.64

Table 3: Test performances of German-English translation, where two models are reported together: the model uses one-hot embedding and VE; performances are reported with text that can appear naturally (original text and natural noise) and text are perturbed with synthetic noises (with three different noise level (NL)); standard deviations (STD) of each row are also reported.

Base	46.31	30.78	44.02	41.91	39.83	5.98
	original text	nature noise	synthetic noise			STD
	original text	nature noise	NL = $p$	NL = $2p$	NL = $3p$	STD
Base-V	38.87	32.91	38.37	37.31	36.02	2.38
ADT_D	43.57	30.35	38.33	36.79	35.25	4.80
ADT_D-V	53.53	32.89	44.36	41.80	39.44	7.54
ADT_I	42.57	32.16	40.17	38.71	37.28	3.89
ADT_I-V	36.2	37.73	37.21	37.44	37.50	0.60
ADT_R	43.61	32.29	39.98	38.56	37.14	4.14
ADT_R-V	36.48	33.88	37.54	36.85	36.09	1.39
ADT_S	46.69	31.64	38.95	37.41	35.84	5.52
ADT_S-V	49.69	34.14	38.23	38.32	38.28	5.85
ADT_A	36.77	28.9	22.56	20.82	19.54	7.15
ADT_A-V	35.21	27.48	23.41	22.95	22.34	5.39

Table 4: Test performances of Czech-English translation, where two models are reported together: the model uses one-hot embedding and VE; performances are reported with text that can appear naturally (original text and natural noise) and text are perturbed with synthetic noises (with three different noise level (NL)); standard deviations (STD) of each row are also reported.

We report the results of German-English translation and Czech-English translation here in Table 3 and Table 4 to support the discussion in the main manuscript. As expected, methods with visual embedding surpass the counterparts with one-hot embedding over the noised sentences in most cases, although visual embedding does not seem to help much in the original text. Interestingly, we noticed that Base model works pretty well in the Czech-English translation over especially when NL= $2p$ and NL= $3p$ , we conjecture this performance is due to that other methods adopting adversarial training overfits the distribution of NL= $p$ .

Additionally, the column of STD shows that visual embedding can significantly improve the robustness of the models towards variations of input.

Overall, together with the results reported in the main manuscript, we can fairly conclude that visual embedding is preferred over the one-hot embedding given the performances discussed.

Word Shape Matters: Robust Machine Translation with Visual Embedding