This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation

Takumi Aoki  Shunsuke Kitada  Hitoshi Iyatomi
Department of Applied Informatics, Graduate School of Science and Engineering
Hosei University, Tokyo, Japan
{takumi.aoki.4g, shunsuke.kitada.8y}@stu.hosei.ac.jp
[email protected]
Abstract

We propose a new character-based text classification framework for non-alphabetic languages, such as Chinese and Japanese. Our framework consists of a variational character encoder (VCE) and character-level text classifier. The VCE is composed of a β\beta-variational auto-encoder (β\beta-VAE) that learns the proposed glyph-aware disentangled character embedding (GDCE). Since our GDCE provides zero-mean unit-variance character embeddings that are dimensionally independent, it is applicable for our interpretable data augmentation, namely, semantic sub-character augmentation (SSA). In this paper, we evaluated our framework using Japanese text classification tasks at the document- and sentence-level. We confirmed that our GDCE and SSA not only provided embedding interpretability but also improved the classification performance. Our proposal achieved a competitive result to the state-of-the-art model while also providing model interpretability.

1 Introduction

Some Asian languages (e.g., Chinese and Japanese) use glyphs to give visual meaning to characters. For example, the following Japanese characters have a common form of “辶,” which is a sub-character meaning of the related word road: “迫” (approach: come near the destination by road) and “追” (follow: track the road). In consideration of these characteristics of the language, several glyph-aware natural language processing (NLP) models have been proposed Shimada et al. (2016); Liu et al. (2017); Kitada et al. (2018); Sun et al. (2019). These deep-learning-based models train input text as a sequence of character images and learn character embeddings from the images.

In general, the interpretability of the NLP model is important in terms of its reliability, as well as providing the required performance for the task. If imaged-based models can learn these sub-characters in a way that is interpretable, it helps greatly in improving the overall interpretability of the models.

In terms of improving the interpretability of models, disentangled representation learning method has received a great deal of attention in recent years, such as InfoGAN Chen et al. (2016) and β\beta-variational auto-encoder (β\beta-VAE) Higgins et al. (2017). This learning method transforms the input data into low-dimensional representations that are independent of each other while still retaining the important content. Although it has been actively discussed in the field of computer vision, there are few applications in the field of NLP.

In terms of ensuring model robustness, data augmentation is necessary and essential in machine learning today. With regard to this desirable feature, glyph-aware embedding (i.e., image-based character embedding) allows data augmentation without contextual consideration, such as word dropout Iyyer et al. (2015) and wildcard training Shimada et al. (2016). Simple data augmentation based on dropout does not consider the features of the input space. If the NLP method based on glyph-aware embedding is highly interpretive, such as a disentangled representation, an effective data augmentation method can be achieved. This improves not only the robustness of the model but also its interpretability.

In this paper, we propose a general-purpose text classification framework that gives interpretability to data augmentation for image-based glyph-aware character embedding, which has the various advantages mentioned above. The framework consists of two novel methods: (1) glyph-aware disentangled character embedding (GDCE) and (2) semantic sub-character augmentation (SSA). Each method has the following simple but effective features:

  • The GDCE is obtained from the variational character encoder (VCE), which is the encoder part of the β\beta-VAE. The VCE takes advantage of the β\beta-VAE to create a low-dimensional representation of the characters, where each dimension follows an independent normal distribution. Therefore, the GDCE provides a disentangled character embedding in which each of the dimensions corresponds to the structure of the sub-character.

  • The SSA alters only one dimension of the GDCE, which corresponds with altering some part of the shape of the original character, and can present how the character has changed. In other words, these combinations are equivalent to replacing the sub-character of a character with another readable sub-character.

Our framework improves the interpretability of character embedding by the GDCE, and the SSA provides interpretable data augmentation suitable for the GDCE. We verified the text classification ability of our proposed framework using Japanese text classification tasks. 111 The code required to reproduce the experiments is available on GitHub. https://github.com/IyatomiLab/GDCE-SSA

2 Related work

2.1 Glyph-aware Natural Language Processing

Embedding methods based on character images have been proposed with some excellent success Chen et al. (2015); Sun et al. (2016); Yu et al. (2017); Sun et al. (2019); Dai and Cai (2017); Shimada et al. (2016); Liu et al. (2017); Kitada et al. (2018); Ke and Hagiwara (2017); Aldón Mínguez et al. (2016). These methods are also called glyph-aware embedding as they generate embeddings that take into account the shape of the characters or sub-characters. These image-based methods mainly use convolutional neural networks (CNNs) or convolutional auto-encoders (CAEs) Masci et al. (2011) for character-embedding learning, and they perform well because of the following advantages: (1) they operate without the cumbersome word segmentation required by some Asian languages, and (2) they can apply additional image-based data augmentation.

2.2 Data Augmentation for Natural Language Processing

For NLP tasks, it is challenging to apply data augmentation methods because of the need to consider the context of the text Sennrich et al. (2016); Jia and Liang (2016); Silfverberg et al. (2017); Edunov et al. (2018). Several data augmentation methods that do not require text analysis have been proposed for word embedding Iyyer et al. (2015); Zhang et al. (2016) and character embedding Shimada et al. (2016). In particular, Shimada et al. (2016) achieved significant performance improvements by applying dropout Hinton et al. (2012)-based data augmentation to a type of character embedding called wildcard training (WT). However, these methods have little interpretability of what the data augmentation means in the input text, partly due to the lack of interpretability of the embedding itself. Our proposed SSA is improved WT, and it replaces the sub-character of a character with another readable sub-character.

2.3 Learning Interpretable Character Embeddings

For learning a latent representation that can be interpreted, InfoGAN Chen et al. (2016) and β\beta-VAE Higgins et al. (2017) are well known. Unlike InfoGAN, β\beta-VAE is stable while training, requires less assumptions about the data, and relies on only a single hyperparameter β\beta. Because of these advantages, several improved models based on β\beta-VAE have been proposed (e.g., Factor-VAE Kim and Mnih (2018), HFVAE Esmaeili et al. (2019)). Therefore, in this paper, we use β\beta-VAE as a VCE to learn interpretable character embeddings.

3 Methodology

In this paper, we propose a new character-based text classification framework that includes a new character embedding method, consisting of glyph-aware disentangled character embedding (GDCE) and semantic sub-character augmentation (SSA). Figure 1 shows an overview of the proposed text classification framework.

3.1 Glyph-aware Disentangled Character Embedding (GDCE)

We obtain the GDCE using the VCE based on the β\beta-VAE. Since the GDCE provides dimensionally independent features, we expect to solve the problem of the poorly interpretable character embedding obtained by the CAE.

Refer to caption
Figure 1: Overview of our text classification framework. Each character in the target text is transformed to an image and forwarded as a glyph feature to the subsequent VCE. The VCE is composed of a β\beta-VAE, and it learns the proposed GDCE. Owing to the attractive properties of the GDCE, character-level text classifier can take advantage of the interpretable and highly effective data augmentation method, SSA.

β\beta-VAE is a generative model that estimates the data distribution p(𝒙)p(\bm{x}), where 𝒙d\bm{x}\in\mathbb{R}^{d} is a dd-dimensional input. Let 𝒛d\bm{z}\in\mathbb{R}^{d^{\prime}} be a dd^{\prime}-dimensional latent variable, which is derived from the GDCE in this paper; p(𝒛)p(\bm{z}) is a normal distribution, which is the prior distribution of the latent variables, q(𝒛|𝒙)q(\bm{z}|\bm{x}) is the posterior distribution, and p(𝒙|𝒛)p(\bm{x}|\bm{z}) is a generative model. We optimize the following function:

β-VAE=𝔼q(𝒛|𝒙)[logp(𝒙|𝒛)]βDKL[q(𝒛|𝒙)||p(𝒛)],\begin{split}\mathcal{L}_{\beta{\textrm{-VAE}}}={}&\mathbb{E}_{q(\bm{z}|\bm{x})}[\log{p(\bm{x}|\bm{z})}]\\ &-\beta D_{\mathrm{KL}}[q(\bm{z}|\bm{x})||p(\bm{z})],\end{split} (1)

where β\beta is a balancing coefficient for the second term. The first term represents the reconstruction error of the character image. The second term represents the regularization of the latent variables that are learned so as to follow the prior distribution by the KL divergence DKL[||]D_{\mathrm{KL}}[\cdot||\cdot]. If the coefficient β\beta increases, it is possible to obtain a representation of the features where each dimension is independent Higgins et al. (2017).

However, the latent variables themselves are a probability distribution and cannot be backpropagated to the encoder. Hence, the reparameterization trick Kingma and Welling (2013) of the approximation method is used. We let 𝜶\bm{\alpha} be a sampled random variable from 𝒩(𝟎,𝑰d)\mathcal{N}(\bm{0},\bm{I}_{d^{\prime}}) and calculate the latent variables as follows:

𝒛=μ(𝒙)+𝜶σ(𝒙),𝜶𝒩(𝟎,𝑰d),\begin{array}[]{ll}\bm{z}=\mu(\bm{x})+\bm{\alpha}\odot\sigma(\bm{x}),&\bm{\alpha}\sim\mathcal{N}(\bm{0},\bm{I}_{d^{\prime}}),\end{array} (2)

where \odot is an element-wise product, μ\mu is the mean of the distribution, and σ\sigma is the variance of the distribution. Here, μ(𝒙)\mu(\bm{x}) and σ(𝒙)\sigma(\bm{x}) are dd^{\prime}-dimensional vectors obtained from the β\beta-VAE.

3.2 Character-level Text Classification with Semantic Sub-character Augmentation (SSA)

The sequence of cc embedded characters C={𝒛(1),𝒛(2),,𝒛(c)}C=\{\bm{z}^{(1)},\bm{z}^{(2)},\cdots,\bm{z}^{(c)}\} from the GDCE, where 𝒛(t)\bm{z}^{(t)} is the tt-th character embedding, in the VCE are provided to the following character-level text classifier. The parameters of the classifier are optimized in the back-propagation using the cross-entropy error.

In this paper, we propose SSA as a data augmentation method. Taking advantage of the preferred features of the embedding created by the GDCE, we expect that the sub-character of a character will be replaced by another readable sub-character, using the SSA.

Let γ\gamma be the perturbation range, and the formula of the SSA for the ii-th dimension 𝒛i(t)\bm{z}^{(t)}_{i} of the character embedding 𝒛(t)\bm{z}^{(t)} is defined as follows:

𝒛i(t)=𝒛i(t)+u,u𝒰(γ,γ),\begin{array}[]{ll}\bm{z^{\prime}}^{(t)}_{i}=\bm{z}^{(t)}_{i}+u,&u\sim\mathcal{U}(-\gamma,\gamma),\end{array} (3)

where u𝒰(a,b)u\sim\mathcal{U}(a,b) indicates that the random variable uu has a uniform distribution with the minimum aa and the maximum bb. Since each dimension of the GDCE follows 𝒩(𝟎,𝑰d)\mathcal{N}(\bm{0},\bm{I}_{d^{\prime}}), the character embedding converted in Eq. 3 falls within the range of trained character-embedding values.

4 Experiment Settings

4.1 Evaluation Datasets

We evaluated our framework with the following datasets: newspaper and livedoor. These datasets were split into two parts: 80% for training and 20% for evaluation. Because these datasets contain new words and/or meanings related to current affairs, accurate word segmentation through morphological analysis has been a challenge in conventional word-level processing for Japanese. Therefore, we can avoid such difficulties by using character-level input instead of word-level input222It is generally known that a character-level model performs better than a word-level model in Chinese and Japanese Zhang and LeCun (2017)..

Newspaper.

The newspaper dataset used in Shimada et al. (2016) contains 5,610 Japanese major web newspaper articles (Asahi, Mainichi, Sankei, and Yomiuri) in the categories of politics, the economy, and international news, for a total of 22,440 articles.

Livedoor.

The livedoor dataset is commonly used to evaluate models for Japanese.333https://www.rondhuit.com/download.html#ldcc The dataset contains, for example, 870 and 900 Japanese sentences in the categories of movie-enter and sports-watch, respectively. In all the nine categories, it contains a total of 7,367 articles.

4.2 Model Architectures

We trained the VCE based on β\beta-VAE and character-level CNN (CLCNN) Zhang et al. (2015) as text classifier independently. The hyperparameters of these models were adjusted with a validation set split from the training set, and the predicted results of the evaluation set were reported.

Layer Encoder
1 Conv2d (k=(4,4)k=(4,4), o=32o=32, s=2s=2) \rightarrow ReLU
2 Conv2d (k=(4,4)k=(4,4), o=32o=32, s=2s=2) \rightarrow ReLU
3 Conv2d (k=(4,4)k=(4,4), o=64o=64, s=2s=2) \rightarrow ReLU
4 Conv2d (k=(4,4)k=(4,4), o=64o=64, s=2s=2) \rightarrow ReLU
5 Linear(o=256o=256) \rightarrow ReLU
6 Linear(o=2×10o=2\times 10)
Layer Decoder
1 Linear (o=256o=256) \rightarrow ReLU
2 Linear (o=1024o=1024) \rightarrow ReLU
3 Deconv2d (k=(4,4)k=(4,4), o=64o=64, s=2s=2) \rightarrow ReLU
4 Deconv2d (k=(4,4)k=(4,4), o=32o=32, s=2s=2) \rightarrow ReLU
5 Deconv2d (k=(4,4)k=(4,4), o=32o=32, s=2s=2) \rightarrow ReLU
6 Deconv2d (k=(4,4)k=(4,4), o=1o=1, s=2s=2) \rightarrow Sigmoid
Table 1: Architecture of β\beta-VAE. Kernel size kk, output size oo, and stride size ss was set to the above table.

β\beta-variational auto-encoder (β\beta-VAE).

Table 1 shows the architecture of β\beta-VAE. Generally, training of β\beta-VAE is unstable, and requires adjustment of hyperparameters. In this paper, we carefully tuned hyperparameters based on Locatello et al. (2019). Adam Kingma and Ba (2014) was used to maximize β-VAE\mathcal{L}_{\beta{\textrm{-VAE}}}, as shown in Eq. 1. We set train batch size to 64 and the learning rate to 1e-4.

To obtain the GDCE, we trained the VCE with 6,631 common Japanese characters, including Japanese Hiragana, Katakana, and Kanji444From the Japanese Industrial Standards; first and second levels., as well as English alphabets and symbols. These characters were converted to d=64×64d=64\times 64 grayscale character images and used as input 𝒙\bm{x} to the VCE. We set β=8\beta=8 and d=10d^{\prime}=10 for all tasks, γ=1.5\gamma=1.5 for the newspaper, and γ=2.0\gamma=2.0 for the livedoor.

Layer CLCNN
1 Conv1d (k=3k=3, o=512o=512) \rightarrow ReLU
2 Maxpool1d (k=3k=3, s=3s=3)
3 Conv1d (k=3k=3, o=512o=512) \rightarrow ReLU
4 Maxpool1d (k=3k=3, s=3s=3)
5 Conv1d (k=3k=3, o=512o=512) \rightarrow ReLU
6 Conv1d (k=3k=3, o=512o=512) \rightarrow ReLU
7 Linear (o=o= #classes)
Table 2: Architecture of CLCNN. Kernel size kk, output size oo, and stride size ss was set to the above table.

Character-level convolutional neural network (CLCNN).

Table 2 shows the architecture of CLCNN. We trained CLCNN with the same parameters as in Shimada et al. (2016). Similar to training the character embedding model, Adam was used to minimize the cross-entropy error. We set the learning rate of Adam to 1e-4 and weight decay to 1e-4, train batch size to 256 for the livedoor, and 512 for the newspaper.

In training the CLCNN, we used the GDCE results obtained by the VCE as the input. For training, c=128c=128 consecutive characters were extracted from the text in the newspaper, and c=80c=80 consecutive characters were extracted from the title text in the livedoor. For evaluation, in the newspaper, c=128c=128 characters were slid one by one, the entire text was used as input in the same manner as in Shimada et al. (2016); in the livedoor, it was the same as in the training.

5 Results and Discussion

Accuracy [%]
Newspaper Livedoor
+ CLCNN Vanilla + WT + SSA (Ours) Vanilla + WT + SSA (Ours)
VCE (Ours) 81.02 82.78 84.00 67.16 68.59 69.05
CAE 79.81 81.62 81.35 58.39 60.87 60.53
Table 3: A comparison between the VCE (with proposed GDCE) and the CAE in the newspaper and the livedoor results. We compared our proposed framework (presented as \dagger; a disentangled representation) with the state-of-the-art framework of Shimada et al. (2016) (presented as \ddagger; without the consideration of disentangled representation). Our proposed framework had the highest performance. The model using the VCE performed better than the CAE.

First, as a comparison of embedding methods, we compared the GDCE with the conventional CAE-based embedding Shimada et al. (2016). Second, as a comparison of data augmentation methods for image-based character embedding, we also compared the proposed SSA with the conventional WT, the latter of which has reported excellent results but offers no way of interpreting the change on the embedding space.

5.1 Effectiveness of the Proposal on Text Classification

Table 3 presents a comparison of the proposed GDCE and CAE-based embedding. The GDCE showed better document- and sentence-level classification performance than the conventional CAE-based character embedding without data augmentation. This may be due to the fact that the characters to be learned by the VCE are distributed in a limited embedded space centered on zeros, so the later stage of the CLCNN training became more effective. The WT, which randomly set all representations of a particular character embedding to zero, enhanced the discrimination of both models. The effect on the CAE-based model was particularly large, as reported in previous studies. We can confirm an effect of the WT as a dropout for preventing overfitting, but it did not provide an interpretation of what was changed in the character embeddings.

The proposed SSA provided us with an idea of what the embedding changes would look like, while also providing the same discriminatory capacity as the WT. This may be due to the fact that the GDCE had standardized metrics in the embedding space (i.e., the embedding had a normal distribution), so that the distances between the character embeddings were within the range of what could be assumed. Hence, the size of the perturbations applied could be designed, allowing for meaningful data augmentation. However, the CAE with SSA did not show an improved classification performance. This may be due to the fact that the CAE with SSA does not change to a meaningful character representation.

Refer to caption
Refer to caption
(a) VCE (proposed)
Refer to caption
Refer to caption
(b) CAE
Figure 2: The results of reconstructing character images from the character embedding trained by the VCE and CAE with perturbation added between 2.0σ2.0\sigma. The upper side is the reconstructed image of “迫” (approach) and “追” (follow). In the reconstruction from the embedding by the VCE and by adding noise to the fifth dimension of the embedding of “迫” or “追” (containing a sub-char. of “辶” meaning road), it can be interpreted that it changed to “氵” (sub-char. of water) or “辶” (sub-char. of road, the same as “辶”). The lower side is the reconstructed image of “綱” (rope) and “縄” (cord). In the reconstruction from the embedding by the VCE and by adding noise to the first dimension of the embedding of “綱” or “縄” (containing a sub-char. of “糸” meaning yarn), it can be interpreted that it changed to “扌” (sub-char. of hand) or “金” (sub-char. of gold).

5.2 Effectiveness of the Proposal on Interpretation

Figure 2 shows a comparison of the reconstructed character images when a ±2.0σ\pm 2.0\sigma perturbation is placed on the 2 (the GDCE) and 2 character embedding obtained by the CAE. In Figure 2, it is confirmed that the shape of the character replaced a different interpretable character or characters with a similar different subcomponent in the input space. In particular, by adding a perturbation to the fifth dimension of the embedding of “迫” or “追” (containing a sub-char. of “辶,” meaning road), it can be interpreted that it changed to “氵” (sub-char. of water) or “辶” (sub-char. of road, the same as “辶”). In addition, by adding a perturbation to the first dimension of the embedding of “綱” or “縄” (containing a sub-char. of “糸,” meaning yarn), it can be interpreted that it changed to “扌” (sub-char. of hand) or “金” (sub-char. of gold). From these results, we are convinced that such a replacement in the embedding resulted in more effective data augmentation for training the model.

As seen in Figure 2, in contrast, we were unable to identify these trends. We consider this is one of the typical benefits of our framework in that each dimension of the GDCE is independent and each of them affects each character component (e.g., sub-char. or radical of the character) with independence. In other words, we can change only some part of the character by changing certain dimensions of the embedding.

Since the SSA is a local transformation for the parts of the character shown above, even some characters that do not actually exist are generated by the combination of parts. These are not readable as correct characters, but we can make certain interpretations of them. In sum, the combination of the proposed GDCE and SSA provides us with the interpretability of the data augmentation as well as embedding the character while providing a high discriminative power.

Refer to caption
(a) The effect of coefficient β\beta (γ=0\gamma=0 i.e., without SSA).
Refer to caption
(b) The effect of perturbation range γ\gamma (β=8\beta=8).
Figure 3: The effect of hyperparameters in our framework using the livedoor dataset on the evaluation performance.

5.3 The Effect of Hyperparameters

To understand the effect of hyperparameters, we analyzed the coefficient β\beta and perturbation size γ\gamma using the livedoor, as shown in Figure 3.

The effect of coefficient β\beta.

Figure 3 shows the effect of coefficient β\beta on the evaluation performance with γ=0\gamma=0 (i.e., without SSA). In our experiments, we confirmed that β=8\beta=8 is the best from the viewpoint of disentanglement and accuracy.

The effect of perturbation size γ\gamma.

Figure 3 shows the effect of perturbation range γ\gamma in SSA on the evaluation performance with β=8\beta=8. Based on the notion that each dimension of the target character embedding follows 𝒩(𝟎,𝑰d)\mathcal{N}(\bm{0},\bm{I}_{d^{\prime}}), the perturbation range γ\gamma was chosen to be from 1.0σ1.0\sigma (covering 68% of the distribution) to 3.0σ3.0\sigma (covering almost the entire distribution). The best performance was obtained when the perturbation range was set to γ=2.0\gamma=2.0. This suggests that the character embedding trained by the VCE followed a normal distribution with a mean of μ=0\mu=0 and a standard deviation of σ=1.0\sigma=1.0. To cover the distribution, it is considered useful to add perturbation in the range of γ=2.0\gamma=2.0 corresponding to 2.0σ2.0\sigma (covering 95% of the distribution).

5.4 Limitations of the Current Study

At present, the role of each dimension in the character reconstruction of the GDCE cannot be clearly defined because it depends on the training of the model. Also, since the VCE was independently trained from the classifier (i.e., not in an end-to-end manner), trained embedding can only consider visual features, not the semantic ones. We will be working on these in the future.

6 Conclusion

We propose a new character-based text classification framework for non-alphabetic languages. As the name implies, the combination of our GDCE and SSA not only provided embedding interpretability but also improved the text classification performance. Our GDCE provided better text classification performance than conventional CAE-based character embedding without data augmentation. Finally, our framework achieved a competitive result to the conventional state-of-the-art CAE-based embedding with WT while also providing model interpretability.

References