This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

From Fully Trained to Fully Random Embeddings: Improving Neural Machine Translation with Compact Word Embedding Tables

Krtin Kumar
Thomson Reuters
[email protected]
&Peyman Passban
Amazon
[email protected]
&Mehdi Rezagholizadeh
Huawei Noah’s Ark Lab
\ANDYiu Sing Lau
McGill University, Statistics
[email protected]
&Qun Liu
Huawei Noah’s Ark Lab
[email protected]
*Work done while Krtin Kumar, Peyman Passban and Yiu Sing Liu were at Huawei.
Abstract

Embedding matrices are key components in neural natural language processing (NLP) models that are responsible to provide numerical representations of input tokens.111In this paper words and subwords are referred to as tokens and the term embedding only refers to embeddings of inputs. In this paper, we analyze the impact and utility of such matrices in the context of neural machine translation (NMT). We show that detracting syntactic and semantic information from word embeddings and running NMT systems with random embeddings is not as damaging as it initially sounds. We also show how incorporating only a limited amount of task-specific knowledge from fully-trained embeddings can boost the performance NMT systems. Our findings demonstrate that in exchange for negligible deterioration in performance, any NMT model can be run with partially random embeddings. Working with such structures means a minimal memory requirement as there is no longer need to store large embedding tables, which is a significant gain in industrial and on-device settings. We evaluated our embeddings in translating English into German and French and achieved a 5.35.3x compression rate. Despite having a considerably smaller architecture, our models in some cases are even able to outperform state-of-the-art baselines.

1 Introduction

One of the main challenges in NLP is to properly encode discrete tokens into continuous vector representations. The most common practice in this regard is to use an embedding matrix which provides a one-to-one mapping from tokens to nn-dimensional, real-valued vectors (Mikolov et al., 2013; Li and Yang, 2018). Typically, values of these vectors are optimized via back-propagation with respect to a particular objective function. Learning embedding matrices with robust performance across different domains and data distributions is a complex task, and can directly impact quality Tian et al. (2014); Shi et al. (2015); Sun et al. (2016).

Embedding matrices are a key component in a seq2seq NMT model, for instance, in a transformer-base NMT model Vaswani et al. (2017) with a vocabulary size of 50k50k, 36%36\% of the total model parameters are utilized by embedding matrices. Thus, a significant amount of parameters are utilized only for representing tokens in a model. Existing work on embedding compression for NMT systems Khrulkov et al. (2019); Chen et al. (2018); Shu and Nakayama (2017), have shown that these matrices can by compressed significantly with minor drop in performance. Our focus in this work is to study the importance of embedding matrices in the context of NMT. We experiment with Transformers Vaswani et al. (2017) and LSTM based seq2seq architectures, which are the two most popular seq2seq models in NMT. In Section 3.1, we compare the performance of a fully-trained embedding matrix, with a completely random word embedding (RWE), and find that using a random embedding matrix leads to drop in about 11 to 44 BLEU Papineni et al. (2002) points on different NMT benchmark datasets. Neural networks have shown impressive performance with random weights for image classification tasks Ramanujan et al. (2020), our experiments show similar results for embedding matrices of NMT models.

Refer to caption
Figure 1: kk is a hyper-parameter that controls the number of Gaussian distributions required to approximate the embedding matrix.

RWE uses no trainable parameters, thus it might be possible to recover the drop in BLEU score by increasing the number of parameters using additional layers. We increase the number of layers in RWE model such that the number of parameters used by the entire model is comparable to fully-trained embedding model. We find that even in comparable parameter setting RWE model performance was inferior to fully-trained model by 1~{}1 BLEU point for high and medium resource datasets. Our results suggest that even though the embedding parameters can be compressed to a large extent with only a minor drop in accuracy, embedding matrices are essential components for token representation, and cannot be replaced by deeper layers of transformer based models.

RWE assumed a fully random embedding sampled from a single Gaussian distribution; however, to control the amount of random information, and to better understand the importance of embedding matrices, we introduce Gaussian product quantization (GPQ). GPQ assumes that kk Gaussian distributions are required to approximate the embedding matrix for NMT models. The means and variances of the kk distributions are learned from a fully-trained embedding matrix trained on the same dataset as GPQ model. kk is a hyper-parameter which controls the amount of information distilled from a pre-trained embedding to partially random embedding in GPQ model. GPQ has the ability to move from a fully random embedding to a fully-trained embedding by increasing the number of distributions kk, as shown in Figure 1. Our results show that only 5050 Gaussian distributions are sufficient to approximate the embedding matrix, without drop in performance. GPQ compresses the embedding matrix 55 times without any significant drop in performance, and uses only 100100 floating point values for storing the means and variances. GPQ demonstrates effective regularization of the embedding matrix, by out-performing the transformer baseline model with fully-trained embedding by 1.21.2 BLEU points for EnFrEn\rightarrow Fr dataset and 0.50.5 BLEU for PtEnPt\rightarrow En dataset. A similar increase in performance was also observed for LSTM based models, thus further showing the effectiveness of GPQ for embedding regularization.

Product Quantization (PQ) (Jegou et al., 2010) was proposed for fast search by approximating nearest neighbour search. PQ has been adapted for compressing embedding matrices for different NLP problems Shu and Nakayama (2017); Kim et al. (2020); Tissier et al. (2019); Li et al. (2018). Our method is an extension of PQ, first, we incorporate variance information in GPQ, then we define unified partitioning to learn a more robust shared space for approximating the embedding matrix. Our extensions consistently outperform the original PQ-based models with better compression rates and higher BLEU scores. These improvement are observed for Transformer-based models as well as conventional, recurrent neural networks.

2 Product Quantization (PQ)

PQ is the core of our techniques, so we briefly discuss its details in this section. For more details see Jegou et al. (2010). As previously mentioned, NLP models encode tokens from a discrete to a continuous domain via an embedding matrix E, as simply shown in Equation 1:

E|V|×nE\in\mathbb{R}^{|V|\times n} (1)

where VV is a vocabulary set of unique tokens and nn is the dimension of each embedding. We use the notation EwnE_{w}\in\mathbb{R}^{n} to refer to the embedding of the w-th word in the vocabulary set. In PQ, first EE is partitioned into gg groups across columns with Gi|V|×ngG_{i}\in\mathbb{R}^{|V|\times\frac{n}{g}} representing the i-th group. Then each group GiG_{i} is clustered using K-means into cc clusters. Cluster indices and cluster centroids are stored in the vector Qi|V|Q_{i}\in\mathbb{N}^{|V|} and matrix Cic×ngC_{i}\in\mathbb{R}^{c\times\frac{n}{g}}, respectively. Figure 2 illustrates the PQ-based decomposition process.

Refer to caption
Figure 2: The matrix on the left hand side shows the original embedding matrix EE, after dividing columns into 33 groups (Gi;i{1,2,3}G_{i};i\in\{1,2,3\}) and applying the k-means algorithm with the cluster number 22 (c=2c=2) to each group. The digit inside each block is the cluster number. The figure on the right hand side shows the quantization matrix Q, and codebook C.

We can obtain the quantization matrix QQ (also known as the index matrix in the literature) and codebook CC by concatenating QiQ_{i} and CiC_{i} for all gg groups along the columns, as shown in Equation 2.

Q=Concatcolumn(Q1,Q2,,Qg)C=Concatcolumn(C1,C2,,Cg)\begin{split}&Q=\text{Concat}_{\text{column}}(Q_{1},Q_{2},...,Q_{g})\\ &C=\text{Concat}_{\text{column}}(C_{1},C_{2},...,C_{g})\end{split} (2)

The decomposition process applied by PQ is reversable, namely any embedding table decomposed by PQ can be reconstructed using the matrices QiQ_{i} and CiC_{i}. The size of an embedding table compressed via PQ can be calculated as shown in Equation 3:

SizePQ=log2(c)×|V|×g+cnfp (in bits)\text{Size}_{\text{PQ}}=\text{log}_{2}(c)\times|V|\times g+cnf_{p}\text{ (in bits)} (3)

where fpf_{p} is the number of bits defined to store parameters with floating points. In our experiments fpf_{p} is 3232 for all settings. We use Equation 3 to measure the compression rate of different models.

3 Methodology

In this section we first explain RWE which stands for Random Word Embeddings. In RWE, all embeddings are initialized with completely random values where there is no syntactic and semantic information is available.

Next, we move from completely random embeddings to weakly supervised embeddings by incorporating some knowledge from a pre-trained embedding matrix. We propose our Gaussian Porduct Quantization (GPQ) technique in this regard, which applies PQ to a pre-trained embedding matrix and models each cluster with a Gaussian distribution. Our GPQ empowers the PQ with taking intra-cluster variance information into account. This approach is particularly useful when PQ clusters have a high variance, which might be the case for large embedding matrices.

3.1 Random Word Embeddings (RWE)

A simple approach to form random embeddings is to sample their values from a standard Gaussian distribution, as shown in Equation 4:

S\displaystyle S =𝒩(Ei,j|μ,σ2)(i,j)\displaystyle=\mathcal{N}(E_{i,j}|\mu,\sigma^{2})\;\forall(i,j)
Ei\displaystyle E^{{}^{\prime}}_{i} =SiSii,\displaystyle=\frac{S_{i}}{||S_{i}||}\;\forall i, (4)

where 𝒩\mathcal{N} is a normal distribution with μ=0\mu=0 and σ=1\sigma=1, SS is sampled matrix of the same dimensions as EE, EE^{\prime} is the reconstructed matrix, EiE_{i} and SiS_{i} represent ithi^{th} row vector of matrices EE^{\prime} and SS respectively. We normalize each word embedding vector to unify the magnitude of all word embeddings to ensure that we do not add any bias to the embedding matrix.

Since this approach relies on completely random values, to increase the expressiveness of embeddings and handle any potential dimension mismatch between the embedding table and the first layer of the neural model we place an optional linear transformation matrix WW (where Wn×mW\in\mathbb{R}^{n\times m}). The weight matrix WW uses n×mn\times m trainable parameters, where nn is the size of embedding vector and mm is the dimension in which the model accepts its inputs. The increase in the number of parameters is negligible as it only relies on the embedding and input dimensions (nn and mm), which are both considerably smaller than the number of words in a vocabulary (n,m|V|n,m\ll|V|).

3.2 Gaussian Product Quantization (GPQ)

RWE considers completely random embeddings which can be a very strict condition for an NLP model. Therefore, we propose our Gaussian Product Quantization (GPQ) to boost random embeddings with prior information from a pre-trained embedding matrix. We assume that there is an existing embedding matrix EE that is obtained from a model with the same architecture as ours, trained for the same task using the same dataset.

The pipeline defined for GPQ is as follows: First, we apply PQ to the pre-trained embedding matrix EE to derive the quantization matrix Q|V|×gQ_{|V|\times g} and the codebook matrix Cc×nC_{c\times n}. The codebook matrix stores centers of cc clusters for all GiG_{i} groups and the quantization matrix stores the mapping index of each sub-vector in the embedding matrix to the corresponding center of the cluster in the codebook. The two matrices QQ and CC can be used to reconstruct the original matrix EE from the PQ process.

The PQ technique only stores the center of clusters in the codebook and does not take variance of them into account. However, our GPQ technique models each cluster with a single Gaussian distribution by setting its mean and variance parameters to the cluster center and intra-cluster variance. We define 𝒞ij,(1jc)\mathcal{C}_{i}^{j},(1\leq j\leq c) to be the cluster corresponding to the jthj-{\text{th}} row of CiC_{i} in the codebook (that is CijC_{i}^{j}) with the mean and variance of μijcn\mu_{i}^{j}\in\mathbb{R}^{\frac{c}{n}} and (σij)2(\sigma_{i}^{j})^{2}. Then, we define entries of the codebook of our GPQ approach as following:

C^ij𝒩ij(μij,(σij)2).\begin{split}&\hat{C}_{i}^{j}\sim\mathcal{N}_{i}^{j}(\mu_{i}^{j},(\sigma_{i}^{j})^{2}).\end{split} (5)

Consequently, we model each cluster of PQ with a single Gaussian distribution with two parameters. Then the embedding matrix E^\hat{E} can be reconstructed using a lookup function that maps values QiQ_{i} from C^i\hat{C}_{i}. We illustrate our GPQ method and compare it with PQ in Figure 3. In GPQ, we only need to store the index matrix, codebook and their corresponding variances in order to be able to reconstruct the embedding matrix.

The PQ/GPQ method relies on partitioning the embedding matrix into same size GiG_{i} groups across the columns and then clustering them. We refer to the partitioning function used in PQ as Structured Partitioning, and propose a new partitioning scheme, which we refer to as Unified Partitioning. The details of each scheme is described in the following.

Refer to caption
Figure 3a
Refer to caption
Figure 3b
Refer to caption
Figure 3c
Figure 3: The figure on the left hand side shows the original embedding matrix EE after applying the k-means algorithm, with 2 clusters (c=2c=2) and 3 groups (GiG_{i}) (g=3g=3), which results in 6 partitions (PiP_{i}). The figure in the middle shows the difference between PQ and GPQ. The figure on the right hand side is the reconstructed matrix EE^{{}^{\prime}}.

3.2.1 Structured Partitioning

The original PQ method is based on structured partitioning to partition the input matrix E|V|×nE_{|V|\times n} into gg groups of GiG_{i} of size |V|×ng|V|\times\frac{n}{g} along the columns uniformly such that E=Concatcolumn(G1,G2,,Gg)E=\text{Concat}_{\text{column}}(G_{1},G_{2},...,G_{g}). Each GiG_{i} group is clustered into cc clusters along the rows using any clustering algorithm. If we choose K-means clustering algorithm for this purpose, then we can obtain the center of clusters CiC (where Cic×ng)C_{i}\subset C\text{ (where }C_{i}\in\mathbb{R}^{c\times\frac{n}{g}}), their variances σi2c\sigma_{i}^{2}\in\mathbb{R}^{{c}}, and the quantization vector Qi|𝕍|Q_{i}\in\mathbb{N^{|V|}} corresponding to GiG_{i} as following:

[Ci,σi2,Qi]=K-means(Gi,c).\begin{split}&[C_{i},\sigma_{i}^{2},Q_{i}]=\text{K-means}(G_{i},c).\end{split} (6)

The total number of clusters in this case is k=cgk=c*g. In our technique, we store the variance of clusters in addition to their mean which utilizes more number of floating point parameters compared to Equation 3, as shown in Equation 7.

SizeGPQStructured=log2(c)×|V|×g+2cnfp bits\text{Size}_{\text{GPQ}}^{\text{Structured}}=\text{log}_{2}(c)\times|V|\times g+2cnf_{p}\text{ bits} (7)

3.2.2 Unified Partitioning

Structured partitioning has limited ability to exploit redundancies across groups. Motivated by this shortcoming we propose our unified partitioning approach in which we concatenate the gg groups along rows to form a matrix 𝒢g|V|×ng\mathcal{G}\in\mathbb{R}^{g|V|\times\frac{n}{g}} and then apply K-Means to the matrix 𝒢\mathcal{G} (instead of applying K-means to each group GiG_{i} separately).

𝒢\displaystyle\mathcal{G} =Concatrow(G1,G2,,Gg)(g|V|)×(n/g)\displaystyle=\text{Concat}_{row}(G_{1},G_{2},...,G_{g})\in\mathbb{R}^{(g|V|)\times(n/g)} (8)
[C,σ2,Q]=K-means(𝒢,c).\displaystyle[C,\sigma^{2},Q]=\text{K-means}(\mathcal{G},c). (9)

For better understanding we illustrate our method in Figure 4. Equation 10 defines the size of parameters for unified partitioning function.

SizeGPQUnified=log2(c)×|V|×g+2cngfp bits\text{Size}_{\text{GPQ}}^{\text{Unified}}=\text{log}_{2}(c)\times|V|\times g+2c\frac{n}{g}f_{p}\text{ bits} (10)
Refer to caption
Figure 4: Unified partitioning function with 3 Groups.
Model Float Integer
PQ 25.6k 16.3M
PQ (Unified) 50 16.3M
GPQ (Unified) 100 16.3M
Table 1: Comparison of floating point vs integer parameters for vocabulary size of 32k, embedding size 512, 512 groups, and 50 clusters
Model WMT En-Fr WMT En-De IWSLT Pt-En
Layers Model Params BLEU Layers Model Params BLEU Layers Model Params BLEU
Transformer 6 60.7M 38.41 6 63M 27.03 3 11.9M 39.58
RWE+linear 6 44.4M 35.76 6 44.2M 23.04 3 3.7M 38.14
RWE+linear 8 59M 37.11 8 58.9M 25.83 6 11.1M 40.69
Table 2: NMT results for Random Word Embeddings (RWE), for different hyper-parameters. M represents Millions, Transformer is the baseline model Vaswani et al. (2017).
Model WMT En-Fr WMT En-De IWSLT Pt-En
Emb. Size BLEU Emb. Size BLEU Emb. Size BLEU
Transformer Baseline 62.5 MB 39.29 72.3 MB 27.38 31.3 MB 39.88
PQ (Structured) 11.82 MB 39.95 13.65 MB 27.04 5.9 MB 40.33
GPQ (Structured) 11.92 MB 40.04 13.75 MB 27.11 5.95 MB 40.16
PQ (Unified) 11.7 MB 40.02 13.54 MB 26.87 5.86 MB 39.34
GPQ (Unified) 11.7 MB 40.51 13.55 MB 27.35 5.86 MB 40.36
Table 3: NMT results on Transformer baseline Vaswani et al. (2017), with 8 layers, groups (g) equal to the embedding size nn, and 50 clusters (c).

4 Experiments

We evaluate our model for machine translation task on WMT-2014 English to French (EnFrEn\rightarrow Fr), WMT-2014 English to German (EnDeEn\rightarrow De), and IWSLT Portuguese to English (PtEnPt\rightarrow En) datasets. We chose these pairs as they are good representatives of high, medium, and low resource language pairs.

For EnFrEn\rightarrow Fr experiments we used Sentence-Piece Kudo and Richardson (2018) to extract a shared vocabulary of 32k sub-word units. We chose newstest2013 as our validation dataset and used newstest2014 as our test dataset. The dataset contains about 36M sentence-pairs.

For EnDeEn\rightarrow De experiments we use the same setup as Vaswani et al. (2017). The dataset contains about 4.5M sentence-pairs. We use a shared vocabulary of 37k sub-word units extracted using Sentence-Piece.

For PtEnPt\rightarrow En experiments, we replicate dataset configuration of Tan et al. (2019) for individual models. Specifically, the dataset contains 167k training pairs. We used a shared vocabulary of 32k subwords extracted with Sentence-Piece.

Evaluation

For all language pairs, we report case-sensitive BLEU score Papineni et al. (2002) using SacreBLEU222https://github.com/mjpost/sacreBLEU Post (2018). We train for 100K steps and save a checkpoint every 5000 steps for EnFrEn\rightarrow Fr and EnDeEn\rightarrow De language pairs. For PtEnPt\rightarrow En translation we train for 50K steps and save checkpoint every 2000 steps. We select the best checkpoints based upon validation loss, and average best 55 checkpoints for EnFrEn\rightarrow Fr and EnDeEn\rightarrow De language pairs, and average best 33 checkpoints for PtEnPt\rightarrow En. We do not use checkpoint averaging for our LSTM experiments. We use beam search with a beam width of 4 for all language pairs.

4.1 Model and Training Details

We use the Transformer model Vaswani et al. (2017) as our baseline for all our experiments. We also report results on LSTM-based models to study our hypothesis for RNN based architectures. Our LSTM-based model uses multiple layers of bidirectional-LSTM, followed by a linear layer on the last layer to combine the internal states of LSTM. The decoder uses multiple layers of uni-directional LSTM, the final output of decoder uses attention on encoder output for generating words. All our models use weight sharing between the embedding matrix and the output projection layer Press and Wolf (2016).

For the transformer-base configuration, the model hidden size h is set to 512, the feed-forward hidden size dffd_{\text{ff}} is set to 2048. We use different number of layers for our experiments, instead of the default 6 layers. For transformer-small configuration the model hidden-size nn is set to 256, the feed-forward hidden size dffd_{\text{ff}} is set to 1024. We use different number of layers for our experiments. For transformer-small, the dropout configuration was set the same as Transformer Base. For LSTM experiments, we use 2 layers and the hidden size nn is set to 256 for LSTM-small, and 7 layers and the hidden size nn is set to 512 for LSTM-large. We use transformer-base for WMT datasets, and transformer-small for IWSLT dataset. We use LSTM-small for IWSLT and LSTM-large for WMT datasets. For all our experiments we set Numpy Van Der Walt et al. (2011) seed to 0 and PyTorch Paszke et al. (2017) seed to 34353435.

All models are optimized using Adam Kingma and Ba (2014) and the same learning rate schedule as proposed by Vaswani et al. (2017). We use label smoothing with 0.1 weight for the uniform prior distribution over the vocabulary Szegedy et al. (2015); Pereyra et al. (2017).

We train all our models on 8 NVIDIA V100 GPUs. Each training batch contains a set of sentence pairs containing approximately 6144 source tokens and 6144 target tokens for each GPU worker. For implementing our method we use the OpenNMT library Klein et al. (2017), implemented in PyTorch.

Model WMT En-Fr WMT En-De IWSLT Pt-En
Emb. Size BLEU Emb. Size BLEU Emb. Size BLEU
LSTM Baseline 62.5 MB 32.61 72.3 22.17 31.3 35.12
PQ (Unified) 13.67 MB 32.05 15.8 MB 22.13 6.8 MB 35.35
GPQ (Unified) 13.67 MB 33.82 15.8 MB 22.14 6.8 MB 35.58
Table 4: NMT results on LSTM based model, we use 128 clusters (c) and groups (g) equal to embedding size nn.
Model WMT En-Fr WMT En-De IWSLT Pt-En
Emb. Size BLEU c Emb. Size BLEU c Emb. Size BLEU c
Transformer 62.5 MB 39.29 - 72.3 MB 27.38 - 31.3 MB 39.88 -
PQ (Struct.) 1.25 MB 39.13 140 1.44 MB 25.35 160 1.22 MB 40.6 256
GPQ (Struct.) 1.25 MB 39.18 102 1.44 MB 26.08 116 1.1 MB 41.1 128
PQ (Unified) 1.25 MB 39.29 1024 1.44 MB 25.83 1024 1.24 MB 40.62 1024
GPQ (Unified) 1.28 MB 39.77 1024 1.47 MB 25.96 1024 1.25 MB 41.32 1024
Table 5: NMT results on Transformer baseline Vaswani et al. (2017), with 32 groups (g) and 8 layers. c represents the number of clusters. Struct. and unified represents structured and unified partitioning functions respectively.

5 Results

5.1 Random Word Embeddings

We replace the embedding matrix of the transformer model using RWE method, and summarize our results in Table 2. We evaluate our model in two cases.

In the first case, we use exactly the same hyper-parameters as our baseline model, and only replace the embedding matrix. In this case our model has only 2 embedding parameters to store the mean and variance of the Gaussian distribution, and an additional linear layer for re-scaling the embedding vectors. Our results show that without any embedding parameters, and a linear transformation, transformer model scores only 1.44~{}1.44 BLEU points below the baseline for PtEnPt\rightarrow En. Deterioration in the case of EnFrEn\rightarrow Fr was approximately 11 BLEU point higher than PtEnPt\rightarrow En, and for EnDeEn\rightarrow De the drop is 4~{}4 BLEU points.

In the previous case, our model used at least 27% fewer parameters compared to the baseline model. In order to compare RWE in a fair parameter setting, we increase the number of layers in Transformer to approximately match the total parameters with the baseline model. In this setting our model performed within 1.31.3 BLEU points for EnDeEn\rightarrow De and EnFrEn\rightarrow Fr. On PtEnPt\rightarrow En experiments, our model scored 1.0~{}1.0 BLEU point higher than the baseline with slightly fewer parameters.

Overall, our experiments show that transformer based neural networks have the capacity to efficiently learn semantic and syntactic information in deeper layers, though utility of using embedding tables for token representation is beneficial and not redundant.

5.2 Gaussian Product Quantization

We apply the GPQ method on the embedding matrix of Transformer and LSTM-based models. We design a set of experiments to investigate, 1) if we can approximate the embedding matrix in an almost discrete space using integer values, 2) to study the compression ability of our method compared to PQ.

Experiments for Discrete Space Approximation

We evaluate our method on Transformer (Table 3) and LSTM-based models (Table 4). For all experiments we chose the number of groups gg equal to the size of embedding vector nn. This allowed us to maximize the number of groups and in turn maximize the amount of discrete information (integer values) and minimize the amount of floating-point parameters required to reconstruct the embedding matrix. We experiment using 5050 clusters for transformer based model and 128128 clusters for the LSTM model. Our results for the GPQ method with unified partitioning function show that with only 50 clusters for transformer model and 128 clusters for LSTM model, we are able to perform on par or better than baseline models. This implies that in all experiments we were able to approximate the embedding matrix with approximately 100%\approx 100\% integer values. Specifically, if g=ng=n, our unified partitioning function used only 2c2c floating points (for storing the mean and variance clusters), while structuredstructured partitioning function uses 2cn2cn floating points. In Table 3, GPQ with unified partitioning function outperforms other models consistently across all datasets. For EnDeEn\rightarrow De, our approach is the only method that performs on par with baseline model with 55x fewer embedding parameters.

In Table 4, we report LSTM experiments for only the unified partitioning function as it was the best performing method. For LSTM, GPQ (Unified) performed better than PQ (Unified) for EnFrEn\rightarrow Fr with about 1.77~{}1.77 BLEU, for other languages the difference was insignificant.

Experiments for Compression Analysis

Our objective is to compare the capacity of GPQ method for compressing the embedding matrix, compared to PQ. We set the number of clusters to 1024 and groups to 32, as reducing the number of groups has greater affect on compression rate. We report the results in Table 5. We find that with unified partition function GPQ method performs significantly better than the Transformer baseline for EnFrEn\rightarrow Fr (+0.48) and PtEnPt\rightarrow En (+0.7) datasets. For EnDEEn\rightarrow DE GPQ (unified) performs only 0.030.03 BLEU points lower than the baseline. For all language pairs GPQ (unified) is the best performing model compared to PQ and other variants.

4488161632326464383838.538.5393939.539.54040Clusters/GroupsBLEU Score128 Clusters256 Groups
Figure 5: NMT Results on IWSLT Pt-En for GPQ method with unified partitioning function
Model WMT En-Fr WMT En-De
Emb. Size BLEU Emb. Size BLEU
Transformer Baseline 1.0x 38.41 1.0x 27.03
*SVD with rank 64 7.87x 37.44 7.89x 26.32
GroupReduce Chen et al. (2018)* 7.79x 37.63 7.88x 26.75
Tensor Train Khrulkov et al. (2019) 7.72x 37.27 7.75x 26.19
GPQ (Unified) 7.9x 39.65 7.9x 26.84
Table 6: NMT results on transformer based models, with 256 clusters (c), 256 groups (g) and 6 layers. Results in rows marked with * are taken from Lioutas et al. (2019)

6 Discussion

6.1 Cluster and Group Size Analysis

We study the effect of different cluster and group sizes on the GPQ method with unified partitioning function, on the PtEnPt\rightarrow En dataset and plot results in Figure 5. We experiment with fixing the groups to g=256g=256 and varying the cluster size. We find out that with only 4 clusters and only 8 floating point parameters our model performed the best. We also experiment with fixing the clusters and varying the group size. We find that group size of 4 was the worst performing setting as a result of excessive regularization. All group sizes larger than 4 had minor influence on performance. We find that we need only 8 floating point parameters, to approximate the embedding matrix without any performance drop. Thus most important information lies in a latent space which is discrete, consisting of integer values.

6.2 Compression Analysis

We compare the performance of our model in a compression setting, with different compression baselines in Table 6. In particular, we compare results with two state-of-the-art methods Khrulkov et al. (2019); Chen et al. (2018). Additionally, we compare two standard baselines reported in Lioutas et al. (2019). For all models we compare our results directly from Lioutas et al. (2019), and choose a similar compression ratio. For both EnFrEn\rightarrow Fr and EnDeEn\rightarrow De language pairs, our GPQ (unified) model performs the best compared to all other models. Additionally, our GPQ (unified) model is the only model that performs better than the baseline model for EnFrEn\rightarrow Fr language pair.

6.3 Importance of Variance

We extended PQ by introducing variance information in our GPQ method. Results in Table 4 highlight the importance of this information for LSTM based models. Table 6 shows that incorporating variance information can be beneficial for transformer based models in a high compression setup. Lastly, Table 3 shows that GPQ model is particularly beneficial when using unifiedunified partitioning function. UnifiedUnified partitioning function uses much smaller number of total clusters, thus each cluster has a higher variance from the cluster centroid, compared to the case of structuredstructured partitioning function. Thus, incorporating variance information using GPQ is beneficial, when cluster values have high variance.

7 Related Work

There are different embedding compression techniques in the literature, these methods are based on either singular value decomposition (SVD) or product quantization (PQ). In general, SVD-based methods Chen et al. (2018); Khrulkov et al. (2019) approximate embedding matrices using fewer parameters by projecting high-dimensional matrices to lower dimensions. Therefore, the learning process is modified to work with new and smaller matrices where a multiplication of these matrices reconstruct original values. The model proposed by Chen et al. (2018) is a well-known example of SVD-based model in this field. In this technique, words are first divided into groups based upon frequency, then weighted SVD is applied on each group, which allows a compressed representation of the embedding matrix. Khrulkov et al. (2019) used a multi-linear variation instead of regular SVD. The embedding dimensions are first factorized to obtain indices of smaller tensor embedding. The tensor embedding has fewer parameters and can be used to reconstruct the original embedding matrix.

Shu and Nakayama (2017) proposed to compress the embedding matrix by learning an integer vector for all words in vocabulary (code matrix), using the Gumbel softmax technique. The code matrix is equivalent to the quantization matrix in PQ. An embedding vector is reconstructed by looking up all the values in the code matrix from the corresponding set of vector dictionaries. The vectors are added (in PQ the vectors are concatenated) to obtain the final embedding vector.

Svenstrup et al. (2017) proposed a hashing based technique with a common memory for storing vectors. The method is similar to Shu and Nakayama (2017), with two key differences. First, the code matrix is not learned but assigned uniformly using a hashing function. Second, instead of a simple summation of component vectors, weighted summation is applied to obtain the final embedding vector.

Kim et al. (2020) learn the code matrix by utilizing binarized code learning introduced in Tissier et al. (2019). The key novelty introduced by Kim et al. (2020), is to learn different length of code vector for each word.

8 Conclusion

Our work points towards the need of rethinking the process of encoding tokens to real valued vectors for machine translation. We show that Transformer model is capable of recovering useful syntactic and semantic information from a random assignment of embedding vectors. Our variant of product quantization was able to approximate the embedding matrix in an almost 100~{}100% discrete space, with better performance than the baseline model. A discrete space is easier to interpret, compared to a continuous space, and can motivate future research to handle unknown tokens through optimum cluster selection.

References

  • Chen et al. (2018) Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018. Groupreduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998.
  • Jegou et al. (2010) Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128.
  • Khrulkov et al. (2019) Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787.
  • Kim et al. (2020) Yeachan Kim, Kang-Min Kim, and SangKeun Lee. 2020. Adaptive compression of word embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3950–3959.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
  • Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. In Proc. ACL.
  • Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  • Li and Yang (2018) Yang Li and Tao Yang. 2018. Word embedding for understanding natural language: a survey. In Guide to Big Data Applications, pages 83–104. Springer.
  • Li et al. (2018) Zhongliang Li, Raymond Kulhanek, Shaojun Wang, Yunxin Zhao, and Shuang Wu. 2018. Slim embedding layers for recurrent neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  • Lioutas et al. (2019) Vasileios Lioutas, Ahmad Rashid, Krtin Kumar, Md Akmal Haidar, and Mehdi Rezagholizadeh. 2019. Distilled embedding: non-linear embedding factorization using knowledge distillation.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
  • Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. CoRR, abs/1701.06548.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. Proceedings of the Third Conference on Machine Translation: Research Papers.
  • Press and Wolf (2016) Ofir Press and Lior Wolf. 2016. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859.
  • Ramanujan et al. (2020) Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11893–11902.
  • Shi et al. (2015) Tianze Shi, Zhiyuan Liu, Yang Liu, and Maosong Sun. 2015. Learning cross-lingual word embeddings via matrix co-factorization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 567–572.
  • Shu and Nakayama (2017) Raphael Shu and Hideki Nakayama. 2017. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068.
  • Sun et al. (2016) Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2016. Sparse word embeddings using l1 regularized online learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2915–2921. AAAI Press.
  • Svenstrup et al. (2017) Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. 2017. Hash embeddings for efficient word representations. In Advances in Neural Information Processing Systems, pages 4928–4936.
  • Szegedy et al. (2015) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567.
  • Tan et al. (2019) Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. CoRR, abs/1902.10461.
  • Tian et al. (2014) Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. 2014. A probabilistic model for learning multi-prototype word embeddings. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 151–160.
  • Tissier et al. (2019) Julien Tissier, Christophe Gravier, and Amaury Habrard. 2019. Near-lossless binarization of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7104–7111.
  • Van Der Walt et al. (2011) Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.