This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Frequency-aware Dimension Selection for Static Word Embedding by Mixed Product Distance

Lingfeng Shen
Johns Hopkins Iniversity &Haiyun Jiang
Tencent AI Lab &Lemao Liu
Tencent AI Lab &Ying Chen
China Agricultural University
Abstract

Static word embedding is still useful, particularly for context-unavailable tasks, because in the case of no context available, pre-trained language models often perform worse than static word embeddings. Although dimension is a key factor determining the quality of static word embeddings, automatic dimension selection is rarely discussed. In this paper, we investigate the impact of word frequency on the dimension selection, and empirically find that word frequency is so vital that it needs to be taken into account during dimension selection. Based on such an empirical finding, this paper proposes a dimension selection method that uses a metric (Mixed Product Distance, MPD) to select a proper dimension for word embedding algorithms without training any word embedding. Through applying a post-processing function to oracle matrices, the MPD-based method can de-emphasize the impact of word frequency. Experiments on both context-unavailable and context-available tasks demonstrate the better efficiency-performance trade-off of our MPD-based dimension selection method over baselines.

1 Introduction

Word embedding has been widely used in numerous NLP tasks, such as recommendation systems Zhang et al. (2019), text classification Chung and Glass (2018), information retrieval Palangi et al. (2016) and machine translation Guo et al. (2019). Though pre-trained language models (PLMs) like BERT Devlin et al. (2018) have become a prevailed paradigm in NLP, static word embeddings, such as Word2Vec Mikolov et al. (2013) and GloVe Pennington et al. (2014), are still useful in scenarios where contexts are missing, such as entity retrieval Liu et al. (2021); Han et al. (2020) and word similarity tasks Halawi et al. (2012); Bruni et al. (2014). Under such scenarios, static word embeddings often perform better than PLMs.

Although there are many algorithms for training static word embeddings, the impact of dimension on word embedding has been less investigated. Yin and Shen (2018) have shown that a critical hyper-parameter, the choice of dimension for word vectors, significantly influences the performance of a model that builds on word embeddings. An extremely low-dimensional word embedding is probably not expressive enough to represent the meanings of words, and in contrast, a high-dimensional word embedding is expressive but possibly leads to over-fitting and high computational cost. For most NLP models building on word embeddings, dimension is selected ad hoc or by grid search, either resulting in sub-optimal model performances. Thus, it is promising to establish a dimension selection criterion that can automatically select a proper dimension for static word embedding algorithms with a good efficiency-performance trade-off.

Yin and Shen (2018) first investigated dimension selection for static word embedding. They proposed a metric called Pairwise Inner Product (PIP) loss which attempts to reflect the relationship between the dimension and the quality of word embedding and then developed a PIP-based dimension selection method that selects a dimension with the minimal PIP loss. Later, Wang (2019) proposed a dimension selection method based on Principal Components Analysis (PCA). The PCA-based dimension selection trains a high-dimensional word embedding and then uses the embedding to conduct a grid search to find a proper dimension, which is time-consuming. However, both the PIP-based dimension selection and the PCA-based dimension selection overlook one dominant element that may influence the dimension selection: word frequency.

Thus, in this paper, we first investigate whether word frequency is influential in the dimension selection of static word embedding. Our results reveal that extreme word frequency may cause an improper dimension for the PIP-based or PCA-based dimension selection. That is, without diminishing the impact of word frequency, it is more likely to select an improper dimension and generate a low-quality word embedding. Then, we explore a way to diminish the impact of word frequency. Inspired by the success of post-processing methods on static word embeddings Mu and Viswanath (2018); Liu et al. (2019), which de-emphasize the influence of some words in an embedding, we attempt to transfer them to the dimension selection.

Specifically, we propose an MPD-based dimension selection, which minimizes a metric called Mixed Product Distance (MPD) to find a proper dimension. The MPD can reduce the impact of word frequency by applying a post-processing function to oracle matrices, which results in a more proper dimension. To investigate the effectiveness of our method, we conduct evaluations on context-unavailable and context-available tasks. Comprehensive experiments demonstrate that the MPD-base dimension selection method can consistently perform better than baselines.

Above all, the main contributions of the paper can be outlined as follows:

  • We expose a critical issue overlooked by previous dimension selection methods: word frequency. It can influence the dimension selection and result in degraded word embeddings.

  • We design a metric called Mixed Product Distance (MPD) for static word embedding dimension selection, which uses post-process functions to overcome the word frequency problem.

  • We evaluate the MPD-based dimension selection on various NLP tasks, including context-unavailable and context-available tasks. Comprehensive experiments demonstrate that our method achieves a better efficiency-performance trade-off.

2 Related Work

2.1 Static Word Embedding and Dimension Selection

Existing static word embedding algorithms can be formulated as either explicit or implicit matrix factorization, and the most popular word embedding algorithms can be regarded as implicit matrix factorization. For example, Word2Vec and GloVe perform implicit factorization on the symmetric matrix (i.e., α=0.5\alpha=0.5), where the signal matrix MM is the shifted PPMI matrix for Word2Vec and the log-count matrix for GloVe. Although intensive studies have been done on static word embedding algorithms that learn word embeddings using matrix factorization, selecting a proper dimension for these word embedding algorithms is less investigated. Yin and Shen (2018) proposed a PIP-based dimension selection method. Specifically, for every possible dimension, its PIP loss is obtained, and then the dimension with the minimal PIP loss is selected. Particularly, the PIP-based method can be conducted without training word embedding. In contrast, Wang (2019) proposed a PCA-based method for dimension selection, which trains a high-dimensional word embedding to help the dimension selection, which results in extremely time-consuming.

2.2 Post-processing for Word Embedding

To further improve the quality of a word embedding, post-processing which reinforces linguistic constraints on the embedding, is often used. According to the studies on post-processing Mu and Viswanath (2018); Liu et al. (2019), one of the important linguistic constraints is to diminish the influence of word frequency. Specifically, in a word embedding, principal components (PCs), which are mainly encoded by frequent words, are shared by all words. As a result, the vectors of less frequent words have a higher variance than those of frequent words, and such a high variance hurts the embedding quality because word frequency is unrelated to lexical semantics. Moreover, two popular post-processing functions, the All-but-the-top (ABTT) Mu and Viswanath (2018) and Conceptor Negation (CN) Liu et al. (2019), have been used to remove or suppress PCs in word embeddings. In our method, instead of word embedding, these post-processing functions are applied to the oracle matrix of word embedding algorithms.

3 Preliminaries

This section presents a basic procedure for the dimension selection of word embedding. A static word embedding algorithm is chosen to obtain an word embedding matrix Xn×kX\in\mathbb{R}^{n\times k}, where XX is composed of nn vectors, and word wiw_{i} is represented by vector vikv_{i}\in\mathbb{R}^{k}, i[n]i\in[n]. Generally, dimension selection aims to find a proper dimension kk with a good performance-efficiency trade-off.

There are two existing dimension selection methods: PIP-based and PCA-based. PIP can directly compute the dimension without training any word embedding, while PCA-based needs to train a word embedding before dimension selection, which leads to much more computational cost. So, the PCA-based method owns a worse performance-efficiency trade-off than the PIP-based method. In this paper, we work on dimension selection without training any word embedding.

Following previous works Levy and Goldberg (2014), we leverage oracle matrix for dimension selection, which can be obtained through matrix decomposition without any training process. Formally, for a static word embedding algorithm with two hyper-parameters (α\alpha and dimension kk) and a signal matrix MM (i.e., the co-occurrence matrix), an oracle matrix XX is derived from matrix MM with the form X=fα,k(M)U,1:kD1:k,1:kαX=f_{\alpha,k}(M)\leq U_{\cdot,1:k}D_{1:k,1:k}^{\alpha}, where M=UDVTM=UDV^{T} is its SVD obtained by matrix factorization Levy et al. (2015).

To implement the dimension selection, we divide the training corpus into two identical partitions C1C_{1} and C2C_{2}, and then obtain the co-occurrence matrices M1M_{1} and M2M_{2} on each partition, respectively. Correspondingly, we can have two oracle matrices XX and X^\hat{X}. Then, a common way to select a proper dimension is used: define a proper distance dd, and then find kk that can minimize the distance d(X,X^)d(X,\hat{X}). Notice that a distance (0\approx 0) that guarantees the unitary-invariance of the two oracle matrices XX and X^\hat{X} may serve as a proper distance for dimension selection Yin and Shen (2018), where unitary-invariance refers that two matrices are essentially identical if one can be transformed from the other by performing a unitary operation (e.g., a simple rotation), and unitary-invariance defines the equivalence class of the oracle matrices.

4 Is word frequency influential in dimension selection?

Although previous works Mu and Viswanath (2018); Gong et al. (2018) have pointed out that word frequency may bring negative consequences to word embeddings, the impact of word frequency on the dimension selection has been overlooked by previous studies on dimension selection Yin and Shen (2018); Wang (2019). Therefore, we explore a question in this section: does the word frequency influence the dimension selection for word embeddings?.

We first prepare an experiment of extreme word frequency on two tasks: context-unavailable (WordSim353 dataset, Finkelstein et al., 2001) and context-available (SST-2 dataset, Socher et al., 2013). In the experiment, the following two kinds of dimensions, kk^{*} and k+k^{+}, are used. Generally, word embedding with k+k^{+} consistently outperforms the one with kk^{*}, and the performance gap between k+k^{+} and kk^{*} reflects the effectiveness of the dimension selection method.

  • Criterion-selected dimension kk^{*}: Given a dimension selection method, kk^{*} with the minimal distance dd is selected as the criterion-selected dimension.

  • Empirically optimal dimension k+k^{+} (Optimal): Given a word embedding algorithm, a word embedding is trained for each dimension in a grid search (i.e., the dimension is selected from 50 to 1000 with an increment step of 2). Then, each word embedding is evaluated on the benchmark, and the one achieving the highest performance is selected as the empirically optimal dimension k+k^{+}.

Dataset Criterion 0 3% 5% 10%
SST-2 k+k^{+} (Optimal) 81.9 81.4 81.5 81.8
kk^{*} (PIP) 80.0 79.5 79.2 78.7
kk^{*} (PCA) 79.8 79.5 79.0 78.5
WordSim353 k+k^{+} (Optimal) 69.1 69.2 69.1 69.1
kk^{*} (PIP) 65.5 64.4 63.9 63.5
kk^{*} (PCA) 64.4 64.0 63.5 63.2
Table 1: Performance on the two tasks with extreme word frequency. The embedding are generated by Word2Vec based on a dimension selection criterion (PIP, PCA or grid search). ‘10%10\%’ column means adding 10%N10\%\cdot N duplicate sentences into the training corpus.

Given the training corpus, Gigaword 5 corpus which contains NN sentences, to obtain varying word frequencies, we randomly select a sentence (length is larger than 20, which will not affect the overall fluency of corpus considering the WordVec window size is 5) and duplicate it for 3%N3\%\cdot N, 5%N5\%\cdot N, and 10%N10\%\cdot N times, and then merge them with the corpus. The results are listed in Table 1. As word frequency becomes extreme, the performance gap between kk^{*} and k+k^{+} enlarges, indicating that extreme word frequency significantly hurts the effectiveness of dimension selection methods. Thus, it is necessary to diminish the influences of word frequency during dimension selection.

5 Methodology

5.1 Overview

The formulation of our proposed distance, mixed pairwise distance (MPD), is defined as follows:

MPD(X,X^)=dr(X,X^)dp(X,X^)MPD(X,\hat{X})=\sqrt{d_{r}(X,\hat{X})\cdot d_{p}(X,\hat{X})} (1)

where XX and X^\hat{X} correspond to the two oracle matrices mentioned in Sec 3. Concretely, MPD consists of two distances: primitive relative distance drd_{r} and post relative distance dpd_{p}. Based on the distance criterion, an MPD-based dimension selection method is developed to select a proper dimension for static embedding algorithms.

5.2 Primitive Relative Distance

Definition 1 gives the definition of the primitive relative distance drd_{r}, and such a definition can guarantee the unitary-invariance of XX and XTX^{T}. Moreover, unless specifically stated, the matrix norms used in the paper are Frobenius norms.

Definition 1

The primitive relative distance drd_{r} between the oracle matrices Xn×kX\in\mathbb{R}^{n\times k} and X^n×k\hat{X}\in\mathbb{R}^{n\times k} is defined as follows:

dr(X,X^)=XXTX^X^T2XXTX^X^Td_{r}\left(X,\hat{X}\right)=\frac{\left\|XX^{T}-\hat{X}\hat{X}^{T}\right\|^{2}}{\left\|XX^{T}\right\|\left\|\hat{X}\hat{X}^{T}\right\|} (2)

where the denominator XXTX^X^T||XX^{T}||||\hat{X}\hat{X}^{T}|| is a scaling term.

Then, the relationship between drd_{r} and unitary-invariance is presented as follows. According to Theorem 1, if drd_{r} is close to 0, the two matrices (XX and X^\hat{X}) are extremely unitary-invariant. In other words, when the numerator XXTX^X^T2||XX^{T}-\hat{X}\hat{X}^{T}||^{2} in Eq 2 is 0, XX and X^\hat{X} can be regarded identical Smith et al. (2017); Artetxe et al. (2016). Therefore, the definition of drd_{r} can guarantee that when drd_{r} is close to 0, the unitary-invariance of XX and XTX^{T} exists. Thus, the optimal dimension which minimizes drd_{r} is a proper dimension for word embedding, as presented in Sec 3.

Theorem 1

Suppose dr(A,B)0\left\|d_{r}(A,B)\right\|\approx 0, then AA and BB are unitary-invariant (i.e., existing an unitary matrix TT, BATB\approx AT).

The proof is deferred to Appendix A.

5.3 Post Relative Distance

In order to overcome the word frequency problem, the post-relative distance dpd_{p} is defined, as shown in Definition 2, where F()F(\cdot) is a post-processing function on the oracle matrices. Notice that the post-processing function in dpd_{p} can be any post-processing method which has been used on word embedding.

Definition 2

Given a post-processing function FF, the post-relative distance dpd_{p} between the oracle matrices Xn×kX\in\mathbb{R}^{n\times k} and X^n×k\hat{X}\in\mathbb{R}^{n\times k} is defined as follows:

dp(Y,Y^)=XYTY^Y^T2XYTY^Y^Td_{p}\left(Y,\hat{Y}\right)=\frac{\left\|XY^{T}-\hat{Y}\hat{Y}^{T}\right\|^{2}}{\left\|XY^{T}\right\|\left\|\hat{Y}\hat{Y}^{T}\right\|} (3)

where Y=F(X)Y=F({X}) and Y^=F(X^)\hat{Y}=F(\hat{X}).

Specifically, we transfer the success of the post-processing function on word embedding matrices to oracle matrices. Post-processing functions essentially ‘normalize’ a word embedding through de-emphasizing the impact of some words. Since the oracle matrices, XX and XTX^{T}, are also influenced by the word frequency, we believe that the application of the post-processing function to the oracle matrices can also diminish such an influence.

Moreover, the post-relative distance is essentially the same as the primitive relative distance, except the usage of post-processing function FF. Similar to the discussion of drd_{r}, dpd_{p} also possesses the unitary-invariance.

5.4 Combination

It is also worth mentioning that eliminating the impact of word frequency is not always beneficial for word embedding, because some crucial information in a word embedding is inevitably eliminated Gong et al. (2018). So, we combine drd_{r} and dpd_{p} with a geometric mean in MPD.

Embedding Criterion WS MTurk MC MEN RG RW
BERT None 41.15 32.89 49.13 41.23 34.66 13.15
Glove Optimal 69.12 58.78 70.20 73.75 76.95 33.55
PIP 65.48 55.48 67.19 71.56 74.32 31.49
PCA 64.44 55.23 66.82 71.09 73.56 31.67
MPD 66.89 57.02 68.66 72.92 75.23 32.91
Word2Vec Optimal 68.03 58.12 63.96 70.27 70.01 25.43
PIP 64.68 55.63 62.19 68.45 68.76 23.12
PCA 64.56 55.91 62.03 68.70 69.04 22.41
MPD 67.14 57.11 62.88 69.79 69.54 24.64
Table 2: The performance on word similarity using different word embeddings. Except the dynamic embedding produced by BERT, other word embeddings are generated by a static word embedding algorithm with a dimension selected by a criterion (PIP, PCA, MPD and a grid search); The number is 100 ×\times ρ\rho (Spearman correlation); ‘Bold’ numbers indicate the best performance except ‘Optimal’.

6 Experiment

This section carries out three sets of experiments to comprehensively analyze our MPD-based dimension selection. The first two sets of experiments evaluate the effectiveness of our MPD-based dimension selection on context-unavailable and context-available tasks, respectively. The third set of experiments compares the efficiency-performance trade-off of different dimension selection methods.

Specifically, two popular static word embedding algorithms (Word2Vec111We choose the CBOW version. and GloVe) are chosen to train word embeddings and both of them use Gigaword 5 as the training corpus. For the MPD-based dimension selection method, two commonly-used post-processing methods (CN and ABTT) are used, and CN is the default one. Moreover, for each experiment, a criterion-selected dimension kk^{*} and an empirically optimal dimension k+k^{+} are computed, whose definitions can be found in Sec 4.

Practical Settings

Considering that SVD decomposition on large matrices needs extremely large memory resources, we reduce the vocab size in word embedding to 25,000, which can be handled by a 16GB-memory device. Moreover, previous works Yin and Shen (2018) have showed that as long as the vocab size is beyond a threshold (i.e., 10,000), the performance will not degrade much even when the vocab size is reduced.

6.1 Context-unavailable Tasks

Dataset Embedding Human Score
Chinese BERT 0.79
kk^{*} (PIP) 1.12
kk^{*} (PCA) 1.08
kk^{*} (MPD) 1.38
English BERT 0.68
kk^{*} (PIP) 1.12
kk^{*} (PCA) 1.05
kk^{*} (MPD) 1.29
Table 3: Performance (Human score) on semantic expansion using different word embeddings, which are generated by GloVe with PIP, PCA or MPD. Particularly, empirically optimal dimension k+k^{+} (Optimal) is not used because it is selected according to the final performance and such a selection is not plausible on semantic expansion since hundreds of times of human annotation are required.

Experiment Settings

We conduct experiments on the following two tasks in which contexts are unavailable.

  • Word similarity: The quality of each embedding is evaluated on the following five benchmarks: WordSim353 Finkelstein et al. (2001), MTurk777 Halawi et al. (2012), MC Miller and Charles (1991), MEN Bruni et al. (2014), RG Rubenstein and Goodenough (1965), and RW Luong et al. (2013). Each dataset contains a list of word pairs with human scores. The performance is measured by the correlation between the cosine similarities of word vectors and human scores.

  • Semantic expansion (SE): The Chinese and English entity list in TexSmart Liu et al. (2021) is selected. Then, a word embedding trained with a selected dimension is applied to each entity to retrieve its related entities. Finally, the top five expansion results of each input entity are judged by human annotators in terms of quality and relatedness. Each result is annotated by two annotators, and a label of 2, 1 or 0 is assigned, corresponding to highly related, slightly related, and not related, respectively.

Results

The performances of different dimension selection methods on word similarity and semantic expansion are listed in Table 2 and Table 3, respectively. First of all, we can see BERT Devlin et al. (2018) performs catastrophically on context-unavailable tasks since the dynamic embedding generation is based on contextual learning. Then, MPD outperforms the baselines (PIP and PCA) in the word similarity tests because the post-processing function in MPD can diminish the impact of word frequency on the oracle matrices. Finally, MPD outperforms the baselines on semantic expansion because of the incorporation of the post-processing function, which enables the dimension selection method to neglect redundant components and capture word semantics better.

6.2 Context-available Tasks

Experiment Settings

Since it is often a case that the success on the context-unavailable tasks cannot be well transferred to context-available tasks Schnabel et al. (2015), we conducted experiments that directly compare the effectiveness of different dimension selection methods on various downstream NLP tasks with contexts. The tasks can be divided into the following two kinds:

Single-sentence tasks: (1) Text classification: the movie review (MR) Pang and Lee (2005), SST-5 Socher et al. (2013) and SST-2 datasets Socher et al. (2013); (2) Linguistic acceptability: CoLA Warstadt et al. (2019) dataset; (3) Dialogue state tracking: Wizard-of-Oz383 (WOZ) dataset Wen et al. (2017).

Sentence-pair tasks: (1) Sentence paraphrase: MRPC Dolan and Brockett (2005) and QQP Wang et al. (2018) dataset; (2) Semantic textual similarity: STS-B dataset Cera et al. ; (3) Language inference: MNLI dataset Williams et al. (2018).

Specifically, for the two single-sentence tasks (text classification and linguistic acceptability), a sentence is encoded with a static word embedding and the resulting embedding vector is fed to a Bi-LSTM classifier. For another single-sentence task, dialogue state tracking, a deep-neural-network-based model, Neural Belief Tracker (NBT) Mrksic and Vulic (2018), is chosen. For the three sentence-pair tasks, two sentences are encoded independently to produce two embedding vectors uu and vv, and then vector [u;v;|uv|][u;v;|u-v|] is fed to a MLP-based classifier. Notice that the performances on these tasks can reflect whether a selected dimension is proper for static word embedding algorithms, but such performances still lag behind the ones using BERT. For the context-available tasks, it is not fair to compare PLMs (e.g., BERT) and static word embedding algorithms, because the former leverages contexts and the latter does not. Therefore, we only compare the four dimension selection methods (PIP-based, PCA-based, MPD-based, and grid search) in the experiments.

Dataset Word2Vec GloVe
PIP PCA MPD(ABTT) MPD(CN) Optimal PIP PCA MPD(ABTT) MPD(CN) Optimal
MR 63.6 63.2 70.6 70.9 71.4 65.0 69.4 71.3 72.0 72.5
SST-2 80.0 79.8 81.6 81.6 81.9 86.0 88.5 90.0 89.8 90.2
SST-5 32.4 32.0 38.0 38.2 38.7 34.4 35.4 39.4 39.2 39.7
CoLA 4.3 4.4 5.6 5.5 5.6 14.4 14.3 15.6 15.6 15.8
WOZ 72.5 72.7 83.9 83.4 85.6 74.6 79.4 86.5 90.2 92.4
MRPC 79.4 79.3 80.6 80.9 80.9 80.0 80.2 81.2 81.3 81.5
QQP 53.9 53.6 56.1 56.2 56.6 60.4 60.5 62.4 62.7 62.9
STS-B 56.6 56.5 58.6 58.8 59.0 58.0 58.7 60.0 60.2 60.4
MNLI 54.4 54.0 57.0 57.7 58.0 67.9 67.6 71.5 71.4 71.6
Table 4: The performance on the context-available tasks. Rows (e.g., ‘WOZ’) represent the used datasets; Column ‘PIP’, ‘PCA’, ‘MPD’ and ‘Optimal’ represent PIP kk^{*}, PCA kk^{*}, MPD kk^{*} and empirically optimal dimensions k+k^{+}, respectively; Column ‘ABTT’ and ‘CN’ represent the used post-processing function in MPD, respectively.

Results

The experimental results of the six downstream NLP tasks are reported in Table 4. As shown in Table 4, the dimension selection method using post-processing (i.e., Column ‘ABTT’ and ‘CN’) significantly outperform the ones without using post-processing (i.e., Column ‘PIP’ and ‘PCA’). This shows the necessity of introducing post-processing to the dimension selection. Moreover, the MPD-based dimension selection achieves competitive performance with the optimal ones (‘Optimal’). For example, on the ’WOZ’ task, compared to PIP, the accuracy score based on MPD(ABTT) increases from 72.5%72.5\% to 83.9%83.9\%, which is closer to the optimal performance (85.6%85.6\%). This means that MPD-based dimension selection works effectively for downstream tasks with contexts.

Then, as illustrated in Table 4, the performance of MPD changes as the used post-processing method changes. Except the ‘WOZ’ task, MPD(CN) usually performs on a par with MPD(ABTT). This indicates that the selection of post-processing functions does not affect much the MPD-based dimension selection.

6.3 Efficiency-performance Trade-off

In this part, we compare the efficiency-performance trade-off among grid search and three dimension selection methods (PIP-based, PCA-based, and MPD-based) using Word2Vec, where efficiency is measured by time and performance is evaluated by the average per-task performance. Ideally, we hope a method could perform better and remain computationally efficient. Specifically, the efficiency evaluation of grid search is dependent on grid granularity because a coarser grid search could save time at the cost of performance loss. Thus, in this experiment, we choose the dimension bound as [50,1000][50,1000] and use three grid searches with three increments (i.e., 2, 5 and 10), respectively. The three grid searches are denoted as ‘GS-2’,‘GS-5’, and ‘GS-10’. The efficiency-performance results are listed in Table 5.

In Table 5, for the three grid searches, ‘GS-10’ performs most closely to MPD but takes 14.01x time. Although ‘GS-2’ and ‘GS-5’ achieve slightly better performances than MPD, they take 56.04x and 28.02x computational cost, which is a disaster efficiency-performance trade-off. In the other hand, compared to PCA, MPD achieves 1.7%1.7\% performance improvement and costs only 17.1%17.1\% computation time of PCA. Compared to PIP, MPD has a similar computation cost, but achieves a 1.3%1.3\% performance increase. Therefore, the MPD-based method achieves the best efficiency-performance trade-off.

Criterion Performance Computation time
GS-2 59.5 56.04x
GS-5 59.3 28.02x
GS-10 59.0 14.01x
PIP 57.8 0.98x
PCA 57.3 10.85x
MPD 59.1 1.00x
Table 5: Comparison of efficiency-performance trade-off. The whole MPD-based dimension selection procedure takes 6.3 min on one RTX2080 device. Specifically, ‘5.85x’ means 5.85 times of computational cost of MPD.
WS353 MT771
MPD Prim-D Post-D Optimal MPD Prim-D Post-D Optimal
GloVe 0.66 0.62 0.65 0.69 0.56 0.55 0.55 0.58
Word2Vec 0.65 0.60 0.64 0.68 0.55 0.54 0.55 0.58
Table 6: Ablation study on the context-unavailable tasks using GloVe and Word2Vec. ‘Bold’ numbers indicate the best performance except ‘Optimal’.
Dataset ABTT CN
MPD Prim-D Post-D Optimal MPD Prim-D Post-D Optimal
MR 71.3 70.3 70.8 72.0 72.0 70.5 70.2 72.5
SST-2 90.0 89.2 89.6 90.2 89.8 89.1 88.9 90.2
SST-5 39.4 38.5 38.9 39.6 39.2 38.4 38.6 39.7
CoLA 15.6 15.0 15.4 15.7 15.6 14.9 15.1 15.8
WOZ 86.1 84.5 85.5 89.8 90.2 88.2 88.0 92.4
MRPC 81.2 80.4 80.9 81.4 81.3 80.3 80.5 81.5
QQP 62.4 61.5 62.0 62.6 62.7 62.0 61.9 62.9
STS-B 60.0 59.2 59.8 60.2 60.2 59.6 59.7 60.4
MNLI 71.5 69.3 70.6 71.6 71.4 69.6 69.9 71.6
Table 7: Ablation analysis on the context-available tasks using GloVe.

7 Ablation

In the ablation studies, we principally answer two questions: (1) Is the distance combination in MPD effective? (2) Is the post-processing function effective in the case of extreme word frequency?

7.1 Is the distance combination in MPD effective?

To further explore the effectiveness of the combination of the two distances (the primitive relative distance drd_{r} and the post-relative distance dpd_{p}) in MPD, we add two extra dimension selection criteria: (1) Prim-D: drd_{r} (2) Post-D: dpd_{p}. Then, we compare the performances of the three dimension selection methods: MPD-based, Prim-D-based, and Post-D-based.

Table 6 and Table 7 illustrate the performance of different selection methods on the context-unavailable and context-available tasks, respectively. As we can see, MPD obtains better results than the other two criteria, validating the effectiveness of the MPD which combines drd_{r} and dpd_{p} through a geometric mean. The findings indicate that either completely preserving or diminishing the impact of word frequency is harmful to word embeddings, so a neutralized combination is proper. Although there may be other combinations of dpd_{p} and drd_{r} with better performances, the distance combination method is not the focus of this paper.

7.2 Is the post-processing function effective in the case of extreme word frequency?

For the training corpus with extreme word frequency, the performances of different dimension selection methods are shown in Table 8. As we can see, the methods with post-processing functions (i.e., MPD and Post-D) perform better than or on a par with those without post-processing functions (i.e., PIP, MPD and Prim-D), demonstrating that the post-processing functions are effective in the case of extreme word frequency.

Dataset Criterion 0 3% 5% 10%
SST-2 k+k^{+} (Optimal) 81.9 81.4 81.5 81.8
kk^{*} (PIP) 80.0 79.5 79.2 78.7
kk^{*} (PCA) 79.8 79.5 79.0 78.5
kk^{*} (Prim-D) 80.2 80.4 80.0 80.9
kk^{*} (Post-D) 80.6 80.9 79.9 80.5
kk^{*} (MPD) 81.6 81.0 81.2 81.3
WordSim353 k+k^{+} (Optimal) 69.1 69.2 69.1 69.1
kk^{*} (PIP) 65.5 64.4 63.9 63.5
kk^{*} (PCA) 64.4 64.0 63.5 63.2
kk^{*} (Prim-D) 62.3 62.2 61.2 61.0
kk^{*} (Post-D) 65.1 65.1 65.0 64.9
kk^{*} (MPD) 66.9 66.6 66.5 66.5
Table 8: Ablation analysis on the two tasks with extreme word frequency. The experimental settings are the same as the Table 1.

8 Conclusion

In this paper, we investigate an overlooked problem, whether word frequency influences the automatic dimension selection for static word embedding algorithms, and discover that extreme word frequency can degrade existing dimension selection methods. According to the empirical finding, we propose MPD, a novel metric that combines primitive relative distance drd_{r} and the post relative distance dpd_{p}. Through dpd_{p}, post-processing functions are introduced into the dimension selection. Then, we develop an MPD-based dimension selection method that automatically selects a proper dimension for static embedding algorithms. Extensive experiments on context-unavailable and context-available tasks show that our MPD-based dimension selection achieves much better performance with a good efficiency-performance trade-off.

References

  • Artetxe et al. (2016) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294.
  • Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of artificial intelligence research, 49:1–47.
  • (3) Daniel Cera, Mona Diabb, Eneko Agirrec, Inigo Lopez-Gazpioc, Lucia Speciad, and Basque Country Donostia. Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation.
  • Chung and Glass (2018) Yu-An Chung and James Glass. 2018. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dolan and Brockett (2005) Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005).
  • Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414.
  • Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Frage: Frequency-agnostic word representation. Advances in neural information processing systems, 31.
  • Guo et al. (2019) Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730.
  • Halawi et al. (2012) Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406–1414.
  • Han et al. (2020) Jialong Han, Aixin Sun, Haisong Zhang, Chenliang Li, and Shuming Shi. 2020. Case: Context-aware semantic expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7871–7878.
  • Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185.
  • Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
  • Liu et al. (2021) Lemao Liu, Haisong Zhang, Haiyun Jiang, Yangming Li, Enbo Zhao, Kun Xu, Linfeng Song, Suncong Zheng, Botong Zhou, Dick Zhu, et al. 2021. Texsmart: A system for enhanced natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 1–10.
  • Liu et al. (2019) Tianlin Liu, Lyle Ungar, and Joao Sedoc. 2019. Unsupervised post-processing of word vectors via conceptor negation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6778–6785.
  • Luong et al. (2013) Minh-Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the seventeenth conference on computational natural language learning, pages 104–113.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Miller and Charles (1991) George A Miller and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28.
  • Mrksic and Vulic (2018) Nikola Mrksic and Ivan Vulic. 2018. Fully statistical neural belief tracking. In ACL (2).
  • Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
  • Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4):694–707.
  • Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Rubenstein and Goodenough (1965) Herbert Rubenstein and John B Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
  • Schnabel et al. (2015) Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 298–307.
  • Smith et al. (2017) Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934. Citeseer.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
  • Wang (2019) Yu Wang. 2019. Single training dimension selection for word embedding with pca. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3597–3602.
  • Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  • Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  • Yin and Shen (2018) Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems, pages 887–898.
  • Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38.

Appendix A Proof of Theorem 1

Proof 1

Let A=UDVTA=UDV^{T} and B=XVYTB=XVY^{T} be the SVDs. If we can prove UXU\approx X and DVD\approx V, then BATB\approx AT, and T=VYTT=VY^{T} is the unitary matrix, which directly solves the proof. So we denote D=diag(di)D=\operatorname{diag}\left(d_{i}\right), V=diag(λj)V=\operatorname{diag}\left(\lambda_{j}\right), and did_{i} and λj\lambda_{j} are the singular values of AA and BB, respectively. The proof is shown as follows.

Firstly, primitive relative distance drd_{r} defined in Eq. 2 is reformulated as Eq. 3:

dr(A,B)\displaystyle d_{r}\left(A,B\right) =AATBBTC\displaystyle=\frac{||AA^{T}-BB^{T}||}{C} (4)
=UD2UTXV2XTC\displaystyle=\frac{||UD^{2}U^{T}-XV^{2}X^{T}||}{C}

where C=1AATBBTC=\frac{1}{||AA^{T}||||BB^{T}||}. If dr(A,B)0d_{r}(A,B)\approx 0, then AATBBT0||AA^{T}-BB^{T}||\approx 0.

Secondly, let x1x_{1} be the first column of XX (i.e., the singular vector corresponding to the largest singular value λ1\lambda_{1}), and suppose λ1d1\lambda_{1}\geq d_{1}. According to Eq. 10-12, 0λ12d12AATBBT00\leq\lambda_{1}^{2}-d_{1}^{2}\leq\left\|AA^{T}-BB^{T}\right\|\approx 0. Then, d1λ1d_{1}\approx\lambda_{1} and u1x1u_{1}\approx x_{1}.

BBTx1AATx1\displaystyle\left\|BB^{T}x_{1}\right\|-\left\|AA^{T}x_{1}\right\| (AATBBT)x1op\displaystyle\leq\left\|\left(AA^{T}-BB^{T}\right)x_{1}\right\|_{op} (5)
AATBBT\displaystyle\leq\left\|AA^{T}-BB^{T}\right\|

where AATBBTAA^{T}-BB^{T} is considered as an operator.

BBTx1=XV2XTx1=λ12\left\|BB^{T}x_{1}\right\|=\left\|XV^{2}X^{T}x_{1}\right\|=\lambda_{1}^{2} (6)
AATx1=UD2UTx1=d12\left\|AA^{T}x_{1}\right\|=||UD^{2}U^{T}x_{1}||=d_{1}^{2} (7)

The proof can be applied to the other singular vectors of XX to get UU and DD. Because UXU\approx X and DVD\approx V, the unitary maxtrix TT is obtained, which completes the proof.

Appendix B Limitation

Since our MPD is specific to static word embedding, it can not be applied to pre-trained language models like BERT. Notice that BERT’s ‘dimension’ refers to the hidden state (768 in base). To dive into the relationship between dimension and BERT’s performance, BERT is needed to be re-trained when the hidden state size changes. This is extremely difficult since re-training BERT needs massive computational resources, while re-training GloVe/Word2Vec can be easily handled. Besides, there are differences between various NLP tasks, but our dimension selection method lacks of task-specific designs, which will be explored in the future.