Frequency-aware Dimension Selection for Static Word Embedding by Mixed Product Distance
Abstract
Static word embedding is still useful, particularly for context-unavailable tasks, because in the case of no context available, pre-trained language models often perform worse than static word embeddings. Although dimension is a key factor determining the quality of static word embeddings, automatic dimension selection is rarely discussed. In this paper, we investigate the impact of word frequency on the dimension selection, and empirically find that word frequency is so vital that it needs to be taken into account during dimension selection. Based on such an empirical finding, this paper proposes a dimension selection method that uses a metric (Mixed Product Distance, MPD) to select a proper dimension for word embedding algorithms without training any word embedding. Through applying a post-processing function to oracle matrices, the MPD-based method can de-emphasize the impact of word frequency. Experiments on both context-unavailable and context-available tasks demonstrate the better efficiency-performance trade-off of our MPD-based dimension selection method over baselines.
1 Introduction
Word embedding has been widely used in numerous NLP tasks, such as recommendation systems Zhang et al. (2019), text classification Chung and Glass (2018), information retrieval Palangi et al. (2016) and machine translation Guo et al. (2019). Though pre-trained language models (PLMs) like BERT Devlin et al. (2018) have become a prevailed paradigm in NLP, static word embeddings, such as Word2Vec Mikolov et al. (2013) and GloVe Pennington et al. (2014), are still useful in scenarios where contexts are missing, such as entity retrieval Liu et al. (2021); Han et al. (2020) and word similarity tasks Halawi et al. (2012); Bruni et al. (2014). Under such scenarios, static word embeddings often perform better than PLMs.
Although there are many algorithms for training static word embeddings, the impact of dimension on word embedding has been less investigated. Yin and Shen (2018) have shown that a critical hyper-parameter, the choice of dimension for word vectors, significantly influences the performance of a model that builds on word embeddings. An extremely low-dimensional word embedding is probably not expressive enough to represent the meanings of words, and in contrast, a high-dimensional word embedding is expressive but possibly leads to over-fitting and high computational cost. For most NLP models building on word embeddings, dimension is selected ad hoc or by grid search, either resulting in sub-optimal model performances. Thus, it is promising to establish a dimension selection criterion that can automatically select a proper dimension for static word embedding algorithms with a good efficiency-performance trade-off.
Yin and Shen (2018) first investigated dimension selection for static word embedding. They proposed a metric called Pairwise Inner Product (PIP) loss which attempts to reflect the relationship between the dimension and the quality of word embedding and then developed a PIP-based dimension selection method that selects a dimension with the minimal PIP loss. Later, Wang (2019) proposed a dimension selection method based on Principal Components Analysis (PCA). The PCA-based dimension selection trains a high-dimensional word embedding and then uses the embedding to conduct a grid search to find a proper dimension, which is time-consuming. However, both the PIP-based dimension selection and the PCA-based dimension selection overlook one dominant element that may influence the dimension selection: word frequency.
Thus, in this paper, we first investigate whether word frequency is influential in the dimension selection of static word embedding. Our results reveal that extreme word frequency may cause an improper dimension for the PIP-based or PCA-based dimension selection. That is, without diminishing the impact of word frequency, it is more likely to select an improper dimension and generate a low-quality word embedding. Then, we explore a way to diminish the impact of word frequency. Inspired by the success of post-processing methods on static word embeddings Mu and Viswanath (2018); Liu et al. (2019), which de-emphasize the influence of some words in an embedding, we attempt to transfer them to the dimension selection.
Specifically, we propose an MPD-based dimension selection, which minimizes a metric called Mixed Product Distance (MPD) to find a proper dimension. The MPD can reduce the impact of word frequency by applying a post-processing function to oracle matrices, which results in a more proper dimension. To investigate the effectiveness of our method, we conduct evaluations on context-unavailable and context-available tasks. Comprehensive experiments demonstrate that the MPD-base dimension selection method can consistently perform better than baselines.
Above all, the main contributions of the paper can be outlined as follows:
-
•
We expose a critical issue overlooked by previous dimension selection methods: word frequency. It can influence the dimension selection and result in degraded word embeddings.
-
•
We design a metric called Mixed Product Distance (MPD) for static word embedding dimension selection, which uses post-process functions to overcome the word frequency problem.
-
•
We evaluate the MPD-based dimension selection on various NLP tasks, including context-unavailable and context-available tasks. Comprehensive experiments demonstrate that our method achieves a better efficiency-performance trade-off.
2 Related Work
2.1 Static Word Embedding and Dimension Selection
Existing static word embedding algorithms can be formulated as either explicit or implicit matrix factorization, and the most popular word embedding algorithms can be regarded as implicit matrix factorization. For example, Word2Vec and GloVe perform implicit factorization on the symmetric matrix (i.e., ), where the signal matrix is the shifted PPMI matrix for Word2Vec and the log-count matrix for GloVe. Although intensive studies have been done on static word embedding algorithms that learn word embeddings using matrix factorization, selecting a proper dimension for these word embedding algorithms is less investigated. Yin and Shen (2018) proposed a PIP-based dimension selection method. Specifically, for every possible dimension, its PIP loss is obtained, and then the dimension with the minimal PIP loss is selected. Particularly, the PIP-based method can be conducted without training word embedding. In contrast, Wang (2019) proposed a PCA-based method for dimension selection, which trains a high-dimensional word embedding to help the dimension selection, which results in extremely time-consuming.
2.2 Post-processing for Word Embedding
To further improve the quality of a word embedding, post-processing which reinforces linguistic constraints on the embedding, is often used. According to the studies on post-processing Mu and Viswanath (2018); Liu et al. (2019), one of the important linguistic constraints is to diminish the influence of word frequency. Specifically, in a word embedding, principal components (PCs), which are mainly encoded by frequent words, are shared by all words. As a result, the vectors of less frequent words have a higher variance than those of frequent words, and such a high variance hurts the embedding quality because word frequency is unrelated to lexical semantics. Moreover, two popular post-processing functions, the All-but-the-top (ABTT) Mu and Viswanath (2018) and Conceptor Negation (CN) Liu et al. (2019), have been used to remove or suppress PCs in word embeddings. In our method, instead of word embedding, these post-processing functions are applied to the oracle matrix of word embedding algorithms.
3 Preliminaries
This section presents a basic procedure for the dimension selection of word embedding. A static word embedding algorithm is chosen to obtain an word embedding matrix , where is composed of vectors, and word is represented by vector , . Generally, dimension selection aims to find a proper dimension with a good performance-efficiency trade-off.
There are two existing dimension selection methods: PIP-based and PCA-based. PIP can directly compute the dimension without training any word embedding, while PCA-based needs to train a word embedding before dimension selection, which leads to much more computational cost. So, the PCA-based method owns a worse performance-efficiency trade-off than the PIP-based method. In this paper, we work on dimension selection without training any word embedding.
Following previous works Levy and Goldberg (2014), we leverage oracle matrix for dimension selection, which can be obtained through matrix decomposition without any training process. Formally, for a static word embedding algorithm with two hyper-parameters ( and dimension ) and a signal matrix (i.e., the co-occurrence matrix), an oracle matrix is derived from matrix with the form , where is its SVD obtained by matrix factorization Levy et al. (2015).
To implement the dimension selection, we divide the training corpus into two identical partitions and , and then obtain the co-occurrence matrices and on each partition, respectively. Correspondingly, we can have two oracle matrices and . Then, a common way to select a proper dimension is used: define a proper distance , and then find that can minimize the distance . Notice that a distance () that guarantees the unitary-invariance of the two oracle matrices and may serve as a proper distance for dimension selection Yin and Shen (2018), where unitary-invariance refers that two matrices are essentially identical if one can be transformed from the other by performing a unitary operation (e.g., a simple rotation), and unitary-invariance defines the equivalence class of the oracle matrices.
4 Is word frequency influential in dimension selection?
Although previous works Mu and Viswanath (2018); Gong et al. (2018) have pointed out that word frequency may bring negative consequences to word embeddings, the impact of word frequency on the dimension selection has been overlooked by previous studies on dimension selection Yin and Shen (2018); Wang (2019). Therefore, we explore a question in this section: does the word frequency influence the dimension selection for word embeddings?.
We first prepare an experiment of extreme word frequency on two tasks: context-unavailable (WordSim353 dataset, Finkelstein et al., 2001) and context-available (SST-2 dataset, Socher et al., 2013). In the experiment, the following two kinds of dimensions, and , are used. Generally, word embedding with consistently outperforms the one with , and the performance gap between and reflects the effectiveness of the dimension selection method.
-
•
Criterion-selected dimension : Given a dimension selection method, with the minimal distance is selected as the criterion-selected dimension.
-
•
Empirically optimal dimension (Optimal): Given a word embedding algorithm, a word embedding is trained for each dimension in a grid search (i.e., the dimension is selected from 50 to 1000 with an increment step of 2). Then, each word embedding is evaluated on the benchmark, and the one achieving the highest performance is selected as the empirically optimal dimension .
Dataset | Criterion | 0 | 3% | 5% | 10% |
---|---|---|---|---|---|
SST-2 | (Optimal) | 81.9 | 81.4 | 81.5 | 81.8 |
(PIP) | 80.0 | 79.5 | 79.2 | 78.7 | |
(PCA) | 79.8 | 79.5 | 79.0 | 78.5 | |
WordSim353 | (Optimal) | 69.1 | 69.2 | 69.1 | 69.1 |
(PIP) | 65.5 | 64.4 | 63.9 | 63.5 | |
(PCA) | 64.4 | 64.0 | 63.5 | 63.2 |
Given the training corpus, Gigaword 5 corpus which contains sentences, to obtain varying word frequencies, we randomly select a sentence (length is larger than 20, which will not affect the overall fluency of corpus considering the WordVec window size is 5) and duplicate it for , , and times, and then merge them with the corpus. The results are listed in Table 1. As word frequency becomes extreme, the performance gap between and enlarges, indicating that extreme word frequency significantly hurts the effectiveness of dimension selection methods. Thus, it is necessary to diminish the influences of word frequency during dimension selection.
5 Methodology
5.1 Overview
The formulation of our proposed distance, mixed pairwise distance (MPD), is defined as follows:
(1) |
where and correspond to the two oracle matrices mentioned in Sec 3. Concretely, MPD consists of two distances: primitive relative distance and post relative distance . Based on the distance criterion, an MPD-based dimension selection method is developed to select a proper dimension for static embedding algorithms.
5.2 Primitive Relative Distance
Definition 1 gives the definition of the primitive relative distance , and such a definition can guarantee the unitary-invariance of and . Moreover, unless specifically stated, the matrix norms used in the paper are Frobenius norms.
Definition 1
The primitive relative distance between the oracle matrices and is defined as follows:
(2) |
where the denominator is a scaling term.
Then, the relationship between and unitary-invariance is presented as follows. According to Theorem 1, if is close to 0, the two matrices ( and ) are extremely unitary-invariant. In other words, when the numerator in Eq 2 is 0, and can be regarded identical Smith et al. (2017); Artetxe et al. (2016). Therefore, the definition of can guarantee that when is close to 0, the unitary-invariance of and exists. Thus, the optimal dimension which minimizes is a proper dimension for word embedding, as presented in Sec 3.
Theorem 1
Suppose , then and are unitary-invariant (i.e., existing an unitary matrix , ).
The proof is deferred to Appendix A.
5.3 Post Relative Distance
In order to overcome the word frequency problem, the post-relative distance is defined, as shown in Definition 2, where is a post-processing function on the oracle matrices. Notice that the post-processing function in can be any post-processing method which has been used on word embedding.
Definition 2
Given a post-processing function , the post-relative distance between the oracle matrices and is defined as follows:
(3) |
where and .
Specifically, we transfer the success of the post-processing function on word embedding matrices to oracle matrices. Post-processing functions essentially ‘normalize’ a word embedding through de-emphasizing the impact of some words. Since the oracle matrices, and , are also influenced by the word frequency, we believe that the application of the post-processing function to the oracle matrices can also diminish such an influence.
Moreover, the post-relative distance is essentially the same as the primitive relative distance, except the usage of post-processing function . Similar to the discussion of , also possesses the unitary-invariance.
5.4 Combination
It is also worth mentioning that eliminating the impact of word frequency is not always beneficial for word embedding, because some crucial information in a word embedding is inevitably eliminated Gong et al. (2018). So, we combine and with a geometric mean in MPD.
Embedding | Criterion | WS | MTurk | MC | MEN | RG | RW |
---|---|---|---|---|---|---|---|
BERT | None | 41.15 | 32.89 | 49.13 | 41.23 | 34.66 | 13.15 |
Glove | Optimal | 69.12 | 58.78 | 70.20 | 73.75 | 76.95 | 33.55 |
PIP | 65.48 | 55.48 | 67.19 | 71.56 | 74.32 | 31.49 | |
PCA | 64.44 | 55.23 | 66.82 | 71.09 | 73.56 | 31.67 | |
MPD | 66.89 | 57.02 | 68.66 | 72.92 | 75.23 | 32.91 | |
Word2Vec | Optimal | 68.03 | 58.12 | 63.96 | 70.27 | 70.01 | 25.43 |
PIP | 64.68 | 55.63 | 62.19 | 68.45 | 68.76 | 23.12 | |
PCA | 64.56 | 55.91 | 62.03 | 68.70 | 69.04 | 22.41 | |
MPD | 67.14 | 57.11 | 62.88 | 69.79 | 69.54 | 24.64 |
6 Experiment
This section carries out three sets of experiments to comprehensively analyze our MPD-based dimension selection. The first two sets of experiments evaluate the effectiveness of our MPD-based dimension selection on context-unavailable and context-available tasks, respectively. The third set of experiments compares the efficiency-performance trade-off of different dimension selection methods.
Specifically, two popular static word embedding algorithms (Word2Vec111We choose the CBOW version. and GloVe) are chosen to train word embeddings and both of them use Gigaword 5 as the training corpus. For the MPD-based dimension selection method, two commonly-used post-processing methods (CN and ABTT) are used, and CN is the default one. Moreover, for each experiment, a criterion-selected dimension and an empirically optimal dimension are computed, whose definitions can be found in Sec 4.
Practical Settings
Considering that SVD decomposition on large matrices needs extremely large memory resources, we reduce the vocab size in word embedding to 25,000, which can be handled by a 16GB-memory device. Moreover, previous works Yin and Shen (2018) have showed that as long as the vocab size is beyond a threshold (i.e., 10,000), the performance will not degrade much even when the vocab size is reduced.
6.1 Context-unavailable Tasks
Dataset | Embedding | Human Score |
---|---|---|
Chinese | BERT | 0.79 |
(PIP) | 1.12 | |
(PCA) | 1.08 | |
(MPD) | 1.38 | |
English | BERT | 0.68 |
(PIP) | 1.12 | |
(PCA) | 1.05 | |
(MPD) | 1.29 |
Experiment Settings
We conduct experiments on the following two tasks in which contexts are unavailable.
-
•
Word similarity: The quality of each embedding is evaluated on the following five benchmarks: WordSim353 Finkelstein et al. (2001), MTurk777 Halawi et al. (2012), MC Miller and Charles (1991), MEN Bruni et al. (2014), RG Rubenstein and Goodenough (1965), and RW Luong et al. (2013). Each dataset contains a list of word pairs with human scores. The performance is measured by the correlation between the cosine similarities of word vectors and human scores.
-
•
Semantic expansion (SE): The Chinese and English entity list in TexSmart Liu et al. (2021) is selected. Then, a word embedding trained with a selected dimension is applied to each entity to retrieve its related entities. Finally, the top five expansion results of each input entity are judged by human annotators in terms of quality and relatedness. Each result is annotated by two annotators, and a label of 2, 1 or 0 is assigned, corresponding to highly related, slightly related, and not related, respectively.
Results
The performances of different dimension selection methods on word similarity and semantic expansion are listed in Table 2 and Table 3, respectively. First of all, we can see BERT Devlin et al. (2018) performs catastrophically on context-unavailable tasks since the dynamic embedding generation is based on contextual learning. Then, MPD outperforms the baselines (PIP and PCA) in the word similarity tests because the post-processing function in MPD can diminish the impact of word frequency on the oracle matrices. Finally, MPD outperforms the baselines on semantic expansion because of the incorporation of the post-processing function, which enables the dimension selection method to neglect redundant components and capture word semantics better.
6.2 Context-available Tasks
Experiment Settings
Since it is often a case that the success on the context-unavailable tasks cannot be well transferred to context-available tasks Schnabel et al. (2015), we conducted experiments that directly compare the effectiveness of different dimension selection methods on various downstream NLP tasks with contexts. The tasks can be divided into the following two kinds:
Single-sentence tasks: (1) Text classification: the movie review (MR) Pang and Lee (2005), SST-5 Socher et al. (2013) and SST-2 datasets Socher et al. (2013); (2) Linguistic acceptability: CoLA Warstadt et al. (2019) dataset; (3) Dialogue state tracking: Wizard-of-Oz383 (WOZ) dataset Wen et al. (2017).
Sentence-pair tasks: (1) Sentence paraphrase: MRPC Dolan and Brockett (2005) and QQP Wang et al. (2018) dataset; (2) Semantic textual similarity: STS-B dataset Cera et al. ; (3) Language inference: MNLI dataset Williams et al. (2018).
Specifically, for the two single-sentence tasks (text classification and linguistic acceptability), a sentence is encoded with a static word embedding and the resulting embedding vector is fed to a Bi-LSTM classifier. For another single-sentence task, dialogue state tracking, a deep-neural-network-based model, Neural Belief Tracker (NBT) Mrksic and Vulic (2018), is chosen. For the three sentence-pair tasks, two sentences are encoded independently to produce two embedding vectors and , and then vector is fed to a MLP-based classifier. Notice that the performances on these tasks can reflect whether a selected dimension is proper for static word embedding algorithms, but such performances still lag behind the ones using BERT. For the context-available tasks, it is not fair to compare PLMs (e.g., BERT) and static word embedding algorithms, because the former leverages contexts and the latter does not. Therefore, we only compare the four dimension selection methods (PIP-based, PCA-based, MPD-based, and grid search) in the experiments.
Dataset | Word2Vec | GloVe | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
PIP | PCA | MPD(ABTT) | MPD(CN) | Optimal | PIP | PCA | MPD(ABTT) | MPD(CN) | Optimal | |
MR | 63.6 | 63.2 | 70.6 | 70.9 | 71.4 | 65.0 | 69.4 | 71.3 | 72.0 | 72.5 |
SST-2 | 80.0 | 79.8 | 81.6 | 81.6 | 81.9 | 86.0 | 88.5 | 90.0 | 89.8 | 90.2 |
SST-5 | 32.4 | 32.0 | 38.0 | 38.2 | 38.7 | 34.4 | 35.4 | 39.4 | 39.2 | 39.7 |
CoLA | 4.3 | 4.4 | 5.6 | 5.5 | 5.6 | 14.4 | 14.3 | 15.6 | 15.6 | 15.8 |
WOZ | 72.5 | 72.7 | 83.9 | 83.4 | 85.6 | 74.6 | 79.4 | 86.5 | 90.2 | 92.4 |
MRPC | 79.4 | 79.3 | 80.6 | 80.9 | 80.9 | 80.0 | 80.2 | 81.2 | 81.3 | 81.5 |
QQP | 53.9 | 53.6 | 56.1 | 56.2 | 56.6 | 60.4 | 60.5 | 62.4 | 62.7 | 62.9 |
STS-B | 56.6 | 56.5 | 58.6 | 58.8 | 59.0 | 58.0 | 58.7 | 60.0 | 60.2 | 60.4 |
MNLI | 54.4 | 54.0 | 57.0 | 57.7 | 58.0 | 67.9 | 67.6 | 71.5 | 71.4 | 71.6 |
Results
The experimental results of the six downstream NLP tasks are reported in Table 4. As shown in Table 4, the dimension selection method using post-processing (i.e., Column ‘ABTT’ and ‘CN’) significantly outperform the ones without using post-processing (i.e., Column ‘PIP’ and ‘PCA’). This shows the necessity of introducing post-processing to the dimension selection. Moreover, the MPD-based dimension selection achieves competitive performance with the optimal ones (‘Optimal’). For example, on the ’WOZ’ task, compared to PIP, the accuracy score based on MPD(ABTT) increases from to , which is closer to the optimal performance (). This means that MPD-based dimension selection works effectively for downstream tasks with contexts.
Then, as illustrated in Table 4, the performance of MPD changes as the used post-processing method changes. Except the ‘WOZ’ task, MPD(CN) usually performs on a par with MPD(ABTT). This indicates that the selection of post-processing functions does not affect much the MPD-based dimension selection.
6.3 Efficiency-performance Trade-off
In this part, we compare the efficiency-performance trade-off among grid search and three dimension selection methods (PIP-based, PCA-based, and MPD-based) using Word2Vec, where efficiency is measured by time and performance is evaluated by the average per-task performance. Ideally, we hope a method could perform better and remain computationally efficient. Specifically, the efficiency evaluation of grid search is dependent on grid granularity because a coarser grid search could save time at the cost of performance loss. Thus, in this experiment, we choose the dimension bound as and use three grid searches with three increments (i.e., 2, 5 and 10), respectively. The three grid searches are denoted as ‘GS-2’,‘GS-5’, and ‘GS-10’. The efficiency-performance results are listed in Table 5.
In Table 5, for the three grid searches, ‘GS-10’ performs most closely to MPD but takes 14.01x time. Although ‘GS-2’ and ‘GS-5’ achieve slightly better performances than MPD, they take 56.04x and 28.02x computational cost, which is a disaster efficiency-performance trade-off. In the other hand, compared to PCA, MPD achieves performance improvement and costs only computation time of PCA. Compared to PIP, MPD has a similar computation cost, but achieves a performance increase. Therefore, the MPD-based method achieves the best efficiency-performance trade-off.
Criterion | Performance | Computation time |
---|---|---|
GS-2 | 59.5 | 56.04x |
GS-5 | 59.3 | 28.02x |
GS-10 | 59.0 | 14.01x |
PIP | 57.8 | 0.98x |
PCA | 57.3 | 10.85x |
MPD | 59.1 | 1.00x |
WS353 | MT771 | |||||||
---|---|---|---|---|---|---|---|---|
MPD | Prim-D | Post-D | Optimal | MPD | Prim-D | Post-D | Optimal | |
GloVe | 0.66 | 0.62 | 0.65 | 0.69 | 0.56 | 0.55 | 0.55 | 0.58 |
Word2Vec | 0.65 | 0.60 | 0.64 | 0.68 | 0.55 | 0.54 | 0.55 | 0.58 |
Dataset | ABTT | CN | ||||||
---|---|---|---|---|---|---|---|---|
MPD | Prim-D | Post-D | Optimal | MPD | Prim-D | Post-D | Optimal | |
MR | 71.3 | 70.3 | 70.8 | 72.0 | 72.0 | 70.5 | 70.2 | 72.5 |
SST-2 | 90.0 | 89.2 | 89.6 | 90.2 | 89.8 | 89.1 | 88.9 | 90.2 |
SST-5 | 39.4 | 38.5 | 38.9 | 39.6 | 39.2 | 38.4 | 38.6 | 39.7 |
CoLA | 15.6 | 15.0 | 15.4 | 15.7 | 15.6 | 14.9 | 15.1 | 15.8 |
WOZ | 86.1 | 84.5 | 85.5 | 89.8 | 90.2 | 88.2 | 88.0 | 92.4 |
MRPC | 81.2 | 80.4 | 80.9 | 81.4 | 81.3 | 80.3 | 80.5 | 81.5 |
QQP | 62.4 | 61.5 | 62.0 | 62.6 | 62.7 | 62.0 | 61.9 | 62.9 |
STS-B | 60.0 | 59.2 | 59.8 | 60.2 | 60.2 | 59.6 | 59.7 | 60.4 |
MNLI | 71.5 | 69.3 | 70.6 | 71.6 | 71.4 | 69.6 | 69.9 | 71.6 |
7 Ablation
In the ablation studies, we principally answer two questions: (1) Is the distance combination in MPD effective? (2) Is the post-processing function effective in the case of extreme word frequency?
7.1 Is the distance combination in MPD effective?
To further explore the effectiveness of the combination of the two distances (the primitive relative distance and the post-relative distance ) in MPD, we add two extra dimension selection criteria: (1) Prim-D: (2) Post-D: . Then, we compare the performances of the three dimension selection methods: MPD-based, Prim-D-based, and Post-D-based.
Table 6 and Table 7 illustrate the performance of different selection methods on the context-unavailable and context-available tasks, respectively. As we can see, MPD obtains better results than the other two criteria, validating the effectiveness of the MPD which combines and through a geometric mean. The findings indicate that either completely preserving or diminishing the impact of word frequency is harmful to word embeddings, so a neutralized combination is proper. Although there may be other combinations of and with better performances, the distance combination method is not the focus of this paper.
7.2 Is the post-processing function effective in the case of extreme word frequency?
For the training corpus with extreme word frequency, the performances of different dimension selection methods are shown in Table 8. As we can see, the methods with post-processing functions (i.e., MPD and Post-D) perform better than or on a par with those without post-processing functions (i.e., PIP, MPD and Prim-D), demonstrating that the post-processing functions are effective in the case of extreme word frequency.
Dataset | Criterion | 0 | 3% | 5% | 10% |
---|---|---|---|---|---|
SST-2 | (Optimal) | 81.9 | 81.4 | 81.5 | 81.8 |
(PIP) | 80.0 | 79.5 | 79.2 | 78.7 | |
(PCA) | 79.8 | 79.5 | 79.0 | 78.5 | |
(Prim-D) | 80.2 | 80.4 | 80.0 | 80.9 | |
(Post-D) | 80.6 | 80.9 | 79.9 | 80.5 | |
(MPD) | 81.6 | 81.0 | 81.2 | 81.3 | |
WordSim353 | (Optimal) | 69.1 | 69.2 | 69.1 | 69.1 |
(PIP) | 65.5 | 64.4 | 63.9 | 63.5 | |
(PCA) | 64.4 | 64.0 | 63.5 | 63.2 | |
(Prim-D) | 62.3 | 62.2 | 61.2 | 61.0 | |
(Post-D) | 65.1 | 65.1 | 65.0 | 64.9 | |
(MPD) | 66.9 | 66.6 | 66.5 | 66.5 |
8 Conclusion
In this paper, we investigate an overlooked problem, whether word frequency influences the automatic dimension selection for static word embedding algorithms, and discover that extreme word frequency can degrade existing dimension selection methods. According to the empirical finding, we propose MPD, a novel metric that combines primitive relative distance and the post relative distance . Through , post-processing functions are introduced into the dimension selection. Then, we develop an MPD-based dimension selection method that automatically selects a proper dimension for static embedding algorithms. Extensive experiments on context-unavailable and context-available tasks show that our MPD-based dimension selection achieves much better performance with a good efficiency-performance trade-off.
References
- Artetxe et al. (2016) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294.
- Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of artificial intelligence research, 49:1–47.
- (3) Daniel Cera, Mona Diabb, Eneko Agirrec, Inigo Lopez-Gazpioc, Lucia Speciad, and Basque Country Donostia. Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation.
- Chung and Glass (2018) Yu-An Chung and James Glass. 2018. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dolan and Brockett (2005) Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005).
- Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414.
- Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Frage: Frequency-agnostic word representation. Advances in neural information processing systems, 31.
- Guo et al. (2019) Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730.
- Halawi et al. (2012) Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406–1414.
- Han et al. (2020) Jialong Han, Aixin Sun, Haisong Zhang, Chenliang Li, and Shuming Shi. 2020. Case: Context-aware semantic expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7871–7878.
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185.
- Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
- Liu et al. (2021) Lemao Liu, Haisong Zhang, Haiyun Jiang, Yangming Li, Enbo Zhao, Kun Xu, Linfeng Song, Suncong Zheng, Botong Zhou, Dick Zhu, et al. 2021. Texsmart: A system for enhanced natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 1–10.
- Liu et al. (2019) Tianlin Liu, Lyle Ungar, and Joao Sedoc. 2019. Unsupervised post-processing of word vectors via conceptor negation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6778–6785.
- Luong et al. (2013) Minh-Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the seventeenth conference on computational natural language learning, pages 104–113.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Miller and Charles (1991) George A Miller and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28.
- Mrksic and Vulic (2018) Nikola Mrksic and Ivan Vulic. 2018. Fully statistical neural belief tracking. In ACL (2).
- Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
- Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4):694–707.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Rubenstein and Goodenough (1965) Herbert Rubenstein and John B Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
- Schnabel et al. (2015) Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 298–307.
- Smith et al. (2017) Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
- Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934. Citeseer.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
- Wang (2019) Yu Wang. 2019. Single training dimension selection for word embedding with pca. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3597–3602.
- Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
- Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
- Yin and Shen (2018) Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems, pages 887–898.
- Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38.
Appendix A Proof of Theorem 1
Proof 1
Let and be the SVDs. If we can prove and , then , and is the unitary matrix, which directly solves the proof. So we denote , , and and are the singular values of and , respectively. The proof is shown as follows.
Firstly, primitive relative distance defined in Eq. 2 is reformulated as Eq. 3:
(4) | ||||
where . If , then .
Secondly, let be the first column of (i.e., the singular vector corresponding to the largest singular value ), and suppose . According to Eq. 10-12, . Then, and .
(5) | ||||
where is considered as an operator.
(6) |
(7) |
The proof can be applied to the other singular vectors of to get and . Because and , the unitary maxtrix is obtained, which completes the proof.
Appendix B Limitation
Since our MPD is specific to static word embedding, it can not be applied to pre-trained language models like BERT. Notice that BERT’s ‘dimension’ refers to the hidden state (768 in base). To dive into the relationship between dimension and BERT’s performance, BERT is needed to be re-trained when the hidden state size changes. This is extremely difficult since re-training BERT needs massive computational resources, while re-training GloVe/Word2Vec can be easily handled. Besides, there are differences between various NLP tasks, but our dimension selection method lacks of task-specific designs, which will be explored in the future.