Frequency-aware Dimension Selection for Static Word Embedding by Mixed Product Distance

Lingfeng Shen
Johns Hopkins Iniversity &Haiyun Jiang
Tencent AI Lab &Lemao Liu
Tencent AI Lab &Ying Chen
China Agricultural University

Abstract

Static word embedding is still useful, particularly for context-unavailable tasks, because in the case of no context available, pre-trained language models often perform worse than static word embeddings. Although dimension is a key factor determining the quality of static word embeddings, automatic dimension selection is rarely discussed. In this paper, we investigate the impact of word frequency on the dimension selection, and empirically find that word frequency is so vital that it needs to be taken into account during dimension selection. Based on such an empirical finding, this paper proposes a dimension selection method that uses a metric (Mixed Product Distance, MPD) to select a proper dimension for word embedding algorithms without training any word embedding. Through applying a post-processing function to oracle matrices, the MPD-based method can de-emphasize the impact of word frequency. Experiments on both context-unavailable and context-available tasks demonstrate the better efficiency-performance trade-off of our MPD-based dimension selection method over baselines.

1 Introduction

Word embedding has been widely used in numerous NLP tasks, such as recommendation systems Zhang et al. (2019), text classification Chung and Glass (2018), information retrieval Palangi et al. (2016) and machine translation Guo et al. (2019). Though pre-trained language models (PLMs) like BERT Devlin et al. (2018) have become a prevailed paradigm in NLP, static word embeddings, such as Word2Vec Mikolov et al. (2013) and GloVe Pennington et al. (2014), are still useful in scenarios where contexts are missing, such as entity retrieval Liu et al. (2021); Han et al. (2020) and word similarity tasks Halawi et al. (2012); Bruni et al. (2014). Under such scenarios, static word embeddings often perform better than PLMs.

Although there are many algorithms for training static word embeddings, the impact of dimension on word embedding has been less investigated. Yin and Shen (2018) have shown that a critical hyper-parameter, the choice of dimension for word vectors, significantly influences the performance of a model that builds on word embeddings. An extremely low-dimensional word embedding is probably not expressive enough to represent the meanings of words, and in contrast, a high-dimensional word embedding is expressive but possibly leads to over-fitting and high computational cost. For most NLP models building on word embeddings, dimension is selected ad hoc or by grid search, either resulting in sub-optimal model performances. Thus, it is promising to establish a dimension selection criterion that can automatically select a proper dimension for static word embedding algorithms with a good efficiency-performance trade-off.

Yin and Shen (2018) first investigated dimension selection for static word embedding. They proposed a metric called Pairwise Inner Product (PIP) loss which attempts to reflect the relationship between the dimension and the quality of word embedding and then developed a PIP-based dimension selection method that selects a dimension with the minimal PIP loss. Later, Wang (2019) proposed a dimension selection method based on Principal Components Analysis (PCA). The PCA-based dimension selection trains a high-dimensional word embedding and then uses the embedding to conduct a grid search to find a proper dimension, which is time-consuming. However, both the PIP-based dimension selection and the PCA-based dimension selection overlook one dominant element that may influence the dimension selection: word frequency.

Thus, in this paper, we first investigate whether word frequency is influential in the dimension selection of static word embedding. Our results reveal that extreme word frequency may cause an improper dimension for the PIP-based or PCA-based dimension selection. That is, without diminishing the impact of word frequency, it is more likely to select an improper dimension and generate a low-quality word embedding. Then, we explore a way to diminish the impact of word frequency. Inspired by the success of post-processing methods on static word embeddings Mu and Viswanath (2018); Liu et al. (2019), which de-emphasize the influence of some words in an embedding, we attempt to transfer them to the dimension selection.

Specifically, we propose an MPD-based dimension selection, which minimizes a metric called Mixed Product Distance (MPD) to find a proper dimension. The MPD can reduce the impact of word frequency by applying a post-processing function to oracle matrices, which results in a more proper dimension. To investigate the effectiveness of our method, we conduct evaluations on context-unavailable and context-available tasks. Comprehensive experiments demonstrate that the MPD-base dimension selection method can consistently perform better than baselines.

Above all, the main contributions of the paper can be outlined as follows:

•

We expose a critical issue overlooked by previous dimension selection methods: word frequency. It can influence the dimension selection and result in degraded word embeddings.
•

We design a metric called Mixed Product Distance (MPD) for static word embedding dimension selection, which uses post-process functions to overcome the word frequency problem.
•

We evaluate the MPD-based dimension selection on various NLP tasks, including context-unavailable and context-available tasks. Comprehensive experiments demonstrate that our method achieves a better efficiency-performance trade-off.

2 Related Work

2.1 Static Word Embedding and Dimension Selection

Existing static word embedding algorithms can be formulated as either explicit or implicit matrix factorization, and the most popular word embedding algorithms can be regarded as implicit matrix factorization. For example, Word2Vec and GloVe perform implicit factorization on the symmetric matrix (i.e., $\alpha=0.5$ ), where the signal matrix $M$ is the shifted PPMI matrix for Word2Vec and the log-count matrix for GloVe. Although intensive studies have been done on static word embedding algorithms that learn word embeddings using matrix factorization, selecting a proper dimension for these word embedding algorithms is less investigated. Yin and Shen (2018) proposed a PIP-based dimension selection method. Specifically, for every possible dimension, its PIP loss is obtained, and then the dimension with the minimal PIP loss is selected. Particularly, the PIP-based method can be conducted without training word embedding. In contrast, Wang (2019) proposed a PCA-based method for dimension selection, which trains a high-dimensional word embedding to help the dimension selection, which results in extremely time-consuming.

2.2 Post-processing for Word Embedding

To further improve the quality of a word embedding, post-processing which reinforces linguistic constraints on the embedding, is often used. According to the studies on post-processing Mu and Viswanath (2018); Liu et al. (2019), one of the important linguistic constraints is to diminish the influence of word frequency. Specifically, in a word embedding, principal components (PCs), which are mainly encoded by frequent words, are shared by all words. As a result, the vectors of less frequent words have a higher variance than those of frequent words, and such a high variance hurts the embedding quality because word frequency is unrelated to lexical semantics. Moreover, two popular post-processing functions, the All-but-the-top (ABTT) Mu and Viswanath (2018) and Conceptor Negation (CN) Liu et al. (2019), have been used to remove or suppress PCs in word embeddings. In our method, instead of word embedding, these post-processing functions are applied to the oracle matrix of word embedding algorithms.

3 Preliminaries

This section presents a basic procedure for the dimension selection of word embedding. A static word embedding algorithm is chosen to obtain an word embedding matrix $X\in\mathbb{R}^{n\times k}$ , where $X$ is composed of $n$ vectors, and word $w_{i}$ is represented by vector $v_{i}\in\mathbb{R}^{k}$ , $i\in[n]$ . Generally, dimension selection aims to find a proper dimension $k$ with a good performance-efficiency trade-off.

There are two existing dimension selection methods: PIP-based and PCA-based. PIP can directly compute the dimension without training any word embedding, while PCA-based needs to train a word embedding before dimension selection, which leads to much more computational cost. So, the PCA-based method owns a worse performance-efficiency trade-off than the PIP-based method. In this paper, we work on dimension selection without training any word embedding.

Following previous works Levy and Goldberg (2014), we leverage oracle matrix for dimension selection, which can be obtained through matrix decomposition without any training process. Formally, for a static word embedding algorithm with two hyper-parameters ( $\alpha$ and dimension $k$ ) and a signal matrix $M$ (i.e., the co-occurrence matrix), an oracle matrix $X$ is derived from matrix $M$ with the form $X=f_{\alpha,k}(M)\leq U_{\cdot,1:k}D_{1:k,1:k}^{\alpha}$ , where $M=UDV^{T}$ is its SVD obtained by matrix factorization Levy et al. (2015).

To implement the dimension selection, we divide the training corpus into two identical partitions $C_{1}$ and $C_{2}$ , and then obtain the co-occurrence matrices $M_{1}$ and $M_{2}$ on each partition, respectively. Correspondingly, we can have two oracle matrices $X$ and $\hat{X}$ . Then, a common way to select a proper dimension is used: define a proper distance $d$ , and then find $k$ that can minimize the distance $d(X,\hat{X})$ . Notice that a distance ( $\approx 0$ ) that guarantees the unitary-invariance of the two oracle matrices $X$ and $\hat{X}$ may serve as a proper distance for dimension selection Yin and Shen (2018), where unitary-invariance refers that two matrices are essentially identical if one can be transformed from the other by performing a unitary operation (e.g., a simple rotation), and unitary-invariance defines the equivalence class of the oracle matrices.

4 Is word frequency influential in dimension selection?

Although previous works Mu and Viswanath (2018); Gong et al. (2018) have pointed out that word frequency may bring negative consequences to word embeddings, the impact of word frequency on the dimension selection has been overlooked by previous studies on dimension selection Yin and Shen (2018); Wang (2019). Therefore, we explore a question in this section: does the word frequency influence the dimension selection for word embeddings?.

We first prepare an experiment of extreme word frequency on two tasks: context-unavailable (WordSim353 dataset, Finkelstein et al., 2001) and context-available (SST-2 dataset, Socher et al., 2013). In the experiment, the following two kinds of dimensions, $k^{*}$ and $k^{+}$ , are used. Generally, word embedding with $k^{+}$ consistently outperforms the one with $k^{*}$ , and the performance gap between $k^{+}$ and $k^{*}$ reflects the effectiveness of the dimension selection method.

•

Criterion-selected dimension $k^{*}$ : Given a dimension selection method, $k^{*}$ with the minimal distance $d$ is selected as the criterion-selected dimension.
•

Empirically optimal dimension $k^{+}$ (Optimal): Given a word embedding algorithm, a word embedding is trained for each dimension in a grid search (i.e., the dimension is selected from 50 to 1000 with an increment step of 2). Then, each word embedding is evaluated on the benchmark, and the one achieving the highest performance is selected as the empirically optimal dimension $k^{+}$ .

Dataset	Criterion	0	3%	5%	10%
SST-2	$k^{+}$ (Optimal)	81.9	81.4	81.5	81.8
	$k^{*}$ (PIP)	80.0	79.5	79.2	78.7
	$k^{*}$ (PCA)	79.8	79.5	79.0	78.5
WordSim353	$k^{+}$ (Optimal)	69.1	69.2	69.1	69.1
	$k^{*}$ (PIP)	65.5	64.4	63.9	63.5
	$k^{*}$ (PCA)	64.4	64.0	63.5	63.2

Table 1: Performance on the two tasks with extreme word frequency. The embedding are generated by Word2Vec based on a dimension selection criterion (PIP, PCA or grid search). ‘

10\%

’ column means adding

10\%\cdot N

duplicate sentences into the training corpus.

Given the training corpus, Gigaword 5 corpus which contains $N$ sentences, to obtain varying word frequencies, we randomly select a sentence (length is larger than 20, which will not affect the overall fluency of corpus considering the WordVec window size is 5) and duplicate it for $3\%\cdot N$ , $5\%\cdot N$ , and $10\%\cdot N$ times, and then merge them with the corpus. The results are listed in Table 1. As word frequency becomes extreme, the performance gap between $k^{*}$ and $k^{+}$ enlarges, indicating that extreme word frequency significantly hurts the effectiveness of dimension selection methods. Thus, it is necessary to diminish the influences of word frequency during dimension selection.

5 Methodology

5.1 Overview

The formulation of our proposed distance, mixed pairwise distance (MPD), is defined as follows:

MPD(X,\hat{X})=\sqrt{d_{r}(X,\hat{X})\cdot d_{p}(X,\hat{X})}

(1)

where $X$ and $\hat{X}$ correspond to the two oracle matrices mentioned in Sec 3. Concretely, MPD consists of two distances: primitive relative distance $d_{r}$ and post relative distance $d_{p}$ . Based on the distance criterion, an MPD-based dimension selection method is developed to select a proper dimension for static embedding algorithms.

5.2 Primitive Relative Distance

Definition 1 gives the definition of the primitive relative distance $d_{r}$ , and such a definition can guarantee the unitary-invariance of $X$ and $X^{T}$ . Moreover, unless specifically stated, the matrix norms used in the paper are Frobenius norms.

Definition 1

The primitive relative distance $d_{r}$ between the oracle matrices $X\in\mathbb{R}^{n\times k}$ and $\hat{X}\in\mathbb{R}^{n\times k}$ is defined as follows:

d_{r}\left(X,\hat{X}\right)=\frac{\left\|XX^{T}-\hat{X}\hat{X}^{T}\right\|^{2}}{\left\|XX^{T}\right\|\left\|\hat{X}\hat{X}^{T}\right\|}

(2)

where the denominator $||XX^{T}||||\hat{X}\hat{X}^{T}||$ is a scaling term.

Then, the relationship between $d_{r}$ and unitary-invariance is presented as follows. According to Theorem 1, if $d_{r}$ is close to 0, the two matrices ( $X$ and $\hat{X}$ ) are extremely unitary-invariant. In other words, when the numerator $||XX^{T}-\hat{X}\hat{X}^{T}||^{2}$ in Eq 2 is 0, $X$ and $\hat{X}$ can be regarded identical Smith et al. (2017); Artetxe et al. (2016). Therefore, the definition of $d_{r}$ can guarantee that when $d_{r}$ is close to 0, the unitary-invariance of $X$ and $X^{T}$ exists. Thus, the optimal dimension which minimizes $d_{r}$ is a proper dimension for word embedding, as presented in Sec 3.

Theorem 1

Suppose $\left\|d_{r}(A,B)\right\|\approx 0$ , then $A$ and $B$ are unitary-invariant (i.e., existing an unitary matrix $T$ , $B\approx AT$ ).

The proof is deferred to Appendix A.

5.3 Post Relative Distance

In order to overcome the word frequency problem, the post-relative distance $d_{p}$ is defined, as shown in Definition 2, where $F(\cdot)$ is a post-processing function on the oracle matrices. Notice that the post-processing function in $d_{p}$ can be any post-processing method which has been used on word embedding.

Definition 2

Given a post-processing function $F$ , the post-relative distance $d_{p}$ between the oracle matrices $X\in\mathbb{R}^{n\times k}$ and $\hat{X}\in\mathbb{R}^{n\times k}$ is defined as follows:

d_{p}\left(Y,\hat{Y}\right)=\frac{\left\|XY^{T}-\hat{Y}\hat{Y}^{T}\right\|^{2}}{\left\|XY^{T}\right\|\left\|\hat{Y}\hat{Y}^{T}\right\|}

(3)

where $Y=F({X})$ and $\hat{Y}=F(\hat{X})$ .

Specifically, we transfer the success of the post-processing function on word embedding matrices to oracle matrices. Post-processing functions essentially ‘normalize’ a word embedding through de-emphasizing the impact of some words. Since the oracle matrices, $X$ and $X^{T}$ , are also influenced by the word frequency, we believe that the application of the post-processing function to the oracle matrices can also diminish such an influence.

Moreover, the post-relative distance is essentially the same as the primitive relative distance, except the usage of post-processing function $F$ . Similar to the discussion of $d_{r}$ , $d_{p}$ also possesses the unitary-invariance.

5.4 Combination

It is also worth mentioning that eliminating the impact of word frequency is not always beneficial for word embedding, because some crucial information in a word embedding is inevitably eliminated Gong et al. (2018). So, we combine $d_{r}$ and $d_{p}$ with a geometric mean in MPD.

Embedding	Criterion	WS	MTurk	MC	MEN	RG	RW
BERT	None	41.15	32.89	49.13	41.23	34.66	13.15
Glove	Optimal	69.12	58.78	70.20	73.75	76.95	33.55
	PIP	65.48	55.48	67.19	71.56	74.32	31.49
	PCA	64.44	55.23	66.82	71.09	73.56	31.67
	MPD	66.89	57.02	68.66	72.92	75.23	32.91
Word2Vec	Optimal	68.03	58.12	63.96	70.27	70.01	25.43
	PIP	64.68	55.63	62.19	68.45	68.76	23.12
	PCA	64.56	55.91	62.03	68.70	69.04	22.41
	MPD	67.14	57.11	62.88	69.79	69.54	24.64

Table 2: The performance on word similarity using different word embeddings. Except the dynamic embedding produced by BERT, other word embeddings are generated by a static word embedding algorithm with a dimension selected by a criterion (PIP, PCA, MPD and a grid search); The number is 100

\times

\rho

(Spearman correlation); ‘Bold’ numbers indicate the best performance except ‘Optimal’.

6 Experiment

This section carries out three sets of experiments to comprehensively analyze our MPD-based dimension selection. The first two sets of experiments evaluate the effectiveness of our MPD-based dimension selection on context-unavailable and context-available tasks, respectively. The third set of experiments compares the efficiency-performance trade-off of different dimension selection methods.

Specifically, two popular static word embedding algorithms (Word2Vec¹¹1We choose the CBOW version. and GloVe) are chosen to train word embeddings and both of them use Gigaword 5 as the training corpus. For the MPD-based dimension selection method, two commonly-used post-processing methods (CN and ABTT) are used, and CN is the default one. Moreover, for each experiment, a criterion-selected dimension $k^{*}$ and an empirically optimal dimension $k^{+}$ are computed, whose definitions can be found in Sec 4.

Practical Settings

Considering that SVD decomposition on large matrices needs extremely large memory resources, we reduce the vocab size in word embedding to 25,000, which can be handled by a 16GB-memory device. Moreover, previous works Yin and Shen (2018) have showed that as long as the vocab size is beyond a threshold (i.e., 10,000), the performance will not degrade much even when the vocab size is reduced.

6.1 Context-unavailable Tasks

Dataset	Embedding	Human Score
Chinese	BERT	0.79
	$k^{*}$ (PIP)	1.12
	$k^{*}$ (PCA)	1.08
	$k^{*}$ (MPD)	1.38
English	BERT	0.68
	$k^{*}$ (PIP)	1.12
	$k^{*}$ (PCA)	1.05
	$k^{*}$ (MPD)	1.29

Table 3: Performance (Human score) on semantic expansion using different word embeddings, which are generated by GloVe with PIP, PCA or MPD. Particularly, empirically optimal dimension

k^{+}

(Optimal) is not used because it is selected according to the final performance and such a selection is not plausible on semantic expansion since hundreds of times of human annotation are required.

Experiment Settings

We conduct experiments on the following two tasks in which contexts are unavailable.

•

Word similarity: The quality of each embedding is evaluated on the following five benchmarks: WordSim353 Finkelstein et al. (2001), MTurk777 Halawi et al. (2012), MC Miller and Charles (1991), MEN Bruni et al. (2014), RG Rubenstein and Goodenough (1965), and RW Luong et al. (2013). Each dataset contains a list of word pairs with human scores. The performance is measured by the correlation between the cosine similarities of word vectors and human scores.
•

Semantic expansion (SE): The Chinese and English entity list in TexSmart Liu et al. (2021) is selected. Then, a word embedding trained with a selected dimension is applied to each entity to retrieve its related entities. Finally, the top five expansion results of each input entity are judged by human annotators in terms of quality and relatedness. Each result is annotated by two annotators, and a label of 2, 1 or 0 is assigned, corresponding to highly related, slightly related, and not related, respectively.

Results

The performances of different dimension selection methods on word similarity and semantic expansion are listed in Table 2 and Table 3, respectively. First of all, we can see BERT Devlin et al. (2018) performs catastrophically on context-unavailable tasks since the dynamic embedding generation is based on contextual learning. Then, MPD outperforms the baselines (PIP and PCA) in the word similarity tests because the post-processing function in MPD can diminish the impact of word frequency on the oracle matrices. Finally, MPD outperforms the baselines on semantic expansion because of the incorporation of the post-processing function, which enables the dimension selection method to neglect redundant components and capture word semantics better.

6.2 Context-available Tasks

Experiment Settings

Since it is often a case that the success on the context-unavailable tasks cannot be well transferred to context-available tasks Schnabel et al. (2015), we conducted experiments that directly compare the effectiveness of different dimension selection methods on various downstream NLP tasks with contexts. The tasks can be divided into the following two kinds:

Single-sentence tasks: (1) Text classification: the movie review (MR) Pang and Lee (2005), SST-5 Socher et al. (2013) and SST-2 datasets Socher et al. (2013); (2) Linguistic acceptability: CoLA Warstadt et al. (2019) dataset; (3) Dialogue state tracking: Wizard-of-Oz383 (WOZ) dataset Wen et al. (2017).

Sentence-pair tasks: (1) Sentence paraphrase: MRPC Dolan and Brockett (2005) and QQP Wang et al. (2018) dataset; (2) Semantic textual similarity: STS-B dataset Cera et al. ; (3) Language inference: MNLI dataset Williams et al. (2018).

Specifically, for the two single-sentence tasks (text classification and linguistic acceptability), a sentence is encoded with a static word embedding and the resulting embedding vector is fed to a Bi-LSTM classifier. For another single-sentence task, dialogue state tracking, a deep-neural-network-based model, Neural Belief Tracker (NBT) Mrksic and Vulic (2018), is chosen. For the three sentence-pair tasks, two sentences are encoded independently to produce two embedding vectors $u$ and $v$ , and then vector $[u;v;|u-v|]$ is fed to a MLP-based classifier. Notice that the performances on these tasks can reflect whether a selected dimension is proper for static word embedding algorithms, but such performances still lag behind the ones using BERT. For the context-available tasks, it is not fair to compare PLMs (e.g., BERT) and static word embedding algorithms, because the former leverages contexts and the latter does not. Therefore, we only compare the four dimension selection methods (PIP-based, PCA-based, MPD-based, and grid search) in the experiments.

Dataset	Word2Vec					GloVe
	PIP	PCA	MPD(ABTT)	MPD(CN)	Optimal	PIP	PCA	MPD(ABTT)	MPD(CN)	Optimal
MR	63.6	63.2	70.6	70.9	71.4	65.0	69.4	71.3	72.0	72.5
SST-2	80.0	79.8	81.6	81.6	81.9	86.0	88.5	90.0	89.8	90.2
SST-5	32.4	32.0	38.0	38.2	38.7	34.4	35.4	39.4	39.2	39.7
CoLA	4.3	4.4	5.6	5.5	5.6	14.4	14.3	15.6	15.6	15.8
WOZ	72.5	72.7	83.9	83.4	85.6	74.6	79.4	86.5	90.2	92.4
MRPC	79.4	79.3	80.6	80.9	80.9	80.0	80.2	81.2	81.3	81.5
QQP	53.9	53.6	56.1	56.2	56.6	60.4	60.5	62.4	62.7	62.9
STS-B	56.6	56.5	58.6	58.8	59.0	58.0	58.7	60.0	60.2	60.4
MNLI	54.4	54.0	57.0	57.7	58.0	67.9	67.6	71.5	71.4	71.6

Table 4: The performance on the context-available tasks. Rows (e.g., ‘WOZ’) represent the used datasets; Column ‘PIP’, ‘PCA’, ‘MPD’ and ‘Optimal’ represent PIP

k^{*}

, PCA

k^{*}

, MPD

k^{*}

and empirically optimal dimensions

k^{+}

, respectively; Column ‘ABTT’ and ‘CN’ represent the used post-processing function in MPD, respectively.

Results

The experimental results of the six downstream NLP tasks are reported in Table 4. As shown in Table 4, the dimension selection method using post-processing (i.e., Column ‘ABTT’ and ‘CN’) significantly outperform the ones without using post-processing (i.e., Column ‘PIP’ and ‘PCA’). This shows the necessity of introducing post-processing to the dimension selection. Moreover, the MPD-based dimension selection achieves competitive performance with the optimal ones (‘Optimal’). For example, on the ’WOZ’ task, compared to PIP, the accuracy score based on MPD(ABTT) increases from $72.5\%$ to $83.9\%$ , which is closer to the optimal performance ( $85.6\%$ ). This means that MPD-based dimension selection works effectively for downstream tasks with contexts.

Then, as illustrated in Table 4, the performance of MPD changes as the used post-processing method changes. Except the ‘WOZ’ task, MPD(CN) usually performs on a par with MPD(ABTT). This indicates that the selection of post-processing functions does not affect much the MPD-based dimension selection.

6.3 Efficiency-performance Trade-off

In this part, we compare the efficiency-performance trade-off among grid search and three dimension selection methods (PIP-based, PCA-based, and MPD-based) using Word2Vec, where efficiency is measured by time and performance is evaluated by the average per-task performance. Ideally, we hope a method could perform better and remain computationally efficient. Specifically, the efficiency evaluation of grid search is dependent on grid granularity because a coarser grid search could save time at the cost of performance loss. Thus, in this experiment, we choose the dimension bound as $[50,1000]$ and use three grid searches with three increments (i.e., 2, 5 and 10), respectively. The three grid searches are denoted as ‘GS-2’,‘GS-5’, and ‘GS-10’. The efficiency-performance results are listed in Table 5.

In Table 5, for the three grid searches, ‘GS-10’ performs most closely to MPD but takes 14.01x time. Although ‘GS-2’ and ‘GS-5’ achieve slightly better performances than MPD, they take 56.04x and 28.02x computational cost, which is a disaster efficiency-performance trade-off. In the other hand, compared to PCA, MPD achieves $1.7\%$ performance improvement and costs only $17.1\%$ computation time of PCA. Compared to PIP, MPD has a similar computation cost, but achieves a $1.3\%$ performance increase. Therefore, the MPD-based method achieves the best efficiency-performance trade-off.

Criterion	Performance	Computation time
GS-2	59.5	56.04x
GS-5	59.3	28.02x
GS-10	59.0	14.01x
PIP	57.8	0.98x
PCA	57.3	10.85x
MPD	59.1	1.00x

Table 5: Comparison of efficiency-performance trade-off. The whole MPD-based dimension selection procedure takes 6.3 min on one RTX2080 device. Specifically, ‘5.85x’ means 5.85 times of computational cost of MPD.

	WS353				MT771
	MPD	Prim-D	Post-D	Optimal	MPD	Prim-D	Post-D	Optimal
GloVe	0.66	0.62	0.65	0.69	0.56	0.55	0.55	0.58
Word2Vec	0.65	0.60	0.64	0.68	0.55	0.54	0.55	0.58

Table 6: Ablation study on the context-unavailable tasks using GloVe and Word2Vec. ‘Bold’ numbers indicate the best performance except ‘Optimal’.

Dataset	ABTT				CN
	MPD	Prim-D	Post-D	Optimal	MPD	Prim-D	Post-D	Optimal
MR	71.3	70.3	70.8	72.0	72.0	70.5	70.2	72.5
SST-2	90.0	89.2	89.6	90.2	89.8	89.1	88.9	90.2
SST-5	39.4	38.5	38.9	39.6	39.2	38.4	38.6	39.7
CoLA	15.6	15.0	15.4	15.7	15.6	14.9	15.1	15.8
WOZ	86.1	84.5	85.5	89.8	90.2	88.2	88.0	92.4
MRPC	81.2	80.4	80.9	81.4	81.3	80.3	80.5	81.5
QQP	62.4	61.5	62.0	62.6	62.7	62.0	61.9	62.9
STS-B	60.0	59.2	59.8	60.2	60.2	59.6	59.7	60.4
MNLI	71.5	69.3	70.6	71.6	71.4	69.6	69.9	71.6

Table 7: Ablation analysis on the context-available tasks using GloVe.

7 Ablation

In the ablation studies, we principally answer two questions: (1) Is the distance combination in MPD effective? (2) Is the post-processing function effective in the case of extreme word frequency?

7.1 Is the distance combination in MPD effective?

To further explore the effectiveness of the combination of the two distances (the primitive relative distance $d_{r}$ and the post-relative distance $d_{p}$ ) in MPD, we add two extra dimension selection criteria: (1) Prim-D: $d_{r}$ (2) Post-D: $d_{p}$ . Then, we compare the performances of the three dimension selection methods: MPD-based, Prim-D-based, and Post-D-based.

Table 6 and Table 7 illustrate the performance of different selection methods on the context-unavailable and context-available tasks, respectively. As we can see, MPD obtains better results than the other two criteria, validating the effectiveness of the MPD which combines $d_{r}$ and $d_{p}$ through a geometric mean. The findings indicate that either completely preserving or diminishing the impact of word frequency is harmful to word embeddings, so a neutralized combination is proper. Although there may be other combinations of $d_{p}$ and $d_{r}$ with better performances, the distance combination method is not the focus of this paper.

7.2 Is the post-processing function effective in the case of extreme word frequency?

For the training corpus with extreme word frequency, the performances of different dimension selection methods are shown in Table 8. As we can see, the methods with post-processing functions (i.e., MPD and Post-D) perform better than or on a par with those without post-processing functions (i.e., PIP, MPD and Prim-D), demonstrating that the post-processing functions are effective in the case of extreme word frequency.

Dataset	Criterion	0	3%	5%	10%
SST-2	$k^{+}$ (Optimal)	81.9	81.4	81.5	81.8
	$k^{*}$ (PIP)	80.0	79.5	79.2	78.7
	$k^{*}$ (PCA)	79.8	79.5	79.0	78.5
	$k^{*}$ (Prim-D)	80.2	80.4	80.0	80.9
	$k^{*}$ (Post-D)	80.6	80.9	79.9	80.5
	$k^{*}$ (MPD)	81.6	81.0	81.2	81.3
WordSim353	$k^{+}$ (Optimal)	69.1	69.2	69.1	69.1
	$k^{*}$ (PIP)	65.5	64.4	63.9	63.5
	$k^{*}$ (PCA)	64.4	64.0	63.5	63.2
	$k^{*}$ (Prim-D)	62.3	62.2	61.2	61.0
	$k^{*}$ (Post-D)	65.1	65.1	65.0	64.9
	$k^{*}$ (MPD)	66.9	66.6	66.5	66.5

Table 8: Ablation analysis on the two tasks with extreme word frequency. The experimental settings are the same as the Table 1.

8 Conclusion

In this paper, we investigate an overlooked problem, whether word frequency influences the automatic dimension selection for static word embedding algorithms, and discover that extreme word frequency can degrade existing dimension selection methods. According to the empirical finding, we propose MPD, a novel metric that combines primitive relative distance $d_{r}$ and the post relative distance $d_{p}$ . Through $d_{p}$ , post-processing functions are introduced into the dimension selection. Then, we develop an MPD-based dimension selection method that automatically selects a proper dimension for static embedding algorithms. Extensive experiments on context-unavailable and context-available tasks show that our MPD-based dimension selection achieves much better performance with a good efficiency-performance trade-off.

References

Artetxe et al. (2016) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294.
Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of artificial intelligence research, 49:1–47.
(3) Daniel Cera, Mona Diabb, Eneko Agirrec, Inigo Lopez-Gazpioc, Lucia Speciad, and Basque Country Donostia. Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation.
Chung and Glass (2018) Yu-An Chung and James Glass. 2018. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dolan and Brockett (2005) Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005).
Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414.
Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Frage: Frequency-agnostic word representation. Advances in neural information processing systems, 31.
Guo et al. (2019) Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3723–3730.
Halawi et al. (2012) Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406–1414.
Han et al. (2020) Jialong Han, Aixin Sun, Haisong Zhang, Chenliang Li, and Shuming Shi. 2020. Case: Context-aware semantic expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7871–7878.
Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185.
Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
Liu et al. (2021) Lemao Liu, Haisong Zhang, Haiyun Jiang, Yangming Li, Enbo Zhao, Kun Xu, Linfeng Song, Suncong Zheng, Botong Zhou, Dick Zhu, et al. 2021. Texsmart: A system for enhanced natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 1–10.
Liu et al. (2019) Tianlin Liu, Lyle Ungar, and Joao Sedoc. 2019. Unsupervised post-processing of word vectors via conceptor negation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6778–6785.
Luong et al. (2013) Minh-Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the seventeenth conference on computational natural language learning, pages 104–113.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
Miller and Charles (1991) George A Miller and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28.
Mrksic and Vulic (2018) Nikola Mrksic and Ivan Vulic. 2018. Fully statistical neural belief tracking. In ACL (2).
Mu and Viswanath (2018) Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4):694–707.
Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Rubenstein and Goodenough (1965) Herbert Rubenstein and John B Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
Schnabel et al. (2015) Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 298–307.
Smith et al. (2017) Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926–934. Citeseer.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
Wang (2019) Yu Wang. 2019. Single training dimension selection for word embedding with pca. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3597–3602.
Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
Yin and Shen (2018) Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems, pages 887–898.
Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38.

Appendix A Proof of Theorem 1

Proof 1

Let $A=UDV^{T}$ and $B=XVY^{T}$ be the SVDs. If we can prove $U\approx X$ and $D\approx V$ , then $B\approx AT$ , and $T=VY^{T}$ is the unitary matrix, which directly solves the proof. So we denote $D=\operatorname{diag}\left(d_{i}\right)$ , $V=\operatorname{diag}\left(\lambda_{j}\right)$ , and $d_{i}$ and $\lambda_{j}$ are the singular values of $A$ and $B$ , respectively. The proof is shown as follows.

Firstly, primitive relative distance $d_{r}$ defined in Eq. 2 is reformulated as Eq. 3:

	$\displaystyle d_{r}\left(A,B\right)$	$\displaystyle=\frac{\|\|AA^{T}-BB^{T}\|\|}{C}$		(4)
		$\displaystyle=\frac{\|\|UD^{2}U^{T}-XV^{2}X^{T}\|\|}{C}$		(4)

where $C=\frac{1}{||AA^{T}||||BB^{T}||}$ . If $d_{r}(A,B)\approx 0$ , then $||AA^{T}-BB^{T}||\approx 0$ .

Secondly, let $x_{1}$ be the first column of $X$ (i.e., the singular vector corresponding to the largest singular value $\lambda_{1}$ ), and suppose $\lambda_{1}\geq d_{1}$ . According to Eq. 10-12, $0\leq\lambda_{1}^{2}-d_{1}^{2}\leq\left\|AA^{T}-BB^{T}\right\|\approx 0$ . Then, $d_{1}\approx\lambda_{1}$ and $u_{1}\approx x_{1}$ .

	$\displaystyle\left\\|BB^{T}x_{1}\right\\|-\left\\|AA^{T}x_{1}\right\\|$	$\displaystyle\leq\left\\|\left(AA^{T}-BB^{T}\right)x_{1}\right\\|_{op}$		(5)
		$\displaystyle\leq\left\\|AA^{T}-BB^{T}\right\\|$		(5)

where $AA^{T}-BB^{T}$ is considered as an operator.

\left\|BB^{T}x_{1}\right\|=\left\|XV^{2}X^{T}x_{1}\right\|=\lambda_{1}^{2}

(6)

\left\|AA^{T}x_{1}\right\|=||UD^{2}U^{T}x_{1}||=d_{1}^{2}

(7)

The proof can be applied to the other singular vectors of $X$ to get $U$ and $D$ . Because $U\approx X$ and $D\approx V$ , the unitary maxtrix $T$ is obtained, which completes the proof.

Appendix B Limitation

Since our MPD is specific to static word embedding, it can not be applied to pre-trained language models like BERT. Notice that BERT’s ‘dimension’ refers to the hidden state (768 in base). To dive into the relationship between dimension and BERT’s performance, BERT is needed to be re-trained when the hidden state size changes. This is extremely difficult since re-training BERT needs massive computational resources, while re-training GloVe/Word2Vec can be easily handled. Besides, there are differences between various NLP tasks, but our dimension selection method lacks of task-specific designs, which will be explored in the future.

	$\displaystyle\left\\|BB^{T}x_{1}\right\\|-\left\\|AA^{T}x_{1}\right\\|$	$\displaystyle\leq\left\\|\left(AA^{T}-BB^{T}\right)x_{1}\right\\|_{op}$		(5)
		$\displaystyle\leq\left\\|AA^{T}-BB^{T}\right\\|$		(5)