Fast Extraction of Word Embedding from Q-contexts
Abstract.
The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11 13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVe and fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.
1. Introduction
Word embedding plays a fundamental role in the development and real-world applications of natural language processing (NLP). It efficiently provides meaningful representations of individual words in a continuous space, allowing smooth integration with machine learning models in various downstream NLP tasks (Maruf and Haffari, 2018; Khabiri et al., 2019). The notion of word embedding is also the predecessor of follow-up deep contextualization models, including the recently discovered powerful pre-trained contextual embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019).
High-quality embedding of words can help boost the performance of many machine learning models in NLP tasks. Recent work about word embedding can be categorized into two genres, i.e., neural network based methods (Mikolov et al., 2013b; Bojanowski et al., 2017; Pennington et al., 2014) and global matrix factorization based methods (Deerwester et al., 1990; Levy and Goldberg, 2014; Arora et al., 2016). Word2Vec, GloVe and fasttext are the most popular neural network based methods. Most of the existing methods focus on improving the performance of word embedding. However, it is computationally expensive to obtain such word embedding — it takes several days and, typically, around a hundred CPU cores to attain decent quality representation of words (Mikolov et al., 2013b; Mnih and Kavukcuoglu, 2013; Pennington et al., 2014; Mikolov et al., 2013a). Global matrix factorization based methods for generating word embedding have roots stretching as far back as LSA(Deerwester et al., 1990). These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus. Previous work has shown that both the word2vec and GloVe methods can be viewed as implicit factorization of special information matrices(Levy and Goldberg, 2014; Arora et al., 2016). Although global matrix factorization based method is more efficient than the neural network based methods, it still needs to factorize a large information matrix for large-scale vocabulary, where is the size of vocabulary. This makes it highly expensive to directly factorize and calculate for large-scale word embedding learning.
To address the efficiency limitations of current work, we propose to study word embedding learning for large-scale vocabulary with the goal of efficiency and theoretical guarantees. Recent literature has shown that quantum perspective can thus provide advantages for classical machine learning (Rebentrost et al., 2014; Zeng and Coecke, 2016). Coecke et al. have previously demonstrated a potential quantum advantage for NLP in various ways including by algorithmic speed-up for search-related or classification tasks (Coecke et al., 2020). By mimicking how word-meanings are encoded in quantum states, we design our algorithms implemented on classical computers to speed up the word embedding learning problem. The main idea is to construct a small and typical information matrix which is a good approximation of the original information matrix. Both the construction and the factorization of the small matrix require a low cost. With this design, we are able to demonstrate running-time supremacy for solving a large-scale word embedding problem and maintain accuracy for various downstream NLP tasks.
We reveal a simple relation between a word vector for a target word and what we call Q-contexts. Q-contexts are a small fraction of contexts that are typical in the whole corpus, capable of capturing the most important information encoded in the information matrix – they are certain rows of chosen to represent the original information matrix. The word vector of is shown to be a combination of its interaction with these contextual environments
(1) |
where is the entry in the information matrix indexed by and , is a constant vector for a context to be determined (for a more detailed description see Section 3.1). Information matrix can be naturally encoded as a sampling state, enabling a fast construction of Q-contexts. To the best of our knowledge, it has not been studied to extract meaningful word embedding from its mutual information with a few contexts.
Based on these, we develop a WEQ method that substantially accelerates the word embedding learning — in fact, our method is at least times faster than well-established methods, and has fewer resource requirements on CC corpus. We show empirically that WEQ achieves comparable performance in comparison to well-known methods, such as the direct matrix factorization, word2vec, GloVe and fasttext (Mikolov et al., 2013b; Pennington et al., 2014; Levy and Goldberg, 2014; Bojanowski et al., 2017) and maintains high accuracy in various downstream NLP tasks. Further, WEQ’s efficiency and effectiveness are theoretically backed up. The small Q-contexts matrix is a good approximation of the original information matrix with negligible error, maintaining the representation power of its learned embedding.
The organization of the rest of this article is as follows. In Section 2, we recall a general matrix factorization perspective on well-known word embedding methods. In Section 3, we introduce the WEQ method. In Section 4, we analyze the approximation error of the Q-contexts with theoretical proof. In Section 5, we conduct a comprehensive set of experiments demonstrating the accuracy and efficiency of our method. In Section 6, we review the related work of word embedding. Finally, we give a conclusion in Section 7.
Notation | Description |
---|---|
the multiset of context-word pairs | |
the target word | |
context word around the target word | |
the number of co-occurrences of and in | |
the number of times of appears in | |
the number of times of appears in | |
entry in information matrix indexed by and | |
word (row) vector for the target word | |
context (row) vector for context | |
the -norm of a vector | |
matrix | |
the Hilbert-Schmidt norm of | |
the operator norm of matrix | |
row of | |
column of | |
the transpose of | |
the number of nonzeros in | |
Q-contexts matrix | |
the normalized version of |
2. PRELIMINARIES
Commonly, the problem of word embedding is learned by capturing the semantic relationship between word-context pairs . For a target word , its context word c is obtained from the neighborhood centering around the locations where appears in the corpora. Previously established results show that the factorization of information matrices provides a united framework for many important existing word embedding algorithms, including word2vec, GloVe, PMI, and NCE (Arora et al., 2016; Pennington et al., 2014; Mnih and Kavukcuoglu, 2013; Levy and Goldberg, 2014). The main difference between these methods lies in different choices of mutual information matrix between contexts and words.
As indicated in (Church and Hanks, 1989; Arora et al., 2016; Dagan et al., 1994; Turney and Pantel, 2010), factorizing the following point-wise mutual information matrix (PMI) yields effective word representations , where is the multiset of context-word pairs, #(c,w) is the number of co-occurrences of context word and target word in the , and are the number of times and appear in respectively. It is shown in (Levy and Goldberg, 2014) that word vectors from word2vec can be obtained from factorization of a shifted version of PMI: where denotes the number of negative samples. They also show that NCE model (Mnih and Kavukcuoglu, 2013) is in fact factorizing To improve performance, a positive version of PMI (PPMI) and a shifted version of PPMI (SPPMI with shift parameter ) are proposed, where
It is shown (Arora et al., 2016) that GloVe objective is in fact optimizing (modulo some error term)
where is the context vector for context and is the word vector for the target word . In their theory, the authors also show that for some constant :
Since , we conclude
Factorization of mutual information matrices constitutes a unified framework of these word embedding algorithms: . We list necessary notations and their descriptions in Table 1.
3. Method
In this section, we present WEQ method which is an efficient and effective method for large-scale word embedding learning problem. We develop the WEQ method to construct and factorize a small typical information matrix that approximates the original information matrix. The WEQ method is composed of three steps, as illustrated in Fig. 1. First, it calculates the information matrix from co-occurrence matrix . Secondly, it constructs the small typical information matrix (Q-contexts) from the original information matrix through -norm sampling. Third, it conducts the singular value decomposition of Q-contexts matrix to obtain the word embedding.
3.1. Q-contexts Definition
We first introduce the -norm sampling, then describe the definition of Q-contexts.
-norm sampling: The -norm sampling technique has well exhibited its effectiveness in machine learning (Hazan et al., 2011; Song et al., 2016) and randomized linear algebra (Drineas et al., 2008). In fact, the work by Frieze, Kannan, and Vempala (Frieze et al., 2004) shows that with certain -norm sampling assumptions, a form of singular value estimation can be achieved in time independent of the size of input matrix. Further, the work by Tang (Tang, 2019) shows that sampling from the projection of a vector onto a subspace is not outside the realm of feasibility.
Inspired by these, we leverage the -norm sampling to construct a small typical information matrix to solve the large-scale word embedding learning problem. We now elaborate on our proposed relation between Q-contexts and words in Equation (1).
Given an information matrix , we first encode the information matrix into an -norm state which will be described in detail in Section 3.2. Now that the mutual information is prepared to be a state, each -norm sampling yields a context with probability . The collection of the corresponding row vectors is what we call Q-contexts. In our proposed scheme, is designed to correctly reflect the amount of information carried by the context , so that contexts with more information are more likely to be sampled.

Definition 3.1.
A Q-contexts matrix is the collection of rows in the information matrix :
where context is the -th sampling outcome from the information matrix .
The central idea in Equation (1) is that a few -norm sampling of the information matrix provides sufficient information such that a good word embedding can be obtained from a linear combination of its mutual information with Q-contexts
where the context coefficients vectors (i.e., ’s) are to be determined in Section 3.2.
3.2. Word Embedding from Q-contexts (WEQ)
We now give the full algorithm (Algorithm 2) for computing word embedding based on Q-contexts matrix . WEQ method consists of three steps: information matrix calculation, Q-contexts construction, and calculating word embedding from Q-contexts.
Step 1: Information matrix calculation. As we have discussed in Section 2, there are different kinds of information matrices. Here, we choose PPMI and SPPMI matrices proposed in (Levy and Goldberg, 2014) which shows that exact factorizing (PPMI / SPPMI) matrix with SVD is at least as good as SGNS’s solutions. Given a co-occurrence matrix, we construct the PPMI information matrix as follows:
and a shifted version of PPMI (SPPMI with shift parameter )
Step 2: Q-contexts construction. As in Definition 3.1, we will encode the information matrix into an -norm state, which admits our fast construction of Q-contexts matrix .
In our method, we propose a data structure that achieves fast -norm sampling in practice. For the information matrix , we store the cumulative summation of its row-squared . To perform the -norm sampling, we first generate a random number from the uniform distribution and perform an efficient binary search algorithm to find the leftmost index such that is less than or equal to the corresponding cumulative sum. Through repeating the -norm sampling for times, we get the small Q-contexts . We normalize each row to get .
In addition, we can use column sampling to reduce the matrix again. Applying the theorem and similar analysis to , we get a smaller matrix for which with high probability
Then
We obtain good approximations of the right singular vectors and singular values of , by simply doing the same calculations for the much smaller matrix .
Step 3: Calculating word embedding from Q-contexts. It is known (Mnih and Kavukcuoglu, 2013) that embedding matrix taken to be the form can be beneficial in predictive performance, where and are right singular vectors and singular values of information matrix . In view of Theorem (4.2) later, with high probability , which implies that where is the left singular vectors of . Since is a (partial-)isometry, we obtain the matrix form of Equation (1):
(2) |
where is the matrix of context coefficients.
Complexity Analysis. We get Algorithm 2 by putting the above procedures together. As for line 1, it requires time to perform point-by-point operations on the nonzero elements of the co-occurrence matrix . As for line 2 and line 3, time complexity for state preparation is and for -norm sampling is . As for line 4, time is spent in singular vectors computation. In MF method, it needs to take highly expensive time to compute singular vectors of original information matrix . As for line 5, time is spent in matrix multiplication.
4. Theoretical proof
In this section, we demonstrate that very few -norm samplings on the information matrix suffice to extract high-quality word embeddings.
The idea that word embedding hidden in mutual information matrix can be extracted from a few -norm sampling on the state of might seem surprising at first glance. The root of this phenomenon stems from the fact that Q-context is a randomized object, its probabilistic behavior has an almost deterministic nature in the sense of Central Limit Theorem.
In more precise terms, word representation hides in the symmetric form In view of the random nature of context , this form can be written as the expectation of the rank-one matrix relating to context ,
(3) |
where is the probability of getting context from a measurement, and runs over the set of contexts. Precisely, is the random matrix taking value with probability .
The idea of Central Limit Theorem and its general form of probabilistic measure concentration is also valid in a functional analytic (matrix) context, where a scalar-valued random variable is generalized to a matrix-valued one. One of the most celebrated theorems is the operator/matrix Bernstein concentration inequality (Tropp, 2015).
Theorem 4.1.
Let be a sequence of independent identically distributed symmetric norm-bounded matrices with mean . Then for sufficiently small we have
where is the covariance matrix of : , and denotes the stable rank of a matrix.
Thanks to the matrix concentration Theorem 4.1 , we see that, with high probability, the expectation can be well approximated by the average of samples . The average of these samples is
where is the normalized version of Q-contexts matrix :
Precisely, we get the following theorem:
Theorem 4.2.
Let , then we have
where with stable rank being bounded by the rank of , and
(4) |
Proof.
The probability inequality in the theorem is a direct consequence of Theorem 1. We only need to prove inequality (4) and the bound on in the theorem.
By direct calculation, we have
Therefore,
The Lagrange multiplier Theorem implies that achieves minimum only when is proportional to namely . In this case,
which is the first part of inequality (4).
The stable rank of is bounded as a consequence of inequality (5):
∎
From the proof of Theorem 4.2 (especially the proof of the first part of inequality (4)), we can see that the choice of is motivated by the variance minimization scheme. Therefore such a sampling strategy can lead to the matrix mean at a faster speed.
4.1. Error Analysis
We remark that that approximation in Theorem 4.2 is of high quality, both in theory and practice.
From a mathematical point of view, approximation error in Theorem 4.2 is small provided that rows are sampled from the information matrix . Our analysis is tighter than (Frieze et al., 2004) since we make use of a stronger bound on matrix concentration (Theorem 4.1) concerning stable rank and spectral norm of a matrix, instead of dimension and Frobenius norm. Stable rank is upper bounded by the rank of a matrix, which is more effective in a low-rank matrix regime such as word embedding problem. Spectral norm is bounded above by Frobenius norm as well. Moreover, the Frobenius norm can fail to capture the genuine behavior of random matrices in even very simple examples (Tropp, 2015). Therefore, our analysis shows that with very few samples of rows, can be a good approximation of with negligible error.
These theoretical observations are also consistent with our experiments. Indeed, in Section 5.5, we will see that with information matrix dimensions varying from 50k to 400k, the number of row samples required to achieve a good approximation is relatively stable — in our test it ranges from 40k to 50k, while less than only of accuracy is lost compared to direct calculation with the original information matrix.
5. Experiments
In this section, we evaluate the proposed WEQ method on three different evaluation tasks: Word Similarity, Text Classification and Named Entity Recognition (NER), which have been commonly used to evaluate previous word embedding methods (Zhang et al., 2019; Levy and Goldberg, 2014). We introduce our training corpora and baselines in Section 5.1. We describe the details of implementation and evaluation tasks in Section 5.2 and Section 5.3. We report experimental results and ablation study in Section 5.4 and Section 5.5, respectively.
enwik9 | WebBase | CC | |
---|---|---|---|
#tokens | 124,301,826 | 2,976,897,565 | 6,065,531,635 |
#words | 833,184 | 3,107,950 | 10,558,748 |
#pairs | 42,234,884 | 930,896,198 | 1,861,679,208 |
5.1. Training Corpora and Baselines
Training corpora. We conduct our experiments on three large scale English corpora: enwik9 with about 0.1 billion tokens, WebBase (Han et al., 2013) with about 3 billion tokens, and the filtering of Common Crawl corpus (CC) (Ortiz Suárez et al., 2020) with about 6 billion tokens. We first lowercase text and then remove the noisy text like HTML span. We mainly implement data pre-processing on the basis of the script111http://mattmahoney.net/dc/textdata.html used in word2vec. After pre-processed, the vocabulary sizes of three corpora are 56,466 (on enwik9), 277,704 (on WebBase) and 400,000 (on CC) respectively. We prepare co-occurrence matrices with a context window size of 10. The co-occurrence matrices are square symmetric matrices whose row sizes are equal to vocabulary sizes. Our training corpora (enwik9222http://mattmahoney.net/dc/enwik9.zip,WebBase333http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus,CC444https://oscar-public.huma-num.fr/shuff-dedup/en/) are publicly available. Note that the original CC is super-scale corpora, which obtains over 418B tokens. To facilitate experiments, we take 6B tokens. The statistics of the three raw corpora are listed in Table 2.
Baselines. In our experiments, we mainly compare our method WEQ with popular benchmarks listed below.
-
•
MF (Levy and Goldberg, 2014) is a global matrix factorization method that uncovers the semantic information implied in the matrix, such as PPMI and SPPMI matrices.
-
•
word2vec555https://code.google.com/archive/p/word2vec/source/default/source (Mikolov et al., 2013b) is a two-layer neural network that is trained to reconstruct linguistic contexts of words. We choose the SGNS architecture in our experiments.
-
•
GloVe666https://github.com/stanfordnlp/GloVe (Pennington et al., 2014) is a typical co-occurrence count based neural method, which captures global information from word co-occurrence matrix in the training corpus.
-
•
fasttext777https://github.com/facebookresearch/fastText (Bojanowski et al., 2017) is a character-based method which represents each word as an n-gram of characters.
5.2. Implementation Details
In our experiments, we use code888https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c from GloVe to obtain co-occurrence matrix. For MF and WEQ, matrices are stored in a list of lists (LIL) sparse matrix format to allow efficient row operations. The shift parameter of SPPMI is set to =5 on all training corpora. Sample size for our method WEQ are chosen from [10, 100000] as shown in Fig. 2. We factorize the Q-contexts matrix to get singular vectors and singular values. According to the official guide, we use 15 iterations to train GloVe and word2vec methods and 5 iterations to train fasttext method. The number of dimensions of embedding is set to 300. Note that we set the dimension of word embedding to sample size when is smaller than 300 in ablation study of sample size. All experiments are carried out on a cloud server with Intel(R) Xeon(R) Gold 6231C CPU. We set the number of CPU cores to 20, 20, 10, 1 and 1 for fasttext, word2vec, GloVe, MF and WEQ, respectively.
Dataset | Method | Time / mins | Word similarity | Text classification | NER | |||
(Cc: CPU core) | MEN | WS353 | SST-2 | 5AG | Conll | BTC | ||
enwik9 (0.1B) | MF-PPMI | 27.4 1 Cc | 74.24 | 70.31 | 78.61 | 86.63 | 87.72 | 71.06 |
WEQ-PPMI (=40000) | 14.9 1 Cc | 73.190.28 | 69.230.41 | 78.730.57 | 87.250.29 | 87.740.11 | 71.160.49 | |
MF-SPPMI | 13.2 1 Cc | 72.26 | 67.14 | 78.54 | 86.19 | 86.13 | 69.88 | |
WEQ-SPPMI (=40000) | 3.9 1 Cc | 71.920.58 | 66.521.60 | 78.150.20 | 87.180.30 | 86.340.12 | 69.700.12 | |
GloVe | 44.6 10 Cc | 69.45 | 66.31 | 79.26 | 86.51 | 86.78 | 71.00 | |
word2vec | 59.9 20 Cc | 74.09 | 68.96 | 79.13 | 85.62 | 86.51 | 70.19 | |
fasttext | 67.3 20 Cc | 73.13 | 65.79 | 80.59 | 86.85 | 87.63 | 70.65 | |
WebBase (3B) | MF-PPMI | 185.4 1 Cc | 77.97 | 68.77 | 80.94 | 86.32 | 88.21 | 71.31 |
WEQ-PPMI (=50000) | 28.4 1 Cc | 77.010.44 | 68.740.84 | 81.590.14 | 87.170.22 | 88.480.17 | 71.370.49 | |
MF-SPPMI | 67.6 1 Cc | 76.84 | 67.62 | 79.13 | 85.34 | 87.07 | 69.31 | |
WEQ-SPPMI (=50000) | 7.3 1 Cc | 76.310.55 | 67.660.55 | 78.980.28 | 86.860.23 | 87.680.12 | 70.440.24 | |
GloVe | 182.7×10 Cc | 75.20 | 63.43 | 82.66 | 86.90 | 87.78 | 71.02 | |
word2vec | 1211.4 20 Cc | 74.73 | 63.98 | 80.93 | 87.64 | 88.58 | 69.89 | |
fasttext | 1843.7 20 Cc | 73.10 | 58.80 | 81.07 | 87.74 | 88.93 | 72.83 | |
CC (6B) | MF-PPMI | 406.5 1 Cc | 77.72 | 70.87 | 83.99 | 86.85 | 87.48 | 72.55 |
WEQ-PPMI (=50000) | 31.4 1 Cc | 76.390.54 | 69.540.54 | 82.900.74 | 87.690.15 | 88.590.25 | 72.360.36 | |
MF-SPPMI | 153.9 1 Cc | 75.30 | 69.35 | 80.25 | 85.92 | 87.95 | 70.37 | |
WEQ-SPPMI (=50000) | 13.3 1 Cc | 73.920.35 | 66.150.96 | 80.790.66 | 86.360.17 | 87.780.18 | 70.860.56 | |
GloVe | 270.5 10 Cc | 76.37 | 65.41 | 81.70 | 86.47 | 88.56 | 71.23 | |
word2vec | 2369.7 20 Cc | 79.29 | 69.90 | 81.57 | 87.14 | 88.65 | 72.93 | |
fasttext | 3419.4 20 Cc | 79.64 | 68.97 | 82.70 | 87.53 | 89.46 | 73.92 | |
random | / | 0.31 | 2.38 | 73.82 | 82.51 | 81.63 | 59.22 | |
/ | 73.75 | 57.26 | 82.79 | 87.34 | 91.15 | 72.26 |
5.3. Evaluation Tasks
To measure the quality of embeddings, we conduct experiments on three different evaluation tasks: Word Similarity, Text Classification and Named Entity Recognition (NER).
Word similarity. The word similarity task is to measure how well the semantic relationship between words is captured by word embeddings. Experiments are conducted on MEN (Bruni et al., 2014) and WS353 (Agirre et al., 2009) datasets. To estimate the quality of word embeddings, we compute the Spearman Rank Correlation score between the word embedding cosine similarities and human-annotated scores.
Text classification. In the text classification task, the classifier predicts predefined labels for the given texts. We conduct experiments on two datasets, i.e., 5AbstractsGroup (5AG) (Liu et al., 2018) and Stanford Sentiment Treebank (SST-2) (Socher et al., 2013).We choose TextCNN (Kim, 2014) as our classifier, and measure the performance by weighted F1 metrics. The sizes of the kernels are 2, 3 and 4 respectively. We use the Adam optimizer with a mini-batch size of 128 and learning rate of 0.001.
Named entity recognition. NER is the task of identifying and categorizing the entities in the given text. We conduct experiments on two benchmark datasets: Conll (Tjong Kim Sang and De Meulder, 2003) and BTC (Derczynski et al., 2016). We choose wordLSTM+charCNN+CRF (Ma and Hovy, 2016) as our NER model and measure the performance by entity-level F1 metrics. The wordLSTM is built upon a one-layer bidirectional LSTM structure with 200 hidden size. We employ the SGD algorithm to optimize model parameters on mini-batches of size 100, with a learning rate of 0.001.
5.4. Results
Table 3 illustrates the results on various evaluation tasks and training times of all word embedding methods. It can be observed that our method WEQ outperforms all baselines in three corpora on training time (and resources) metric while the performance metrics are fairly close.
In terms of evaluation tasks, we notice that MF and WEQ are superior to GloVe, word2vec and fasttext in most word similarity tasks, whereas performing slightly worse on text classification and NER tasks. For both NLP tasks (Text classification and NER), models using pre-trained word embeddings have remarkable improvement compared to the one using random embeddings. Comparing with MF, results obtained with WEQ are fairly close to the ones obtained with MF and sometimes even better, despite the fact we only use a fraction of the original information matrix.
We also observe that the released GloVe embedding is about 3% better than word embeddings trained by us on Conll task. One possible reason is the mismatch of vocabulary (Ma and Hovy, 2016). We reuse the data pre-process script of word2vec excluding punctuations and digits. And the punctuations and digits are important in the Conll task.
To verify the stability of our method WEQ, we run 5 times for each embedding. We only compute the mean and variance of WEQ because other methods need a long time to train word embedding. The small variance shown in Table 3 indicate that the word embedding obtained by WEQ are stable.
With regard to training time, MF method innately has an advantage over neural network based methods (word2vec, GloVe and fasttext) given the same computational settings. Our method WEQ improves significantly on top of that. As can be seen from the Table 3, our method WEQ-PPMI only requires 54.38% (on enwik9), 15.32% (on WebBase) and 7.72% (on CC) of the training time of MF-PPMI respectively. With the CC corpus, while GloVe needs 270.5 minutes with 10 CPU cores to obtain word embeddings, it takes only 31.4 minutes with 1 CPU core with WEQ-PPMI. Although, as the vocabulary size increases, the training time of WEQ increases accordingly, the time taken by MF increases by a significantly larger margin.
Note that in the case closest to practice (6B tokens and 400K vocabularies), we achieve at least times speed-up compared with other baselines. In more realistic scenarios where the number of tokens is trillions and the size of vocabulary is several million, one can infer from the growing trend that the speed advantage of our algorithm could potentially reach several orders of magnitude.
The analysis presented above confirms that WEQ method can substantially minimize the training time while maintaining benchmark performance.
5.5. Discussion and Analysis
|
|
Sampling |
|
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WEQ-PPMI | 26.37% | 3.74% | 6.93% | 46.40% | 16.56% | ||||||||
WEQ-SPPMI | 52.93% | 3.76% | 3.73% | 30.92% | 8.66% |
Time analysis. Recall three main steps of the WEQ method: information matrix computation, Q-contexts construction (including state preparation and sampling) and calculating word embedding from Q-contexts (including computation of singular values and left singular vectors, and embedding computation). The breakdown of computational time is displayed in Table.4. Note that the Q-contexts construction step takes up only no more than 11% of the whole time, which shows the efficiency of Q-contexts construction step.
Sample size. In our method, Q-contexts matrix is obtained by sampling from the information matrix. To investigate how the sample size affects the performance of WEQ, we carry out experiments with different sample sizes on all three corpora as shown in Fig.2. We present a representative benchmark score, similarity task MEN, against the training times under corresponding sample sizes. The results of other evaluation tasks are shown in Fig.3. As the sample size increases, it is obvious that the performance of WEQ on MEN approaches that of MF. The convergence is fairly fast, especially when the corpus size is large. We can see that to reach at least 98% of the performance of MF, the sample size of WEQ method only needs to be 40k (on enwik9), 50k (on WebBase), 50k (on CC), which are about 70.83%, 18.00%, 12.50% of the original size of the information matrices, respectively. This validates the theoretical observations in Section 4.1. As the size of the original information matrix increases, the percentage of samples needed decreases. In terms of training time, it rises slowly at small sample sizes then trends upward sharply at larger sample sizes. This indicates that the training time can be effectively reduced when sampling a small matrix to get word embedding.
-norm sampling vs uniform sampling. In WEQ method, we use -norm sampling to capture the typical information of origin information matrix. To verify the effectiveness of the -norm sampling, we replace the -norm sampling with uniform sampling in the Q-contexts construction step. Uniform sampling means that each row of information matrix will be sampled with equal probability. We compare the performance of word embeddings trained by these two different sampling algorithms. The evaluation results are shown in Fig. 4. -norm sampling has great advantages over uniform sampling in terms of word similarity scores. In terms of downstream tasks (text classification, NER), -norm sampling has increase compared with uniform sampling. These experimental results are consistent with our theoretical proof in Section 4.

Nonzeros/Sparsity analysis. To investigate the correlation between matrix sparsity and algorithm efficiency, we show a comparison between time and number of nonzeros in an information matrix in Fig. 5. By TES, we mean total time excluding SVD computation, which corresponds to Algorithm 3 excluding line 3. Since the time for SVD computation depends on the SVD engine used in our method, we do not provide a detailed analysis. The Pearson Correlation Score between and TES is 0.9583. It demonstrates that our algorithm efficiency is approximately linearly and positively correlated with the number of nonzero elements of . We also observe that number of nonzero elements in SPPMI matrix is significantly less than that in PPMI. That is because only the positive values higher than are retained after the shifted operation. The decrease of number of nonzero elements leads to a more sparse matrix and shorter TES.

6. Related work
In this section, we review the related work of word embedding and low-rank approximation.
6.1. Word Embedding
Constructing the word embedding which could express the semantic features is a fundamental task in Natural Language Processing. It brings benefits to many NLP tasks (Maruf and Haffari, 2018; Khabiri et al., 2019). Briefly, recent work about word embedding can be categorized into two genres, i.e., neural network based methods (Mikolov et al., 2013b; Bojanowski et al., 2017; Pennington et al., 2014) and global matrix factorization based methods (Deerwester et al., 1990; Levy and Goldberg, 2014; Arora et al., 2016). It is found that we can bridge these two categories by unifying neural network based methods into a matrix factorization framework (Arora et al., 2016; Levy and Goldberg, 2014).
Most of the neural network based methods model the relation between the target word and its contextual words. Among them, the SGNS (Skip-Gram with Negative Sampling) model is the popular implementation of word2vec (Mikolov et al., 2013b). It separates local context windows on the whole corpus, and focuses on maximizing the likelihood of contextual words based on the given word. Fasttext (Bojanowski et al., 2017) is an extension of the word2vec model, which represents each word as an n-gram of characters by a sliding window. GloVe (Pennington et al., 2014) is trained based on the global word-word co-occurrence counts from a corpus. As for one kind of global matrix factorization based method, MF (Levy and Goldberg, 2014) factorizes the SPPMI matrix explicitly, rather than iteratively tuning the network parameters.
More recently, ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) achieve state-of-the-art results in a variety of NLP tasks. Despite the state-of-the-art results achieved by deep contextualization models, pre-trained word embedding should not be neglected. Indeed, ELMo is dependent on GloVe embedding as input during training (Peters et al., 2018). Moreover, both ELMo and BERT are extremely large deep networks that require huge computing resources. The pre-training of takes 4 days to complete on 64 TPUs (Devlin et al., 2019). And it is also difficult to apply deep contextualization models directly on low-resource PCs or mobile phones (Strubell et al., 2019; Sun et al., 2020).
In the studies described above, researchers paid less attention to the computational cost of word embedding training. It is true that large-scale word embedding training requires significant computational costs (Mikolov et al., 2013b; Mnih and Kavukcuoglu, 2013; Pennington et al., 2014; Mikolov et al., 2013a). In this work, we aim to present a novel efficient method to obtain word embeddings of a large-scale vocabulary, while maintaining its superiority in terms of effectiveness.
6.2. Low-Rank Approximation
Low-rank approximation problems in mathematical modeling have been studied for decades (Frieze et al., 2004; Dahiya et al., 2018), which is approximating a matrix by one whose rank is less than that of the original matrix. The aim is to obtain more compact information representation and less complex data modeling with limited losses. It arises in many applications such as recommendation system (Arrazola et al., 2020; Tang, 2019). Unlike their implementations of random matrices and matrices of small sizes (Arrazola et al., 2020), our WEQ investigates the actual large matrices associated with natural language and reveals new relations for word embeddings. Our work is also different from (Tang, 2019; Frieze et al., 2004) since we provide more detailed theoretical and empirical analysis on matrix concentration, and direct analysis of singular vectors. We also have novel treatments for sparse matrices and better algorithmic designs for state preparation.
7. Conclusion
In this work, we introduce the notion of Q-contexts (matrix), which can be constructed efficiently. These are only a small fraction (less than 12.5% in the practical scenario) of all the contexts in the entire corpus. We also present a novel relation between word vectors and Q-contexts, and provide a theoretical foundation and rigorous analysis. Based on this relation, we design a novel and efficient WEQ method for fast computation of large-scale word embedding. Empirical experiments show that our algorithm WEQ method runs at least times faster than well-established matrix factorization methods. Resource and time advantages over word2vec, GloVe and fasttext are even more pronounced in our empirical study. We have also shown that WEQ enjoys decent accuracy performance in a variety of NLP tasks compared to the other methods tested in this study. In the future, we would like to efficiently learn word embedding of million-level vocabulary. It is also an interesting topic to apply our method to other domains such as graph embedding.
Acknowledgements.
This work was supported by National Natural Science Foundation of China (62076100), and Fundamental Research Funds for the Central Universities, SCUT (D2210010,D2200150,and D2201300), the Science and Technology Planning Project of Guangdong Province (2020B0101100002).References
- (1)
- Agirre et al. (2009) Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In NAACL ’09.
- Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4 (2016), 385–399.
- Arrazola et al. (2020) Juan Miguel Arrazola, Alain Delgado, Bhaskar Roy Bardhan, and Seth Lloyd. 2020. Quantum-inspired algorithms in practice. Quantum (2020).
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
- Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of artificial intelligence research 49 (2014), 1–47.
- Church and Hanks (1989) Kenneth Ward Church and Patrick Hanks. 1989. Word Association Norms, Mutual Information, and Lexicography. In ACL ’89.
- Coecke et al. (2020) Bob Coecke, Giovanni de Felice, Konstantinos Meichanetzidis, and Alexis Toumi. 2020. Foundations for Near-Term Quantum Natural Language Processing. arXiv preprint arXiv:2012.03755 (2020).
- Dagan et al. (1994) Ido Dagan, Fernando Pereira, and Lillian Lee. 1994. Similarity-Based Estimation of Word Cooccurrence Probabilities. In ACL ’94.
- Dahiya et al. (2018) Yogesh Dahiya, Dimitris Konomis, and David P. Woodruff. 2018. An Empirical Evaluation of Sketching for Numerical Linear Algebra. In KDD ’18.
- Deerwester et al. (1990) Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41, 6 (1990), 391–407.
- Derczynski et al. (2016) Leon Derczynski, Kalina Bontcheva, and Ian Roberts. 2016. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 1169–1179.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL ’19. 4171–4186.
- Drineas et al. (2008) Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. 2008. Relative-Error CUR Matrix Decompositions. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 844–881. https://doi.org/10.1137/07070471X
- Frieze et al. (2004) Alan Frieze, Ravi Kannan, and Santosh Vempala. 2004. Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations. J. ACM (2004).
- Han et al. (2013) Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity.
- Hazan et al. (2011) Elad Hazan, Tomer Koren, and Nati Srebro. 2011. Beating SGD: Learning SVMs in Sublinear Time. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger (Eds.). 1233–1241.
- Khabiri et al. (2019) Elham Khabiri, Wesley M Gifford, Bhanukiran Vinzamuri, Dhaval Patel, and Pietro Mazzoleni. 2019. Industry specific word embedding and its application in log classification. In CIKM ’19. 2713–2721.
- Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP ’14.
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS ’14.
- Liu et al. (2018) Qian Liu, Heyan Huang, Yang Gao, Xiaochi Wei, Yuxin Tian, and Luyang Liu. 2018. Task-oriented Word Embedding for Text Classification. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2023–2032.
- Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In ACL ’16.
- Maruf and Haffari (2018) Sameen Maruf and Gholamreza Haffari. 2018. Document Context Neural Machine Translation with Memory Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1275–1284.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. ICLR Workshop’13 (2013).
- Mikolov et al. (2013b) Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS ’13, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.).
- Mnih and Kavukcuoglu (2013) Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NeurIPS ’13, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.).
- Ortiz Suárez et al. (2020) Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. 2020. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In ACL ’20.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP ’14. 1532–1543.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237.
- Rebentrost et al. (2014) Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. 2014. Quantum support vector machine for big data classification. Physical review letters 113, 13 (2014), 130503.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP ’13. 1631–1642.
- Song et al. (2016) Zhao Song, David P. Woodruff, and Huan Zhang. 2016. Sublinear Time Orthogonal Tensor Decomposition. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 793–801.
- Strubell et al. (2019) Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In ACL ’19. Association for Computational Linguistics, Florence, Italy, 3645–3650.
- Sun et al. (2020) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. In ACL ’20. Association for Computational Linguistics, Online, 2158–2170.
- Tang (2019) Ewin Tang. 2019. A quantum-inspired classical algorithm for recommendation systems. STOC ’19 (2019).
- Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In NAACL ’03.
- Tropp (2015) Joel A. Tropp. 2015. An Introduction to Matrix Concentration Inequalities. (2015).
- Turney and Pantel (2010) Peter D. Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. J. Artif. Int. Res. (2010).
- Zeng and Coecke (2016) William Zeng and Bob Coecke. 2016. Quantum algorithms for compositional natural language processing. arXiv preprint arXiv:1608.01406 (2016).
- Zhang et al. (2019) Yun Zhang, Yongguo Liu, Jiajing Zhu, Ziqiang Zheng, Xiaofeng Liu, Weiguang Wang, Zijie Chen, and Shuangqing Zhai. 2019. Learning Chinese word embeddings from stroke, structure and pinyin of characters. In CIKM ’19. 1011–1020.