Interpretable Neural Embedding with Sparse Words Self-Representation

Minxue Xia
Vector Lab
JD Finance
&Hao Zhu
Vector Lab
JD Finance

Abstract

Interpretability benefits the theoretical understanding of representations. Existing word embeddings are generally dense representations. Hence, the meaning of latent dimensions is difficult to interpret. This makes word embeddings like a black-box and prevents them from being human-readable and further manipulation. Many methods employ sparse representation to learn interpretable word embeddings for better interpretability. However, they also suffer from the unstable issue of grouped selection in $\ell 1$ and online dictionary learning. Therefore, they tend to yield different results each time. To alleviate this challenge, we propose a novel method to associate data self-representation with a shallow neural network to learn expressive, interpretable word embeddings. In experiments, we report that the resulting word embeddings achieve comparable and even slightly better interpretability than baseline embeddings. Besides, we also evaluate that our approach performs competitively well on all downstream tasks and outperforms benchmark embeddings on a majority of them.

1 Introduction

Word embeddings have gained immense success and obtained state-of-the-art results in a wide range of NLP tasks, such as parsing Lazaridou et al. (2013b); Bansal et al. (2014), named entity recognition Guo et al. (2014), and sentiment analysis Socher et al. (2013). One of the main attraction in these embeddings is that they capture the underlying complexity and latent trends in the data. However, a critical issue is that they are difficult to interpret, hence people are unaware of what each dimension represents. Subramanian et al. (2018) examined top-ranked words per dimension (”top dimensions words”) in benchmark embeddings like Glove Pennington et al. (2014) and Word2Vec Mikolov et al. (2013) and demonstrated that it does not form a semantically coherent group . For instance, the word ’Internet’ is mainly composed by the following top dimensions words in GloVe (as shown in Table. 1) which is far from being interpretable.

People have utilized sparse coding that transforms word embeddings into sparse vectors Faruqui et al. (2015); Subramanian et al. (2018). It learns interpretable embeddings by applying non-negative and sparsity constraints on original word embeddings.

However, there are two technical drawbacks for these solutions. Firstly, it fails to deal with grouped selections due to the employment of $\ell 1$ ( Lasso) Tibshirani (1996). More specifically, if there is a group of words among which the pairwise correlations are very high, Lasso tends to select only one word from that group and does not care which one is selected Zou and Hastie (2005). For instance, there exists two groups of words representing two semantic concepts $college$ and $creature$ . Given the target word $youtube$ , Lasso tends to select the most similar word in both groups and combine them to represent the target word. Secondly, online learning makes the learned dictionary unstable and unrepeatable. It may yield different basis of the dictionary if we trained the model multiple times. Such phenomenon is observed in SPINE Subramanian et al. (2018), we examined the overlap between top 5 dimensions only accounts for 7% when SPINE trained twice. Last but not least, rare words are difficult to be considered as the basis of the dictionary, because it cannot be fitted well by other words and cannot fits well to other words.

To alleviate these problems, we propose a method based on words self-representation (SR) to generate highly efficient and interpretable word embeddings. Motivated by the linear algebraic structure of word senses Arora et al. (2018), we extend the linear assertion and assume a word in a subspace can always be expressed as a linear combination of other words in that subspace. Therefore, we use a neural model with Sparse Words Self-Representation (SWSR) to obtain desired embeddings. The self-expressiveness of data guarantees the stability of the basis in the dictionary, and the neural approach can efficiently avoid the issue of grouped selection.

Our contributions can be arranged as follows: 1) We formulate the word embedding with self-expressiveness property of data. 2) We propose a neural model to solve words self-representation with non-negative and sparsity constraints. 3) Last but not least, our approach is competitive to existing methods on many downstream tasks.

	Top dimensions words
	housands, residents, palestinian
’Internet’	intelligence, government, foreign
	nhl, writer, writers, drama

Table 1: An example of un-interpretable of word embeddings. It shows top-ranked words per dimension, each row represents words from top dimensions from GloVe for word ’Internet’.

2 Related Work

To learn interpretable word embeddings, current approaches are generally based on sparse coding. There are two major types of inputs: inputs derived from Positive Point-wise Mutual Information (PPMI) and inputs derived from the neural embedding. Murphy et al. (2012); Luo et al. (2015) apply sparse matrix factorization with non-negativity on the PPMI inputs. Subramanian et al. (2018); Faruqui et al. (2015) employ neural inputs such as pre-trained GloVe and Word2Vec embeddings. Comparing Subramanian et al. (2018) and Faruqui et al. (2015), the first approach uses neural-based method to constraint sparse and non-negative embeddings, while the second utilizes the convex optimization to transform dense representations into sparse and interpretable. Other method like Park et al. (2017) rotates word vectors to enhance interpretability. Our method is constructed based on the neural approach and is different from other methods, since it is more expressive than linear methods like matrix factorization or rotations. Next, words in our method are allowed to participate at varying levels in different dimensions naturally.

3 Methodology

In order to obtain stable and interpretable embeddings, we introduce the self-expressiveness property of data, stating each word in a union of subspaces can be efficiently represented as a linear combination of other words. Normally, such representation is not unique because there are infinite approaches to express a word through a combination of other words. Ideally, a sparse representation of a word corresponds to a combination of few words from its own subspace.

In the viewpoint of sparse representation, this avoids learning a dictionary in advance and therefore has a more stable ’dictionary’. The sparse representation NEED TO BE IMPLEMENTED. We implement words self-representation in a shallow neural network and also modify the loss functions to pose sparsity constraints on parameters.

3.1 Learning Embedding with Self-Expressiveness Property

Given a set of word embeddings $\{X\}_{i=1,...,|V|}\in R^{d\times|V|}$ , where $|V|$ is the vocabulary size, and $d$ is the number of dimensions in the input word embeddings. We assume that word representations are drawn from multiple linear subspaces $\{S\}_{i=1,...,K}$ . Each word in the subspace can be regarded as a linear combination of other words in the same subspace. This property is called self-expressiveness property (SEP) in Rao et al. (2008). We can use a single equation, i.e., $\boldsymbol{X=XC}$ , to represent the property, where $C$ is the self-representation coefficient matrix. It has been shown in Rao et al. (2008) that, under the assumption that the subspaces are independent, by minimizing certain norms of $C$ , $C$ is guaranteed to have a block-diagonal structure (up to certain permutations), i.e., $c_{ij}>0$ if and only if point $x_{i}$ and point $x_{j}$ lie in the same subspace. So we can leverage the words self-representation $C$ to construct new words embedding that is interpretable because only similar words belong to the same subspace and thus have similar representations in $C$ .

The self-representation coefficient matrix can be obtained by solving the optimization problem:

\min\|C\|_{p}\;\;s.t.\;\;X=XC,\;diag(C)=0,\;C_{ij}\geq 0

(1)

where $\|\cdot\|_{p}$ represents an arbitrary matrix norm, and the optional diagonal constraint on $C$ prevents trivial solutions for sparsity inducing norms, such as the $\ell 1$ -norm. Various norms for $C$ have been proposed in the literature, e.g., the $\ell 1$ -norm in Sparse Subspace Clustering (SSC) Elhamifar and Vidal (2009), the nuclear norm in Low Rank Representation (LRR) Liu et al. (2013) and Low Rank Subspace Clustering (LRSC) Favaro et al. (2011), and the Frobenius norm in Least-Squares Regression Lu et al. (2012) (LSR).

3.2 Self-Expressive Layer in Shallow Neural Network

In this paper, we propose a neural model to obtain words self-representation. Our goal is to train a shallow neural network where its inputs are well-suited to SEP with a parameter $C$ . To this end, we introduce a new layer that encodes the notion of self-expressiveness property. To encode words self-representation under the architecture of a neural network, we reformulate the Eq. 1 to another form defined as: \useshortskip

\min_{C}\frac{1}{2}\|X-XC\|_{2}^{2}+\lambda\Omega(C)

(2)

\useshortskip

where $\Omega(C)$ is the regularization terms. Typical terms include sparsity-induced norms $\|\cdot\|_{1}$ Elhamifar and Vidal (2009), and nuclear norm Favaro et al. (2011).

In contrast to the sparse coding or convex optimization approach of solving the parameter $C$ Elhamifar and Vidal (2009); Liu et al. (2013), we propose a shallow neural network with a sparsity penalty, a non-negative constraint on $C$ to obtain sparse, interpretable embedding. Then we can define the forward step in our method like: \useshortskip

\hat{X}=Xf(C)

(3)

\useshortskip

where $f$ is an appropriate element-wise activation function to ensure that $C$ is non-negative, and $W\in R^{|V|\times|V|}$ is model parameters. In order to produce non-negative values and exact zeros, we use Capped-ReLU Subramanian et al. (2018) as the activation function to map continuous value into [0, 1]. Mathematically, we define the activation like: \useshortskip

\textbf{Capped-ReLU}(x)=\left\{\begin{array}[]{rcl}0,&if&{x\leq 0}\\ x,&if&{0<x<1}\\ 1,&if&{x\geq 1}\\ \end{array}\right.

(4)

\useshortskip

In this setting, given $X$ , our shallow neural network is trained to minimize the following loss function. \useshortskip

\textbf{L}=\textbf{RL}+\lambda_{1}\textbf{ASL}+\lambda_{2}\textbf{PSL}

(5)

\useshortskip

where Reconstruction Loss (RL) is an average loss in reconstructing the input representation from the learned representation, the loss function can be defined with a mean square error like: \useshortskip

\textbf{RL}=\frac{1}{|V|}\sum^{|V|}\|X_{i}-\hat{X}_{i}\|_{2}^{2}

(6)

\useshortskip

In order to simulate sparsity-induced norm in our neural model, we penalize any deviation of the observed average activation value in $C$ from the desired average activation value of a given hidden unit. We formulate this Average Sparsity Loss (ASL) as follows. \useshortskip

\textbf{ASL}=\sum_{i}^{n}(\max(\frac{\sum_{j}^{n}c_{ij}}{n}-\rho,0))

(7)

\useshortskip

where $\rho$ is the the parameter of sparsity ratio.

Not only obtain k-sparse activation values for input, but we also hope to have discriminative representations such as binary codes. Thus a Partial Sparsity Loss (PSL) term Subramanian et al. (2018) is used to penalize values that are neither close to 0 nor 1, pushing them close to either 0 or 1. The formulation of PSL is as following: \useshortskip

\textbf{PSL}=\frac{1}{n}\sum_{i}^{n}C_{i}-C_{i}C_{i}^{T}

(8)

\useshortskip

where $C_{i}$ is the $i$ -th row of the parameter $C$ .

4 Experiments

To enable fair comparisons, experiments are conducted within the framework of SPINE Subramanian et al. (2018). Briefly, we use pre-trained GloVe and Word2Vec embeddings for training and both embeddings are 300 dimensions long.

We compare the embedding generated by our model (denoted as SWSR) against GloVe, Word2Vec, as well as SPOWV and SPINE. All hyperparameters in SPINE and SPOWV are under authors’ recommendations. To test the quality of these embeddings, we use them in the following benchmark downstream classification tasks: 1) News classification on the 20 Newsgroups dataset (Sports/Computers/Religion); 2) Noun Phrase Bracketing Lazaridou et al. (2013a); 3) Question Classification Li and Roth (2006). For text classification tasks, we use the average of embeddings of the word in a sentence as features. Algorithms, like SVM, Logistic Regression, Random Forests are experimented.

4.1 Downstream Task Evaluations

To test the quality of the embeddings, all performances are evaluated through accuracy (Table 2 and 3). It is clear that embeddings generated by our method generally perform competitively well on all benchmark tasks, and do significantly better on a majority of them.

Tasks	GloVe	SPOWV	SPINE	SWSR
Sports	95.31	95.80	96.42	91.53
Religion	80.62	81.27	79.80	81.59
Computers	71.64	78.16	75.35	80.77
NPBracket	73.31	69.89	73.58	68.69
Question.	82.80	87.45	88.75	90.28

Table 2: Accuracy on three downstream tasks using Glove word embeddings.

Tasks	Word2Vec	SPOWV	SPINE	SWSR
Sports	93.01	95.72	93.96	94.1
Religion	80.09	83.77	83.43	84.17
Computers	71.37	76.32	74.26	81.85
NPBracket	77.91	70.34	72.56	70.65
Question.	89.02	90.50	92.34	93.49

Table 3: Accuracy on three downstream tasks using Word2Vec word embeddings.

4.2 Interpretability

We investigate interpretability of word embeddings on the word intrusion detection task, seeking to measure how coherent each dimension of these vectors are. For a given dimension, we introduce a word set containing top 5 words in that dimension and a noisy word (also called intruder) from the bottom half of the dimension which ranks high (top 10%) in other dimensions.

We follow the automatic evaluation metric DistRatio proposed by Sun et al. (2016) to measure the interpretability of word representations. Formulations presented as follows: \useshortskip

\textbf{DistRatio}=\frac{1}{d}\sum_{i}^{d}\frac{\textbf{InterDist}}{\textbf{IntraDist}}

(9)

\useshortskip\useshortskip

\textbf{InterDist}=\sum_{w_{j}\in top_{k}(i)}\sum_{w_{k}\in top_{k}(i)\atop w_{k}\neq w_{j}}\frac{dist(w_{j},w_{k})}{k(k-1)}

(10)

\useshortskip\useshortskip

\textbf{IntraDist}=\sum_{w_{j}\in top_{k}(i)}\frac{dist(w_{j},w_{b})}{k}

(11)

\useshortskip

where $top_{k}(i)$ , $w_{bi}$ denotes top-k words on dimension $i$ and the intruder word for dimension i. We use Euclidean distance to measure $dist(w_{i},w_{j})$ . $IntraDist$ stands for the average distance between top words, while $InterDist$ represents the average distance between top words and the intruder word.

As we can see from Table 4, SWSR achieves comparable performances with previous baseline approaches and even perform the best on pre-trained Word2Vec embeddings. These results confirm our model generates more interpretable word embeddings. We attribute the success of SWSR to the expressiveness of a neural model, and the non-negativity, as well as sparsity, further lead to semantically coherent dimensions. In this way, SWSR constructs expressiveness and interpretable word representations.

Model	Sparsity	DistRatio
Glove	0%	0.99
SPINE	85%	1.25
SPOWV	90%	1.12
SWSR	95%	1.27
Word2Vec	0%	0.98
SPINE	85%	1.28
SPOWV	90%	1.08
SWSR	99%	1.34

Table 4: Average performances across all vector models on the word intrusion task, measured with

DistRatio

5 Conclusion

We have presented a method that converts word representations derived from any state-of-the-art embedding models into sparse by implementing Self Representation within a shallow Neural Network. Experiments demonstrate that embeddings generated by our method generally perform competitively than currently recognized methods on a diverse set of downstream tasks. Besides, it also achieves comparable and slightly better interpretability. Moreover, the utilization of words self-representation guarantees selecting multiple words from the same top dimensions to construct interpretable representations. The full usage of data points in SR also ensures the repeatable, stable of the dictionary.

References

Arora et al. (2018) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association of Computational Linguistics, 6:483–495.
Bansal et al. (2014) M. Bansal, K. Gimpel, and K Livescu. 2014. Tailoring continuous word representations for dependency parsing. In ACL (2), 809–815.
Elhamifar and Vidal (2009) Ehsan Elhamifar and René Vidal. 2009. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790–2797. IEEE.
Faruqui et al. (2015) M. Faruqui, Y. Tsvetkov, D. Yogatama, C. Dyer, and N. A Smith. 2015. Sparse overcomplete word vector representations. In Proc. of ACL.
Favaro et al. (2011) Paolo Favaro, René Vidal, and Avinash Ravichandran. 2011. A closed form solution to robust subspace estimation and clustering. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1801–1807. IEEE.
Guo et al. (2014) J. Guo, W. Che, H.; Wang, and T Liu. 2014. Revisiting embedding features for simple semi-supervised learning. In EMNLP, 110–120.
Lazaridou et al. (2013a) A. Lazaridou, E. M. Vecchi, and M. Baroni. 2013a. Fish transporters and miracle homes: How compositional distributional semantics can help np parsing. In Proc. of EMNLP.
Lazaridou et al. (2013b) Angeliki Lazaridou, Eva Maria Vecchi, and Marco Baroni. 2013b. Fish transporters and miracle homes: How compositional distributional semantics can help np parsing. In Proceedings of EMNLP.
Li and Roth (2006) X. Li and D Roth. 2006. Learning question classifiers: the role of semantic information. Natural Language Engineering, 12(03):229–249.
Liu et al. (2013) Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. 2013. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(1):171–184.
Lu et al. (2012) Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. 2012. Robust and efficient subspace segmentation via least squares regression. In European conference on computer vision, pages 347–360. Springer.
Luo et al. (2015) Hongyin Luo, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2015. Online learning of interpretable word embeddings. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, page 1687–1692.
Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
Murphy et al. (2012) Brian Murphy, Partha Talukdar, and Tom Mitchell. 2012. Learning effective and interpretable semantic models using non-negative sparse embedding. In Proc. of COLING.
Park et al. (2017) S. Park, J. Bak, and A Oh. 2017. Rotated word vector representations and their interpretability. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 401–411.
Pennington et al. (2014) J. Pennington, R. Socher, and C. D Manning. 2014. Glove: Global vectors for word representation. In EMNLP, 14, 1532–43.
Rao et al. (2008) Shankar R Rao, Roberto Tron, René Vidal, and Yi Ma. 2008. Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories.
Socher et al. (2013) R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), 1631, 1642.
Subramanian et al. (2018) Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. 2018. Spine: Sparse interpretable neural embeddings. Proceedings of the Thirty Second AAAI Conference on Artificial Intelligence (AAAI).
Sun et al. (2016) Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2016. Sparse word embeddings using l1 regularized online learning. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16).
Tibshirani (1996) Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
Zou and Hastie (2005) Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.