This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Neighbor Embedding
with Gaussian and Student-t Distributions: Tutorial and Survey

Benyamin Ghojogh Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
   Ali Ghodsi Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,
Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada
   Fakhri Karray Department of Electrical and Computer Engineering,
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada
   Mark Crowley Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Abstract

Stochastic Neighbor Embedding (SNE) is a manifold learning and dimensionality reduction method with a probabilistic approach. In SNE, every point is consider to be the neighbor of all other points with some probability and this probability is tried to be preserved in the embedding space. SNE considers Gaussian distribution for the probability in both the input and embedding spaces. However, t-SNE uses the Student-t and Gaussian distributions in these spaces, respectively. In this tutorial and survey paper, we explain SNE, symmetric SNE, t-SNE (or Cauchy-SNE), and t-SNE with general degrees of freedom. We also cover the out-of-sample extension and acceleration for these methods.

Tutorial, Locally Linear Embedding
\AddToShipoutPictureBG

*\AtPageUpperLeft                                                                                              To appear as a part of an upcoming academic book on dimensionality reduction and manifold learning.


1 Introduction

Stochastic Neighbor Embedding (SNE) (Hinton & Roweis, 2003) is a manifold learning and dimensionality reduction method which can be used for feature extraction (Ghojogh et al., 2019). It has a probabilistic approach. It fits the data in the embedding space locally hoping to preserve the global structure of data (Saul & Roweis, 2003). The idea of SNE is to consider every point as neighbors of other points with some probability where the closer points are neighbors with higher probability. Therefore, rather than considering kk nearest neighbors in a binary manner (whether being neighbors or not), it considers neighbors in a stochastic way (for how probable it is to be neighbors). It tries to preserve the probability of neighborhoods in the low-dimensional embedding space. It is noteworthy that there exist some other similar probabilistic dimensionality reduction methods which make use of Gaussian distribution for neighborhood. Some examples are Neighborhood Component Analysis (NCA) (Goldberger et al., 2005), deep NCA (Liu et al., 2020), and Proxy-NCA (Movshovitz-Attias et al., 2017).

SNE uses the Gaussian distribution for neighbors in both the input and embedding spaces. The Student-t distributed SNE, or so-called t-SNE (van der Maaten & Hinton, 2008), considers the Student-t and Gaussian distributions in the input and embedding spaces, respectively. The reason of using Student-t distribution in t-SNE is because of its heavier tails so it can include more information from the high-dimensional data. t-SNE is one of the state-of-the-art methods for data visualization; for example, it has been used for DNA and single-cell data visualization (Kobak & Berens, 2019). In this paper, we explain SNE, symmetric SNE, t-SNE (or Cauchy-SNE), t-SNE with general degrees of freedom, their out-of-sample extensions, and their accelerations. We also show the results of simulations for visualization of embeddings.

The goal of SNE is to embed the high-dimensional data {𝒙i}i=1n\{\boldsymbol{x}_{i}\}_{i=1}^{n} into the lower dimensional data {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n} where nn is the number of data points. We denote the dimensionality of high- and low-dimensional spaces by dd and pp, respectively, i.e. 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} and 𝒚ip\boldsymbol{y}_{i}\in\mathbb{R}^{p}. We usually have pdp\ll d. For data visualization, we have p{2,3}p\in\{2,3\}.

The remainder of this paper is organized as follows. In Sections 2 and 3, we explain SNE and symmetric SNE, respectively. Section 4 introduces the crowding problem and the t-SNE or Cauchy-SNE method. The out-of-sample embedding and acceleration of these methods are introduced in Sections 6 and 7, respectively. Recent improvements of t-SNE are briefly mentioned in Section 8. Finally, Section 9 concludes the paper.

2 Stochastic Neighbor Embedding (SNE)

In SNE (Hinton & Roweis, 2003), we consider a Gaussian probability around every point 𝒙i\boldsymbol{x}_{i} where the point 𝒙i\boldsymbol{x}_{i} is on the mean and the distribution is for probability of accepting any other point as the neighbor of 𝒙i\boldsymbol{x}_{i}; the farther points are neighbors with less probability. Hence, the variable is distance, denoted by dd\in\mathbb{R}, and the Gaussian probability is:

f(d)=12πσ2exp(d22σ2),\displaystyle f(d)=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp(-\frac{d^{2}}{2\sigma^{2}}), (1)

where the mean of distribution is assumed to be zero. The fixed multiplier 12πσ2\frac{1}{\sqrt{2\pi\sigma^{2}}} can be dropped; however, exp(d2/2σ2)\exp(-d^{2}/2\sigma^{2}) does not add (integrate) to one and thus it is not a probability density function. In order to tackle this problem, we can do a trick and divide exp(d2/2σ2)\exp(-d^{2}/2\sigma^{2}) by the summation of all possible values of exp(d2/2σ2)\exp(-d^{2}/2\sigma^{2}) to have a softmax function. Therefore, the probability that the point 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} takes 𝒙jd\boldsymbol{x}_{j}\in\mathbb{R}^{d} as its neighbor is:

pij:=exp(dij2)kiexp(dik2),\displaystyle\mathbb{R}\ni p_{ij}:=\frac{\exp(-d_{ij}^{2})}{\sum_{k\neq i}\exp(-d_{ik}^{2})}, (2)

where:

dij2:=𝒙i𝒙j222σi2.\displaystyle\mathbb{R}\ni d_{ij}^{2}:=\frac{||\boldsymbol{x}_{i}-\boldsymbol{x}_{j}||_{2}^{2}}{2\sigma_{i}^{2}}. (3)

Note that this trick is also used for qijq_{ij} in SNE and also for pijp_{ij} and qijq_{ij} in t-SNE (and its variants) as we will see later.

It is noteworthy that the mentioned trick is also used in other methods such as Continuous Bag-of-Word (CBOW) model of Word2Vec (Mikolov et al., 2013a; Rong, 2014), Euclidean Embedding (Globerson et al., 2007), and Parametric Embedding (Iwata et al., 2005). In this trick, the summation in the denominator can get very time-consuming especially when the dataset (or corpus for Word2Vec) is large. This plus the slow pace of gradient descent (Boyd & Vandenberghe, 2004) are the reasons that SNE, t-SNE, and Word2Vec are very slow and even infeasible for large datasets. The Word2Vec tackled the problem of the slow pace by introducing Negative Sampling Skip-Gram model (Mikolov et al., 2013b; Goldberg & Levy, 2014) which uses logistic function similar to the approach of logistic regression (Kleinbaum et al., 2002). In logistic function, we deal with inner product (similarity) of data points rather than distance of data points and there is no summation in the denominator. The Negative Sampling Skip-Gram model also uses Newton’s method, which is much faster than gradient descent (Boyd & Vandenberghe, 2004), similar to logistic regression.

The σi2\sigma_{i}^{2} is the variance which we consider for the Gaussian distribution used for the 𝒙i\boldsymbol{x}_{i}. It can be set to a fixed number or determined by a binary search to make the entropy of distribution some specific value (Hinton & Roweis, 2003). Note that according to the distribution of data in the input space, the best value for the variance of Gaussian distributions can be found.

In the low-dimensional embedding space, we again consider a Gaussian probability distribution for the point 𝒚ip\boldsymbol{y}_{i}\in\mathbb{R}^{p} to take 𝒚jp\boldsymbol{y}_{j}\in\mathbb{R}^{p} as its neighbor:

qij:=exp(zij2)kiexp(zik2),\displaystyle\mathbb{R}\ni q_{ij}:=\frac{\exp(-z_{ij}^{2})}{\sum_{k\neq i}\exp(-z_{ik}^{2})}, (4)

where:

zij2:=𝒚i𝒚j22.\displaystyle\mathbb{R}\ni z_{ij}^{2}:=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}. (5)

It is noteworthy that the variance of distribution is not used (or is set to σi2=0.5\sigma_{i}^{2}=0.5 to cancel 22 in the denominator) because the variance of distribution in the embedding space is the choice of algorithm.

We want the probability distributions in both the input and embedded spaces to be as similar as possible; therefore, the cost function to be minimized can be summation of the Kullback-Leibler (KL) divergences (Kullback, 1997) over the nn points:

c1:=i=1nKL(Pi||Qi)=i=1nj=1,jinpijlog(pijqij),\displaystyle\mathbb{R}\ni c_{1}:=\sum_{i=1}^{n}\text{KL}(P_{i}||Q_{i})=\sum_{i=1}^{n}\sum_{j=1,j\neq i}^{n}p_{ij}\log(\frac{p_{ij}}{q_{ij}}), (6)

where pijp_{ij} and qijq_{ij} are the Eqs. (2) and (4). Note that divergences other than the KL divergence can be used for the SNE optimization; e.g., see (Im et al., 2018).

Proposition 1.

The gradient of c1c_{1} with respect to 𝐲i\boldsymbol{y}_{i} is:

pc1𝒚i=2j=1n(pijqij+pjiqji)(𝒚i𝒚j),\displaystyle\mathbb{R}^{p}\ni\frac{\partial c_{1}}{\partial\boldsymbol{y}_{i}}=2\sum_{j=1}^{n}(p_{ij}-q_{ij}+p_{ji}-q_{ji})(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}), (7)

where pijp_{ij} and qijq_{ij} are the Eqs. (2) and (4), and pii=qii=0p_{ii}=q_{ii}=0.

Proof.

Proof is inspired by (van der Maaten & Hinton, 2008). Let:

rij:=zij2=𝒚i𝒚j22.\displaystyle\mathbb{R}\ni r_{ij}:=z_{ij}^{2}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}. (8)

By changing 𝒚i\boldsymbol{y}_{i}, we only have change impact in zijz_{ij} and zjiz_{ji} (or rijr_{ij} and rjir_{ji}) for all jj’s. According to chain rule, we have:

pc1𝒚i=j(c1rijrij𝒚i+c1rjirji𝒚i).\displaystyle\mathbb{R}^{p}\ni\frac{\partial c_{1}}{\partial\boldsymbol{y}_{i}}=\sum_{j}\big{(}\frac{\partial c_{1}}{\partial r_{ij}}\frac{\partial r_{ij}}{\partial\boldsymbol{y}_{i}}+\frac{\partial c_{1}}{\partial r_{ji}}\frac{\partial r_{ji}}{\partial\boldsymbol{y}_{i}}\big{)}.

According to Eq. (8), we have:

rij=𝒚i𝒚j22rij𝒚i=2(𝒚i𝒚j),\displaystyle r_{ij}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}\implies\frac{\partial r_{ij}}{\partial\boldsymbol{y}_{i}}=2(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),
rji=𝒚j𝒚i22=𝒚i𝒚j22rji𝒚i=2(𝒚i𝒚j).\displaystyle r_{ji}=||\boldsymbol{y}_{j}-\boldsymbol{y}_{i}||_{2}^{2}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}\implies\frac{\partial r_{ji}}{\partial\boldsymbol{y}_{i}}=2(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}).

Therefore:

c1𝒚i=2j(c1rij+c1rji)(𝒚i𝒚j).\displaystyle\therefore~{}~{}~{}~{}\frac{\partial c_{1}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}\big{(}\frac{\partial c_{1}}{\partial r_{ij}}+\frac{\partial c_{1}}{\partial r_{ji}}\big{)}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}). (9)

The cost function can be re-written as:

c1\displaystyle c_{1} =klkpkllog(pklqkl)=klpkllog(pklqkl)\displaystyle=\sum_{k}\sum_{l\neq k}p_{kl}\log(\frac{p_{kl}}{q_{kl}})=\sum_{k\neq l}p_{kl}\log(\frac{p_{kl}}{q_{kl}})
=kl(pkllog(pkl)pkllog(qkl)),\displaystyle=\sum_{k\neq l}\big{(}p_{kl}\log(p_{kl})-p_{kl}\log(q_{kl})\big{)},

whose first term is a constant with respect to qklq_{kl} and thus to rklr_{kl}. We have:

c1rij=klpkl(log(qkl))rij.\displaystyle\mathbb{R}\ni\frac{\partial c_{1}}{\partial r_{ij}}=-\sum_{k\neq l}p_{kl}\frac{\partial(\log(q_{kl}))}{\partial r_{ij}}.

According to Eq. (4), the qklq_{kl} is:

qkl:=exp(zkl2)kfexp(zkf2)=exp(rkl)kfexp(rkf).\displaystyle q_{kl}:=\frac{\exp(-z_{kl}^{2})}{\sum_{k\neq f}\exp(-z_{kf}^{2})}=\frac{\exp(-r_{kl})}{\sum_{k\neq f}\exp(-r_{kf})}.

We take the denominator of qklq_{kl} as:

β:=kfexp(zkf2)=kfexp(rkf).\displaystyle\beta:=\sum_{k\neq f}\exp(-z_{kf}^{2})=\sum_{k\neq f}\exp(-r_{kf}). (10)

We have log(qkl)=log(qkl)+logβlogβ=log(qklβ)logβ\log(q_{kl})=\log(q_{kl})+\log\beta-\log\beta=\log(q_{kl}\beta)-\log\beta. Therefore:

c1rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{1}}{\partial r_{ij}} =klpkl(log(qklβ)logβ)rij\displaystyle=-\sum_{k\neq l}p_{kl}\frac{\partial\big{(}\log(q_{kl}\beta)-\log\beta\big{)}}{\partial r_{ij}}
=klpkl[(log(qklβ))rij(logβ)rij]\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{\partial\big{(}\log(q_{kl}\beta)\big{)}}{\partial r_{ij}}-\frac{\partial\big{(}\log\beta\big{)}}{\partial r_{ij}}\bigg{]}
=klpkl[1qklβ(qklβ)rij1ββrij].\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{1}{q_{kl}\beta}\frac{\partial\big{(}q_{kl}\beta\big{)}}{\partial r_{ij}}-\frac{1}{\beta}\frac{\partial\beta}{\partial r_{ij}}\bigg{]}.

The qklβq_{kl}\beta is:

qklβ\displaystyle q_{kl}\beta =exp(rkl)fkexp(rkf)×kfexp(rkf)\displaystyle=\frac{\exp(-r_{kl})}{\sum_{f\neq k}\exp(-r_{kf})}\times\sum_{k\neq f}\exp(-r_{kf})
=exp(rkl).\displaystyle=\exp(-r_{kl}).

Therefore, we have:

c1rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{1}}{\partial r_{ij}} =klpkl[1qklβ(exp(rkl))rij1ββrij].\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{1}{q_{kl}\beta}\frac{\partial\big{(}\exp(-r_{kl})\big{)}}{\partial r_{ij}}-\frac{1}{\beta}\frac{\partial\beta}{\partial r_{ij}}\bigg{]}.

The (exp(rkl))/rij\partial\big{(}\exp(-r_{kl})\big{)}/\partial r_{ij} is non-zero for only k=ik=i and l=jl=j; therefore:

(exp(rij))rij\displaystyle\frac{\partial\big{(}\exp(-r_{ij})\big{)}}{\partial r_{ij}} =exp(rij),\displaystyle=-\exp(-r_{ij}),
βrij\displaystyle\frac{\partial\beta}{\partial r_{ij}} =kfexp(rkf)rij=exp(rij)rij\displaystyle=\frac{\partial\sum_{k\neq f}\exp(-r_{kf})}{\partial r_{ij}}=\frac{\partial\exp(-r_{ij})}{\partial r_{ij}}
=exp(rij).\displaystyle=-\exp(-r_{ij}).

Therefore:

c1rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{1}}{\partial r_{ij}} =\displaystyle= (pij[1qijβexp(rij)]+0++0)\displaystyle-\bigg{(}p_{ij}\Big{[}\frac{-1}{q_{ij}\beta}\exp(-r_{ij})\Big{]}+0+\dots+0\bigg{)}
klpkl[1βexp(rij)].\displaystyle-\sum_{k\neq l}p_{kl}\Big{[}\frac{1}{\beta}\exp(-r_{ij})\Big{]}.

We have klpkl=1\sum_{k\neq l}p_{kl}=1 because summation of all possible probabilities is one. Thus:

c1rij\displaystyle\frac{\partial c_{1}}{\partial r_{ij}} =pij[1qijβexp(rij)][1βexp(rij)]\displaystyle=-p_{ij}\Big{[}\frac{-1}{q_{ij}\beta}\exp(-r_{ij})\Big{]}-\Big{[}\frac{1}{\beta}\exp(-r_{ij})\Big{]}
=exp(rij)β=qij[pijqij1]=pijqij.\displaystyle=\underbrace{\frac{\exp(-r_{ij})}{\beta}}_{=q_{ij}}\Big{[}\frac{p_{ij}}{q_{ij}}-1\Big{]}=p_{ij}-q_{ij}. (11)

Similarly, we have:

c1rji=pjiqji.\displaystyle\frac{\partial c_{1}}{\partial r_{ji}}=p_{ji}-q_{ji}. (12)

Substituting the obtained derivatives in Eq. (9) gives us:

c1𝒚i=2j(pijqij+pjiqji)(𝒚i𝒚j),\displaystyle\frac{\partial c_{1}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}(p_{ij}-q_{ij}+p_{ji}-q_{ji})(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),

which is the gradient mentioned in the proposition. Q.E.D. ∎

The update of the embedded point 𝒚i\boldsymbol{y}_{i} is done by gradient descent. Every iteration is:

Δ𝒚i(t):=ηc1𝒚i+α(t)Δ𝒚i(t1),\displaystyle\Delta\boldsymbol{y}_{i}^{(t)}:=-\eta\,\frac{\partial c_{1}}{\partial\boldsymbol{y}_{i}}+\alpha(t)\,\Delta\boldsymbol{y}_{i}^{(t-1)}, (13)
𝒚i(t):=𝒚i(t1)+Δ𝒚i(t),\displaystyle\boldsymbol{y}_{i}^{(t)}:=\boldsymbol{y}_{i}^{(t-1)}+\Delta\boldsymbol{y}_{i}^{(t)},

where momentum is used for better convergence (Qian, 1999). The α(t)\alpha(t) is the momentum. It can be smaller for initial iterations and larger for further iterations. For example, we can have (van der Maaten & Hinton, 2008):

α(t):={0.5t<250,0.8t250.\displaystyle\alpha(t):=\left\{\begin{array}[]{ll}0.5&t<250,\\ 0.8&t\geq 250.\end{array}\right. (16)

In the original paper of SNE (Hinton & Roweis, 2003), the momentum term is not mentioned but it is suggested in (van der Maaten & Hinton, 2008).

The η\eta is the learning rate which can be a small positive constant (e.g., η=0.1\eta=0.1) or can be updated adaptively according to (Jacobs, 1988).

Moreover, in both (Hinton & Roweis, 2003) and (van der Maaten & Hinton, 2008), it is mentioned that in SNE we should add some Gaussian noise (random jitter) to the solution of the first iterations before going to the next iterations. It helps avoiding the local optimum solutions.

3 Symmetric Stochastic Neighbor Embedding

In symmetric SNE (van der Maaten & Hinton, 2008), we consider a Gaussian probability around every point 𝒙i\boldsymbol{x}_{i}. The probability that the point 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} takes 𝒙jd\boldsymbol{x}_{j}\in\mathbb{R}^{d} as its neighbor is:

pij:=exp(dij2)klexp(dkl2),\displaystyle\mathbb{R}\ni p_{ij}:=\frac{\exp(-d_{ij}^{2})}{\sum_{k\neq l}\exp(-d_{kl}^{2})}, (17)

where:

dij2:=𝒙i𝒙j222σi2.\displaystyle\mathbb{R}\ni d_{ij}^{2}:=\frac{||\boldsymbol{x}_{i}-\boldsymbol{x}_{j}||_{2}^{2}}{2\sigma_{i}^{2}}. (18)

Note that the denominator of Eq. (17) for all points is fixed and thus it is symmetric for ii and jj. Compare this with Eq. (2) which is not symmetric.

The σi2\sigma_{i}^{2} is the variance which we consider for the Gaussian distribution used for the 𝒙i\boldsymbol{x}_{i}. It can be set to a fixed number or determined by a binary search to make the entropy of distribution some specific value (Hinton & Roweis, 2003).

The Eq. (17) has a problem with outliers. If the point 𝒙i\boldsymbol{x}_{i} is an outlier, its pijp_{ij} will be extremely small because the denominator is fixed for every point and numerator will be small for the outlier. However, If we use Eq. (2) for pijp_{ij}, the denominator for all the points is not the same and therefore, the denominator for an outlier will also be small waving out the problem of small numerator. For this mentioned problem, we do not use Eq. (17) and rather we use:

pij:=pi|j+pj|i2n,\displaystyle\mathbb{R}\ni p_{ij}:=\frac{p_{i|j}+p_{j|i}}{2n}, (19)

where:

pj|i:=exp(dij2)kiexp(dik2),\displaystyle\mathbb{R}\ni p_{j|i}:=\frac{\exp(-d_{ij}^{2})}{\sum_{k\neq i}\exp(-d_{ik}^{2})}, (20)

is the probability that 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} takes 𝒙jd\boldsymbol{x}_{j}\in\mathbb{R}^{d} as its neighbor.

In the low-dimensional embedding space, we consider a Gaussian probability distribution for the point 𝒚ip\boldsymbol{y}_{i}\in\mathbb{R}^{p} to take 𝒚jp\boldsymbol{y}_{j}\in\mathbb{R}^{p} as its neighbor and we make it symmetric (fixed denominator for all points):

qij:=exp(zij2)klexp(zkl2),\displaystyle\mathbb{R}\ni q_{ij}:=\frac{\exp(-z_{ij}^{2})}{\sum_{k\neq l}\exp(-z_{kl}^{2})}, (21)

where:

zij2:=𝒚i𝒚j22.\displaystyle\mathbb{R}\ni z_{ij}^{2}:=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}. (22)

Note that the Eq. (21) does not have the problem of outliers as in Eq. (17) because even for an outlier, the embedded points are initialized close together and not far.

We want the probability distributions in both the input and embedded spaces to be as similar as possible; therefore, the cost function to be minimized can be summation of the Kullback-Leibler (KL) divergences (Kullback, 1997) over the nn points:

c2:=i=1nKL(Pi||Qi)=i=1nj=1,jinpijlog(pijqij),\displaystyle\mathbb{R}\ni c_{2}:=\sum_{i=1}^{n}\text{KL}(P_{i}||Q_{i})=\sum_{i=1}^{n}\sum_{j=1,j\neq i}^{n}p_{ij}\log(\frac{p_{ij}}{q_{ij}}), (23)

where pijp_{ij} and qijq_{ij} are the Eqs. (19) and (21).

Proposition 2.

The gradient of c2c_{2} with respect to 𝐲i\boldsymbol{y}_{i} is:

pc2𝒚i=4j=1n(pijqij)(𝒚i𝒚j),\displaystyle\mathbb{R}^{p}\ni\frac{\partial c_{2}}{\partial\boldsymbol{y}_{i}}=4\sum_{j=1}^{n}(p_{ij}-q_{ij})(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}), (24)

where pijp_{ij} and qijq_{ij} are the Eqs. (19) and (21), and pii=qii=0p_{ii}=q_{ii}=0.

Proof.

Proof is inspired by (van der Maaten & Hinton, 2008).

Similar to Eq. (9), we have:

c2𝒚i=2j(c2rij+c2rji)(𝒚i𝒚j).\displaystyle\frac{\partial c_{2}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}\big{(}\frac{\partial c_{2}}{\partial r_{ij}}+\frac{\partial c_{2}}{\partial r_{ji}}\big{)}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}). (25)

Similar to the derivation of Eqs. (11) and (12), we can derive:

c2rij=pijqij, and\displaystyle\frac{\partial c_{2}}{\partial r_{ij}}=p_{ij}-q_{ij},\text{ and} (26)
c2rji=pjiqji,\displaystyle\frac{\partial c_{2}}{\partial r_{ji}}=p_{ji}-q_{ji},

respectively. In the symmetric SNE, we have:

c2rji=pjiqji=(a)pijqij,\displaystyle\frac{\partial c_{2}}{\partial r_{ji}}=p_{ji}-q_{ji}\overset{(a)}{=}p_{ij}-q_{ij}, (27)

where (a)(a) is because in symmetric SNE, the pijp_{ij} and qijq_{ij} are symmetric for ii and jj according to Eqs. (19) and (21).

Substituting Eqs. (26) and (27) in Eq. (25) gives us:

c2𝒚i=4j(pijqij)(𝒚i𝒚j),\displaystyle\frac{\partial c_{2}}{\partial\boldsymbol{y}_{i}}=4\sum_{j}(p_{ij}-q_{ij})(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),

which is the gradient mentioned in the proposition. Q.E.D. ∎

The update of the embedded point 𝒚i\boldsymbol{y}_{i} is done by gradient descent whose every iteration is as Eq. (13) where c1c_{1} is replaced by c2c_{2}. Note that the momentum term can be omitted in the symmetric SNE. Like in SNE, in symmetric SNE, we should add some Gaussian noise (random jitter) to the solution of the first iterations before going to the next iterations. It helps avoiding the local optimum solutions.

4 t-distributed Stochastic Neighbor Embedding (t-SNE)

4.1 The Crowding Problem

In SNE (Hinton & Roweis, 2003), we are considering Gaussian distribution for both input and embedded spaces. That is okay for the input space because it already has a high dimensionality. However, when we embed the high-dimensional data into a low-dimensional space, it is very hard to fit the information of all the points in the same neighborhood area. For better clarification, suppose the dimensionality is like the size of a room. In high dimensionality, we have a large hall including a huge crowd of people. Now, we want to fit all the people into a small room; of course, we cannot! This problem is referred to as the crowding problem.

The main idea of t-SNE (van der Maaten & Hinton, 2008) is addressing the crowding problem which exists in SNE (Hinton & Roweis, 2003). In the example of fitting people in a room, t-SNE enlarges the room to solve the crowding problem. Therefore, in the formulation of t-SNE, we use Student-t distribution (Gosset (1908), Student) rather than Gaussian distribution for the low-dimensional embedded space. This is because the Student-t distribution has heavier tails than Gaussian distribution, which is like a larger room, and can fit the information of high dimensional data in the low dimensional embedding space.

As we will see later, the qijq_{ij} in t-SNE is:

qij=(1+zij2)1kl(1+zkl2)1,\displaystyle q_{ij}=\frac{(1+z_{ij}^{2})^{-1}}{\sum_{k\neq l}(1+z_{kl}^{2})^{-1}},

which is based on the standard Cauchy distribution:

f(z)=1π(1+z2),\displaystyle f(z)=\frac{1}{\pi(1+z^{2})}, (28)

where π\pi is canceled from the numerator and the normalizing denominator in qijq_{ij} (see the explanations of this trick in Section 2).

If the Student-t distribution (Gosset (1908), Student) with the general degrees of freedom δ\delta is used, we would have:

f(z)=Γ(δ+12)δ×πΓ(δ2)(1+z2δ)δ+12,\displaystyle f(z)=\frac{\Gamma(\frac{\delta+1}{2})}{\sqrt{\delta\times\pi}~{}~{}\Gamma(\frac{\delta}{2})}(1+\frac{z^{2}}{\delta})^{-\frac{\delta+1}{2}}, (29)

where Γ\Gamma is the gamma function. Cancelling out the scaling factors from the numerator and denominator, we would have (van der Maaten, 2009):

qij=(1+zij2/δ)(δ+1)/2kl(1+zkl2/δ)(δ+1)/2.\displaystyle q_{ij}=\frac{(1+z_{ij}^{2}/\delta)^{-(\delta+1)/2}}{\sum_{k\neq l}(1+z_{kl}^{2}/\delta)^{-(\delta+1)/2}}. (30)

However, as the first degree of freedom has the heaviest tails amongst different degrees of freedom, it is the most suitable for the crowding problem; hence, we use the first degree of freedom which is the Cauchy distribution. Note that the t-SNE algorithm, which uses the Cauchy distribution, may also be called the Cauchy-SNE. Later, t-SNE with general degrees of freedom was proposed (van der Maaten, 2009), which we explain in Section 5.

4.2 t-SNE Formulation

In t-SNE (van der Maaten & Hinton, 2008), we consider a Gaussian probability around every point 𝒙i\boldsymbol{x}_{i} in the input space because the crowding problem does not exist in the high dimensional data. The probability that the point 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} takes 𝒙jd\boldsymbol{x}_{j}\in\mathbb{R}^{d} as its neighbor is:

pj|i:=exp(dij2)kiexp(dik2),\displaystyle\mathbb{R}\ni p_{j|i}:=\frac{\exp(-d_{ij}^{2})}{\sum_{k\neq i}\exp(-d_{ik}^{2})}, (31)

where:

dij2:=𝒙i𝒙j222σi2.\displaystyle\mathbb{R}\ni d_{ij}^{2}:=\frac{||\boldsymbol{x}_{i}-\boldsymbol{x}_{j}||_{2}^{2}}{2\sigma_{i}^{2}}. (32)

Note that Eq. (31) is not symmetric for ii and jj because of the denominator. We take the symmetric pijp_{ij} as the scaled average of pi|jp_{i|j} and pj|ip_{j|i}:

pij:=pi|j+pj|i2n.\displaystyle\mathbb{R}\ni p_{ij}:=\frac{p_{i|j}+p_{j|i}}{2n}. (33)

In the low-dimensional embedding space, we consider a Student’s tt-distribution with one degree of freedom (Cauchy distribution) for the point 𝒚ip\boldsymbol{y}_{i}\in\mathbb{R}^{p} to take 𝒚jp\boldsymbol{y}_{j}\in\mathbb{R}^{p} as its neighbor:

qij:=(1+zij2)1kl(1+zkl2)1,\displaystyle\mathbb{R}\ni q_{ij}:=\frac{(1+z_{ij}^{2})^{-1}}{\sum_{k\neq l}(1+z_{kl}^{2})^{-1}}, (34)

where:

zij2:=𝒚i𝒚j22.\displaystyle\mathbb{R}\ni z_{ij}^{2}:=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}. (35)

We want the probability distributions in both the input and embedded spaces to be as similar as possible; therefore, the cost function to be minimized can be summation of the Kullback-Leibler (KL) divergences (Kullback, 1997) over the nn points:

c3:=i=1nKL(Pi||Qi)=i=1nj=1,jinpijlog(pijqij),\displaystyle\mathbb{R}\ni c_{3}:=\sum_{i=1}^{n}\text{KL}(P_{i}||Q_{i})=\sum_{i=1}^{n}\sum_{j=1,j\neq i}^{n}p_{ij}\log(\frac{p_{ij}}{q_{ij}}), (36)

where pijp_{ij} and qijq_{ij} are the Eqs. (33) and (34).

Proposition 3.

The gradient of c3c_{3} with respect to 𝐲i\boldsymbol{y}_{i} is:

c3𝒚i=4j=1n(pijqij)(1+𝒚i𝒚j22)1(𝒚i𝒚j),\displaystyle\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=4\sum_{j=1}^{n}(p_{ij}-q_{ij})(1+||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2})^{-1}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}), (37)

where pijp_{ij} and qijq_{ij} are the Eqs. (33) and (34), and pii=qii=0p_{ii}=q_{ii}=0.

Proof.

Proof is according to (van der Maaten & Hinton, 2008). Let:

rij:=zij2=𝒚i𝒚j22.\displaystyle\mathbb{R}\ni r_{ij}:=z_{ij}^{2}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}. (38)

By changing 𝒚i\boldsymbol{y}_{i}, we only have change impact in zijz_{ij} and zjiz_{ji} for all jj’s. According to chain rule, we have:

pc3𝒚i=j(c3rijrij𝒚i+c3rjirji𝒚i).\displaystyle\mathbb{R}^{p}\ni\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=\sum_{j}\big{(}\frac{\partial c_{3}}{\partial r_{ij}}\frac{\partial r_{ij}}{\partial\boldsymbol{y}_{i}}+\frac{\partial c_{3}}{\partial r_{ji}}\frac{\partial r_{ji}}{\partial\boldsymbol{y}_{i}}\big{)}.

According to Eq. (38), we have:

rij=𝒚i𝒚j22rij𝒚i=2(𝒚i𝒚j),\displaystyle r_{ij}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}\implies\frac{\partial r_{ij}}{\partial\boldsymbol{y}_{i}}=2(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),
rji=𝒚j𝒚i22=𝒚i𝒚j22rji𝒚i=2(𝒚i𝒚j).\displaystyle r_{ji}=||\boldsymbol{y}_{j}-\boldsymbol{y}_{i}||_{2}^{2}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}\implies\frac{\partial r_{ji}}{\partial\boldsymbol{y}_{i}}=2(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}).

Therefore:

c3𝒚i=2j(c3rij+c3rji)(𝒚i𝒚j).\displaystyle\therefore~{}~{}~{}~{}\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}\big{(}\frac{\partial c_{3}}{\partial r_{ij}}+\frac{\partial c_{3}}{\partial r_{ji}}\big{)}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}). (39)

The cost function can be re-written as:

c3\displaystyle c_{3} =klkpkllog(pklqkl)=klpkllog(pklqkl)\displaystyle=\sum_{k}\sum_{l\neq k}p_{kl}\log(\frac{p_{kl}}{q_{kl}})=\sum_{k\neq l}p_{kl}\log(\frac{p_{kl}}{q_{kl}})
=kl(pkllog(pkl)pkllog(qkl)),\displaystyle=\sum_{k\neq l}\big{(}p_{kl}\log(p_{kl})-p_{kl}\log(q_{kl})\big{)},

whose first term is a constant with respect to qklq_{kl} and thus to rklr_{kl}. We have:

c3rij=klpkl(log(qkl))rij.\displaystyle\mathbb{R}\ni\frac{\partial c_{3}}{\partial r_{ij}}=-\sum_{k\neq l}p_{kl}\frac{\partial(\log(q_{kl}))}{\partial r_{ij}}.

According to Eq. (34), the qklq_{kl} is:

qkl:=(1+zkl2)1mf(1+zmf2)1,=(1+rkl)1mf(1+rmf)1,\displaystyle q_{kl}:=\frac{(1+z_{kl}^{2})^{-1}}{\sum_{m\neq f}(1+z_{mf}^{2})^{-1}},=\frac{(1+r_{kl})^{-1}}{\sum_{m\neq f}(1+r_{mf})^{-1}},

We take the denominator of qklq_{kl} as:

β:=mf(1+zmf2)1=mf(1+rmf)1.\displaystyle\beta:=\sum_{m\neq f}(1+z_{mf}^{2})^{-1}=\sum_{m\neq f}(1+r_{mf})^{-1}. (40)

We have log(qkl)=log(qkl)+logβlogβ=log(qklβ)logβ\log(q_{kl})=\log(q_{kl})+\log\beta-\log\beta=\log(q_{kl}\beta)-\log\beta. Therefore:

c3rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{3}}{\partial r_{ij}} =klpkl(log(qklβ)logβ)rij\displaystyle=-\sum_{k\neq l}p_{kl}\frac{\partial\big{(}\log(q_{kl}\beta)-\log\beta\big{)}}{\partial r_{ij}}
=klpkl[(log(qklβ))rij(logβ)rij]\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{\partial\big{(}\log(q_{kl}\beta)\big{)}}{\partial r_{ij}}-\frac{\partial\big{(}\log\beta\big{)}}{\partial r_{ij}}\bigg{]}
=klpkl[1qklβ(qklβ)rij1ββrij].\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{1}{q_{kl}\beta}\frac{\partial\big{(}q_{kl}\beta\big{)}}{\partial r_{ij}}-\frac{1}{\beta}\frac{\partial\beta}{\partial r_{ij}}\bigg{]}.

The qklβq_{kl}\beta is:

qklβ\displaystyle q_{kl}\beta =(1+rkl)1mf(1+rmf)1×mf(1+rmf)1\displaystyle=\frac{(1+r_{kl})^{-1}}{\sum_{m\neq f}(1+r_{mf})^{-1}}\times\sum_{m\neq f}(1+r_{mf})^{-1}
=(1+rkl)1.\displaystyle=(1+r_{kl})^{-1}.

Therefore, we have:

c3rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{3}}{\partial r_{ij}} =klpkl[1qklβ((1+rkl)1)rij1ββrij].\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{1}{q_{kl}\beta}\frac{\partial\big{(}(1+r_{kl})^{-1}\big{)}}{\partial r_{ij}}-\frac{1}{\beta}\frac{\partial\beta}{\partial r_{ij}}\bigg{]}.

The ((1+rkl)1)/rij\partial\big{(}(1+r_{kl})^{-1}\big{)}/\partial r_{ij} is non-zero for only k=ik=i and l=jl=j; therefore:

((1+rij)1)rij\displaystyle\frac{\partial\big{(}(1+r_{ij})^{-1}\big{)}}{\partial r_{ij}} =(1+rij)2,\displaystyle=-(1+r_{ij})^{-2},
βrij\displaystyle\frac{\partial\beta}{\partial r_{ij}} =mf(1+rmf)1rij=(1+rij)1rij\displaystyle=\frac{\partial\sum_{m\neq f}(1+r_{mf})^{-1}}{\partial r_{ij}}=\frac{\partial(1+r_{ij})^{-1}}{\partial r_{ij}}
=(1+rij)2.\displaystyle=-(1+r_{ij})^{-2}.

Therefore:

c3rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{3}}{\partial r_{ij}} =\displaystyle= (pij[1qijβ(1+rij)2]+0++0)\displaystyle-\bigg{(}p_{ij}\Big{[}\frac{-1}{q_{ij}\beta}(1+r_{ij})^{-2}\Big{]}+0+\dots+0\bigg{)}
klpkl[1β(1+rij)2].\displaystyle-\sum_{k\neq l}p_{kl}\Big{[}\frac{1}{\beta}(1+r_{ij})^{-2}\Big{]}.

We have klpkl=1\sum_{k\neq l}p_{kl}=1 because summation of all possible probabilities is one. Thus:

c3rij\displaystyle\frac{\partial c_{3}}{\partial r_{ij}} =pij[1qijβ(1+rij)2][1β(1+rij)2]\displaystyle=-p_{ij}\Big{[}\frac{-1}{q_{ij}\beta}(1+r_{ij})^{-2}\Big{]}-\Big{[}\frac{1}{\beta}(1+r_{ij})^{-2}\Big{]}
=(1+rij)1(1+rij)1β=qij[pijqij1]\displaystyle=(1+r_{ij})^{-1}\underbrace{\frac{(1+r_{ij})^{-1}}{\beta}}_{=q_{ij}}\Big{[}\frac{p_{ij}}{q_{ij}}-1\Big{]}
=(1+rij)1(pijqij).\displaystyle=(1+r_{ij})^{-1}(p_{ij}-q_{ij}).

Similarly, we have:

c3rji=(1+rji)1(pjiqji)=(a)(1+rij)1(pijqij),\displaystyle\frac{\partial c_{3}}{\partial r_{ji}}=(1+r_{ji})^{-1}(p_{ji}-q_{ji})\overset{(a)}{=}(1+r_{ij})^{-1}(p_{ij}-q_{ij}),

where (a)(a) is because in t-SNE, the pijp_{ij}, qijq_{ij}, and rijr_{ij} are symmetric for ii and jj according to Eqs. (33), (34), and (38).

Substituting the obtained derivatives in Eq. (39) gives us:

c3𝒚i=4j(pijqij)(1+rij)1(𝒚i𝒚j),\displaystyle\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=4\sum_{j}(p_{ij}-q_{ij})(1+r_{ij})^{-1}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),

which is the gradient mentioned in the proposition. Q.E.D.

Note that in (van der Maaten & Hinton, 2008), the proof uses zijz_{ij} rather than rij=zij2r_{ij}=z_{ij}^{2} in Eq. (38) and the rest of the proof. In our opinion, it is better to use zij2z_{ij}^{2} rather than zijz_{ij} for the proof. ∎

The update of the embedded point 𝒚i\boldsymbol{y}_{i} is done by gradient descent whose every iteration is as Eq. (13) where c1c_{1} is replaced by c3c_{3}. For t-SNE, there is no need to add jitter to the solution of initial iterations (van der Maaten & Hinton, 2008) because it is more robust than SNE. The α(t)\alpha(t) is the momentum which can be updated according to Eq. (16). The η\eta is the learning rate which can be a small positive constant (e.g., η=0.1\eta=0.1) or can be updated according to (Jacobs, 1988) (in (van der Maaten & Hinton, 2008), the initial η\eta is 100100).

Note that in (van der Maaten & Hinton, 2008), the update of 𝒚i(t)\boldsymbol{y}_{i}^{(t)} is Δ𝒚i(t):=+ηc3𝒚i+α(t)Δ𝒚i(t1)\Delta\boldsymbol{y}_{i}^{(t)}:=+\eta\,\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}+\alpha(t)\,\Delta\boldsymbol{y}_{i}^{(t-1)} which we think is a typo in that paper because the positive direction of gradient is used in gradient ascent for maximizing and not minimizing the objective function. We also checked the implementation of t-SNE in Python scikit-learn library111The link is: https://github.com/scikit-learn/
scikit-learn/blob/7b136e9/sklearn/manifold/
t_sne.py
and it was gradient descent and not gradient ascent.

4.3 Early Exaggeration

In t-SNE, it is better to multiply all pijp_{ij}’s by a constant (e.g., 44) in the initial iterations:

pij:=pij×4,\displaystyle p_{ij}:=p_{ij}\times 4, (41)

which is called early exaggeration. This heuristic helps the optimization focus on the large pijp_{ij}’s (close neighbors) more in the early iterations. This is because large pijp_{ij}’s are affected more by multiplying by 44 than the small pijp_{ij}’s. After the neighbours are embedded close to one another, we are free not to do this multiplication any more and let far-away points be embedded using the probabilities without multiplication. Note that the early exaggeration is optional and not mandatory.

5 General Degrees of Freedom in t-SNE

We can have general degrees of freedom for Student-t distribution in t-SNE (van der Maaten, 2009). As we saw in Eqs. (29) and (30), we can have any degrees of freedom for qijq_{ij} (note that α\alpha is a positive integer). We repeat Eq. (30) here for more convenience:

qij=(1+zij2/δ)(δ+1)/2kl(1+zkl2/δ)(δ+1)/2.\displaystyle q_{ij}=\frac{(1+z_{ij}^{2}/\delta)^{-(\delta+1)/2}}{\sum_{k\neq l}(1+z_{kl}^{2}/\delta)^{-(\delta+1)/2}}. (42)

If δ\delta\rightarrow\infty, the Student-t distribution formulated in Eq. (29) tends to Gaussian distribution used in SNE (Hinton & Roweis, 2003). SNE and t-SNE use degrees δ\delta\rightarrow\infty and δ=1\delta=1 in Eq. (42), respectively. Note that the kernel qijq_{ij} in the low-dimensional space has no need to be a probability distribution necessarily, but it is enough for it to be a decaying function. It has been shown in (Kobak et al., 2019) the degree δ<1\delta<1 works properly well for embedding.

There are three ways to determine δ\delta (van der Maaten, 2009):

  1. 1.

    We can set δ\delta to be fixed. For example, δ=1\delta=1 is used in the original t-SNE (van der Maaten & Hinton, 2008) which uses the Cauchy distribution in Eq. (34).

  2. 2.

    The problem of the first approach is not considering the relation of the crowding problem with the dimensionality of the embedded space. Recall the crowding problem discussed in Section 4.1. On the one hand, as Eq. (29) shows, the degree of freedom is in the power so the tail thickness of Student-t distribution decreases exponentially with δ\delta. On the other hand, the volume of a hyper-sphere grows exponentially with the dimension; for example, in two and three dimensions, the volume is πr2\pi r^{2} and (4/3)πr3(4/3)\pi r^{3}, respectively, where rr is the radius. The crowding volume in the embedded space to store the embedded data points is πrh\propto\pi r^{h} and grows exponentially with hh. Therefore, the relation of δ\delta and hh (dimensionality of embedded space) is linear, i.e., hδh\propto\delta. In order to be consistent with the original t-SNE (van der Maaten & Hinton, 2008), we take δ=h1\delta=h-1 which gives δ=1\delta=1 for h=2h=2 (van der Maaten, 2009).

  3. 3.

    The problem of the second approach is that δ\delta might not “only” depend on hh. In this approach, we find the best α\alpha which minimizes the cost c3c_{3}, i.e. Eq. (36), (van der Maaten, 2009) where pijp_{ij} is obtained using Eqs. (31) and (33) and qijq_{ij} is Eq. (30). We use gradient descent (Boyd & Vandenberghe, 2004) for optimization of both δ\delta and {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n}, where the gradients are as mentioned and proved before. The parametric t-SNE (van der Maaten, 2009) has used restricted Boltzmann machine (Hinton & Salakhutdinov, 2006; Hinton, 2012; Ghojogh et al., 2021b) to learn the optimal δ\delta and {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n} by a neural network. One can use an alternating optimization approach (Jain & Kar, 2017) to solve for both δ\delta and {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n} simultaneously. In this approach, {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n} are updated with gradient descent using Eq. (49); then, the degree δ\delta is updated with gradient descent using Eq. (48), and this procedure is repeated until convergence.

    Note that the degree δ\delta is an integer greater than or equal to one. However, the gradient in Eq. (48) is a float number. For updating the degree using gradient descent in the alternating optimization approach, one can update the degree using the sign of gradient, i.e.:

    δ:=δsign(c3δ),\displaystyle\delta:=\delta-\text{sign}(\frac{\partial c_{3}}{\partial\delta}), (43)

    because the direction of updating is opposite to the gradient direction.

For convenience, we list pijp_{ij}, qijq_{ij}, and c3c_{3} here again:

pj|i:=exp(dij2)kiexp(dik2),\displaystyle\mathbb{R}\ni p_{j|i}:=\frac{\exp(-d_{ij}^{2})}{\sum_{k\neq i}\exp(-d_{ik}^{2})}, (44)
pij:=pi|j+pj|i2n,\displaystyle\mathbb{R}\ni p_{ij}:=\frac{p_{i|j}+p_{j|i}}{2n}, (45)
qij=(1+zij2/δ)(δ+1)/2kl(1+zkl2/δ)(δ+1)/2,\displaystyle\mathbb{R}\ni q_{ij}=\frac{(1+z_{ij}^{2}/\delta)^{-(\delta+1)/2}}{\sum_{k\neq l}(1+z_{kl}^{2}/\delta)^{-(\delta+1)/2}}, (46)
c3:=iKL(Pi||Qi)=ijipijlog(pijqij).\displaystyle\mathbb{R}\ni c_{3}:=\sum_{i}\text{KL}(P_{i}||Q_{i})=\sum_{i}\sum_{j\neq i}p_{ij}\log(\frac{p_{ij}}{q_{ij}}). (47)
Proposition 4.

The gradient of c3c_{3} with respect to δ\delta is:

c3δ=ij((1+δ)zij22δ2(1+zij2δ)+12log(1+zij2δ))(pijqij),\displaystyle\frac{\partial c_{3}}{\partial\delta}=\sum_{i\neq j}\Big{(}\frac{-(1+\delta)z_{ij}^{2}}{2\delta^{2}(1+\frac{z_{ij}^{2}}{\delta})}+\frac{1}{2}\log(1+\frac{z_{ij}^{2}}{\delta})\Big{)}(p_{ij}-q_{ij}), (48)

where pijp_{ij} and qijq_{ij} are the Eqs. (45) and (46), respectively, and zij2:=𝐲i𝐲j22z_{ij}^{2}:=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}.

No matter which of the three ways of determining δ\delta is used, we need to optimize the cost function c3c_{3} (Eq. (47)) using gradient descent.

Proposition 5.

The gradient of c3c_{3} with respect to 𝐲i\boldsymbol{y}_{i} is:

c3𝒚i=2δ+2δ×\displaystyle\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=\frac{2\delta+2}{\delta}\times (49)
j(pijqij)(1+𝒚i𝒚j22δ)1(𝒚i𝒚j),\displaystyle\sum_{j}(p_{ij}-q_{ij})(1+\frac{||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}}{\delta})^{-1}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),

where pijp_{ij} and qijq_{ij} are the Eqs. (45) and (46), respectively.

Proof.

Let:

rij:=zij2=𝒚i𝒚j22.\displaystyle\mathbb{R}\ni r_{ij}:=z_{ij}^{2}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}. (50)

By changing 𝒚i\boldsymbol{y}_{i}, we only have change impact in zijz_{ij} and zjiz_{ji} for all jj’s. Considering Eq. (50) and according to chain rule, we have:

pc3𝒚i=j(c3rijrij𝒚i+c3rjirji𝒚i).\displaystyle\mathbb{R}^{p}\ni\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=\sum_{j}\big{(}\frac{\partial c_{3}}{\partial r_{ij}}\frac{\partial r_{ij}}{\partial\boldsymbol{y}_{i}}+\frac{\partial c_{3}}{\partial r_{ji}}\frac{\partial r_{ji}}{\partial\boldsymbol{y}_{i}}\big{)}.

According to Eq. (50), we have:

rij=𝒚i𝒚j22rij𝒚i=2(𝒚i𝒚j),\displaystyle r_{ij}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}\implies\frac{\partial r_{ij}}{\partial\boldsymbol{y}_{i}}=2(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),
rji=𝒚j𝒚i22=𝒚i𝒚j22rji𝒚i=2(𝒚i𝒚j).\displaystyle r_{ji}=||\boldsymbol{y}_{j}-\boldsymbol{y}_{i}||_{2}^{2}=||\boldsymbol{y}_{i}-\boldsymbol{y}_{j}||_{2}^{2}\implies\frac{\partial r_{ji}}{\partial\boldsymbol{y}_{i}}=2(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}).

Therefore:

c3𝒚i=2j(c3rij+c3rji)(𝒚i𝒚j).\displaystyle\therefore~{}~{}~{}~{}\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}\big{(}\frac{\partial c_{3}}{\partial r_{ij}}+\frac{\partial c_{3}}{\partial r_{ji}}\big{)}(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}). (51)

The cost function can be re-written as:

c3\displaystyle c_{3} =klkpkllog(pklqkl)=klpkllog(pklqkl)\displaystyle=\sum_{k}\sum_{l\neq k}p_{kl}\log(\frac{p_{kl}}{q_{kl}})=\sum_{k\neq l}p_{kl}\log(\frac{p_{kl}}{q_{kl}})
=kl(pkllog(pkl)pkllog(qkl)),\displaystyle=\sum_{k\neq l}\big{(}p_{kl}\log(p_{kl})-p_{kl}\log(q_{kl})\big{)},

whose first term is a constant with respect to qklq_{kl} and thus to rklr_{kl}. We have:

c3rij=klpkl(log(qkl))rij.\displaystyle\mathbb{R}\ni\frac{\partial c_{3}}{\partial r_{ij}}=-\sum_{k\neq l}p_{kl}\frac{\partial(\log(q_{kl}))}{\partial r_{ij}}.

According to Eq. (46), the qklq_{kl} is:

qkl=(1+rkl/δ)(δ+1)/2mf(1+rmf/δ)(δ+1)/2,\displaystyle q_{kl}=\frac{(1+r_{kl}/\delta)^{-(\delta+1)/2}}{\sum_{m\neq f}(1+r_{mf}/\delta)^{-(\delta+1)/2}},

We take the denominator of qklq_{kl} as:

β:=mf(1+rmf/δ)(δ+1)/2.\displaystyle\beta:=\sum_{m\neq f}(1+r_{mf}/\delta)^{-(\delta+1)/2}. (52)

We have log(qkl)=log(qkl)+logβlogβ=log(qklβ)logβ\log(q_{kl})=\log(q_{kl})+\log\beta-\log\beta=\log(q_{kl}\beta)-\log\beta. Therefore:

c3rij\displaystyle\therefore~{}~{}~{}\frac{\partial c_{3}}{\partial r_{ij}} =klpkl(log(qklβ)logβ)rij\displaystyle=-\sum_{k\neq l}p_{kl}\frac{\partial\big{(}\log(q_{kl}\beta)-\log\beta\big{)}}{\partial r_{ij}}
=klpkl[(log(qklβ))rij(logβ)rij]\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{\partial\big{(}\log(q_{kl}\beta)\big{)}}{\partial r_{ij}}-\frac{\partial\big{(}\log\beta\big{)}}{\partial r_{ij}}\bigg{]}
=klpkl[1qklβ(qklβ)rij1ββrij].\displaystyle=-\sum_{k\neq l}p_{kl}\bigg{[}\frac{1}{q_{kl}\beta}\frac{\partial\big{(}q_{kl}\beta\big{)}}{\partial r_{ij}}-\frac{1}{\beta}\frac{\partial\beta}{\partial r_{ij}}\bigg{]}.

The qklβq_{kl}\beta is:

qklβ=\displaystyle q_{kl}\beta= (1+rkl/δ)(δ+1)/2mf(1+rmf/δ)(δ+1)/2×\displaystyle\,\frac{(1+r_{kl}/\delta)^{-(\delta+1)/2}}{\sum_{m\neq f}(1+r_{mf}/\delta)^{-(\delta+1)/2}}\times
mf(1+rmf/δ)(δ+1)/2=(1+rkl/δ)(δ+1)/2.\displaystyle\sum_{m\neq f}(1+r_{mf}/\delta)^{-(\delta+1)/2}=(1+r_{kl}/\delta)^{-(\delta+1)/2}.

The ((1+rkl/δ)(δ+1)/2)/rij\partial\big{(}(1+r_{kl}/\delta)^{-(\delta+1)/2}\big{)}/\partial r_{ij} is non-zero for only k=ik=i and l=jl=j; therefore:

(qklβ)rij=((1+rkl/δ)(δ+1)/2)rij\displaystyle\frac{\partial\big{(}q_{kl}\beta\big{)}}{\partial r_{ij}}=\frac{\partial\big{(}(1+r_{kl}/\delta)^{-(\delta+1)/2}\big{)}}{\partial r_{ij}}
=δ+12δ(1+rijδ)δ+32,\displaystyle=-\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}},
βrij=mf(1+rmf/δ)(δ+1)/2rij\displaystyle\frac{\partial\beta}{\partial r_{ij}}=\frac{\partial\sum_{m\neq f}(1+r_{mf}/\delta)^{-(\delta+1)/2}}{\partial r_{ij}}
=((1+rkl/δ)(δ+1)/2)rij=δ+12δ(1+rijδ)δ+32.\displaystyle=\frac{\partial\big{(}(1+r_{kl}/\delta)^{-(\delta+1)/2}\big{)}}{\partial r_{ij}}=-\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}}.

Therefore:

c3rij\displaystyle\frac{\partial c_{3}}{\partial r_{ij}} =\displaystyle= (pij[1qijβδ+12δ(1+rijδ)δ+32]+0\displaystyle-\bigg{(}p_{ij}\Big{[}\frac{-1}{q_{ij}\beta}\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}}\Big{]}+0
++0)klpkl[1βδ+12δ(1+rijδ)δ+32].\displaystyle+\dots+0\bigg{)}-\sum_{k\neq l}p_{kl}\Big{[}\frac{1}{\beta}\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}}\Big{]}.

We have klpkl=1\sum_{k\neq l}p_{kl}=1 because summation of all possible probabilities is one. Thus:

c3rij\displaystyle\frac{\partial c_{3}}{\partial r_{ij}} =pij[1qijβδ+12δ(1+rijδ)δ+32]\displaystyle=-p_{ij}\Big{[}\frac{-1}{q_{ij}\beta}\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}}\Big{]}
[1βδ+12δ(1+rijδ)δ+32]\displaystyle-\Big{[}\frac{1}{\beta}\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}}\Big{]}
=δ+12δ(1+rijδ)δ+321β(pijqij1)\displaystyle=\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+3}{2}}\frac{1}{\beta}(\frac{p_{ij}}{q_{ij}}-1)
=δ+12δ(1+rijδ)δ+12β=qij(1+rijδ)1(pijqij1)\displaystyle=\frac{\delta+1}{2\delta}\underbrace{\frac{(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+1}{2}}}{\beta}}_{=q_{ij}}(1+\frac{r_{ij}}{\delta})^{-1}(\frac{p_{ij}}{q_{ij}}-1) (53)
=δ+12δ(1+rijδ)1(pijqij).\displaystyle=\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-1}(p_{ij}-q_{ij}).

Similarly, we have:

c3rji\displaystyle\frac{\partial c_{3}}{\partial r_{ji}} =δ+12δ(1+rjiδ)1(pjiqji)\displaystyle=\frac{\delta+1}{2\delta}(1+\frac{r_{ji}}{\delta})^{-1}(p_{ji}-q_{ji})
=(a)δ+12δ(1+rijδ)1(pijqij),\displaystyle\overset{(a)}{=}\frac{\delta+1}{2\delta}(1+\frac{r_{ij}}{\delta})^{-1}(p_{ij}-q_{ij}),

where (a)(a) is because in t-SNE with general degrees of freedom, the pijp_{ij}, qijq_{ij}, and rijr_{ij} are symmetric for ii and jj according to Eqs. (45), (46), and (50).

Substituting the obtained derivatives in Eq. (51) gives us:

c3𝒚i=2jδ+1δ(1+rijδ)1(pijqij)(𝒚i𝒚j),\displaystyle\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}\frac{\delta+1}{\delta}(1+\frac{r_{ij}}{\delta})^{-1}(p_{ij}-q_{ij})(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}),

which is the gradient mentioned in the proposition. Q.E.D.

Note that in (van der Maaten, 2009), the gradient is mentioned to be:

c3𝒚i=2jδ+1δ(1+rijδ)δ+12(pijqij)(𝒚i𝒚j),\displaystyle\frac{\partial c_{3}}{\partial\boldsymbol{y}_{i}}=2\sum_{j}\frac{\delta+1}{\delta}(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+1}{2}}(p_{ij}-q_{ij})(\boldsymbol{y}_{i}-\boldsymbol{y}_{j}), (54)

which we think is wrong. We conjecture that, in the paper (van der Maaten, 2009), there might have been a small mistake in derivation of Eq. (53) where possibly (1+rijδ)1/β(1+\frac{r_{ij}}{\delta})^{-1}/\beta has been taken rather than (1+rijδ)δ+12/β(1+\frac{r_{ij}}{\delta})^{-\frac{\delta+1}{2}}/\beta to be qijq_{ij} because of the habit of the original t-SNE (van der Maaten & Hinton, 2008). ∎

Comparing Eqs. (37) and (49) shows that the original t-SNE (van der Maaten & Hinton, 2008) is a special case with δ=1\delta=1.

6 Out-of-sample Embedding

Recall that we have nn high-dimensional data points {𝒙i}i=1n\{\boldsymbol{x}_{i}\}_{i=1}^{n} and we want to embed them into the lower dimensional data {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n} where 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} and 𝒚ip\boldsymbol{y}_{i}\in\mathbb{R}^{p}. Assume we have ntn_{t} out-of-sample data points {𝒙i(t)}i=1nt\{\boldsymbol{x}_{i}^{(t)}\}_{i=1}^{n_{t}} and we want to embed them into the lower dimensional data {𝒚it}i=1nt\{\boldsymbol{y}_{i}^{t}\}_{i=1}^{n_{t}} where 𝒙i(t)d\boldsymbol{x}_{i}^{(t)}\in\mathbb{R}^{d} and 𝒚i(t)p\boldsymbol{y}_{i}^{(t)}\in\mathbb{R}^{p}. There are several different methods for out-of-sample extension of SNE and t-SNE methods. One approach, which we do not cover in this manuscript, is based on optimization (Bunte et al., 2012). Another method is based on kernel mapping (Gisbrecht et al., 2012, 2015) which we explain in the following.

We define a map which maps any data point as 𝒙𝒚(𝒙)\boldsymbol{x}\mapsto\boldsymbol{y}(\boldsymbol{x}), where:

p𝒚(𝒙):=j=1n𝜶jk(𝒙,𝒙j)=1nk(𝒙,𝒙),\displaystyle\mathbb{R}^{p}\ni\boldsymbol{y}(\boldsymbol{x}):=\sum_{j=1}^{n}\boldsymbol{\alpha}_{j}\,\frac{k(\boldsymbol{x},\boldsymbol{x}_{j})}{\sum_{\ell=1}^{n}k(\boldsymbol{x},\boldsymbol{x}_{\ell})}, (55)

and 𝜶jp\boldsymbol{\alpha}_{j}\in\mathbb{R}^{p}, and 𝒙j\boldsymbol{x}_{j} and 𝒙\boldsymbol{x}_{\ell} denote the jj-th and \ell-th training data points, respectively. The k(𝒙,𝒙j)k(\boldsymbol{x},\boldsymbol{x}_{j}) is a kernel such as the Gaussian kernel:

k(𝒙,𝒙j)=exp(𝒙𝒙j222σj2),\displaystyle k(\boldsymbol{x},\boldsymbol{x}_{j})=\exp(\frac{-||\boldsymbol{x}-\boldsymbol{x}_{j}||_{2}^{2}}{2\,\sigma_{j}^{2}}), (56)

where σj\sigma_{j} is calculated as (Gisbrecht et al., 2015):

σj:=γ×mini(𝒙j𝒙i2),\displaystyle\sigma_{j}:=\gamma\times\min_{i}(||\boldsymbol{x}_{j}-\boldsymbol{x}_{i}||_{2}), (57)

where γ\gamma is a small positive number.

Assume we have already embedded the training data points using SNE or t-SNE; therefore, the set {𝒚i}i=1n\{\boldsymbol{y}_{i}\}_{i=1}^{n} is available. If we map the training data points, we want to minimize the following least-squares cost function in order to get 𝒚(𝒙i)\boldsymbol{y}(\boldsymbol{x}_{i}) close to 𝒚i\boldsymbol{y}_{i} for the ii-th training point:

minimize𝜶j’s\displaystyle\underset{\boldsymbol{\alpha}_{j}\text{'s}}{\text{minimize}} i=1n𝒚i𝒚(𝒙i)22,\displaystyle\sum_{i=1}^{n}||\boldsymbol{y}_{i}-\boldsymbol{y}(\boldsymbol{x}_{i})||_{2}^{2}, (58)

where the summation is over the training data points. We can write this cost function in matrix form as:

minimize𝑨\displaystyle\underset{\boldsymbol{A}}{\text{minimize}} 𝒀𝑲𝑨F2,\displaystyle||\boldsymbol{Y}-\boldsymbol{K}\boldsymbol{A}||_{F}^{2}, (59)

where n×p𝒀:=[𝒚1,,𝒚n]\mathbb{R}^{n\times p}\ni\boldsymbol{Y}:=[\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{n}]^{\top} and n×p𝑨:=[𝜶1,,𝜶n]\mathbb{R}^{n\times p}\ni\boldsymbol{A}:=[\boldsymbol{\alpha}_{1},\dots,\boldsymbol{\alpha}_{n}]^{\top}. The 𝑲n×n\boldsymbol{K}\in\mathbb{R}^{n\times n} is the kernel matrix whose (i,j)(i,j)-th element is:

𝑲(i,j):=k(𝒙i,𝒙j)=1nk(𝒙i,𝒙).\displaystyle\boldsymbol{K}(i,j):=\frac{k(\boldsymbol{x}_{i},\boldsymbol{x}_{j})}{\sum_{\ell=1}^{n}k(\boldsymbol{x}_{i},\boldsymbol{x}_{\ell})}. (60)

The Eq. (59) is always non-negative; thus, its smallest value is zero. Therefore, the solution to this equation is:

𝒀𝑲𝑨=𝟎\displaystyle\boldsymbol{Y}-\boldsymbol{K}\boldsymbol{A}=\boldsymbol{0} 𝒀=𝑲𝑨\displaystyle\implies\boldsymbol{Y}=\boldsymbol{K}\boldsymbol{A}
(a)𝑨=𝑲𝒀,\displaystyle\overset{(a)}{\implies}\boldsymbol{A}=\boldsymbol{K}^{\dagger}\,\boldsymbol{Y}, (61)

where 𝑲\boldsymbol{K}^{\dagger} is the pseudo-inverse of 𝑲\boldsymbol{K}:

𝑲=(𝑲𝑲)1𝑲,\displaystyle\boldsymbol{K}^{\dagger}=(\boldsymbol{K}^{\top}\boldsymbol{K})^{-1}\boldsymbol{K}^{\top}, (62)

and (a)(a) is because 𝑲𝑲=𝑰\boldsymbol{K}^{\dagger}\,\boldsymbol{K}=\boldsymbol{I}.

Finally, the mapping of Eq. (55) for the ntn_{t} out-of-sample data points is:

𝒀(t)=𝑲(t)𝑨,\displaystyle\boldsymbol{Y}^{(t)}=\boldsymbol{K}^{(t)}\,\boldsymbol{A}, (63)

where nt×p𝒀(t):=[𝒚1(t),,𝒚nt(t)]\mathbb{R}^{n_{t}\times p}\ni\boldsymbol{Y}^{(t)}:=[\boldsymbol{y}_{1}^{(t)},\dots,\boldsymbol{y}_{n_{t}}^{(t)}]^{\top} and the (i,j)(i,j)-th element of the out-of-sample kernel matrix 𝑲(t)nt×n\boldsymbol{K}^{(t)}\in\mathbb{R}^{n_{t}\times n} is:

𝑲(t)(i,j):=k(𝒙i(t),𝒙j)=1nk(𝒙i(t),𝒙),\displaystyle\boldsymbol{K}^{(t)}(i,j):=\frac{k(\boldsymbol{x}_{i}^{(t)},\boldsymbol{x}_{j})}{\sum_{\ell=1}^{n}k(\boldsymbol{x}_{i}^{(t)},\boldsymbol{x}_{\ell})}, (64)

where 𝒙i(t)\boldsymbol{x}_{i}^{(t)} is the ii-th out-of-sample data point, and 𝒙j\boldsymbol{x}_{j} and 𝒙\boldsymbol{x}_{\ell} are the jj-th and \ell-th training data points, respectively.

In Eq. (61), if, 𝒀\boldsymbol{Y} is the embedding of training data using SNE/t-SNE, then the out-of-sample embedding of SNE/t-SNE are obtained. As mentioned in (Gisbrecht et al., 2015), this method can also be used for out-of-sample extension in Isomap (Tenenbaum et al., 2000; Ghojogh et al., 2020), Locally Linear Embedding (LLE) (Roweis & Saul, 2000), and Maximum Variance Embedding (MVU) (Weinberger & Saul, 2006). The only difference is in obtaining the embedded training points 𝒀\boldsymbol{Y} using different non-parametric dimensionality reduction methods (Gisbrecht et al., 2012, 2015).

7 Accelerating SNE and t-SNE

The SNE and t-SNE methods are very slow because of numerical iterative optimization. Different methods have been proposed for accelerating these methods (Linderman et al., 2017). Some of these methods are the tree-based algorithms (van der Maaten, 2014; Robinson & Pierce-Hoffman, 2020). This type of t-SNE is also referred to as Barnes-Hut t-SNE (Van Der Maaten, 2013; van der Maaten, 2014). We do not cover the tree-based algorithms (van der Maaten, 2014) in this paper for the sake of brevity. Some other methods exist for accelerating SNE and t-SNE which are based on landmarks. In these methods, we randomly sample from the dataset in order to have a subset of data. The sampled data points are called landmarks. In the following, we mention three methods for accelerating t-SNE and/or SNE which use landmarks. In the following, we review the methods based on landmarks.

7.1 Acceleration Using Out-of-sample Embedding

One way to speed up SNE and t-SNE is the kernel mapping (Gisbrecht et al., 2015) introduced in Section 6. We consider the landmarks as the training data points and train SNE or t-SNE with them. Thereafter, we treat the non-landmark data points as out-of-sample points. We use kernel SNE or kernel t-SNE (or Fisher kernel t-SNE for supervised cases) in order to embed the out-of-sample data points.

Another method is to again to consider the landmarks as training points and embed the training points using SNE or t-SNE. Then, the non-landmarks, which are out-of-sample points, are embedded using optimization (Bunte et al., 2012) as was mentioned in Section 6.

7.2 Acceleration Using Random Walk

Another way of accelerating t-SNE is random walk (van der Maaten & Hinton, 2008). First, a kk-Nearest Neighbor (kkNN) graph is constructed using all points including landmarks and non-landmarks. This method has an acceptable robustness to the choice of kk; for example, k=20k=20 can be used (van der Maaten & Hinton, 2008). Also, note that calculation of kkNN is time-consuming for large dataset; however, it is not a big deal as it is done only once. Then, multiple random walks are performed in this kkNN graph (Spitzer, 2013). For every random walk, we start from a random landmark and randomly select the edges and go further randomly until we reach another landmark and then we terminate for that random walk. After performing all the random walks, the fraction of random walks which pass through the point 𝒙i\boldsymbol{x}_{i} (either landmark or non-landmark) and then reach the point 𝒙j\boldsymbol{x}_{j} (either landmark or non-landmark) is a good approximation for pj|ip_{j|i}. In t-SNE, we use this approximation in place of Eq. (31), which is then used in Eq. (33). The rest of t-SNE is similar to the original t-SNE. Therefore, for pijp_{ij} in Eq. (37), we use the approximation rather than Eq. (33) and this makes the t-SNE much faster.

8 Recent Improvements of t-SNE

Here, we just list some of the recent improvements of t-SNE and do not explain them in detail for brevity. Recall that the variance σi2\sigma_{i}^{2} is determined for every point 𝒙i\boldsymbol{x}_{i} using binary search. This cancels the local density information for points because a point in denser regions will have a smaller σi2\sigma_{i}^{2}. Dense t-SNE (Narayan et al., 2021) resolves this problem by a density radius to include the density information. LargeVis (Tang et al., 2016) and UMAP (McInnes et al., 2018; Ghojogh et al., 2021c) are closely related methods to t-SNE. Parametric t-SNE (van der Maaten, 2009) and parametric kernel t-SNE (Gisbrecht et al., 2015) implement t-SNE formulation in a neural network structure. The optimization of t-SNE can be seen as optimizing attractive and repulsive forces between points. Some discussions on the attractive and repulsive forces in t-SNE can be found in (Linderman et al., 2017; Sainburg et al., 2020). Many algorithms such as t-SNE, which are based on these forces, can be unified as a family of neighborhood embedding methods (Böhm et al., 2020; Böhm, 2020). Finally, note that a combination of variational autoencoder (Kingma & Welling, 2014; Ghojogh et al., 2021a) and SNE exists (Graving & Couzin, 2020).

9 Conclusion

This paper was a tutorial and survey paper on SNE and its variants. These methods have a probabilistic approach where the probabilities of neighborhood in the input space are tried to be preserved in the embedding space. We explained SNE, symmetric SNE, t-SNE (or Cauchy-SNE), and t-SNE with general degrees of freedom. We also covered out-of-sample extension and their acceleration methods. Finally, some simulations were provided for visualization of embeddings.

Some newer variants of SNE and t-SNE were not covered in this manuscript and we refer the reader to those papers for more information. An example is Fisher kernel t-SNE (Gisbrecht et al., 2015) for supervised embedding using t-SNE. This method uses TT-point approximation of Riemannian distance (Peltonen et al., 2004) in the formulation of probability. There is also some other technique for heavy-tailed SNE such as (Yang et al., 2009).

References

  • Böhm (2020) Böhm, Jan Niklas. Dimensionality Reduction with Neighborhood Embeddings. PhD thesis, University of Tübingen, 2020.
  • Böhm et al. (2020) Böhm, Jan Niklas, Berens, Philipp, and Kobak, Dmitry. A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum. arXiv preprint arXiv:2007.08902, 2020.
  • Boyd & Vandenberghe (2004) Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004.
  • Bunte et al. (2012) Bunte, Kerstin, Biehl, Michael, and Hammer, Barbara. A general framework for dimensionality-reducing data visualization mapping. Neural Computation, 24(3):771–804, 2012.
  • Ghojogh et al. (2019) Ghojogh, Benyamin, Samad, Maria N, Mashhadi, Sayema Asif, Kapoor, Tania, Ali, Wahab, Karray, Fakhri, and Crowley, Mark. Feature selection and feature extraction in pattern analysis: A literature review. arXiv preprint arXiv:1905.02845, 2019.
  • Ghojogh et al. (2020) Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and Crowley, Mark. Multidimensional scaling, Sammon mapping, and Isomap: Tutorial and survey. arXiv preprint arXiv:2009.08136, 2020.
  • Ghojogh et al. (2021a) Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and Crowley, Mark. Factor analysis, probabilistic principal component analysis, variational inference, and variational autoencoder: Tutorial and survey. arXiv preprint arXiv:2101.00734, 2021a.
  • Ghojogh et al. (2021b) Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and Crowley, Mark. Restricted Boltzmann machine and deep belief network: Tutorial and survey. arXiv preprint arXiv:2107.12521, 2021b.
  • Ghojogh et al. (2021c) Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and Crowley, Mark. Uniform manifold approximation and projection (UMAP) and its variants: Tutorial and survey. arXiv preprint arXiv:2109.02508, 2021c.
  • Gisbrecht et al. (2012) Gisbrecht, Andrej, Lueks, Wouter, Mokbel, Bassam, and Hammer, Barbara. Out-of-sample kernel extensions for nonparametric dimensionality reduction. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, volume 2012, pp. 531–536, 2012.
  • Gisbrecht et al. (2015) Gisbrecht, Andrej, Schulz, Alexander, and Hammer, Barbara. Parametric nonlinear dimensionality reduction using kernel t-sne. Neurocomputing, 147:71–82, 2015.
  • Globerson et al. (2007) Globerson, Amir, Chechik, Gal, Pereira, Fernando, and Tishby, Naftali. Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 8(Oct):2265–2295, 2007.
  • Goldberg & Levy (2014) Goldberg, Yoav and Levy, Omer. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.
  • Goldberger et al. (2005) Goldberger, Jacob, Hinton, Geoffrey E, Roweis, Sam T, and Salakhutdinov, Russ R. Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520, 2005.
  • Gosset (1908) (Student) Gosset (Student), William Sealy. The probable error of a mean. Biometrika, pp.  1–25, 1908.
  • Graving & Couzin (2020) Graving, Jacob M and Couzin, Iain D. VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering. BioRxiv, 2020.
  • Hinton (2012) Hinton, Geoffrey E. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pp.  599–619. Springer, 2012.
  • Hinton & Roweis (2003) Hinton, Geoffrey E and Roweis, Sam T. Stochastic neighbor embedding. In Advances in neural information processing systems, pp. 857–864, 2003.
  • Hinton & Salakhutdinov (2006) Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • Im et al. (2018) Im, Daniel Jiwoong, Verma, Nakul, and Branson, Kristin. Stochastic neighbor embedding under f-divergences. arXiv preprint arXiv:1811.01247, 2018.
  • Iwata et al. (2005) Iwata, Tomoharu, Saito, Kazumi, Ueda, Naonori, Stromsten, Sean, Griffiths, Thomas L, and Tenenbaum, Joshua B. Parametric embedding for class visualization. In Advances in neural information processing systems, pp. 617–624, 2005.
  • Jacobs (1988) Jacobs, Robert A. Increased rates of convergence through learning rate adaptation. Neural networks, 1(4):295–307, 1988.
  • Jain & Kar (2017) Jain, Prateek and Kar, Purushottam. Non-convex optimization for machine learning. arXiv preprint arXiv:1712.07897, 2017.
  • Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Kleinbaum et al. (2002) Kleinbaum, David G, Dietz, K, Gail, M, Klein, Mitchel, and Klein, Mitchell. Logistic regression. Springer, 2002.
  • Kobak & Berens (2019) Kobak, Dmitry and Berens, Philipp. The art of using t-SNE for single-cell transcriptomics. Nature communications, 10(1):1–14, 2019.
  • Kobak et al. (2019) Kobak, Dmitry, Linderman, George, Steinerberger, Stefan, Kluger, Yuval, and Berens, Philipp. Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  124–139. Springer, 2019.
  • Kullback (1997) Kullback, Solomon. Information theory and statistics. Courier Corporation, 1997.
  • Linderman et al. (2017) Linderman, George C, Rachh, Manas, Hoskins, Jeremy G, Steinerberger, Stefan, and Kluger, Yuval. Efficient algorithms for t-distributed stochastic neighborhood embedding. arXiv preprint arXiv:1712.09005, 2017.
  • Liu et al. (2020) Liu, Xueliang, Yang, Xun, Wang, Meng, and Hong, Richang. Deep neighborhood component analysis for visual similarity modeling. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–15, 2020.
  • McInnes et al. (2018) McInnes, Leland, Healy, John, and Melville, James. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  • Mikolov et al. (2013a) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
  • Mikolov et al. (2013b) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013b.
  • Movshovitz-Attias et al. (2017) Movshovitz-Attias, Yair, Toshev, Alexander, Leung, Thomas K, Ioffe, Sergey, and Singh, Saurabh. No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp.  360–368, 2017.
  • Narayan et al. (2021) Narayan, Ashwin, Berger, Bonnie, and Cho, Hyunghoon. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology, 39(6):765–774, 2021.
  • Peltonen et al. (2004) Peltonen, Jaakko, Klami, Arto, and Kaski, Samuel. Improved learning of Riemannian metrics for exploratory analysis. Neural Networks, 17(8-9):1087–1100, 2004.
  • Qian (1999) Qian, Ning. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  • Robinson & Pierce-Hoffman (2020) Robinson, Isaac and Pierce-Hoffman, Emma. Tree-SNE: Hierarchical clustering and visualization using t-SNE. arXiv preprint arXiv:2002.05687, 2020.
  • Rong (2014) Rong, Xin. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
  • Roweis & Saul (2000) Roweis, Sam T and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • Sainburg et al. (2020) Sainburg, Tim, McInnes, Leland, and Gentner, Timothy Q. Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning. 2020.
  • Saul & Roweis (2003) Saul, Lawrence K and Roweis, Sam T. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of machine learning research, 4(Jun):119–155, 2003.
  • Spitzer (2013) Spitzer, Frank. Principles of random walk, volume 34. Springer Science & Business Media, 2013.
  • Tang et al. (2016) Tang, Jian, Liu, Jingzhou, Zhang, Ming, and Mei, Qiaozhu. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web, pp.  287–297, 2016.
  • Tenenbaum et al. (2000) Tenenbaum, Joshua B, De Silva, Vin, and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • van der Maaten (2009) van der Maaten, Laurens. Learning a parametric embedding by preserving local structure. In Artificial Intelligence and Statistics, pp.  384–391, 2009.
  • Van Der Maaten (2013) Van Der Maaten, Laurens. Barnes-Hut-SNE. arXiv preprint arXiv:1301.3342, 2013.
  • van der Maaten (2014) van der Maaten, Laurens. Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
  • van der Maaten & Hinton (2008) van der Maaten, Laurens and Hinton, Geoffrey. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • Weinberger & Saul (2006) Weinberger, Kilian Q and Saul, Lawrence K. Unsupervised learning of image manifolds by semidefinite programming. International journal of computer vision, 70(1):77–90, 2006.
  • Yang et al. (2009) Yang, Zhirong, King, Irwin, Xu, Zenglin, and Oja, Erkki. Heavy-tailed symmetric stochastic neighbor embedding. In Advances in neural information processing systems, pp. 2169–2177, 2009.