This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Variational Multivariate Information Bottleneck -
A Framework for Variational Losses

\nameEslam Abdelaleem \email[email protected]
\addrDepartment of Physics
Emory University
Atlanta, GA 30322, USA
\AND\nameIlya Nemenman \email[email protected]
\addrDepartments of Physics and Biology
Initiative in Theory and Modeling of Living Systems
Emory University
Atlanta, GA 30322, USA
\AND\nameK. Michael Martini 11footnotemark: 1   \email[email protected]
\addrDepartment of Physics
Emory University
Atlanta, GA 30322, USA
Equal contributionCorresponding author
Abstract

Variational dimensionality reduction methods are known for their high accuracy, generative abilities, and robustness. We introduce a framework to unify many existing variational methods and design new ones. The framework is based on an interpretation of the multivariate information bottleneck, in which an encoder graph, specifying what information to compress, is traded-off against a decoder graph, specifying a generative model. Using this framework, we rederive existing dimensionality reduction methods including the deep variational information bottleneck and variational auto-encoders. The framework naturally introduces a trade-off parameter extending the deep variational CCA (DVCCA) family of algorithms to beta-DVCCA. We derive a new method, the deep variational symmetric informational bottleneck (DVSIB), which simultaneously compresses two variables to preserve information between their compressed representations. We implement these algorithms and evaluate their ability to produce shared low dimensional latent spaces on Noisy MNIST dataset. We show that algorithms that are better matched to the structure of the data (in our case, beta-DVCCA and DVSIB) produce better latent spaces as measured by classification accuracy, dimensionality of the latent variables, and sample efficiency. We believe that this framework can be used to unify other multi-view representation learning algorithms and to derive and implement novel problem-specific loss functions.

Keywords: Information Bottleneck, Symmetric Information Bottleneck, Variational Methods, Generative Models, Dimensionality Reduction, Data Efficiency

1 Introduction

Large dimensional multi-modal datasets are abundant in multimedia systems utilized for language modeling (Xu et al., 2016; Zhou et al., 2018; Rohrbach et al., 2017; Sanabria et al., 2018; Goyal et al., 2017; Wang et al., 2018; Hendrycks et al., 2020), neural control of behavior studies (Steinmetz et al., 2021; Urai et al., 2022; Krakauer et al., 2017; Pang et al., 2016), multi-omics approaches in systems biology (Clark et al., 2013; Zheng et al., 2017; Svensson et al., 2018; Huntley et al., 2015; Lorenzi et al., 2018), and many other domains. Such data come with the curse of dimensionality, making it hard to learn the relevant statistical correlations from samples. The problem is made even harder by the data often containing information that is irrelevant to the specific questions one asks. To tackle these challenges, a myriad of dimensionality reduction (DR) methods have emerged. By preserving certain aspects of the data while discarding the remainder, DR can decrease the complexity of the problem, yield clearer insights, and provide a foundation for more refined modeling approaches.

DR techniques span linear methods like Principal Component Analysis (PCA) (Hotelling, 1933), Partial Least Squares (PLS) (Wold et al., 2001), Canonical Correlations Analysis (CCA) (Hotelling, 1936), and regularized CCA (Vinod, 1976; Årup Nielsen et al., 1998), as well as nonlinear approaches, including Autoencoders (AE) (Hinton and Salakhutdinov, 2006), Deep CCA (Andrew et al., 2013), Deep Canonical Correlated AE (Wang et al., 2015), Correlational Neural Networks (Chandar et al., 2016), Deep Generalized CCA (Benton et al., 2017), and Deep Tensor CCA (Wong et al., 2021). Of particular interest to us are variational methods, such as Variational Autoencoders (VAE) (Kingma and Welling, 2014), beta-VAE (Higgins et al., 2016), Joint Multimodal VAE (JMVAE) (Suzuki et al., 2016), Deep Variational CCA (DVCCA) (Wang et al., 2016), Deep Variational Information Bottleneck (DVIB) (Alemi et al., 2017), Variational Mixture-of-experts AE (Shi et al., 2019), and Multiview Information Bottleneck (Federici et al., 2020b). These DR methods use deep neural networks and variational approximations to learn robust and accurate representations of the data, while, at the same time, often serving as generative models for creating samples from the learned distributions.

There are many theoretical derivations and justifications for variational DR methods (Kingma and Welling, 2014; Higgins et al., 2016; Suzuki et al., 2016; Wang et al., 2016; Karami and Schuurmans, 2021; Qiu et al., 2022; Alemi et al., 2017; Bao, 2021; Lee and Van der Schaar, 2021; Wang et al., 2019; Wan et al., 2021; Federici et al., 2020a; Huang et al., 2022; Hu et al., 2020). This diversity of derivations, while enabling adaptability, often leaves researchers with no principled ways for choosing a method for a particular application, for designing new methods with distinct assumptions, or for comparing methods to each other.

Here, we introduce the Deep Variational Multivariate Information Bottleneck (DVMIB) framework, offering a unified mathematical foundation for many variational DR methods. Our framework is grounded in the multivariate information bottleneck loss function (Tishby et al., 2000; Friedman et al., 2013). This loss, amenable to approximation through upper and lower variational bounds, provides a system for implementing diverse DR variants using deep neural networks. We demonstrate the framework’s efficacy by deriving the loss functions of many existing variational DR methods starting from the same principles. Furthermore, our framework naturally allows the adjustment of trade-off parameters, leading to generalizations of these existing methods. For instance, we generalize DVCCA to β\beta-DVCCA. The framework further allows us to introduce and implement in software novel DR methods. We view the DVMIB framework, with its uniform information bottleneck language, conceptual clarity of translating statistical dependencies in data via graphical models of encoder and decoder structures into variational losses, the ability to unify existing approaches, and easy adaptability to new scenarios as one of the main contributions of our work.

Beyond its unifying role, our framework offers a principled approach for deriving problem-specific loss functions using domain-specific knowledge. Thus, we anticipate its application for multi-view representation learning across diverse fields. To illustrate this, we use the framework to derive a novel dimensionality reduction method, the Deep Variational Symmetric Information Bottleneck (DVSIB), which compresses two random variables into two distinct latent variables that are maximally informative about one another. This new method produces better representations of classic datasets than previous approaches. The introduction of DVSIB is another major contribution of our paper.

In summary, our paper makes the following contributions to the field:

  1. 1.

    Introduction of the Variational Multivariate Information Bottleneck Framework: We provide both intuitive and mathematical insights into this framework, establishing a robust foundation for further exploration.

  2. 2.

    Rederivation and Generalization of Existing Methods within a Common Framework: We demonstrate the versatility of our framework by systematically rederiving and generalizing various existing methods from the literature, showcasing the framework’s ability to unify diverse approaches.

  3. 3.

    Design of a Novel Method — Deep Variational Symmetric Information Bottleneck (DVSIB): Employing our framework, we introduce DVSIB as a new method, contributing to the growing repertoire of techniques in variational dimensionality reduction. The method constructs high-accuracy latent spaces from substantially fewer samples than comparable approaches.

The paper is structured as follows. First, we introduce the underlying mathematics and the implementation of the DVMIB framework. We then explain how to use the framework to generate new DR methods. In Tbl. 1, we present several known and newly derived variational methods, illustrating how easily they can be derived within the framework. As a proof of concept, we then benchmark simple computational implementations of methods in Tbl. 1 against the Noisy MNIST dataset. Appendices present detailed treatment of all terms in variational losses in our framework, discussion of multi-view generalizations, and more details —including visualizations— of the performance of many methods on the Noisy MNIST.

2 Multivariate Information Bottleneck Framework

We represent DR problems similar to the Multivariate Information Bottleneck (MIB) of Friedman et al. (2013), which is a generalization of the more traditional Information Bottleneck algorithm (Tishby et al., 2000) to multiple variables. The reduced representation is achieved as a trade-off between two Bayesian networks. Bayesian networks are directed acyclic graphs that provide a factorization of the joint probability distribution, P(X1,X2,X3,..,XN)=i=1NP(Xi|PaXiG)P(X_{1},X_{2},X_{3},..,X_{N})=\prod_{i=1}^{N}P(X_{i}|Pa_{X_{i}}^{G}), where PaXiGPa_{X_{i}}^{G} is the set of parents of XiX_{i} in graph GG. The multiinformation (Studenỳ and Vejnarová, 1998) of a Bayesian network is defined as the Kullback-Leibler divergence between the joint probability distribution and the product of the marginals, and it serves as a measure of the total correlations among the variables, I(X1,X2,X3,,XN)=DKL(P(X1,X2,X3,,XN)P(X1)P(X2)P(X3)P(XN))I(X_{1},X_{2},X_{3},...,X_{N})=D_{KL}(P(X_{1},X_{2},X_{3},...,X_{N})\|P(X_{1})P(X_{2})P(X_{3})...P(X_{N})). For a Bayesian network, the multiinformation reduces to the sum of all the local informations I(X1,X2,..XN)=i=1NI(Xi;PaXiG)I(X_{1},X_{2},..X_{N})=\sum_{i=1}^{N}I(X_{i};Pa_{X_{i}}^{G}) (Friedman et al., 2013).

The first of the Bayesian networks is an encoder (compression) graph, which models how compressed (reduced, latent) variables are obtained from the observations. The second network is a decoder graph, which specifies a generative model for the data from the compressed variables, i.e., it is an alternate factorization of the distribution. In MIB, the information of the encoder graph is minimized, ensuring strong compression (corresponding to the approximate posterior). The information of the decoder graph is maximized, promoting the most accurate model of the data (corresponding to maximizing the log-likelihood). As in IB (Tishby et al., 2000), the trade-off between the compression and reconstruction is controlled by a trade-off parameter β\beta:

L=IencoderβIdecoder.L=I_{\text{encoder}}-\beta I_{\text{decoder}}. (1)

In this work, our key contribution is in writing an explicit variational loss for typical information terms found in both the encoder and the decoder graphs. All terms in the decoder graph use samples of the compressed variables as determined from the encoder graph. If there are two terms that correspond to the same information in Eq. (1), one from each of the graphs, they do not cancel each other since they correspond to two different variational expressions. For pedagogical clarity, we do this by first analyzing the Symmetric Information Bottleneck (SIB), a special case of MIB. We derive the bounds for three types of information terms in SIB, which we then use as building blocks for all other variational MIB methods in subsequent Sections.

2.1 Deep Variational Symmetric Information Bottleneck

The Deep Variational Symmetric Information Bottleneck (DVSIB) simultaneously reduces a pair of datasets XX and YY into two separate lower dimensional compressed versions ZXZ_{X} and ZYZ_{Y}. These compressions are done at the same time to ensure that the latent spaces are maximally informative about each other. The joint compression is known to decrease data set size requirements compared to individual ones (Martini and Nemenman, 2023). Having distinct latent spaces for each modality usually helps with interpretability. For example, XX could be the neural activity of thousands of neurons, and YY could be the recordings of joint angles of the animal. Rather than one latent space representing both, separate latent spaces for the neural activity and the joint angles are sought. By maximizing compression as well as I(ZX,ZY)I(Z_{X},Z_{Y}), one constructs the latent spaces that capture only the neural activity pertinent to joint movement and only the movement that is correlated with the neural activity (cf. Pang et al. (2016)). Many other applications could benefit from a similar DR approach.

GencoderG_{\text{encoder}}XXYYZXZ_{X}ZYZ_{Y}XXGdecoderG_{\text{decoder}}YYZXZ_{X}ZYZ_{Y}

Figure 1: The encoder and decoder graphs for DVSIB.

In Fig. 1, we define two Bayesian networks for DVSIB, GencoderG_{\text{encoder}} and GdecoderG_{\text{decoder}}. GencoderG_{\text{encoder}} encodes the compression of XX to ZXZ_{X} and YY to ZYZ_{Y}. It corresponds to the factorization p(x,y,zx,zy)=p(x,y)p(zx|x)p(zy|y)p(x,y,z_{x},z_{y})=p(x,y)p(z_{x}|x)p(z_{y}|y) and the resultant Iencoder=IE(X;Y)+IE(X;ZX)+IE(Y;ZY)I_{\text{encoder}}=I^{E}(X;Y)+I^{E}(X;Z_{X})+I^{E}(Y;Z_{Y}). The IE(X,Y)I^{E}(X,Y) term does not depend on the compressed variables, does not affect the optimization problem, and hence is discarded in what follows. GdecoderG_{\text{decoder}} represents a generative model for XX and YY given the compressed latent variables ZXZ_{X} and ZYZ_{Y}. It corresponds to the factorization p(x,y,zx,zy)=p(zx)p(zy|zx)p(x|zx)p(y|zy)p(x,y,z_{x},z_{y})=p(z_{x})p(z_{y}|z_{x})p(x|z_{x})p(y|z_{y}) and the resultant Idecoder=ID(ZX;ZY)+ID(X;ZX)+ID(Y;ZY)I_{\text{decoder}}=I^{D}(Z_{X};Z_{Y})+I^{D}(X;Z_{X})+I^{D}(Y;Z_{Y}). Combing the informations from both graphs and using Eq. (1), we find the SIB loss:

LSIB=IE(X;ZX)+IE(Y;ZY)β(ID(ZX;ZY)+ID(X;ZX)+ID(Y;ZY)).L_{\text{SIB}}=I^{E}(X;Z_{X})+I^{E}(Y;Z_{Y})-\beta\left(I^{D}(Z_{X};Z_{Y})+I^{D}(X;Z_{X})+I^{D}(Y;Z_{Y})\right). (2)

Note that information in the encoder terms is minimized, and information in the decoder terms is maximized. Thus, while it is tempting to simplify Eq. (2) by canceling IE(X;ZX)I^{E}(X;Z_{X}) and ID(X;ZX)I^{D}(X;Z_{X}), this would be a mistake. Indeed, these terms come from different factorizations: the encoder corresponds to learning p(zx|x)p(z_{x}|x), and the decoder to p(x|zx)p(x|z_{x}).

While the DVSIB loss may appear similar to previous models, such as MultiView Information Bottleneck (MVIB) (Federici et al., 2020b) and Barlow Twins (Zbontar et al., 2021), it is distinct both conceptually and in practice. For example, MVIB aims to generate latent variables that are as similar to each other as possible, sharing the same domain. DVSIB, however, endeavors to produce distinct latent representations, which could potentially have different units or dimensions, while maximizing mutual information between them. Barlow Twins architecture on the other hand appears to have two latent subspaces while in fact they are one latent subspace that is being optimized by a regular information bottleneck.

We now follow a procedure and notation similar to Alemi et al. (2017) and construct variational bounds on all IEI^{E} and IDI^{D} terms. Terms without leaf nodes, i. e., ID(ZX,ZY)I^{D}(Z_{X},Z_{Y}), require new approaches.

2.2 Variational bounds on DVSIB encoder terms

The information IE(ZX;X)I^{E}(Z_{X};X) corresponds to compressing the random variable XX to ZXZ_{X}. Since this is an encoder term, it needs to be minimized in Eq. (2). Thus, we seek a variational bound IE(ZX;X)I~E(ZX;X)I^{E}(Z_{X};X)\leq\tilde{I}^{E}(Z_{X};X), where I~E\tilde{I}^{E} is the variational version of IEI^{E}, which can be implemented using a deep neural network. We find I~E\tilde{I}^{E} by using the positivity of the Kullback–Leibler divergence. We make r(zx)r(z_{x}) be a variational approximation to p(zx)p(z_{x}). Then DKL(p(zx)r(zx))0D_{\rm KL}(p(z_{x})\|r(z_{x}))\geq 0, so that 𝑑zxp(zx)ln(p(zx))𝑑zxp(zx)ln(r(zx))-\int dz_{x}p(z_{x})\ln(p(z_{x}))\leq-\int dz_{x}p(z_{x})\ln(r(z_{x})). Thus, 𝑑x𝑑zxp(zx,x)ln(p(zx))𝑑x𝑑zxp(zx,x)ln(r(zx))-\int dxdz_{x}p(z_{x},x)\ln(p(z_{x}))\leq-\int dxdz_{x}p(z_{x},x)\ln(r(z_{x})). We then add 𝑑x𝑑zxp(zx,x)ln(p(zx|x))\int dxdz_{x}p(z_{x},x)\ln(p(z_{x}|x)) to both sides and find:

IE(ZX;X)=𝑑x𝑑zxp(zx,x)ln(p(zx|x)p(zx))𝑑x𝑑zxp(zx,x)ln(p(zx|x)r(zx))I~E(ZX;X).I^{E}(Z_{X};X)=\int dxdz_{x}p(z_{x},x)\ln\left(\frac{p(z_{x}|x)}{p(z_{x})}\right)\leq\int dxdz_{x}p(z_{x},x)\ln\left(\frac{p(z_{x}|x)}{r(z_{x})}\right)\equiv\tilde{I}^{E}(Z_{X};X). (3)

We further simplify the variational loss by approximating p(x)1Ni=1Nδ(xxi)p(x)\approx\frac{1}{N}\sum_{i=1}^{N}\delta(x-x_{i}), so that:

I~E(ZX;X)1Ni=1N𝑑zxp(zx|xi)ln(p(zx|xi)r(zx))=1Ni=1NDKL(p(zx|xi)r(zx)).\tilde{I}^{E}(Z_{X};X)\approx\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln\left(\frac{p(z_{x}|x_{i})}{r(z_{x})}\right)=\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x})). (4)

The term IE(Y;ZY)I^{E}(Y;Z_{Y}) can be treated in an analogous manner, resulting in:

I~E(ZY;Y)1Ni=1NDKL(p(zy|yi)r(zy)).\tilde{I}^{E}(Z_{Y};Y)\approx\frac{1}{N}\sum_{i=1}^{N}D_{KL}(p(z_{y}|y_{i})\|r(z_{y})). (5)

2.3 Variational bounds on DVSIB decoder terms

The term ID(X;Z)I^{D}(X;Z) corresponds to a decoder of XX from the compressed variable ZXZ_{X}. It is maximized in Eq. (2). Thus, we seek its variational version I~D\tilde{I}^{D}, such that IDI~DI^{D}\geq\tilde{I}^{D}. Here, q(x|zx)q(x|z_{x}) will serve as a variational approximation to p(x|zx)p(x|z_{x}). We use the positivity of the Kullback-Leibler divergence, DKL(p(x|zx)q(x|zx))0D_{\rm KL}(p(x|z_{x})\|q(x|z_{x}))\geq 0, to find 𝑑xp(x|zx)ln(p(x|zx))𝑑xp(x|zx)ln(q(x|zx))\int dx\,p(x|z_{x})\ln(p(x|z_{x}))\geq\int dx\,p(x|z_{x})\ln(q(x|z_{x})). This gives 𝑑zx𝑑xp(x,zx)ln(p(x|zx))𝑑zx𝑑xp(x,zx)ln(q(x|zx))\int dz_{x}dx\,p(x,z_{x})\ln(p(x|z_{x}))\geq\int dz_{x}dx\,p(x,z_{x})\ln(q(x|z_{x})). We add the entropy of XX to both sides to arrive at the variational bound:

ID(X;ZX)=𝑑zx𝑑xp(x,zx)lnp(x|zx)p(x)𝑑zx𝑑xp(x,zx)lnq(x|zx)p(x)I~D(X;ZX).I^{D}(X;Z_{X})=\int dz_{x}dx\,p(x,z_{x})\ln\frac{p(x|z_{x})}{p(x)}\geq\int dz_{x}dxp(x,z_{x})\ln\frac{q(x|z_{x})}{p(x)}\equiv\tilde{I}^{D}(X;Z_{X}). (6)

We further simplify I~D\tilde{I}^{D} by replacing p(x)p(x) by samples, p(x)1NiNδ(xxi)p(x)\approx\frac{1}{N}\sum_{i}^{N}\delta(x-x_{i}) and using the p(zx|x)p(z_{x}|x) that we learned previously from the encoder:

I~D(X;ZX)H(X)+1Ni=1N𝑑zxp(zx|xi)ln(q(xi|zx)).\tilde{I}^{D}(X;Z_{X})\approx H(X)+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(x_{i}|z_{x})). (7)

Here H(X)H(X) does not depend on p(zx|x)p(z_{x}|x) and, therefore, can be dropped from the loss. The variational version of ID(Y;ZY)I^{D}(Y;Z_{Y}) is obtained analogously:

I~D(Y;ZY)H(Y)+1Ni=1N𝑑zxp(zy|yi)ln(q(yi|zy)).\tilde{I}^{D}(Y;Z_{Y})\approx H(Y)+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{y}|y_{i})\ln(q(y_{i}|z_{y})). (8)

2.4 Variational Bounds on decoder terms not on a leaf - MINE

The variational bound above cannot be applied to the information terms that do not contain leaves in GdecoderG_{\rm decoder}. For SIB, this corresponds to the ID(ZX,ZY)I^{D}(Z_{X},Z_{Y}) term. This information is maximized. To find a variational bound such that ID(ZX,ZY)I~D(ZX,ZY)I^{D}(Z_{X},Z_{Y})\geq\tilde{I}^{D}(Z_{X},Z_{Y}), we use the MINE mutual information estimator (Belghazi et al., 2018), which samples both ZXZ_{X} and ZYZ_{Y} from their respective variational encoders. Other mutual information estimators, such as IInfoNCEI_{\rm InfoNCE} (Poole et al., 2019), can be used as long as they are differentiable. Other estimators might be better suited for different problems, but for our current application, IMINEI_{\rm MINE} was sufficient. We variationally approximate p(zx,zy)p(z_{x},z_{y}) as p(zx)p(zy)eT(zx,zy)/𝒵normp(z_{x})p(z_{y})e^{T(z_{x},z_{y})}/\mathcal{Z}_{\text{norm}}, where 𝒵norm=𝑑zx𝑑zyp(zx)p(zy)eT(zx,zy)\mathcal{Z}_{\text{norm}}=\int dz_{x}dz_{y}p(z_{x})p(z_{y})e^{T(z_{x},z_{y})} is the normalization factor. Here T(zx,zy)T(z_{x},z_{y}) is parameterized by a neural network that takes in samples of the latent spaces zxz_{x} and zyz_{y} and returns a single number. We again use the positivity of the Kullback-Leibler divergence, DKL(p(zx,zy)p(zx)p(zy)eT(zx,zy)/𝒵norm)0D_{\rm KL}(p(z_{x},z_{y})\|p(z_{x})p(z_{y})e^{T(z_{x},z_{y})}/\mathcal{Z}_{\text{norm}})\geq 0, which implies 𝑑zx𝑑zyp(zx,zy)ln(p(zx,zy))𝑑zx𝑑zyp(zx,zy)lnp(zx)p(zy)eT(zx,zy)𝒵norm\int dz_{x}dz_{y}p(z_{x},z_{y})\ln(p(z_{x},z_{y}))\geq\int dz_{x}dz_{y}p(z_{x},z_{y})\ln\frac{p(z_{x})p(z_{y})e^{T(z_{x},z_{y})}}{\mathcal{Z}_{\text{norm}}}. Subtracting 𝑑zx𝑑zyp(zx,zy)ln(p(zx)p(zy))\int dz_{x}dz_{y}p(z_{x},z_{y})\ln(p(z_{x})p(z_{y})) from both sides, we find:

ID(ZX;ZY)𝑑zx𝑑zyp(zx,zy)lneT(zx,zy)𝒵normI~MINED(ZX;ZY).I^{D}(Z_{X};Z_{Y})\geq\int dz_{x}dz_{y}p(z_{x},z_{y})\ln\frac{e^{T(z_{x},z_{y})}}{\mathcal{Z}_{\text{norm}}}\equiv\tilde{I}_{\text{MINE}}^{D}(Z_{X};Z_{Y}). (9)

2.5 Parameterizing the distributions and the reparameterization trick

H(X)H(X), H(Y)H(Y), and I(X,Y)I(X,Y) do not depend on p(zx|x)p(z_{x}|x) and p(zy|y)p(z_{y}|y) and are dropped from the loss. Further, we can use any ansatz for the variational distributions we introduced. We choose parametric probability distribution families and learn the nearest distribution in these families consistent with the data. We assume p(zx|x)p(z_{x}|x) is a normal distribution with mean μZX(x)\mu_{Z_{X}}(x) and a diagonal variance ΣZX(x)\Sigma_{Z_{X}}(x). We learn the mean and the log variance as neural networks. We also assume that q(x|zx)q(x|z_{x}) is normal with a mean μX(zx)\mu_{X}(z_{x}) and a unit variance. In principle, we could also learn the variance for this distribution, but practically we did not find the need for that, and the approach works well as is. Finally, we assume that r(zx)r(z_{x}) is a standard normal distribution. We use the reparameterization trick to produce samples of zxi,j=zxj(xi)=μ(xi)+ΣZX(xi)ηj{z_{x}}_{i,j}=z_{x_{j}}(x_{i})=\mu(x_{i})+\sqrt{\Sigma_{Z_{X}}(x_{i})}\eta_{j} from p(zx|xi)p(z_{x}|x_{i}), where ηj\eta_{j} is drawn from a standard normal distribution (Kingma and Welling, 2014). We choose the same types of distributions for the corresponding zyz_{y} terms.

To sample from p(zx,zy)p(z_{x},z_{y}) we use p(zx,zy)=𝑑x𝑑yp(zx,zy,x,y)=𝑑x𝑑yp(zx|x)p(zy|y)×p(x,y)1Ni=1Np(zx|xi)p(zy|yi)=1NM2i=1N(j=1Mδ(zxzxi,j))(j=1Mδ(zyzyi,j))p(z_{x},z_{y})=\int dxdy\,p(z_{x},z_{y},x,y)=\int dxdy\,p(z_{x}|x)p(z_{y}|y)\times p(x,y)\approx\frac{1}{N}\sum_{i=1}^{N}p(z_{x}|x_{i})p(z_{y}|y_{i})=\frac{1}{NM^{2}}\sum_{i=1}^{N}(\sum_{j=1}^{M}\delta(z_{x}-{z_{x}}_{i,j}))(\sum_{j=1}^{M}\delta(z_{y}-{z_{y}}_{i,j})), where zxi,jp(zx|xi){z_{x}}_{i,j}\in p(z_{x}|x_{i}) and zyi,jp(zy|yi){z_{y}}_{i,j}\in p(z_{y}|y_{i}), and MM is the number of new samples being generated. To sample from p(zx)p(zy)p(z_{x})p(z_{y}), we generate samples from p(zx,zy)p(z_{x},z_{y}) and scramble the generated entries zxz_{x} and zyz_{y}, destroying all correlations. With this, the components of the loss function become

I~E(X;ZX)\displaystyle\tilde{I}^{E}(X;Z_{X}) 12Ni=1N[Tr(ΣZX(xi))+μZX(xi)2kZXlndet(ΣZX(xi))],\displaystyle\approx\frac{1}{2N}\sum_{i=1}^{N}\left[\text{Tr}({\Sigma_{Z_{X}}(x_{i})})+||\vec{\mu}_{Z_{X}}(x_{i})||^{2}-k_{Z_{X}}-\ln\det(\Sigma_{Z_{X}}(x_{i}))\right], (10)
I~D(X;ZX)\displaystyle\tilde{I}^{D}(X;Z_{X}) 1MNi,j=1N,M12(xiμX(zxi,j))2,\displaystyle\approx\frac{1}{MN}\sum_{i,j=1}^{N,M}{-\frac{1}{2}}||(x_{i}-\mu_{X}({z_{x}}_{i,j}))||^{2}, (11)
I~MINED(ZX;ZY)\displaystyle\tilde{I}^{D}_{\rm MINE}(Z_{X};Z_{Y}) 1M2Ni,jx,jy=1N,M,M[T(zxi,jx,zyi,jy)ln𝒵norm],\displaystyle\approx\frac{1}{M^{2}N}\sum_{i,j_{x},j_{y}=1}^{N,M,M}\left[T({z_{x}}_{i,j_{x}},{z_{y}}_{i,j_{y}})-\ln\mathcal{Z}_{\text{norm}}\right], (12)

where 𝒵norm=𝔼zxp(zx),zyp(zy)[eT(zx,zy)]\mathcal{Z}_{\text{norm}}=\mathbb{E}_{z_{x}\sim p(z_{x}),z_{y}\sim p(z_{y})}[e^{T(z_{x},z_{y})}], kZXk_{Z_{X}} is the dimension of ZXZ_{X}, and the corresponding terms for YY are similar. Combining these terms results in the variational loss for DVSIB:

LDVSIB=I~E(X;ZX)+I~E(Y;ZY)β(I~MINED(ZX;ZY)+I~D(X;ZX)+I~D(Y;ZY)).L_{\text{DVSIB}}=\tilde{I}^{E}(X;Z_{X})+\tilde{I}^{E}(Y;Z_{Y})-\beta\left(\tilde{I}^{D}_{\rm MINE}(Z_{X};Z_{Y})+\tilde{I}^{D}(X;Z_{X})+\tilde{I}^{D}(Y;Z_{Y})\right). (13)
Table 1: Method descriptions, variational losses, and the Bayesian Network graphs for each DR method derived in our framework. See Appendix A for details.
Method Description GencoderG_{\text{encoder}} GdecoderG_{\text{decoder}}
beta-VAE (Kingma and Welling, 2014; Higgins et al., 2016): Two independent Variational Autoencoder (VAE) models trained, one for each view, XX and YY (only XX graphs/loss shown).
LVAE=I~E(X;ZX)βI~D(X;ZX)\sloppy L_{\text{VAE}}=\tilde{I}^{E}(X;Z_{X})-\beta\tilde{I}^{D}(X;Z_{X})
XXZXZ_{X}

XXZXZ_{X}

DVIB (Alemi et al., 2017): Two bottleneck models trained, one for each view, XX and YY, using the other view as the supervising signal. (Only XX graphs/loss shown).
LDVIB=I~E(X;ZX)βI~D(Y;ZX)\sloppy L_{\text{DVIB}}=\tilde{I}^{E}(X;Z_{X})-\beta\tilde{I}^{D}(Y;Z_{X})

XXYYZXZ_{X}

XXYYZXZ_{X}

beta-DVCCA: Similar to DVIB (Alemi et al., 2017), but with reconstruction of both views. Two models trained, compressing either XX or YY, while reconstructing both XX and YY. (Only XX graphs/loss shown).
LDVCCA=I~E(X;ZX)β(I~D(Y;ZX)+I~D(X;ZX))\sloppy L_{\text{DVCCA}}=\tilde{I}^{E}(X;Z_{X})-\beta(\tilde{I}^{D}(Y;Z_{X})+\tilde{I}^{D}(X;Z_{X}))
DVCCA (Wang et al., 2016):β\beta-DVCCA with β=1\beta=1.

XXYYZXZ_{X}

XXYYZXZ_{X}

beta-joint-DVCCA: A single model trained using a concatenated variable [X,Y][X,Y], learning one latent representation ZZ.
LjDVCCA=I~E((X,Y);Z)β(I~D(Y;Z)+I~D(X;Z))\sloppy L_{\text{jDVCCA}}=\tilde{I}^{E}((X,Y);Z)-\beta(\tilde{I}^{D}(Y;Z)+\tilde{I}^{D}(X;Z))
joint-DVCCA (Wang et al., 2016): β\beta-jDVCCA with β=1\beta=1.

XXZZYY

XXZZYY

beta-DVCCA-private: Two models trained, compressing either XX or YY, while reconstructing both XX and YY, and simultaneously learning private information WXW_{X} and WYW_{Y}. (Only XX graphs/loss shown). LDVCCA-p=I~E(X;Z)+I~E(X;WX)+I~E(Y;WY)β(I~D(X;(WX,Z))+I~D(Y;(WY,Z)))\sloppy L_{\text{DVCCA-p}}=\tilde{I}^{E}(X;Z)+\tilde{I}^{E}(X;W_{X})+\tilde{I}^{E}(Y;W_{Y})-\beta(\tilde{I}^{D}(X;(W_{X},Z))+\tilde{I}^{D}(Y;(W_{Y},Z)))
DVCCA-private (Wang et al., 2016): β\beta-DVCCA-p with β=1\beta=1.

XXZZYYWXW_{X}WYW_{Y}

XXZZYYWXW_{X}WYW_{Y}

beta-joint-DVCCA-private: A single model trained using a concatenated variable [X,Y][X,Y], learning one latent representation ZZ, and simultaneously learning private information WXW_{X} and WYW_{Y}. LjDVCCA-p=I~E((X,Y);Z)+I~E(X;WX)+\sloppy L_{\text{jDVCCA-p}}=\tilde{I}^{E}((X,Y);Z)+\tilde{I}^{E}(X;W_{X})+I~E(Y;WY)β(I~D(X;(WX,Z))+I~D(Y;(WY,Z)))\sloppy\tilde{I}^{E}(Y;W_{Y})-\beta(\tilde{I}^{D}(X;(W_{X},Z))+\tilde{I}^{D}(Y;(W_{Y},Z)))
joint-DVCCA-private(Wang et al., 2016): β\beta-jDVCCA-p with β=1\beta=1.

XXZZYYWXW_{X}WYW_{Y}

XXZZYYWXW_{X}WYW_{Y}

DVSIB: A symmetric model trained, producing ZXZ_{X} and ZYZ_{Y}.
LDVSIB=I~E(X;ZX)+I~E(Y;ZY)\sloppy L_{\text{DVSIB}}=\tilde{I}^{E}(X;Z_{X})+\tilde{I}^{E}(Y;Z_{Y})
β(I~MINED(ZX;ZY)+I~D(X;ZX)+I~D(Y;ZY))-\beta\left(\tilde{I}^{D}_{\text{MINE}}(Z_{X};Z_{Y})+\tilde{I}^{D}(X;Z_{X})+\tilde{I}^{D}(Y;Z_{Y})\right)

XXYYZXZ_{X}ZYZ_{Y}

XXYYZXZ_{X}ZYZ_{Y}

DVSIB-private: A symmetric model trained, producing ZXZ_{X} and ZYZ_{Y}, while simultaneously learning private information WXW_{X} and WYW_{Y}.
LDVSIBp=I~E(X;WX)+I~E(X;ZX)+\sloppy L_{\text{DVSIBp}}=\tilde{I}^{E}(X;W_{X})+\tilde{I}^{E}(X;Z_{X})+
I~E(Y;ZY)+I~E(Y;WY)\tilde{I}^{E}(Y;Z_{Y})+\tilde{I}^{E}(Y;W_{Y})-
β(I~MINED(ZX;ZY)+I~D(X;(ZX,WX))+I~D(Y;(ZY,WY)))\sloppy\beta\left(\tilde{I}^{D}_{\text{MINE}}(Z_{X};Z_{Y})+\tilde{I}^{D}(X;(Z_{X},W_{X}))+\tilde{I}^{D}(Y;(Z_{Y},W_{Y}))\right)

XXYYZXZ_{X}ZYZ_{Y}wXw_{X}wYw_{Y}

wXw_{X}XXYYZXZ_{X}ZYZ_{Y}wYw_{Y}

3 Deriving other DR methods

The variational bounds used in DVSIB can be used to implement loss functions that correspond to other encoder-decoder graph pairs and hence to other DR algorithms. The simplest is the beta variational auto-encoder. Here GencoderG_{\rm encoder} consists of one term: XX compressed into ZXZ_{X}. Similarly GdecoderG_{\rm decoder} consists of one term: XX decoded from ZXZ_{X} (see Table 1). Using this simple set of Bayesian networks, we find the variational loss:

Lbeta-VAE=I~E(X;ZX)βI~D(X;ZX).L_{\text{beta-VAE}}=\tilde{I}^{E}(X;Z_{X})-\beta\tilde{I}^{D}(X;Z_{X}). (14)

Both terms in Eq. (14) are the same as Eqs. (10, 11) and can be approximated and implemented by neural networks.

Similarly, we can re-derive the DVCCA family of losses (Wang et al., 2016). Here GencoderG_{\rm encoder} is XX compressed into ZXZ_{X}. GdecoderG_{\rm decoder} reconstructs both XX and YY from the same compressed latent space ZXZ_{X}. In fact, our loss function is more general than the DVCCA loss and has an additional compression-reconstruction trade-off parameter β\beta. We call this more general loss β\beta-DVCCA, and the original DVCCA emerges when β=1\beta=1:

LDVCCA=I~E(X;ZX)β(I~D(Y;ZX)+I~D(X;ZX)).L_{\rm DVCCA}=\tilde{I}^{E}(X;Z_{X})-\beta(\tilde{I}^{D}(Y;Z_{X})+\tilde{I}^{D}(X;Z_{X})). (15)

Using the same library of terms as we found in DVSIB, Eqs. (10, 11), we find:

LDVCCA1Ni=1NDKL(p(zx|xi)r(zx))β(1Ni=1N𝑑zxp(zx|xi)ln(q(yi|zx))+1Ni=1N𝑑zxp(zx|xi)ln(q(xi|zx))).L_{\text{DVCCA}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x}))\\ -\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(y_{i}|z_{x}))+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(x_{i}|z_{x}))\right). (16)

This is similar to the loss function of the deep variational CCA (Wang et al., 2016), but now it has a trade-off parameter β\beta. It trades off the compression into ZXZ_{X} against the reconstruction of XX and YY from the compressed variable ZXZ_{X}.

Table 1 shows how our framework reproduces and generalizes other DR losses (see Appendix A). Our framework naturally extends beyond two variables as well (see Appendix B).

4 Results

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Dataset consisting of pairs of digits drawn from MNIST that share an identity. Top row, XX: MNIST digits randomly scaled (0.51.5)(0.5-1.5) and rotated (0π/2)(0-\pi/2). Bottom row, YY: MNIST digits with a background Perlin noise. t-SNE of XX and YY datasets (left and middle) shows poor separation by digit, and there is a wide range of correlation between XX and YY (right).

To test our methods, we created a dataset inspired by the noisy MNIST dataset (LeCun et al., 1998; Wang et al., 2015, 2016), consisting of two distinct views of data, both with dimensions of 28×2828\times 28 pixels, cf. Fig. 2. The first view comprises the original image randomly rotated by an angle uniformly sampled between 0 and π2\frac{\pi}{2} and scaled by a factor uniformly distributed between 0.50.5 and 1.51.5. The second view consists of the original image with an added background Perlin noise (Perlin, 1985) with the noise factor uniformly distributed between 0 and 11. Both image intensities are scaled to the range of [0,1)[0,1). The dataset was shuffled within labels, retaining only the shared label identity between two images, while disregarding the view-specific details, i.e., the random rotation and scaling for XX, and the correlated background noise for YY. The dataset, totaling 70,00070,000 images, was partitioned into training (80%80\%), testing (10%10\%), and validation (10%10\%) subsets. Visualization via t-SNE (Hinton and Roweis, 2002) plots of the original dataset suggest poor separation by digit, and the two digit views have diverse correlations, making this a sufficiently hard problem.

The DR methods we evaluated include all methods from Tbl. 1. PCA and CCA (Hotelling, 1933, 1936) served as a baseline for linear dimensionality reduction. Multi-view Information Bottleneck Federici et al. (2020b) was included for a specific comparison with DVSIB (see Appendix C). We emphasize that none of the algorithms were given labeled data. They had to infer compressed latent representations that presumably should cluster into ten different digits based simply on the fact that images come in pairs, and the (unknown) digit label is the only information that relates the two images.

Each method was trained for 100 epochs using fully connected neural networks with layer sizes (input_dim,1024,1024,(kZ,kZ))(\text{input\_dim},1024,1024,(k_{Z},k_{Z})), where kZk_{Z} is the latent dimension size, employing ReLU activations for the hidden layers. The input dimension (input_dim)(\text{input\_dim}) was either the size of XX (784) or the size of the concatenated [X,Y][X,Y] (1568). The last two layers of size kZk_{Z} represented the means and log(variance)\log(\text{variance}) learned. For the decoders, we employed regular decoders, fully connected neural networks with layer sizes (kZ,1024,1024,output_dim)(k_{Z},1024,1024,\text{output\_dim}), using ReLU activations for the hidden layers and sigmoid activation for the output layer. Again, the output dimension (output_dim)(\text{output\_dim}) could either be the size of XX (784) or the size of the concatenated [X,Y][X,Y] (1568). The latent dimension (kZ)(k_{Z}) could be kZXk_{Z_{X}} or kZYk_{Z_{Y}} for regular decoders, or kZX+kWXk_{Z_{X}}+k_{W_{X}} or kZY+kWYk_{Z_{Y}}+k_{W_{Y}} for decoders with private information. Additionally, another decoder denoted as decoder_MINE, based on the MINE estimator for estimating I(ZX,ZY)I(Z_{X},Z_{Y}), was used in DVSIB and DVSIB with private information. The decoder_MINE is a fully connected neural network with layer sizes (kZX+kZY,1024,1024,1)(k_{Z_{X}}+k_{Z_{Y}},1024,1024,1) and ReLU activations for the hidden layers. Optimization was conducted using the ADAM optimizer with default parameters.

Table 2: Maximum accuracy from a linear SVM and the optimal kZk_{Z} and β\beta for variational DR methods reported on the YY (above the line) and the joint [X,Y][X,Y] (below the line) datasets. ( fixed values)
Method Acc. % kZbest{k_{Z}}_{\textbf{best}} 95%95\% kZrange{k_{Z}}_{\text{range}} βbest\beta_{\text{best}} 95%95\% βrange\beta_{\text{range}} 𝑪best\bm{C_{\text{best}}}
Baseline 90.8 784 - - - 0.1
PCA 90.5 256 [64,256*] - - 1
CCA 85.7 256 [32,256*] - - 10
β\beta-VAE 96.3 256 [64,256*] 32 [2,1024*] 10
DVIB 90.4 256 [16,256*] 512 [8,1024*] 0.003
DVCCA 89.6 128 [16,256*] 1 - 31.623
β\beta-DVCCA 95.4 256 [64,256*] 16 [2,1024*] 10
DVCCA-p 92.1 16 [16,256*] 1 - 0.316
β\beta-DVCCA-p 95.5 16 [4,256*] 1024 [1,1024*] 0.316
MVIB 97.7 8 [4,64] 1024 [128,1024*] 0.01
DVSIB 97.8 256 [8,256*] 128 [2,1024*] 3.162
DVSIB-p 97.8 256 [8,256*] 32 [2,1024*] 10
jBaseline 91.9 1568 - - - 0.003
jDVCCA 92.5 256 [64,265*] 1 - 10
β\beta-jDVCCA 96.7 256 [16,265*] 256 [1,1024*] 1
jDVCCA-p 92.5 64 [32,265*] 1 - 10
β\beta-jDVCCA-p 92.7 256 [4,265*] 2 [1,1024*] 10
Refer to caption
Refer to caption
Refer to caption
Figure 3: Top: t-SNE plot of the latent space ZYZ_{Y} of DVSIB colored by the identity of digits. Top Right: Classification accuracy of an SVM trained on DVSIB’s ZYZ_{Y} latent space. The accuracy was evaluated for DVSIB with a parameter sweep of the trade-off parameter β=25,,210\beta=2^{-5},...,2^{10} and the latent dimension kZ=21,,28k_{Z}=2^{1},...,2^{8}. The max accuracy was 97.8%97.8\% for β=128\beta=128 and kZ=256k_{Z}=256. Bottom: Example digits generated by sampling from the DVSIB decoder, XX and YY branches.

To evaluate the methods, we trained them on the training portions of XX and YY without exposure to the true labels. Subsequently, we utilized the trained encoders to compute ZtrainZ_{\text{train}}, ZtestZ_{\text{test}}, and ZvalidationZ_{\text{validation}} on the respective datasets. To assess the quality of the learned representations, we revealed the labels of ZtrainZ_{\text{train}} and trained a linear SVM classifier with ZtrainZ_{\text{train}} and labelstrain\text{labels}_{\text{train}}. Fine-tuning of the classifier was performed to identify the optimal SVM slack parameter (CC value), maximizing accuracy on ZtestZ_{\text{test}}. This best classifier was then used to predict ZvalidationZ_{\text{validation}}, yielding the reported accuracy. We also conducted classification experiments using fully connected neural networks, with detailed results available in the Appendix D. For both SVM and the fully connected network, we find the baseline accuracy on the original training data and labels (Xtrain,labelstrain)(X_{\text{train}},\text{labels}_{\text{train}}) and (Ytrain,labelstrain)(Y_{\text{train}},\text{labels}_{\text{train}}), fine-tuning with the test datasets, and reporting the results of the validation datasets. Using Linear SVM enables us to assess the linear separability of the clusters of ZXZ_{X} and ZYZ_{Y} obtained through the DR methods. While neural networks excel at uncovering nonlinear relationships that could result in higher classification accuracy, the comparison with a linear SVM establishes a level playing field. It ensures a fair comparison among different methods and is independent of the success of the classifier used for comparison in detecting nonlinear features in the data, which might have been missed by the DR methods. Here, we focus on the results of the YY datasets (MNIST with correlated noise background); results for XX are in the Appendix D. A parameter sweep was performed to identify optimal kZk_{Z} values, ranging from 212^{1} to 282^{8} dimensions on log2\log_{2} scale, as well as optimal β\beta values, ranging from 252^{-5} to 2102^{10}. For methods with private information, kWXk_{W_{X}} and kWYk_{W_{Y}} were varied from 212^{1} to 262^{6}. The highest accuracy is reported in Tbl. 2, along with the optimal parameters used to obtain this accuracy. Additionally, for every method we find the range of β\beta and the dimensionality kZk_{Z} of the latent variable ZYZ_{Y} that gives 95% of the method’s maximum accuracy. If the range includes the limits of the parameter, this is indicated by an asterisk.

Figure 3 shows a t-SNE plot of DVSIB’s latent space, ZYZ_{Y}, colored by the identity of digits. The resulting latent space has 10 clusters, each corresponding to one digit. The clusters are well separated and interpretable. Further, DVSIB’s ZYZ_{Y} latent space provides the best classification of digits using a linear method such as an SVM showing the latent space is linearly separable. DVSIB maximum classification accuracy obtained for the linear SVM is 97.8%. Crucially, DVSIB maintains accuracy of at least 92.9% (95% of 97.8%) for β[2,1024]\beta\in[2,1024^{*}] and kZ[8,256]k_{Z}\in[8,256^{*}]. This accuracy is high compared to other methods and has a large range of hyperparameters that maintain its ability to correctly capture information about the identity of the shared digit. DVSIB is a generative method, we have provided sample generated digits from the decoders that were trained from the model graph.

Refer to caption
Figure 4: The best SVM classification accuracy curves for each method. Here DVSIB and DVSIB-private obtained the best accuracy and, together with β\beta-DVCCA-private, they had the best accuracy for low latent dimensional spaces.
Refer to caption
Figure 5: Classification accuracy (AA) of DVSIB has a better sample size (nn) dependent scaling. Main: a log-log plot of 100%A100\%-A vs 1/n1/n. Slope for fitted lines are 0.345±0.0070.345\pm 0.007 for DVSIB, and 0.196±0.0130.196\pm 0.013 for β\beta-VAE, corresponding to a faster increase of accuracy of DVSIB with nn. Inset: same data, but plotted as AA vs nn.

In Fig. 4, we show the highest SVM classification accuracy curves for each method. DVSIB and DVSIB-private tie for the best classification accuracy for YY. Together with β\beta-DVCCA-private they have the highest accuracy for all dimensions of the latent space, kZk_{Z}. In theory, only one dimension should be needed to capture the identity of a digit, but our data sets also contain information about the rotation and scale for XX and the strength of the background noise for YY. YY should then need at least two latent dimensions to be reconstructed and XX should need at least three. Since DVSIB, DVSIB-private, and β\beta-DVCCA-private performed with the best accuracy starting with the smallest kZk_{Z}, we conclude that methods with the encoder-decoder graphs that more closely match the structure of the data produce higher accuracy with lower dimensional latent spaces.

Next, in Fig. 5, we compare the sample training efficiency of DVSIB and β\beta-VAE by training new instances of these methods on a geometrically increasing number of samples n=[256,339,451,,42k,56k]n=[256,339,451,\dots,\sim 42k,\sim 56k], consisting of 20 subsamples of the full training data (X,YX,Y) to get (Xtrainn,YtrainnX_{\text{train}_{n}},Y_{\text{train}_{n}}), where each larger subsample includes the previous one. Each method was trained for 60 epochs, and we used β=1024\beta=1024 (as defined by the DVMIB framework). Further, all reported results are with the latent space size kZ=64k_{Z}=64. We explored other numbers of training epochs and latent space dimensions (see Appendix D.6), but did not observe qualitative differences. We follow the same procedure as outlined earlier, using the 20 trained encoders for each method to compute ZtrainnZ_{\text{train}_{n}}, ZtestZ_{\text{test}}, and ZvalidationZ_{\text{validation}} for the training, test, and validation datasets. As before, we then train and evaluate the classification accuracy of SVMs for the ZYZ_{Y} representation learned by each method. Fig. 5, inset, shows the classification accuracy of each method as a function of the number of samples used in training. Again, CCA and PCA serve as linear methods baselines. PCA is able to capture the linear correlations in the dataset consistently, even at low sample sizes. However, it is unable to capture the nonlinearities of the data, and its accuracy does not improve with the sample size. Because of the iterative nature of the implementation of the PCA algorithm (Pedregosa et al., 2011), it is able to capture some linear correlations in a relatively low number of dimensions, which are sufficiently sampled even with small-sized datasets. Thus the accuracy of PCA barely depends on the training set size. CCA, on the other hand, does not work in the under-sampled regime (see Abdelaleem et al. (2023) for discussion of this). DVSIB performs uniformly better, at all training set sizes, than the β\beta-VAE. Furthermore, DVSIB improves its quality faster, with a different sample size scaling. Specifically, DVSIB and β\beta-VAE accuracy (AA, measured in percent) appears to follow the scaling form A=100c/nmA=100-c/n^{m}, where cc is a constant, and the scaling exponent m=0.345±0.007m=0.345\pm 0.007 for DVSIB, and 0.196±0.0130.196\pm 0.013 for β\beta-VAE. We illustrate this scaling in Fig. 5 by plotting a log-log plot of 100A100-A vs 1/n1/n and observing a linear relationship.

5 Conclusion

We developed an MIB-based framework for deriving variational loss functions for DR applications. We demonstrated the use of this framework by developing a novel variational method, DVSIB. DVSIB compresses the variables XX and YY into latent variables ZXZ_{X} and ZYZ_{Y} respectively, while maximizing the information between ZXZ_{X} and ZYZ_{Y}. The method generates two distinct latent spaces—a feature highly sought after in various applications—but it accomplishes this with superior data efficiency, compared to other methods. The example of DVSIB demonstrates the process of deriving variational bounds for terms present in all examined DR methods. A comprehensive library of typical terms is included in Appendix A for reference, which can be used to derive additional DR methods. Further, we (re)-derive several DR methods, as outlined in Table 1. These include well-known techniques such as β\beta-VAE, DVIB, DVCCA, and DVCCA-private. MIB naturally introduces a trade-off parameter into the DVCCA family of methods, resulting in what we term the β\beta-DVCCA DR methods, of which DVCCA is a special case. We implement this new family of methods and show that it produces better latent spaces than DVCCA at β=1\beta=1, cf. Tbl. 2.

We observe that methods that more closely match the structure of dependencies in the data can give better latent spaces as measured by the dimensionality of the latent space and the accuracy of reconstruction (see Figure 4). This makes DVSIB, DVSIB-private, and β\beta-DVCCA-private perform the best. DVSIB and DVSIB-private both have separate latent spaces for XX and YY. The private methods allow us to learn additional aspects about XX and YY that are not important for the shared digit label, but allow reconstruction of the rotation and scale for XX and the background noise of YY. We also found that DVSIB can make more efficient use of data when producing latent spaces as compared to β\beta-VAEs and linear methods.

Our framework may be extended beyond variational approaches. For instance, in the deterministic limit of VAE, autoencoders can be retrieved by defining the encoder/decoder graphs as nonlinear neural networks z=f(x)z=f(x) and x=g(z)x=g(z). Additionally, linear methods like CCA can be viewed as special cases of the information bottleneck (Chechik et al., 2003) and hence must follow from our approach. Similarly, by using specialized encoder and decoder neural networks, e.g., convolutional ones, our framework can implement symmetries and other constraints into the DR process. Overall, the framework serves as a versatile and customizable toolkit, capable of encompassing a wide spectrum of dimensionality reduction methods. With the provided tools and code, we aim to facilitate the adaptation of the approach to diverse problems.


Acknowledgments and Disclosure of Funding

EA and KMM contributed equally to this work. We thank Sean Ridout and Michael Pasek for providing feedback on the manuscript. EA and KMM also thank Ahmed Roman for useful discussions. IN is particularly grateful to the late Tali Tishby for many discussions about life, science, and information bottleneck over the years. This work was funded, in part, by NSF Grants Nos. 2010524 and 2014173, by the Simons Investigator award to IN, and the Simons-Emory International Consortium on Motor Control. We acknowledge support of our work through the use of the HyPER C3 cluster of Emory University’s AI.Humanity Initiative.

Appendix A Deriving and designing variational losses

In the next two sections, we provide a library of typical terms found in encoder graphs, Appendix A.1, and decoder graphs, AppendixA.2. In Appendix A.3, we provide examples of combining these terms to produce variational losses corresponding to beta-VAE, DVIB, beta-DVCCA, beta-DVCCA-joint, beta-DVCCA-private, DVSIB, and DVSIB-private.

A.1 Encoder graph components

We expand Sec. 2.2 and present a range of common components found in encoder graphs across various DR methods, cf. Fig. (6).

a.XXZXZ_{X}b.XXWXW_{X}ZXZ_{X}c.XXZZYYd.XXYY
Figure 6: Encoder graph components.
  1. a.

    This graph corresponds to compressing the random variable XX to ZXZ_{X}. Variational bounds for encoders of this type were derived in the main text in Sec. 2.2 and correspond to the loss:

    I~E(X;ZX)\displaystyle\tilde{I}^{E}(X;Z_{X}) =1Ni=1NDKL(p(zx|xi)r(zx))\displaystyle=\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x}))
    12Ni=1N[Tr(ΣZX(xi))+μZX(xi)2kZXlndet(ΣZX(xi))].\displaystyle\approx\frac{1}{2N}\sum_{i=1}^{N}\left[\text{Tr}({\Sigma_{Z_{X}}(x_{i})})+||\vec{\mu}_{Z_{X}}(x_{i})||^{2}-k_{Z_{X}}-\ln\det(\Sigma_{Z_{X}}(x_{i}))\right]. (17)
  2. b.

    This type of encoder graph is similar to the first, but now with two outputs, ZXZ_{X} and WXW_{X}. This corresponds to making two encoders, one for ZXZ_{X} and one for WXW_{X}, I~E(ZX;X)+I~E(WX;X)\tilde{I}^{E}(Z_{X};X)+\tilde{I}^{E}(W_{X};X), where

    I~E(ZX;X)\displaystyle\tilde{I}^{E}(Z_{X};X) 1Ni=1NDKL(p(zx|xi)r(zx)),\displaystyle\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x})), (18)
    I~E(WX;X)\displaystyle\tilde{I}^{E}(W_{X};X) 1Ni=1NDKL(p(wx|xi)r(wx)).\displaystyle\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(w_{x}|x_{i})\|r(w_{x})). (19)
  3. c.

    This type of encoder consists of compressing XX and YY into a single variable ZZ. It corresponds to the information loss IE(Z;(X,Y))I^{E}(Z;(X,Y)). This again has a similar encoder structure to type (a), but XX is replaced by a joint variable (X,Y)(X,Y). For this loss, we find a variational version:

    I~E(Z;(X,Y))1Ni=1NDKL(p(z|xi,yi)r(xi,yi)).\tilde{I}^{E}(Z;(X,Y))\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z|x_{i},y_{i})\|r(x_{i},y_{i})). (20)
  4. d.

    This final type of an encoder term corresponds to information IE(X,Y)I^{E}(X,Y), which is constant with respect to our minimization. In practice, we drop terms of this type.

A.2 Decoder graph components

In this section, we elaborate on the decoder graphs that happen in our considered DR methods, cf. Fig. (7).

a.XXZXZ_{X}b.XXWXW_{X}ZXZ_{X}c.XXZZYYd.ZXZ_{X}ZYZ_{Y}
Figure 7: Decoder graph components.

All decoder graphs sample from their methods’ corresponding encoder graph.

  1. a.

    In this decoder graph, we decode XX from the compressed variable ZXZ_{X}. Variational bounds for decoders of this type were derived in the main text, Sec. 2.3, and they correspond to the loss:

    I~D(X;ZX)\displaystyle\tilde{I}^{D}(X;Z_{X}) =H(X)+1Ni=1N𝑑zxp(zx|xi)lnq(xi|zx)\displaystyle=H(X)+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln q(x_{i}|z_{x})
    H(X)+1MNi,j=1N,M12(xiμX(zxi,j))2,\displaystyle\approx H(X)+\frac{1}{MN}\sum_{i,j=1}^{N,M}{-\frac{1}{2}}||(x_{i}-\mu_{X}({z_{x}}_{i,j}))||^{2}, (21)

    where H(X)H(X) can be dropped from the loss since it doesn’t change in optimization.

  2. b.

    This type of decoder term is similar to that in part (a), but XX is decoded from two variables simultaneously. The corresponding loss term is ID(X;(ZX,WX))I^{D}(X;(Z_{X},W_{X})). We find a variational loss by replacing ZXZ_{X} in part (a) by (ZX,WX)(Z_{X},W_{X}):

    I~D(X;(ZX,WX))H(X)+1Ni=1N𝑑zx𝑑wxp(zx,wx|xi)ln(q(xi|zx,wx)),\tilde{I}^{D}(X;(Z_{X},W_{X}))\approx H(X)+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}dw_{x}p(z_{x},w_{x}|x_{i})\ln(q(x_{i}|z_{x},w_{x})), (22)

    where, again, the entropy of XX can be dropped.

  3. c.

    This decoder term can be obtained by adding two decoders of type (a) together. In this case, the loss term is ID(X;Z)+ID(Y;Z)I^{D}(X;Z)+I^{D}(Y;Z):

    I~D(X;Z)+I~D(Y;Z)H(X)+H(Y)+1Ni=1N𝑑zp(z|xi)ln(q(xi|z))+1Ni=1N𝑑zp(z|yi)ln(q(yi|z)),\tilde{I}^{D}(X;Z)+\tilde{I}^{D}(Y;Z)\approx H(X)+H(Y)\\ +\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|x_{i})\ln(q(x_{i}|z))+\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|y_{i})\ln(q(y_{i}|z)), (23)

    and the entropy terms can be dropped, again.

  4. d.

    Decoders of this type were discussed in the main text in Sec. 2.4. They correspond to the information between latent variables ZXZ_{X} and ZYZ_{Y}. We use the MINE estimator to find variational bounds for such terms:

    I~MINED(ZX;ZY)\displaystyle\tilde{I}^{D}_{\rm MINE}(Z_{X};Z_{Y}) =𝑑zx𝑑zyp(zx,zy)lneT(zx,zy)𝒵norm1NM2i,jx,jy=1N,M,M[T(zxi,jx,zyi,jy)ln𝒵norm].\displaystyle=\int dz_{x}dz_{y}p(z_{x},z_{y})\ln\frac{e^{T(z_{x},z_{y})}}{\mathcal{Z}_{\text{norm}}}\approx\frac{1}{NM^{2}}\sum_{i,j_{x},j_{y}=1}^{N,M,M}\left[T({z_{x}}_{i,j_{x}},{z_{y}}_{i,j_{y}})-\ln\mathcal{Z}_{\text{norm}}\right]. (24)

A.3 Detailed method implementations

For completeness, we provide detailed implementations of methods outlined in Tbl. 1.

A.3.1 Beta Variational Auto-Encoder

GencoderG_{\text{encoder}}XXZXZ_{X}GdecoderG_{\text{decoder}}XXZXZ_{X}
Figure 8: Encoder and decoder graphs for the beta-variational auto-encoder method

A variational autoencoder (Kingma and Welling, 2014; Higgins et al., 2016) compresses XX into a latent variable ZXZ_{X} and then reconstructs XX from the latent variable, cf. Fig. (9). The overall loss is a trade-off between the compression IE(X;ZX)I^{E}(X;Z_{X}) and the reconstruction ID(X;ZX)I^{D}(X;Z_{X}):

IE(X;ZX)βID(X;ZX)I~E(X;ZX)βI~D(X;ZX)1Ni=1NDKL(p(zx|xi)r(zx))β(H(X)+1Ni=1N𝑑zxp(zx|xi)ln(q(xi|zx))).I^{E}(X;Z_{X})-\beta I^{D}(X;Z_{X})\leq\tilde{I}^{E}(X;Z_{X})-\beta\tilde{I}^{D}(X;Z_{X})\\ \lesssim\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x}))-\beta\left(H(X)+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(x_{i}|z_{x}))\right). (25)

H(X)H(X) is a constant with respect to the minimization, and it can be omitted from the loss. Similar to the main text, DVSIB case, we make ansatzes for forms of each of the variational distributions. We choose parametric distribution families and learn the nearest distribution in these families consistent with the data. Specifically, we assume p(zx|x)p(z_{x}|x) is a normal distribution with mean μZX(X)\mu_{Z_{X}}(X) and variance ΣZX(X)\Sigma_{Z_{X}}(X). We learn the mean and the log-variance as neural networks. We also assume that q(x|zx)q(x|z_{x}) is normal with a mean μX(zx)\mu_{X}(z_{x}) and a unit variance. Finally, we assume that r(zx)r(z_{x}) is drawn from a standard normal distribution. We then use the re-parameterization trick to produce samples of zxj(x)=μ(x)+ΣZX(x)ηjz_{x_{j}}(x)=\mu(x)+\sqrt{\Sigma_{Z_{X}}(x)}\eta_{j} from p(zx|x)p(z_{x}|x), where η\eta is drawn from a standard normal distribution. Overall, this gives:

LVAE=12Ni=1N[Tr(ΣZX(xi))+μZX(xi)TμZX(xi)kZXlndet(ΣZX(xi))]β(1MNi=1Nj=1M12(xiμX(zxj))T(xiμX(zxj))).L_{\text{VAE}}=\frac{1}{2N}\sum_{i=1}^{N}\left[\text{Tr}({\Sigma_{Z_{X}}(x_{i})})+\vec{\mu}_{Z_{X}}(x_{i})^{T}\vec{\mu}_{Z_{X}}(x_{i})-k_{Z_{X}}-\ln\det(\Sigma_{Z_{X}}(x_{i}))\right]\\ -\beta\left(\frac{1}{MN}\sum_{i=1}^{N}\sum_{j=1}^{M}{-\frac{1}{2}}(x_{i}-\mu_{X}(z_{x_{j}}))^{T}(x_{i}-\mu_{X}(z_{x_{j}}))\right). (26)

This is the same loss as for a beta auto-encoder. However, following the convention in the Information Bottleneck literature (Tishby et al., 2000; Friedman et al., 2013), our β\beta is the inverse of the one typically used for beta auto-encoders. A small β\beta in our case results in a stronger compression, while a large β\beta results in a better reconstruction.

A.3.2 Deep Variational Information Bottleneck

GencoderG_{\text{encoder}}XXYYZXZ_{X}GdecoderG_{\text{decoder}}XXYYZXZ_{X}
Figure 9: Encoder and decoder graphs for the Deep Variational Information Bottleneck.

Just as in the beta auto-encoder, we immediately write down the loss function for the information bottleneck. Here, the encoder graph compresses XX into ZXZ_{X}, while the decoder tries to maximize the information between the compressed variable and the relevant variable YY, cf. Fig. (9). The resulting loss function is:

LIB=IE(X;Y)+IE(X;ZX)βID(Y;ZX).L_{\text{IB}}=I^{E}(X;Y)+I^{E}(X;Z_{X})-\beta I^{D}(Y;Z_{X}). (27)

Here the information between XX and YY does not depend on p(zx|x)p(z_{x}|x) and can dropped in the optimization.

Thus the Deep Variational Information Bottleneck (Alemi et al., 2017) becomes :

LDVIB1Ni=1NDKL(p(zx|xi)r(zx))β(1Ni=1N𝑑zxp(zx|xi)ln(q(yi|zx))),L_{\text{DVIB}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x}))-\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(y_{i}|z_{x}))\right), (28)

where we dropped H(Y)H(Y) since it doesn’t change in the optimization.

As we have been doing before, we choose to parameterize all these distributions by Gaussians and their means and their log variances are learned by neural networks. Specifically, we parameterize p(zx|x)=N(μzx(x),Σzx)p(z_{x}|x)=N(\mu_{z_{x}}(x),\Sigma_{z_{x}}), r(zx)=N(0,I)r(z_{x})=N(0,I), and q(y|zx)=N(μY,I)q(y|z_{x})=N(\mu_{Y},I). Again we can use the reparameterization trick and sample from p(zx|xi)p(z_{x}|x_{i}) by zxj(x)=μ(x)+Σzx(x)ηjz_{x_{j}}(x)=\mu(x)+\sqrt{\Sigma_{z_{x}}(x)}\eta_{j} where η\eta is drawn from a standard normal distribution.

A.3.3 Beta Deep Variational CCA

GencoderG_{\text{encoder}}XXYYZXZ_{X}GdecoderG_{\text{decoder}}XXYYZXZ_{X}
Figure 10: Encoder and decoder graphs for beta Deep Variational CCA.

beta-DVCCA, cf. Fig. 10, is similar to the traditional information bottleneck, but now XX and YY are both used as relevance variables:

LDVCCA=I~E(X;Y)+I~E(X;ZX)β(I~D(Y;ZX)+I~D(X;ZX))L_{\rm DVCCA}=\tilde{I}^{E}(X;Y)+\tilde{I}^{E}(X;Z_{X})-\beta(\tilde{I}^{D}(Y;Z_{X})+\tilde{I}^{D}(X;Z_{X})) (29)

Using the same library of terms as before, we find:

LDVCCA1Ni=1NDKL(p(zx|xi)r(zx))β(1Ni=1N𝑑zxp(zx|xi)ln(q(yi|zx))+1Ni=1N𝑑zxp(zx|xi)ln(q(xi|zx))).L_{\text{DVCCA}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x}))\\ -\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(y_{i}|z_{x}))+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}p(z_{x}|x_{i})\ln(q(x_{i}|z_{x}))\right). (30)

This is similar to the loss function of the deep variational CCA (Wang et al., 2016), but now it has a trade-off parameter β\beta. It trades off the compression into ZZ against the reconstruction of XX and YY from the compressed variable ZZ.

A.3.4 beta joint-Deep Variational CCA

GencoderG_{\text{encoder}}XXZZYYGdecoderG_{\text{decoder}}XXZZYY
Figure 11: Encoder and decoder graphs for beta joint-Deep Variational CCA.

Joint deep variational CCA (Wang et al., 2016), cf. Fig. 11, compresses (X,Y)(X,Y) into one ZZ and then reconstructs the individual terms XX and YY,

LDVCCA=IE(X;Y)+IE((X,Y);Z)β(ID(Y;Z)+ID(X;Z)).L_{\rm DVCCA}=I^{E}(X;Y)+I^{E}((X,Y);Z)-\beta(I^{D}(Y;Z)+I^{D}(X;Z)). (31)

Using the terms we derived, the loss function is:

LDVCCA1Ni=1NDKL(p(z|xi,yi)r(z))β(1Ni=1N𝑑zp(z|xi)ln(q(yi|z))+1Ni=1N𝑑zp(z|xi)ln(q(xi|z))).L_{\text{DVCCA}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{KL}(p(z|x_{i},y_{i})\|r(z))\\ -\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|x_{i})\ln(q(y_{i}|z))+\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|x_{i})\ln(q(x_{i}|z))\right). (32)

The information between XX and YY does not change under the minimization and can be dropped.

A.3.5 beta (joint) Deep Variational CCA-private

GencoderG_{\text{encoder}}XXZZYYWXW_{X}WYW_{Y}GdecoderG_{\text{decoder}}XXZZYYWXW_{X}WYW_{Y}
Figure 12: Encoder and decoder graphs for beta Deep Variational CCA-private

This is a generalization of the Deep Variational CCA Wang et al. (2016) to include private information, cf. Fig. 12. Here XX is encoded into a shared latent variable ZZ and a private latent variable WXW_{X}. Similarly YY is encoded into the same shared variable and a different private latent variable WYW_{Y}. XX is reconstructed from ZZ and WXW_{X}, and YY is reconstructed from ZZ and WYW_{Y}. In the joint version (X,Y)(X,Y) are compressed jointly in ZZ similar to the previous joint methods. What follows is the loss XX version of beta Deep Variational CCA-private.

LDVCCAp=IE(X;Y)+IE((X,Y);Z)+IE(X;WX)+IE(Y;WY)β(ID(X;(WX,Z))+ID(Y;(WY,Z))).L_{\rm DVCCAp}=I^{E}(X;Y)+I^{E}((X,Y);Z)+I^{E}(X;W_{X})+I^{E}(Y;W_{Y})\\ -\beta(I^{D}(X;(W_{X},Z))+I^{D}(Y;(W_{Y},Z))). (33)

After the usual variational manipulations, this becomes:

LDVCCAp1Ni=1NDKL(p(z|xi)r(z))+1Ni=1NDKL(p(wx|xi)r(wx))+1Ni=1NDKL(p(wy|yi)r(wy))β(1Ni=1Ndzdwxp(wx|xi)p(z|xi)ln(q(yi|z,wx))+1Ni=1Ndzdwyp(wy|yi)p(z|xi)ln(q(xi|z,wy))).L_{\text{DVCCAp}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z|x_{i})\|r(z))+\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(w_{x}|x_{i})\|r(w_{x}))\\ +\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(w_{y}|y_{i})\|r(w_{y}))-\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dzdw_{x}p(w_{x}|x_{i})p(z|x_{i})\ln(q(y_{i}|z,w_{x}))\right.\\ +\left.\frac{1}{N}\sum_{i=1}^{N}\int dzdw_{y}p(w_{y}|y_{i})p(z|x_{i})\ln(q(x_{i}|z,w_{y}))\right). (34)

A.3.6 Deep Variational Symmetric Information Bottleneck

This has been analyzed in detail in the main text, Sec. 2.1, and will not be repeated here.

A.3.7 Deep Variational Symmetric Information Bottleneck-private

GencoderG_{\text{encoder}}XXYYZXZ_{X}ZYZ_{Y}WXW_{X}WYW_{Y}WXW_{X}XXGdecoderG_{\text{decoder}}YYZXZ_{X}ZYZ_{Y}WYW_{Y}
Figure 13: Encoder and decoder graphs for DVSIB-private.

This is a generalization of the Deep Variational Symmetric Information Bottleneck to include private information. Here XX is encoded into a shared latent variable ZXZ_{X} and a private latent variable WXW_{X}. Similarly, YY is encoded into its own shared ZYZ_{Y} variable and a private latent variable WYW_{Y}. XX is reconstructed from ZXZ_{X} and WXW_{X}, and YY is reconstructed from ZYZ_{Y} and WYW_{Y}. ZXZ_{X} and ZYZ_{Y} are constructed to be maximally informative about each another. This results in

LDVSIBp=IE(X;WX)+IE(X;ZX)+IE(Y;ZY)+IE(Y;WY)β(ID(ZX;ZY)+ID(X;(ZX,WX))+ID(Y;(ZY,WY))).L_{\text{DVSIBp}}=I^{E}(X;W_{X})+I^{E}(X;Z_{X})+I^{E}(Y;Z_{Y})+I^{E}(Y;W_{Y})\\ -\beta\left(I^{D}(Z_{X};Z_{Y})+I^{D}(X;(Z_{X},W_{X}))+I^{D}(Y;(Z_{Y},W_{Y}))\right). (35)

After the usual variational manipulations, this becomes (see also main text):

LDVSIBp1Ni=1NDKL(p(zx|xi)r(zx))+1Ni=1NDKL(p(zy|xi)r(zy))+1Ni=1NDKL(p(wx|xi)r(wx))+1Ni=1NDKL(p(wy|yi)r(wy))β(dzxdzyp(zx,zy)lneT(zx,zy)𝒵norm+1Ni=1Ndzydwyp(wy|yi)p(zy|yi)ln(q(yi|zy,wy))+1Ni=1Ndzxdwxp(wx|xi)p(zx|xi)ln(q(xi|zx,wx))),L_{\text{DVSIBp}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{x}|x_{i})\|r(z_{x}))+\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{y}|x_{i})\|r(z_{y}))\\ +\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(w_{x}|x_{i})\|r(w_{x}))+\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(w_{y}|y_{i})\|r(w_{y}))\\ -\beta\left(\int dz_{x}dz_{y}p(z_{x},z_{y})\ln\frac{e^{T(z_{x},z_{y})}}{\mathcal{Z}_{\text{norm}}}+\frac{1}{N}\sum_{i=1}^{N}\int dz_{y}dw_{y}p(w_{y}|y_{i})p(z_{y}|y_{i})\ln(q(y_{i}|z_{y},w_{y}))\right.\\ \left.+\frac{1}{N}\sum_{i=1}^{N}\int dz_{x}dw_{x}p(w_{x}|x_{i})p(z_{x}|x_{i})\ln(q(x_{i}|z_{x},w_{x}))\right), (36)

where

𝒵norm=𝑑zx𝑑zyp(zx)p(zy)eT(zx,zy).\mathcal{Z}_{\rm norm}=\int dz_{x}dz_{y}p(z_{x})p(z_{y})e^{T(z_{x},z_{y})}. (37)

Appendix B Multi-variable Losses (More than 2 Views / Variables)

It is possible to rederive several multi-variable losses that have appeared in the literature within our framework.

B.1 Multi-view Total Correlation Auto-encoder

GencoderG_{\text{encoder}}X1X_{1}X2X_{2}X3X_{3}ZZGdecoderG_{\text{decoder}}X1X_{1}X2X_{2}X3X_{3}ZZ
Figure 14: Encoder and decoder graphs for a multi-view auto-encoder.

Here we demonstrate several graphs for multi-variable losses. This first example consists of a structure, where all the views X1X_{1}, X2X_{2}, and X3X_{3} are compressed into the same latent variable ZZ. The corresponding decoder produces reconstructed views from the same latent variable ZZ. This is known in the literature as a multi-view auto-encoder.

LMVAE=I~E((X1,X2,X3);Z)β(I~D(X1;Z)+I~D(X2;Z)+I~D(X3;Z)).L_{\rm MVAE}=\tilde{I}^{E}((X_{1},X_{2},X_{3});Z)-\beta(\tilde{I}^{D}(X_{1};Z)+\tilde{I}^{D}(X_{2};Z)+\tilde{I}^{D}(X_{3};Z)). (38)

Using the same library of terms as before, we find:

LMVAE1Ni=1NDKL(p(z|x1i,x2i,x3i)r(z))β(1Ni=1Ndzp(z|x1i,x2i,x3i)ln(q(x1i|z))+1Ni=1Ndzp(z|x1i,x2i,x3i)ln(q(x2i|z))+1Ni=1Ndzp(z|x1i,x2i,x3i)ln(q(x3i|z))).L_{\text{MVAE}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})\|r(z))\\ -\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})\ln(q({x_{1}}_{i}|z))+\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})\ln(q({x_{2}}_{i}|z))\right.\\ \left.+\frac{1}{N}\sum_{i=1}^{N}\int dzp(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})\ln(q({x_{3}}_{i}|z))\right). (39)

B.2 Deep Variational Multimodal Information Bottlenecks

GencoderG_{\text{encoder}}X1X_{1}X2X_{2}X3X_{3}Z1Z_{1}Z2Z_{2}Z3Z_{3}ZZGdecoderG_{\text{decoder}}X1X_{1}X2X_{2}X3X_{3}Z1Z_{1}Z2Z_{2}Z3Z_{3}ZZ
Figure 15: Encoder and decoder graphs for Multimodal Information Bottleneck.

This example consists of a structure where all the views X1X_{1}, X2X_{2}, and X3X_{3} are compressed into separate latent views Z1Z_{1}, Z2Z_{2}, and Z3Z_{3} and one global shared latent variable ZZ. This structure is analogous to DVCCA-private, but it extends to three variables rather than two. It appears in the literature with slightly different variations. In the decoder graph, X1X_{1} is reconstructed from both ZZ and Z1Z_{1}, X2X_{2} is reconstructed from both ZZ and Z2Z_{2}, and X3X_{3} is reconstructed from both ZZ and Z3Z_{3}.

LDVAE=I~E((X1,X2,X3);Z)+I~E(X1;Z1)+I~E(X2;Z2)+I~E(X3;Z3)β(I~D(X1;(Z,Z1))+I~D(X2;(Z,Z2))+I~D(X3;(Z,Z3)))L_{\rm DVAE}=\tilde{I}^{E}((X_{1},X_{2},X_{3});Z)+\tilde{I}^{E}(X_{1};Z_{1})+\tilde{I}^{E}(X_{2};Z_{2})+\tilde{I}^{E}(X_{3};Z_{3})\\ -\beta(\tilde{I}^{D}(X_{1};(Z,Z_{1}))+\tilde{I}^{D}(X_{2};(Z,Z_{2}))+\tilde{I}^{D}(X_{3};(Z,Z_{3}))) (40)

Using the same library of terms as before, we find:

LDVAE1Ni=1NDKL(p(z|x1i,x2i,x3i)r(z))+j=131Ni=1NDKL(p(zj|xji)rj(z))β(1Ni=1Ndzdz1dz2dz3p(z|x1i,x2i,x3i)p(z1|x1i)p(z2|x2i)p(z3|x3i)ln(q(x1i|z,z1))+1Ni=1N𝑑z𝑑z1𝑑z2𝑑z3p(z|x1i,x2i,x3i)p(z1|x1i)p(z2|x2i)p(z3|x3i)ln(q(x2i|z,z2))+1Ni=1Ndzdz1dz2dz3p(z|x1i,x2i,x3i)p(z1|x1i)p(z2|x2i)p(z3|x3i)ln(q(x3i|z,z3))).L_{\text{DVAE}}\approx\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})\|r(z))+\sum_{j=1}^{3}\frac{1}{N}\sum_{i=1}^{N}D_{\rm KL}(p(z_{j}|{x_{j}}_{i})\|r_{j}(z))\\ -\beta\left(\frac{1}{N}\sum_{i=1}^{N}\int dzdz_{1}dz_{2}dz_{3}p(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})p(z_{1}|{x_{1}}_{i})p(z_{2}|{x_{2}}_{i})p(z_{3}|{x_{3}}_{i})\ln(q({x_{1}}_{i}|z,z_{1}))\right.\\ \left.+\frac{1}{N}\sum_{i=1}^{N}\int dzdz_{1}dz_{2}dz_{3}p(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})p(z_{1}|{x_{1}}_{i})p(z_{2}|{x_{2}}_{i})p(z_{3}|{x_{3}}_{i})\ln(q({x_{2}}_{i}|z,z_{2}))\right.\\ \left.+\frac{1}{N}\sum_{i=1}^{N}\int dzdz_{1}dz_{2}dz_{3}p(z|{x_{1}}_{i},{x_{2}}_{i},{x_{3}}_{i})p(z_{1}|{x_{1}}_{i})p(z_{2}|{x_{2}}_{i})p(z_{3}|{x_{3}}_{i})\ln(q({x_{3}}_{i}|z,z_{3}))\right). (41)

B.3 Discussion

There exist many other structures that have been explored in the multi-view representation learning literature, including conditional VIB (Shi et al., 2019; Hwang et al., 2021), which is formulated in terms of conditional information. These types of structures are beyond the current scope of our framework. However, they could be represented by an encoder mapping from all independent views XνX_{\nu} to ZZ, subtracted from another encoder mapping from the joint view X\vec{X} to ZZ. Coupled with this would be a decoder mapping from ZZ to the independent views XνX_{\nu} (or the joint view X\vec{X}, analogous to the Joint-DVCCA). Similarly, one can use our framework to represent other multi-view approaches, or their approximations (Lee and Van der Schaar, 2021; Wan et al., 2021; Hwang et al., 2021). This underscores the breadth of methods seeking to address specific questions by exploring known or assumed statistical dependencies within data, and also the generality of our approach, which can re-derive these methods.

Appendix C Multi-view Information Bottleneck

The multiview information bottleneck (MVIB) (Federici et al., 2020b) attempts to remove redundant information between views (v1,v2)(v_{1},v_{2}). This is achieved with the following losses:

L1\displaystyle L_{1} =I(z1;v1|v2)λ1I(v2;z1),\displaystyle=I(z_{1};v_{1}|v_{2})-\lambda_{1}I(v_{2};z_{1}), (42)
L2\displaystyle L_{2} =I(z2;v2|v1)λ1I(v1;z2).\displaystyle=I(z_{2};v_{2}|v_{1})-\lambda_{1}I(v_{1};z_{2}). (43)

These losses are equivalent to two deep variational information bottlenecks performed in parallel. Within our framework, the same algorithm emerges with the encoder graph that compresses v1v_{1} into z1z_{1} and v2v_{2} into z2z_{2}, while the decoder graph would reconstruct v2v_{2} from z1z_{1} and v1v_{1} from z2z_{2}.

Federici et al. (2020b) combines these two losses while enforcing the condition that z1z_{1} and z2z_{2} are the same. They bounded the combined loss function to obtain:

LMVIB=DSKL(P(z1|v1)||P(z2|v2))βI(z1,z2),L_{\rm MVIB}=D_{SKL}(P(z_{1}|v_{1})||P(z_{2}|v_{2}))-\beta I(z_{1},z_{2}), (44)

with z1z_{1} and z2z_{2} being the same latent space in this approximation. Here DSKLD_{\rm SKL} is the symmetrized KL divergence, viv_{i} corresponds to the two different views, and ziz_{i} corresponds to their two latent, compressed representation. (Here we changed the parameter β\beta to be in front of I(z1,z2)I(z_{1},z_{2}), to be consistent with the definition of β\beta we use elsewhere in this work.) While this loss looks similar to the DVSIB loss, it is conceptually different. It attempts to produce latent variables that are as similar to one another as possible (ideally, z1=z2z_{1}=z_{2}). In contrast, DVSIB attempts to produce different latent variables that could, in theory, have different units, dimensionalities, and domains, while still being as informative about each other as possible. For example, in the noisy MNIST, ZXZ_{X} contains information about the labels, the angles, and the scale of images (all needed for reconstructing XX) and no information about the noise structure. At the same time, ZYZ_{Y} contains information about the labels and the noise factor only (both needed to reconstruct YY). See Appendix D.4 for 2-d latent spaces colored by these variables, illustrating the difference between ZXZ_{X} and ZYZ_{Y} in DVSIB. Further, in practice, the implementation of MVIB uses the same encoder for both views of the data; this is equivalent to encoding different views using the same function and then trying to force the output to be as close as possible to each other, in contrast to DVSIB.

We evaluate MVIB on the noisy MNIST dataset and include it in Table 2. The performance is similar to that of DVSIB, but slightly worse.

Moreover, MVIB appears to be highly sensitive to parameters and training conditions. Despite employing identical initial conditions and parameters used for training other methods, the approach often experienced collapses during training, resulting in infinities. Interestingly enough, in instances where training persisted for a limited set of parameters (usually low kZk_{Z} and high β\beta), MVIB generated good latent spaces, evidenced by their relatively high classification accuracy.

Appendix D Additional MNIST Results

In this section, we present supplementary results derived from the methods in Tbl 1.

D.1 Additional results tables for the best parameters

We report classification accuracy using SVM on data XX, and using neural networks on both XX and YY.

Table 3: Maximum accuracy from a linear SVM and the optimal kZk_{Z} and β\beta for variational DR methods on the XX dataset. ( fixed values)
Method Acc. % kZbest{k_{Z}}_{\textbf{best}} 95%95\% kZrange{k_{Z}}_{\text{range}} βbest\beta_{\text{best}} 95%95\% βrange\beta_{\text{range}} 𝑪best\bm{C_{\text{best}}}
Baseline 57.8 784 - - - 0.01
PCA 58.0 256 [32,265*] - - 0.1
CCA 54.4 256 [8,265*] - - 0.032
β\beta-VAE 84.4 256 [128,265*] 4 [2,8] 10
DVIB 87.3 128 [4,265*] 512 [8,1024*] 0.032
DVCCA 86.1 256 [64,265*] 1 - 31.623
β\beta-DVCCA 88.9 256 [128,265*] 4 [1,128] 10
DVCCA-private 85.3 128 [32,265*] 1 - 31.623
β\beta-DVCCA-private 85.3 128 [32,265*] 1 [1,8] 31.623
MVIB 93.8 8 [8,16] 128 [128,1024*] 0.01
DVSIB 92.9 256 [64,265*] 256 [4,1024*] 1
DVSIB-private 92.6 256 [32,265*] 128 [8,1024*] 3.162
Table 4: Maximum accuracy from a feed forward neural network and the optimal kZk_{Z} and β\beta for variational DR methods on the YY and the joined [X,Y][X,Y] datasets. ( fixed values)
Method Acc. % kZbest{k_{Z}}_{\textbf{best}} 95%95\% kZrange{k_{Z}}_{\text{range}} βbest\beta_{\text{best}} 95%95\% βrange\beta_{\text{range}}
Baseline 92.8 784 - - -
PCA 97.6 128 [16,256*] - -
CCA 90.2 256 [32,256*] - -
β\beta-VAE 98.4 64 [8,256*] 64 [2,1024*]
DVIB 90.4 128 [8,256*] 1024 [8,1024*]
DVCCA 91.3 16 [4,256*] 1 -
β\beta-DVCCA 97.5 128 [8,256*] 512 [2,1024*]
DVCCA-private 93.8 16 [2,256*] 1 -
β\beta-DVCCA-private 97.5 256 [2,256*] 32 [1,1024*]
MVIB 97.5 16 [8,16] 256 [128,1024*]
DVSIB 98.3 256 [4,256*] 32 [2,1024*]
DVSIB-private 98.3 256 [4,256*] 32 [2,1024*]
Baseline-joint 97.7 1568 - - -
joint-DVCCA 93.7 256 [8,256*] 1 -
β\beta-joint-DVCCA 98.9 64 [8,256*] 512 [2,1024*]
joint-DVCCA-private 93.5 16 [4,256*] 1 -
β\beta-joint-DVCCA-private 95.6 32 [4,256*] 512 [1,1024*]
Table 5: Maximum accuracy from a neural network the optimal kZk_{Z} and β\beta for variational DR methods on the XX dataset. ( fixed values)
Method Acc. % kZbest{k_{Z}}_{\textbf{best}} 95%95\% kZrange{k_{Z}}_{\text{range}} βbest\beta_{\text{best}} 95%95\% βrange\beta_{\text{range}}
Baseline 92.8 784 - - -
PCA 91.9 64 [32,256*] - -
CCA 72.6 256 [256,256*] - -
β\beta-VAE 93.3 256 [16,256*] 256 [2,1024*]
DVIB 87.5 4 [2,256*] 1024 [4,1024*]
DVCCA 87.5 128 [8,256*] 1 -
β\beta-DVCCA 92.2 64 [8,256*] 32 [2,1024*]
DVCCA-private 88.2 8 [8,256*] 1 -
β\beta-DVCCA-private 90.7 256 [4,256*] 8 [1,1024*]
MVIB 93.6 8 [8,16] 256 [128,1024*]
DVSIB 93.9 128 [8,256*] 16 [2,1024*]
DVSIB-private 92.8 32 [8,256*] 256 [4,1024*]

D.2 t-SNE Embeddings at best parameters

Figures 16 and 17 display 2d t-SNE embeddings for variables ZXZ_{X} and ZYZ_{Y} generated by various considered DR methods.

Refer to caption
Refer to caption
Refer to caption
Figure 16: t-SNE X
Refer to caption
Refer to caption
Refer to caption
Figure 17: t-SNE Y

D.3 DVSIB-private reconstructions for best parameters

Figure 18 shows the t-SNE embeddings of the private latent variables constructed by DVSIB-private, colored by the digit label. To the extent that the labels do not cluster, private latent variables do not preserve the label information shared between XX and YY.

Refer to caption
Refer to caption
Refer to caption
Figure 18: Private embeddings of DVSIB-private colored by labels, rotations, scales, and noise factors for XX (up), and YY (middle). Reconstructions of the digits using both shared and private information (bottom) show that the private information allows to produce different backgrounds, scalings, and rotations.

D.4 Additional results at 2 latent dimensions

We now demonstrate how different DR methods behave when the compressed variables are restricted to have not more than 2 dimensions, cf. Figs. 19, 20.

Refer to caption
Refer to caption
Refer to caption
Figure 19: Clustering of embeddings when restricting kZXk_{Z_{X}} to two for DVSIB, DVSIB-private, and β\beta-DVCCA, results on the XX dataset.
Refer to caption
Refer to caption
Refer to caption
Figure 20: Clustering of embeddings when restricting kZYk_{Z_{Y}} to two for DVSIB, DVSIB-private, and β\beta-DVCCA, results on the YY dataset.

D.5 DVSIB-private reconstructions at 2 Latent Dimensions

Figure 21 shows the reconstructions of the private latent variables constructed by DVSIB-private, colored by the digit label, rotations, scales, and noise factors for XX (up), and YY (bottom). Private latent variables at 2 latent dimensions preserve a little about the label information shared between XX and YY, but clearly preserve the scale information for XX, even at only 2 latent dimensions.

Refer to caption
Refer to caption
Figure 21: Private embeddings of DVSIB-private colored by labels, rotations, scales, and noise factors for XX (top), and YY (bottom).

D.6 Testing Training Efficiency

We tested an SVM’s classification accuracy for distinguishing digits based on latent subspaces created by DVSIB, β\beta-VAE, CCA, and PCA trained using different amounts of samples. Figure 5 in the main text shows the results for 60 epochs of training with latent spaces of dimension kZX=kZY=64k_{Z_{X}}=k_{Z_{Y}}=64. The DVSIB and β\beta-VAE were trained with β=1024\beta=1024. Figure 22 shows the SVM’s classification accuracy for a range of latent dimensions (from right to left): kZX=kZY=2,16,64,256k_{Z_{X}}=k_{Z_{Y}}=2,16,64,256. Additionally, it shows the results for different amounts of training time for the encoders ranging from 20 epochs (top row) to 100 epochs (bottom row). As explained in the main text, we plot a log-log graph of 100A100-A versus 1/n1/n. Plotted in this way, high accuracy appears at the bottom, and large sample sizes are at the left of the plots. DVSIB, β\beta-VAE, and CCA often appear linear when plotted this way, implying that they follow the form A=100c/nmA=100-c/n^{m}. Steeper slopes mm on these plots correspond to a faster increase in the accuracy with the sample size. This parameter sweep shows that the tested methods have not had time to fully converge at low epoch numbers. Additionally, increasing the number of latent dimensions helps the SVMs untangle the non-linearities present in the data and improves the corresponding classifiers.

Refer to caption
Figure 22: Log-log plot of 100A100-A vs 1/n1/n. DVSIB has a steeper slope than β\beta-VAE corresponding to faster convergence with fewer samples for DVSIB. Plots vary kZ=2,16,64,256k_{Z}=2,16,64,256 and training epochs 20,40,60,80,10020,40,60,80,100.

References

  • Abdelaleem et al. (2023) Eslam Abdelaleem, Ahmed Roman, K Michael Martini, and Ilya Nemenman. Simultaneous dimensionality reduction: A data efficient approach for multimodal representations learning. arXiv preprint arXiv:2310.04458, 2023.
  • Alemi et al. (2017) Alex Alemi, Ian Fischer, Josh Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.
  • Andrew et al. (2013) Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1247–1255, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  • Bao (2021) Feng Bao. Disentangled variational information bottleneck for multiview representation learning. In Artificial Intelligence: First CAAI International Conference, CICAI 2021, Hangzhou, China, June 5–6, 2021, Proceedings, Part II 1, pages 91–102. Springer, 2021.
  • Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pages 531–540. PMLR, 2018.
  • Benton et al. (2017) Adrian Benton, Huda Khayrallah, Biman Gujral, Dee Ann Reisinger, Sheng Zhang, and Raman Arora. Deep generalized canonical correlation analysis. arXiv preprint arXiv:1702.02519, 2017.
  • Chandar et al. (2016) Sarath Chandar, Mitesh M. Khapra, Hugo Larochelle, and Balaraman Ravindran. Correlational neural networks. Neural Computation, 28(2):257–285, 2016. doi: 10.1162/NECO_a_00801.
  • Chechik et al. (2003) Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. Advances in Neural Information Processing Systems, 16, 2003.
  • Clark et al. (2013) Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips, David Maffitt, Michael Pringle, et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging, 26:1045–1057, 2013.
  • Federici et al. (2020a) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. In 8th International Conference on Learning Representations. OpenReview. net, 2020a.
  • Federici et al. (2020b) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017, 2020b.
  • Friedman et al. (2013) Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information bottleneck. arXiv preprint arXiv:1301.2270, 2013.
  • Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
  • Hinton and Roweis (2002) Geoffrey E Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural information processing systems, 15, 2002.
  • Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. doi: 10.1126/science.1127647.
  • Hotelling (1933) Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
  • Hotelling (1936) Harold Hotelling. Relations between two sets of variates. Biometrika, 1936. doi: 10.1007/978-1-4612-4380-9_14.
  • Hu et al. (2020) Shizhe Hu, Zenglin Shi, and Yangdong Ye. Dmib: Dual-correlated multivariate information bottleneck for multiview clustering. IEEE Transactions on Cybernetics, 52(6):4260–4274, 2020.
  • Huang et al. (2022) Teng-Hui Huang, Aly El Gamal, and Hesham El Gamal. On the multi-view information bottleneck representation. In 2022 IEEE Information Theory Workshop (ITW), pages 37–42. IEEE, 2022.
  • Huntley et al. (2015) Rachael P Huntley, Tony Sawford, Prudence Mutowo-Meullenet, Aleksandra Shypitsyna, Carlos Bonilla, Maria J Martin, and Claire O’Donovan. The goa database: gene ontology annotation updates for 2015. Nucleic acids research, 43(D1):D1057–D1063, 2015.
  • Hwang et al. (2021) HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, and Kee-Eung Kim. Multi-view representation learning via total correlation objective. Advances in Neural Information Processing Systems, 34:12194–12207, 2021.
  • Karami and Schuurmans (2021) Mahdi Karami and Dale Schuurmans. Deep probabilistic canonical correlation analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8055–8063, May 2021. doi: 10.1609/aaai.v35i9.16982.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • Krakauer et al. (2017) John W Krakauer, Asif A Ghazanfar, Alex Gomez-Marin, Malcolm A MacIver, and David Poeppel. Neuroscience needs behavior: correcting a reductionist bias. Neuron, 93(3):480–490, 2017.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee and Van der Schaar (2021) Changhee Lee and Mihaela Van der Schaar. A variational information bottleneck approach to multi-omics data integration. In International Conference on Artificial Intelligence and Statistics, pages 1513–1521. PMLR, 2021.
  • Lorenzi et al. (2018) Marco Lorenzi, Andre Altmann, Boris Gutman, Selina Wray, Charles Arber, Derrek P Hibar, Neda Jahanshad, Jonathan M Schott, Daniel C Alexander, Paul M Thompson, et al. Susceptibility of brain atrophy to trib3 in alzheimer’s disease, evidence from functional prioritization in imaging genetics. Proceedings of the National Academy of Sciences, 115(12):3162–3167, 2018.
  • Martini and Nemenman (2023) K Michael Martini and Ilya Nemenman. Data efficiency, dimensionality reduction, and the generalized symmetric information bottleneck. arXiv preprint arXiv:2309.05649, 2023.
  • Pang et al. (2016) Rich Pang, Benjamin J Lansdell, and Adrienne L Fairhall. Dimensionality reduction in neuroscience. Current Biology, 26(14):R656–R660, 2016.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Perlin (1985) Ken Perlin. An image synthesizer. ACM Siggraph Computer Graphics, 19(3):287–296, 1985.
  • Poole et al. (2019) Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5171–5180. PMLR, 09–15 Jun 2019.
  • Qiu et al. (2022) Lin Qiu, Vernon M Chinchilli, and Lin Lin. Variational interpretable deep canonical correlation analysis. In ICLR2022 Machine Learning for Drug Discovery, 2022.
  • Rohrbach et al. (2017) Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Chris Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. International Journal of Computer Vision, 2017.
  • Sanabria et al. (2018) Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347, 2018.
  • Shi et al. (2019) Yuge Shi, Brooks Paige, Philip Torr, et al. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32, 2019.
  • Steinmetz et al. (2021) Nicholas A Steinmetz, Cagatay Aydin, Anna Lebedeva, Michael Okun, Marius Pachitariu, Marius Bauza, Maxime Beau, Jai Bhagat, Claudia Böhm, Martijn Broux, et al. Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science, 372(6539):eabf4588, 2021.
  • Studenỳ and Vejnarová (1998) Milan Studenỳ and Jirina Vejnarová. The multiinformation function as a tool for measuring stochastic dependence. Learning in graphical models, pages 261–297, 1998.
  • Suzuki et al. (2016) Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891, 2016.
  • Svensson et al. (2018) Valentine Svensson, Roser Vento-Tormo, and Sarah A Teichmann. Exponential scaling of single-cell rna-seq in the past decade. Nature protocols, 13(4):599–604, 2018.
  • Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • Urai et al. (2022) Anne E Urai, Brent Doiron, Andrew M Leifer, and Anne K Churchland. Large-scale neural recordings call for new insights to link brain and behavior. Nature neuroscience, 25(1):11–19, 2022.
  • Vinod (1976) Hrishikesh D Vinod. Canonical ridge and econometrics of joint production. Journal of Econometrics, 1976. doi: 10.1016/0304-4076(76)90010-5.
  • Wan et al. (2021) Zhibin Wan, Changqing Zhang, Pengfei Zhu, and Qinghua Hu. Multi-view information-bottleneck representation learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 10085–10092, 2021.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  • Wang et al. (2019) Qi Wang, Claire Boudreau, Qixing Luo, Pang-Ning Tan, and Jiayu Zhou. Deep multi-view information bottleneck. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 37–45. SIAM, 2019.
  • Wang et al. (2015) Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In International conference on machine learning, pages 1083–1092. PMLR, 2015.
  • Wang et al. (2016) Weiran Wang, Xinchen Yan2 Honglak Lee, and Karen Livescu. Deep variational canonical correlation analysis. arXiv preprint arXiv:1610.03454, 2016.
  • Wold et al. (2001) Svante Wold, Michael Sjöström, and Lennart Eriksson. Pls-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2):109–130, 2001. ISSN 0169-7439. doi: https://doi.org/10.1016/S0169-7439(01)00155-1.
  • Wong et al. (2021) Hok Shing Wong, Li Wang, Raymond Chan, and Tieyong Zeng. Deep tensor cca for multi-view learning. IEEE Transactions on Big Data, 8(6):1664–1677, 2021.
  • Xu et al. (2016) Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  • Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
  • Zheng et al. (2017) Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):14049, 2017.
  • Zhou et al. (2018) Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Årup Nielsen et al. (1998) Finn Årup Nielsen, Lars Kai Hansen, and Stephen C Strother. Canonical ridge analysis with ridge parameter optimization. NeuroImage, 1998. doi: 10.1016/s1053-8119(18)31591-x.