Supplementary Material of
“Explicit Mutual Information Maximization for Self-Supervised Learning”

Lele Chang¹, Peilin Liu¹, Qinghai Guo², Fei Wen¹
¹School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
²Advanced Computing and Storage Laboratory, Huawei Technologies Co., Ltd., Shenzhen, China

-A Proof of Theorem 1

We first recall the result on the invariance property of mutual information. Specifically, if $Y=F(Z)$ and $Y^{\prime}=G(Z^{\prime})$ are homeomorphisms, then $I(Z;Z^{\prime})=I(Y;Y^{\prime})$ [kraskov2004estimating]. Denote $\tilde{Y}={[{Y^{T}},{Y^{\prime T}}]^{T}}$ , if $Y=F(Z)$ and $Y^{\prime}=G(Z^{\prime})$ are Gaussian distributed, i.e. $Y\sim\mathcal{N}(Y;{\mu_{Y}},{C_{YY}})$ and $Y^{\prime}\sim\mathcal{N}(Y^{\prime};{\mu_{Y^{\prime}}},{C_{Y^{\prime}Y^{\prime}}})$ , we have $\tilde{Y}\sim\mathcal{N}(\tilde{Y};{\mu_{\tilde{Y}}},{C_{\tilde{Y}\tilde{Y}}})$ with

{C_{\tilde{Y}\tilde{Y}}}=\left[{\begin{array}[]{*{20}{c}}{{C_{YY}}}&{{C_{YY^{\prime}}}}\\ {{C_{Y^{\prime}Y}}}&{{C_{Y^{\prime}Y^{\prime}}}}\end{array}}\right].

Then, the mutual information $I(Y;Y^{\prime})=H(Y)+H(Y^{\prime})-H(Y,Y^{\prime})$ is given by

I(Y;Y^{\prime})=\frac{1}{2}\log\frac{{\det({C_{YY}})\det({C_{Y^{\prime}Y^{\prime}}})}}{{\det({C_{\tilde{Y}\tilde{Y}}})}},

where $H(Y)$ and $H(Y^{\prime})$ are the marginal entropy of $Y$ and $Y^{\prime}$ , respectively, $H(Y,Y^{\prime})$ is the joint entropy of $Y$ and $Y^{\prime}$ . This together with the invariance property of mutual information, i.e. $I(Z;Z^{\prime})=I(Y;Y^{\prime})$ under homeomorphisms condition, results in Theorem 1.

-B Proof of Theorem 2

Theorem 1 implies that we can compute the MI only based on second-order statistics even if the distributions of $Z$ and $Z^{\prime}$ are not Gaussian. We investigate the MI under the generalized Gaussian distribution (GGD) as defined in (3) of the main paper. The GGD offers a flexible parametric form that can adapt to a wide range of distributions by varying the shape parameter $\beta$ in (3), from super-Gaussian when $\beta<1$ to sub-Gaussian when $\beta>1$ , including the Gamma, Laplacian and Gaussian distributions as special cases. Figure 1 provides an illustration of univariate GGD with different values.

Refer to caption — Figure 1: Univariate generalized Gaussian distribution with different values of the shape parameter.

Let $\tilde{Z}=\left[Z^{T},Z^{\prime T}\right]^{T}\in\mathbb{R}^{2d}$ , and from (3) in main paper, the joint distribution ${p_{Z,Z^{\prime}}}(z,z^{\prime})$ is $\tilde{Z}\sim\mathcal{GN}\left(\tilde{Z};\mu_{\tilde{Z}},\Sigma_{\tilde{Z}\tilde{Z}},\beta\right)$ , where $\mu_{\tilde{Z}}$ is the mean, and $\Sigma_{\tilde{Z}\tilde{Z}}$ is the dispersion matrix. The MI between $Z$ and $Z^{\prime}$ is given by

	$\displaystyle I\left(Z,Z^{\prime}\right)$
	$\displaystyle=\iint p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\log\frac{p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)}{p_{Z}(z)p_{Z^{\prime}}\left(z^{\prime}\right)}dzdz^{\prime}$		(12)
	$\displaystyle=E\left[\log p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\right]-E\left[\log p_{Z}(z)\right]-E\left[\log p_{Z^{\prime}}\left(z^{\prime}\right)\right].$

Then, it follows that

\begin{aligned} &E\left[\log p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\right]\\ &=\log\frac{\Phi(\beta,2n)}{\left[\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)\right]^{1/2}}-\frac{1}{2}E\left\{\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\right\}\end{aligned},

where the expectation over the parameter space $\mathbb{R}^{n}$ of a function $\varphi\left((\tilde{Z}-\mu_{\tilde{Z}})^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right)=\varphi\left(\bar{Z}^{T}\bar{Z}\right)\equiv\varphi(w)$ (with $w>0$ for $\left.\tilde{Z}-\mu_{\tilde{Z}}\neq 0\right)$ is essentially type 1 Dirichlet integral, which can be converted into integral over $\mathbb{R}^{+}$ [verdoolaegeGeometryMultivariateGeneralized2012]. Specifically, let $\varphi(w)=w^{\beta}$ , then

\begin{aligned} &E\left\{\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\right\}\\ &=\frac{\Phi(\beta,2n)}{\left[\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)\right]^{1/2}}\int_{\mathbb{R}^{2n}}\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\\ &~{}~{}~{}~{}~{}~{}~{}\times\exp\left(-\frac{1}{2}\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\right)d\tilde{z}\\ &\stackrel{{\scriptstyle(a)}}{{=}}\frac{\beta}{2^{n/\beta}\Gamma(n/\beta)}\int_{\mathbb{R}^{+}}\varphi(w)w^{n-1}\exp\left(-\frac{1}{2}w^{\beta}\right)dw\\ &=\frac{\beta}{2^{n/\beta}\Gamma(n/\beta)}\frac{2^{2+n/\beta}\Gamma\left((\beta+n)/\beta\right)}{2\beta}\\ &=\frac{2\Gamma((\beta+n)/\beta)}{\Gamma(n/\beta)}\\ &=\frac{2n}{\beta}\end{aligned},

where (a) is due to the fact that the density function of the positive variable $w=\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)$ is given by [verdoolaegeGeometryMultivariateGeneralized2012]

p(w;\beta)=\frac{\beta}{\Gamma\left(\frac{n}{\beta}\right)2^{\frac{n}{\beta}}}w^{n-1}\exp\left(-\frac{1}{2}w^{\beta}\right).

Therefore,

E\left[\log p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\right]=\log\frac{\Phi(\beta,2n)}{\left[\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)\right]^{1/2}}-\frac{2n}{\beta}.

Similarly,

\begin{gathered}E\left[\log p_{Z}(z)\right]=\log\frac{\Phi(\beta,n)}{\left[\operatorname{det}\left(\Sigma_{ZZ}\right)\right]^{1/2}}-\frac{n}{\beta},\\ E\left[\log p_{Z^{\prime}}\left(z^{\prime}\right)\right]=\log\frac{\Phi(\beta,n)}{\left[\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)\right]^{1/2}}-\frac{n}{\beta}.\end{gathered}

Substituting these expectations into (7) in main paper leads to

	$\displaystyle I\left(Z,Z^{\prime}\right)$
	$\displaystyle=\frac{1}{2}\log\frac{\operatorname{det}\left(\Sigma_{ZZ}\right)\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)}+\log\frac{\Phi(\beta,2n)}{[\Phi(\beta,n)]^{2}}$		(13)
	$\displaystyle=\frac{1}{2}\log\frac{\operatorname{det}\left(\Sigma_{ZZ}\right)\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)},$

where we used $\Phi(\beta,n)=\frac{\beta\Gamma(n/2)}{2^{n/(2\beta)}\pi^{n/2}\Gamma(n/(2\beta))}$ and the following relation

\frac{\Phi(\beta,2n)}{[\Phi(\beta,n)]^{2}}=\frac{\beta\Gamma(n)}{\Gamma(n/\beta)}\frac{\left[\Gamma\left(\frac{n}{2\beta}\right)\right]^{2}}{\left[\beta\Gamma\left(\frac{n}{2}\right)\right]^{2}}=\frac{1}{\beta}\frac{2^{n-\frac{1}{2}}}{2^{\frac{n}{\beta}-\frac{1}{2}}}\frac{\Gamma\left(\frac{n}{2}+\frac{1}{2}\right)}{\Gamma\left(\frac{n}{2\beta}+\frac{1}{2}\right)}\frac{\Gamma\left(\frac{n}{2\beta}\right)}{\Gamma\left(\frac{n}{2}\right)}=1.

Then, using the relation between the dispersion matrix and covariance matrix

{\Sigma_{\tilde{X}\tilde{X}}}{\rm{=}}\frac{{n\Gamma(n/(2\beta))}}{{{2^{1/\beta}}\Gamma((n+2)/(2\beta))}}{C_{\tilde{X}\tilde{X}}},

it follows that

I\left(Z;Z^{\prime}\right)=\frac{1}{2}\log\frac{\operatorname{det}\left(\Sigma_{ZZ}\right)\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)}=\frac{1}{2}\log\frac{\operatorname{det}\left(C_{ZZ}\right)\operatorname{det}\left(C_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(C_{\tilde{Z}\tilde{Z}}\right)}.

-C Ablation Study

-C1 Loss Function

We investigate the effectiveness of each term of the loss function. Specifically, we remove one of $\log{\operatorname{det}{C_{ZZ}}}$ term (w/o $\log{\operatorname{det}{C_{ZZ}}}$ ) and the $\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}}$ term (w/o $\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}}$ ) or both terms (w/o both) from the loss function. Additionally, we replace the $\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})}$ term with $\|Z-Z^{\prime}\|^{2}$ since both terms aim to align the representations $Z$ and $Z^{\prime}$ . Furthermore, we simplify the rescaling operation from $\tilde{M}=\frac{M-\mu_{\lambda}I}{\alpha}+I$ to $\tilde{M}=\frac{M}{\alpha}+I$ (w/o $\mu_{\lambda}$ ). Results in Table I show that removing either $\log{\operatorname{det}{C_{ZZ}}}$ or $\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}}$ leads to performance decrease, yet the training still succeed. However, removing both leads to training failure.

TABLE I: Ablation on loss function. The experiment follows the same setup as in Table LABEL:tab:CIFAR_linear in main paper, with Top-1 accuracy is reported.

Method	CIFAR-100	ImageNet-100
Original	70.5	81.1
w/o $\log{\operatorname{det}{C_{ZZ}}}$	66.6	76.5
w/o $\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}}$	67.7	78.8
w/o both	3.55	4.01
Using $\\|Z-Z^{\prime}\\|^{2}$	69.8	79.6
w/o $\mu_{\lambda}$ in rescaling	70.6	80.3

The reason behind this is straightforward. By minimizing the term $\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})}$ , we aim to align the representations $Z$ and $Z^{\prime}$ . The terms $\log{\operatorname{det}{C_{ZZ}}}$ and $\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}}$ play a crucial role in ensuring that these representations are informative enough to prevent representation collapse. When one of these terms is removed, the remaining term is expected to partially fulfill this, but becomes less effective.

It can be seen from Table I that replacing the $\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})}$ term with $\|Z-Z^{\prime}\|^{2}$ decreases the performance. Although both $\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})}$ and $\|Z-Z^{\prime}\|^{2}$ encourage consistency between $Z$ and $Z^{\prime}$ , their mathematical properties differ significantly. The term $\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})}$ encourages a holistic consistency in the structural properties of the feature spaces. Meanwhile, our derived loss function obviates the need to tune the balance ratio between the terms. Moreover, the results show that removing $\mu_{\lambda}$ term in the Taylor approximation, i.e. adding a fixed $I_{d}$ to the three terms in the loss function does not affect the performance on CIFAR-100 but decreases the performance on ImageNet-100.

-C2 Projector Hidden Dimension and Projector Output Dimension

We evaluate the effect of the projector’s hidden dimension and projector output dimension in Table II. For both our method and Barlow Twins, there’s a tendency that the increase of projector hidden dimension generally improves the performance on both the CIFAR-100 and ImageNet-100 datasets. Compared with CIFAR-100, both our method and Barlow Twins need a larger hidden dimension to achieve high performance on the more complex ImageNet-100 dataset. Using a momentum encoder, our method can achieve better performance and becomes more robust to projector hidden dimension. Moreover, similar to the results on the hidden dimension, both our method and Barlow Twins exhibit a trend that increasing the projector output dimension generally improves performance. The performance of Barlow Twins is particularly sensitive to projector output dimension, whereas our method performs well even with a very small projector output dimension of 256.

TABLE II: Impact of projector hidden/output dimension on accuracy.

	CIFAR100			ImageNet100
Proj. hidden dim	Ours	Ours-M	Barlow Twins	Ours	Ours-M	Barlow Twins
2048	70.5	70.4	70.9	81.1	81.7	80.4
1024	70.8	70.4	70.2	80.2	81.5	79.3
512	69.1	70.1	69.6	79.5	81.4	78.3
256	67.9	70.0	68.0	78.7	80.6	76.9
Proj. output dim	Ours	Ours-M	Barlow Twins	Ours	Ours-M	Barlow Twins
2048	70.5	70.4	70.9	81.1	81.7	80.4
1024	70.3	70.4	69.7	80.6	81.1	79.6
512	70.6	70.5	66.5	80.4	81.7	77.4
256	70.5	71.1	62.1	80.3	81.2	73.6

-D Experiment Implementation Details

For experiments on ImageNet-1K, we use a batch size of 1020 on 3 A100 GPUs for 100, 400, and 800 epochs. Training is conducted using 16-bit precision (FP16) and 4 batches of gradient accumulation to stabilize model updating and accelerate the training process. We use the LARS optimizer with a base learning rate of 0.8 for the backbone pretraining and 0.2 for the classifier training. The learning rate is scaled by $\text{lr}=\text{base\_lr}\times\text{batch\_size}/256\times\text{num\_gpu}$ . We use a weight decay of 1.5E-6 for backbone parameters. The linear classifier is trained on top of frozen backbone. We follow the default setting as in Solo-learn benchmark [costaSololearnLibrarySelfsupervised2022] for the rest of the training hyper-parameters.

Recall that in implementing the loss function (9) from the main paper, the three log-determinant terms are expanded as

$\displaystyle\log\operatorname{det}(M)$	$\displaystyle=\sum_{i=1}^{n}\log\lambda_{i}(M)$
	$\displaystyle=\sum_{i=1}^{n}\sum_{k=1}^{\infty}(-1)^{k+1}\frac{\left(\lambda_{i}(M)-1\right)^{k}}{k}$
	$\displaystyle=\sum_{k=1}^{\infty}(-1)^{k+1}\frac{\operatorname{tr}\left((M-I)^{k}\right)}{k}=\operatorname{tr}\left(\sum_{k=1}^{\infty}(-1)^{k+1}\frac{(M-I)^{k}}{k}\right),$	(10)

retaining only a $p$ -th order approximation is kept, e.g., $p=4$ in the experiments of this work. As shown in Figure 2, a fourth-order approximation of the log function in (10) is sufficiently accurate around the value of 1.

For our method with a momentum encoder, we follow the setting of [grillBootstrapYourOwn2020, chenExploringSimpleSiamese2020] and use a two-layer predictor with hidden dimension 1024 for all datasets. For ImageNet-100, we set the base learning rate as 0.2 for backbone pretraining and 0.3 for the classifier. We set the weight decay of backbone parameters as 0.0001. For CIFAR-100, we set the base learning rate for backbone pretraining as 0.3 and the classifier as 0.2. The weight decay is set as 6E-5. For the rescaling operation ( $\tilde{M}=\frac{M-\mu_{\lambda}I}{\alpha}+I$ ), we track the eigenvalues of $C_{ZZ}$ with an update interval of 100 batches with a moving average coefficient $\rho$ of 0.99. Table III shows the ablation study on the update interval and moving average coefficient $\rho$ . Generally, a larger update interval should be used with a smaller moving average coefficient, and vice versa. This is reasonable as the two parameters together control the speed of the eigenvalue tracking. Overall, our method is insensitive to these two hyperparameters. Table LABEL:tab:CIFAR_linear in main paper and Figure 3 depict the ablation study on the parameter $\beta$ used for the rescaling operation. We adhere to the experimental setup described in Table LABEL:tab:CIFAR_linear and report the Top-1 accuracy on CIFAR-100. As shown in Table IV and Figure 3, a smaller $\beta$ results in faster convergence during training. However, excessively small values may lead to training failure. We set $\beta=5$ for all the experiments.

TABLE III: Ablation study on the update interval and moving average coefficient

\rho

for eigenvalues tracking used for the rescaling operation.

update interval	$\rho$	CIFAR-100
1000	0	69.45
1000	0.1	70.22
1000	0.99	68.69
100	0.99	70.54
1	0.99	69.17
1	0	69.37

TABLE IV: Ablation study on the parameter

\beta

used for the rescaling operation .

${\beta}$	CIFAR-100
7	69.85
6	70.10
5	70.17
4	70.09
3	69.46
1	$\mathrm{NaN}$