Supplementary Material of
“Explicit Mutual Information Maximization for
Self-Supervised Learning”
-A Proof of Theorem 1
We first recall the result on the invariance property of mutual information. Specifically, if and are homeomorphisms, then [kraskov2004estimating]. Denote , if and are Gaussian distributed, i.e. and , we have with
Then, the mutual information is given by
where and are the marginal entropy of and , respectively, is the joint entropy of and . This together with the invariance property of mutual information, i.e. under homeomorphisms condition, results in Theorem 1.
-B Proof of Theorem 2
Theorem 1 implies that we can compute the MI only based on second-order statistics even if the distributions of and are not Gaussian. We investigate the MI under the generalized Gaussian distribution (GGD) as defined in (3) of the main paper. The GGD offers a flexible parametric form that can adapt to a wide range of distributions by varying the shape parameter in (3), from super-Gaussian when to sub-Gaussian when , including the Gamma, Laplacian and Gaussian distributions as special cases. Figure 1 provides an illustration of univariate GGD with different values.

Let , and from (3) in main paper, the joint distribution is , where is the mean, and is the dispersion matrix. The MI between and is given by
(12) | |||
Then, it follows that
where the expectation over the parameter space of a function (with for is essentially type 1 Dirichlet integral, which can be converted into integral over [verdoolaegeGeometryMultivariateGeneralized2012]. Specifically, let , then
where (a) is due to the fact that the density function of the positive variable is given by [verdoolaegeGeometryMultivariateGeneralized2012]
Therefore,
Similarly,
Substituting these expectations into (7) in main paper leads to
(13) | |||
where we used and the following relation
Then, using the relation between the dispersion matrix and covariance matrix
it follows that
-C Ablation Study
-C1 Loss Function
We investigate the effectiveness of each term of the loss function. Specifically, we remove one of term (w/o ) and the term (w/o ) or both terms (w/o both) from the loss function. Additionally, we replace the term with since both terms aim to align the representations and . Furthermore, we simplify the rescaling operation from to (w/o ). Results in Table I show that removing either or leads to performance decrease, yet the training still succeed. However, removing both leads to training failure.
Method | CIFAR-100 | ImageNet-100 |
---|---|---|
Original | 70.5 | 81.1 |
w/o | 66.6 | 76.5 |
w/o | 67.7 | 78.8 |
w/o both | 3.55 | 4.01 |
Using | 69.8 | 79.6 |
w/o in rescaling | 70.6 | 80.3 |
The reason behind this is straightforward. By minimizing the term , we aim to align the representations and . The terms and play a crucial role in ensuring that these representations are informative enough to prevent representation collapse. When one of these terms is removed, the remaining term is expected to partially fulfill this, but becomes less effective.
It can be seen from Table I that replacing the term with decreases the performance. Although both and encourage consistency between and , their mathematical properties differ significantly. The term encourages a holistic consistency in the structural properties of the feature spaces. Meanwhile, our derived loss function obviates the need to tune the balance ratio between the terms. Moreover, the results show that removing term in the Taylor approximation, i.e. adding a fixed to the three terms in the loss function does not affect the performance on CIFAR-100 but decreases the performance on ImageNet-100.
-C2 Projector Hidden Dimension and Projector Output Dimension
We evaluate the effect of the projector’s hidden dimension and projector output dimension in Table II. For both our method and Barlow Twins, there’s a tendency that the increase of projector hidden dimension generally improves the performance on both the CIFAR-100 and ImageNet-100 datasets. Compared with CIFAR-100, both our method and Barlow Twins need a larger hidden dimension to achieve high performance on the more complex ImageNet-100 dataset. Using a momentum encoder, our method can achieve better performance and becomes more robust to projector hidden dimension. Moreover, similar to the results on the hidden dimension, both our method and Barlow Twins exhibit a trend that increasing the projector output dimension generally improves performance. The performance of Barlow Twins is particularly sensitive to projector output dimension, whereas our method performs well even with a very small projector output dimension of 256.
CIFAR100 | ImageNet100 | |||||
---|---|---|---|---|---|---|
Proj. hidden dim | Ours | Ours-M | Barlow Twins | Ours | Ours-M | Barlow Twins |
2048 | 70.5 | 70.4 | 70.9 | 81.1 | 81.7 | 80.4 |
1024 | 70.8 | 70.4 | 70.2 | 80.2 | 81.5 | 79.3 |
512 | 69.1 | 70.1 | 69.6 | 79.5 | 81.4 | 78.3 |
256 | 67.9 | 70.0 | 68.0 | 78.7 | 80.6 | 76.9 |
Proj. output dim | Ours | Ours-M | Barlow Twins | Ours | Ours-M | Barlow Twins |
2048 | 70.5 | 70.4 | 70.9 | 81.1 | 81.7 | 80.4 |
1024 | 70.3 | 70.4 | 69.7 | 80.6 | 81.1 | 79.6 |
512 | 70.6 | 70.5 | 66.5 | 80.4 | 81.7 | 77.4 |
256 | 70.5 | 71.1 | 62.1 | 80.3 | 81.2 | 73.6 |
-D Experiment Implementation Details
For experiments on ImageNet-1K, we use a batch size of 1020 on 3 A100 GPUs for 100, 400, and 800 epochs. Training is conducted using 16-bit precision (FP16) and 4 batches of gradient accumulation to stabilize model updating and accelerate the training process. We use the LARS optimizer with a base learning rate of 0.8 for the backbone pretraining and 0.2 for the classifier training. The learning rate is scaled by . We use a weight decay of 1.5E-6 for backbone parameters. The linear classifier is trained on top of frozen backbone. We follow the default setting as in Solo-learn benchmark [costaSololearnLibrarySelfsupervised2022] for the rest of the training hyper-parameters.
Recall that in implementing the loss function (9) from the main paper, the three log-determinant terms are expanded as
(10) |
retaining only a -th order approximation is kept, e.g., in the experiments of this work. As shown in Figure 2, a fourth-order approximation of the log function in (10) is sufficiently accurate around the value of 1.

For our method with a momentum encoder, we follow the setting of [grillBootstrapYourOwn2020, chenExploringSimpleSiamese2020] and use a two-layer predictor with hidden dimension 1024 for all datasets. For ImageNet-100, we set the base learning rate as 0.2 for backbone pretraining and 0.3 for the classifier. We set the weight decay of backbone parameters as 0.0001. For CIFAR-100, we set the base learning rate for backbone pretraining as 0.3 and the classifier as 0.2. The weight decay is set as 6E-5. For the rescaling operation (), we track the eigenvalues of with an update interval of 100 batches with a moving average coefficient of 0.99. Table III shows the ablation study on the update interval and moving average coefficient . Generally, a larger update interval should be used with a smaller moving average coefficient, and vice versa. This is reasonable as the two parameters together control the speed of the eigenvalue tracking. Overall, our method is insensitive to these two hyperparameters. Table LABEL:tab:CIFAR_linear in main paper and Figure 3 depict the ablation study on the parameter used for the rescaling operation. We adhere to the experimental setup described in Table LABEL:tab:CIFAR_linear and report the Top-1 accuracy on CIFAR-100. As shown in Table IV and Figure 3, a smaller results in faster convergence during training. However, excessively small values may lead to training failure. We set for all the experiments.
update interval | CIFAR-100 | |
---|---|---|
1000 | 0 | 69.45 |
1000 | 0.1 | 70.22 |
1000 | 0.99 | 68.69 |
100 | 0.99 | 70.54 |
1 | 0.99 | 69.17 |
1 | 0 | 69.37 |
CIFAR-100 | |
7 | 69.85 |
6 | 70.10 |
5 | 70.17 |
4 | 70.09 |
3 | 69.46 |
1 |
