This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Supplementary Material of
“Explicit Mutual Information Maximization for Self-Supervised Learning”

Lele Chang1, Peilin Liu1, Qinghai Guo2, Fei Wen1
1School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
2Advanced Computing and Storage Laboratory, Huawei Technologies Co., Ltd., Shenzhen, China

-A Proof of Theorem 1

We first recall the result on the invariance property of mutual information. Specifically, if Y=F(Z)Y=F(Z) and Y=G(Z)Y^{\prime}=G(Z^{\prime}) are homeomorphisms, then I(Z;Z)=I(Y;Y)I(Z;Z^{\prime})=I(Y;Y^{\prime}) [kraskov2004estimating]. Denote Y~=[YT,YT]T\tilde{Y}={[{Y^{T}},{Y^{\prime T}}]^{T}}, if Y=F(Z)Y=F(Z) and Y=G(Z)Y^{\prime}=G(Z^{\prime}) are Gaussian distributed, i.e. Y𝒩(Y;μY,CYY)Y\sim\mathcal{N}(Y;{\mu_{Y}},{C_{YY}}) and Y𝒩(Y;μY,CYY)Y^{\prime}\sim\mathcal{N}(Y^{\prime};{\mu_{Y^{\prime}}},{C_{Y^{\prime}Y^{\prime}}}), we have Y~𝒩(Y~;μY~,CY~Y~)\tilde{Y}\sim\mathcal{N}(\tilde{Y};{\mu_{\tilde{Y}}},{C_{\tilde{Y}\tilde{Y}}}) with

CY~Y~=[CYYCYYCYYCYY].{C_{\tilde{Y}\tilde{Y}}}=\left[{\begin{array}[]{*{20}{c}}{{C_{YY}}}&{{C_{YY^{\prime}}}}\\ {{C_{Y^{\prime}Y}}}&{{C_{Y^{\prime}Y^{\prime}}}}\end{array}}\right].

Then, the mutual information I(Y;Y)=H(Y)+H(Y)H(Y,Y)I(Y;Y^{\prime})=H(Y)+H(Y^{\prime})-H(Y,Y^{\prime}) is given by

I(Y;Y)=12logdet(CYY)det(CYY)det(CY~Y~),I(Y;Y^{\prime})=\frac{1}{2}\log\frac{{\det({C_{YY}})\det({C_{Y^{\prime}Y^{\prime}}})}}{{\det({C_{\tilde{Y}\tilde{Y}}})}},

where H(Y)H(Y) and H(Y)H(Y^{\prime}) are the marginal entropy of YY and YY^{\prime}, respectively, H(Y,Y)H(Y,Y^{\prime}) is the joint entropy of YY and YY^{\prime}. This together with the invariance property of mutual information, i.e. I(Z;Z)=I(Y;Y)I(Z;Z^{\prime})=I(Y;Y^{\prime}) under homeomorphisms condition, results in Theorem 1.

-B Proof of Theorem 2

Theorem 1 implies that we can compute the MI only based on second-order statistics even if the distributions of ZZ and ZZ^{\prime} are not Gaussian. We investigate the MI under the generalized Gaussian distribution (GGD) as defined in (3) of the main paper. The GGD offers a flexible parametric form that can adapt to a wide range of distributions by varying the shape parameter β\beta in (3), from super-Gaussian when β<1\beta<1 to sub-Gaussian when β>1\beta>1, including the Gamma, Laplacian and Gaussian distributions as special cases. Figure 1 provides an illustration of univariate GGD with different values.

Refer to caption
Figure 1: Univariate generalized Gaussian distribution with different values of the shape parameter.

Let Z~=[ZT,ZT]T2d\tilde{Z}=\left[Z^{T},Z^{\prime T}\right]^{T}\in\mathbb{R}^{2d}, and from (3) in main paper, the joint distribution pZ,Z(z,z){p_{Z,Z^{\prime}}}(z,z^{\prime}) is Z~𝒢𝒩(Z~;μZ~,ΣZ~Z~,β)\tilde{Z}\sim\mathcal{GN}\left(\tilde{Z};\mu_{\tilde{Z}},\Sigma_{\tilde{Z}\tilde{Z}},\beta\right), where μZ~\mu_{\tilde{Z}} is the mean, and ΣZ~Z~\Sigma_{\tilde{Z}\tilde{Z}} is the dispersion matrix. The MI between ZZ and ZZ^{\prime} is given by

I(Z,Z)\displaystyle I\left(Z,Z^{\prime}\right)
=pZ,Z(z,z)logpZ,Z(z,z)pZ(z)pZ(z)dzdz\displaystyle=\iint p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\log\frac{p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)}{p_{Z}(z)p_{Z^{\prime}}\left(z^{\prime}\right)}dzdz^{\prime} (12)
=E[logpZ,Z(z,z)]E[logpZ(z)]E[logpZ(z)].\displaystyle=E\left[\log p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\right]-E\left[\log p_{Z}(z)\right]-E\left[\log p_{Z^{\prime}}\left(z^{\prime}\right)\right].

Then, it follows that

E[logpZ,Z(z,z)]=logΦ(β,2n)[det(ΣZ~Z~)]1/212E{[(Z~μZ~)TΣZ~Z~1(Z~μZ~)]β},\begin{aligned} &E\left[\log p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\right]\\ &=\log\frac{\Phi(\beta,2n)}{\left[\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)\right]^{1/2}}-\frac{1}{2}E\left\{\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\right\}\end{aligned},

where the expectation over the parameter space n\mathbb{R}^{n} of a function φ((Z~μZ~)TΣZ~Z~1(Z~μZ~))=φ(Z¯TZ¯)φ(w)\varphi\left((\tilde{Z}-\mu_{\tilde{Z}})^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right)=\varphi\left(\bar{Z}^{T}\bar{Z}\right)\equiv\varphi(w) (with w>0w>0 for Z~μZ~0)\left.\tilde{Z}-\mu_{\tilde{Z}}\neq 0\right) is essentially type 1 Dirichlet integral, which can be converted into integral over +\mathbb{R}^{+} [verdoolaegeGeometryMultivariateGeneralized2012]. Specifically, let φ(w)=wβ\varphi(w)=w^{\beta}, then

E{[(Z~μZ~)TΣZ~Z~1(Z~μZ~)]β}=Φ(β,2n)[det(ΣZ~Z~)]1/22n[(Z~μZ~)TΣZ~Z~1(Z~μZ~)]β×exp(12[(Z~μZ~)TΣZ~Z~1(Z~μZ~)]β)dz~=(a)β2n/βΓ(n/β)+φ(w)wn1exp(12wβ)𝑑w=β2n/βΓ(n/β)22+n/βΓ((β+n)/β)2β=2Γ((β+n)/β)Γ(n/β)=2nβ,\begin{aligned} &E\left\{\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\right\}\\ &=\frac{\Phi(\beta,2n)}{\left[\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)\right]^{1/2}}\int_{\mathbb{R}^{2n}}\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\\ &~{}~{}~{}~{}~{}~{}~{}\times\exp\left(-\frac{1}{2}\left[\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right)\right]^{\beta}\right)d\tilde{z}\\ &\stackrel{{\scriptstyle(a)}}{{=}}\frac{\beta}{2^{n/\beta}\Gamma(n/\beta)}\int_{\mathbb{R}^{+}}\varphi(w)w^{n-1}\exp\left(-\frac{1}{2}w^{\beta}\right)dw\\ &=\frac{\beta}{2^{n/\beta}\Gamma(n/\beta)}\frac{2^{2+n/\beta}\Gamma\left((\beta+n)/\beta\right)}{2\beta}\\ &=\frac{2\Gamma((\beta+n)/\beta)}{\Gamma(n/\beta)}\\ &=\frac{2n}{\beta}\end{aligned},

where (a) is due to the fact that the density function of the positive variable w=(Z~μZ~)TΣZ~Z~1(Z~μZ~)w=\left(\tilde{Z}-\mu_{\tilde{Z}}\right)^{T}\Sigma_{\tilde{Z}\tilde{Z}}^{-1}\left(\tilde{Z}-\mu_{\tilde{Z}}\right) is given by [verdoolaegeGeometryMultivariateGeneralized2012]

p(w;β)=βΓ(nβ)2nβwn1exp(12wβ).p(w;\beta)=\frac{\beta}{\Gamma\left(\frac{n}{\beta}\right)2^{\frac{n}{\beta}}}w^{n-1}\exp\left(-\frac{1}{2}w^{\beta}\right).

Therefore,

E[logpZ,Z(z,z)]=logΦ(β,2n)[det(ΣZ~Z~)]1/22nβ.E\left[\log p_{Z,Z^{\prime}}\left(z,z^{\prime}\right)\right]=\log\frac{\Phi(\beta,2n)}{\left[\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)\right]^{1/2}}-\frac{2n}{\beta}.

Similarly,

E[logpZ(z)]=logΦ(β,n)[det(ΣZZ)]1/2nβ,E[logpZ(z)]=logΦ(β,n)[det(ΣZZ)]1/2nβ.\begin{gathered}E\left[\log p_{Z}(z)\right]=\log\frac{\Phi(\beta,n)}{\left[\operatorname{det}\left(\Sigma_{ZZ}\right)\right]^{1/2}}-\frac{n}{\beta},\\ E\left[\log p_{Z^{\prime}}\left(z^{\prime}\right)\right]=\log\frac{\Phi(\beta,n)}{\left[\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)\right]^{1/2}}-\frac{n}{\beta}.\end{gathered}

Substituting these expectations into (7) in main paper leads to

I(Z,Z)\displaystyle I\left(Z,Z^{\prime}\right)
=12logdet(ΣZZ)det(ΣZZ)det(ΣZ~Z~)+logΦ(β,2n)[Φ(β,n)]2\displaystyle=\frac{1}{2}\log\frac{\operatorname{det}\left(\Sigma_{ZZ}\right)\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)}+\log\frac{\Phi(\beta,2n)}{[\Phi(\beta,n)]^{2}} (13)
=12logdet(ΣZZ)det(ΣZZ)det(ΣZ~Z~),\displaystyle=\frac{1}{2}\log\frac{\operatorname{det}\left(\Sigma_{ZZ}\right)\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)},

where we used Φ(β,n)=βΓ(n/2)2n/(2β)πn/2Γ(n/(2β))\Phi(\beta,n)=\frac{\beta\Gamma(n/2)}{2^{n/(2\beta)}\pi^{n/2}\Gamma(n/(2\beta))} and the following relation

Φ(β,2n)[Φ(β,n)]2=βΓ(n)Γ(n/β)[Γ(n2β)]2[βΓ(n2)]2=1β2n122nβ12Γ(n2+12)Γ(n2β+12)Γ(n2β)Γ(n2)=1.\frac{\Phi(\beta,2n)}{[\Phi(\beta,n)]^{2}}=\frac{\beta\Gamma(n)}{\Gamma(n/\beta)}\frac{\left[\Gamma\left(\frac{n}{2\beta}\right)\right]^{2}}{\left[\beta\Gamma\left(\frac{n}{2}\right)\right]^{2}}=\frac{1}{\beta}\frac{2^{n-\frac{1}{2}}}{2^{\frac{n}{\beta}-\frac{1}{2}}}\frac{\Gamma\left(\frac{n}{2}+\frac{1}{2}\right)}{\Gamma\left(\frac{n}{2\beta}+\frac{1}{2}\right)}\frac{\Gamma\left(\frac{n}{2\beta}\right)}{\Gamma\left(\frac{n}{2}\right)}=1.

Then, using the relation between the dispersion matrix and covariance matrix

ΣX~X~=nΓ(n/(2β))21/βΓ((n+2)/(2β))CX~X~,{\Sigma_{\tilde{X}\tilde{X}}}{\rm{=}}\frac{{n\Gamma(n/(2\beta))}}{{{2^{1/\beta}}\Gamma((n+2)/(2\beta))}}{C_{\tilde{X}\tilde{X}}},

it follows that

I(Z;Z)=12logdet(ΣZZ)det(ΣZZ)det(ΣZ~Z~)=12logdet(CZZ)det(CZZ)det(CZ~Z~).I\left(Z;Z^{\prime}\right)=\frac{1}{2}\log\frac{\operatorname{det}\left(\Sigma_{ZZ}\right)\operatorname{det}\left(\Sigma_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(\Sigma_{\tilde{Z}\tilde{Z}}\right)}=\frac{1}{2}\log\frac{\operatorname{det}\left(C_{ZZ}\right)\operatorname{det}\left(C_{Z^{\prime}Z^{\prime}}\right)}{\operatorname{det}\left(C_{\tilde{Z}\tilde{Z}}\right)}.

-C Ablation Study

-C1 Loss Function

We investigate the effectiveness of each term of the loss function. Specifically, we remove one of logdetCZZ\log{\operatorname{det}{C_{ZZ}}} term (w/o logdetCZZ\log{\operatorname{det}{C_{ZZ}}}) and the logdetCZZ\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}} term (w/o logdetCZZ\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}}) or both terms (w/o both) from the loss function. Additionally, we replace the logdet(CZZCZZ)\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})} term with ZZ2\|Z-Z^{\prime}\|^{2} since both terms aim to align the representations ZZ and ZZ^{\prime}. Furthermore, we simplify the rescaling operation from M~=MμλIα+I\tilde{M}=\frac{M-\mu_{\lambda}I}{\alpha}+I to M~=Mα+I\tilde{M}=\frac{M}{\alpha}+I (w/o μλ\mu_{\lambda}). Results in Table I show that removing either logdetCZZ\log{\operatorname{det}{C_{ZZ}}} or logdetCZZ\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}} leads to performance decrease, yet the training still succeed. However, removing both leads to training failure.

TABLE I: Ablation on loss function. The experiment follows the same setup as in Table LABEL:tab:CIFAR_linear in main paper, with Top-1 accuracy is reported.
Method CIFAR-100 ImageNet-100
Original 70.5 81.1
w/o logdetCZZ\log{\operatorname{det}{C_{ZZ}}} 66.6 76.5
w/o logdetCZZ\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}} 67.7 78.8
w/o both 3.55 4.01
Using ZZ2\|Z-Z^{\prime}\|^{2} 69.8 79.6
w/o μλ\mu_{\lambda} in rescaling 70.6 80.3

The reason behind this is straightforward. By minimizing the term logdet(CZZCZZ)\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})}, we aim to align the representations ZZ and ZZ^{\prime}. The terms logdetCZZ\log{\operatorname{det}{C_{ZZ}}} and logdetCZZ\log{\operatorname{det}{C_{Z^{\prime}Z^{\prime}}}} play a crucial role in ensuring that these representations are informative enough to prevent representation collapse. When one of these terms is removed, the remaining term is expected to partially fulfill this, but becomes less effective.

It can be seen from Table I that replacing the logdet(CZZCZZ)\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})} term with ZZ2\|Z-Z^{\prime}\|^{2} decreases the performance. Although both logdet(CZZCZZ)\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})} and ZZ2\|Z-Z^{\prime}\|^{2} encourage consistency between ZZ and ZZ^{\prime}, their mathematical properties differ significantly. The term logdet(CZZCZZ)\log{\operatorname{det}({C_{ZZ}-C_{Z^{\prime}Z^{\prime}}})} encourages a holistic consistency in the structural properties of the feature spaces. Meanwhile, our derived loss function obviates the need to tune the balance ratio between the terms. Moreover, the results show that removing μλ\mu_{\lambda} term in the Taylor approximation, i.e. adding a fixed IdI_{d} to the three terms in the loss function does not affect the performance on CIFAR-100 but decreases the performance on ImageNet-100.

-C2 Projector Hidden Dimension and Projector Output Dimension

We evaluate the effect of the projector’s hidden dimension and projector output dimension in Table II. For both our method and Barlow Twins, there’s a tendency that the increase of projector hidden dimension generally improves the performance on both the CIFAR-100 and ImageNet-100 datasets. Compared with CIFAR-100, both our method and Barlow Twins need a larger hidden dimension to achieve high performance on the more complex ImageNet-100 dataset. Using a momentum encoder, our method can achieve better performance and becomes more robust to projector hidden dimension. Moreover, similar to the results on the hidden dimension, both our method and Barlow Twins exhibit a trend that increasing the projector output dimension generally improves performance. The performance of Barlow Twins is particularly sensitive to projector output dimension, whereas our method performs well even with a very small projector output dimension of 256.

TABLE II: Impact of projector hidden/output dimension on accuracy.
CIFAR100 ImageNet100
Proj. hidden dim Ours Ours-M Barlow Twins Ours Ours-M Barlow Twins
2048 70.5 70.4 70.9 81.1 81.7 80.4
1024 70.8 70.4 70.2 80.2 81.5 79.3
512 69.1 70.1 69.6 79.5 81.4 78.3
256 67.9 70.0 68.0 78.7 80.6 76.9
Proj. output dim Ours Ours-M Barlow Twins Ours Ours-M Barlow Twins
2048 70.5 70.4 70.9 81.1 81.7 80.4
1024 70.3 70.4 69.7 80.6 81.1 79.6
512 70.6 70.5 66.5 80.4 81.7 77.4
256 70.5 71.1 62.1 80.3 81.2 73.6

-D Experiment Implementation Details

For experiments on ImageNet-1K, we use a batch size of 1020 on 3 A100 GPUs for 100, 400, and 800 epochs. Training is conducted using 16-bit precision (FP16) and 4 batches of gradient accumulation to stabilize model updating and accelerate the training process. We use the LARS optimizer with a base learning rate of 0.8 for the backbone pretraining and 0.2 for the classifier training. The learning rate is scaled by lr=base_lr×batch_size/256×num_gpu\text{lr}=\text{base\_lr}\times\text{batch\_size}/256\times\text{num\_gpu}. We use a weight decay of 1.5E-6 for backbone parameters. The linear classifier is trained on top of frozen backbone. We follow the default setting as in Solo-learn benchmark [costaSololearnLibrarySelfsupervised2022] for the rest of the training hyper-parameters.

Recall that in implementing the loss function (9) from the main paper, the three log-determinant terms are expanded as

logdet(M)\displaystyle\log\operatorname{det}(M) =i=1nlogλi(M)\displaystyle=\sum_{i=1}^{n}\log\lambda_{i}(M)
=i=1nk=1(1)k+1(λi(M)1)kk\displaystyle=\sum_{i=1}^{n}\sum_{k=1}^{\infty}(-1)^{k+1}\frac{\left(\lambda_{i}(M)-1\right)^{k}}{k}
=k=1(1)k+1tr((MI)k)k=tr(k=1(1)k+1(MI)kk),\displaystyle=\sum_{k=1}^{\infty}(-1)^{k+1}\frac{\operatorname{tr}\left((M-I)^{k}\right)}{k}=\operatorname{tr}\left(\sum_{k=1}^{\infty}(-1)^{k+1}\frac{(M-I)^{k}}{k}\right), (10)

retaining only a pp-th order approximation is kept, e.g., p=4p=4 in the experiments of this work. As shown in Figure 2, a fourth-order approximation of the log function in (10) is sufficiently accurate around the value of 1.

Refer to caption
Figure 2: Illustration of a fourth-order approximation of the log function in (10) in main paper.

For our method with a momentum encoder, we follow the setting of [grillBootstrapYourOwn2020, chenExploringSimpleSiamese2020] and use a two-layer predictor with hidden dimension 1024 for all datasets. For ImageNet-100, we set the base learning rate as 0.2 for backbone pretraining and 0.3 for the classifier. We set the weight decay of backbone parameters as 0.0001. For CIFAR-100, we set the base learning rate for backbone pretraining as 0.3 and the classifier as 0.2. The weight decay is set as 6E-5. For the rescaling operation (M~=MμλIα+I\tilde{M}=\frac{M-\mu_{\lambda}I}{\alpha}+I), we track the eigenvalues of CZZC_{ZZ} with an update interval of 100 batches with a moving average coefficient ρ\rho of 0.99. Table III shows the ablation study on the update interval and moving average coefficient ρ\rho. Generally, a larger update interval should be used with a smaller moving average coefficient, and vice versa. This is reasonable as the two parameters together control the speed of the eigenvalue tracking. Overall, our method is insensitive to these two hyperparameters. Table LABEL:tab:CIFAR_linear in main paper and Figure 3 depict the ablation study on the parameter β\beta used for the rescaling operation. We adhere to the experimental setup described in Table LABEL:tab:CIFAR_linear and report the Top-1 accuracy on CIFAR-100. As shown in Table IV and Figure 3, a smaller β\beta results in faster convergence during training. However, excessively small values may lead to training failure. We set β=5\beta=5 for all the experiments.

TABLE III: Ablation study on the update interval and moving average coefficient ρ\rho for eigenvalues tracking used for the rescaling operation.
update interval ρ\rho CIFAR-100
1000 0 69.45
1000 0.1 70.22
1000 0.99 68.69
100 0.99 70.54
1 0.99 69.17
1 0 69.37
TABLE IV: Ablation study on the parameter β\beta used for the rescaling operation .
β{\beta} CIFAR-100
7 69.85
6 70.10
5 70.17
4 70.09
3 69.46
1 NaN\mathrm{NaN}
Refer to caption
Figure 3: The convergence curves of our method on CIFAR-100 for different values of the parameter β\beta used for the rescaling operation.