This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

Pritam Sarkar1, 2, Ali Etemad1
Abstract

We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by 8%8\% to 14%14\% on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by 5.5%5.5\% on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of 96.5%96.5\%.

1 Introduction

Self-supervised learning aims to learn meaningful representations from unlabelled data with no human supervision (Chen et al. 2020; Chen and He 2021; Misra and Maaten 2020; Grill et al. 2020; He et al. 2021). Using self-supervision, recent multimodal methods (Morgado, Vasconcelos, and Misra 2021; Alwassel et al. 2020; Morgado, Misra, and Vasconcelos 2021; Min et al. 2021; Ma et al. 2020; Sarkar and Etemad 2023) have shown great promise in learning effective representations from videos. In general, multimodal frameworks leverage the existing information in multiple data streams to learn better representations for downstream tasks toward either modality (or all). Recent audio-visual frameworks aim to perform information sharing between networks to further enrich the learned representations (Afouras, Chung, and Zisserman 2020; Chen et al. 2021; Ren et al. 2021; Aytar, Vondrick, and Torralba 2016; Albanie et al. 2018). However, effective knowledge sharing between audio and video is particularly challenging due to the inherent diversity, complexity, and domain-specific nature of each modality, as well as the existence of substantial domain gaps between them (Ren et al. 2021; Chen et al. 2021).

In this work, we aim to perform effective information sharing between audio and video streams to obtain more generalized representations for downstream tasks. To this end, we propose a novel self-supervised framework called XKD, which stands for Cross-modal Knowledge Distillation. Our approach consists of two sets of pseudo-tasks, (i) masked data modelling and (ii) cross-modal knowledge distillation. The former is performed to learn modality-specific (MS) information, while the latter distills and transfers knowledge across modalities to further enrich the learned representations. To allow for stable and effective information exchange between modalities, we introduce a domain alignment strategy. The proposed strategy involves 22 steps (i) feature refinement that identifies ‘what to transfer’ based on cross-modal feature relevance and (ii) minimizing domain discrepancy to align the two representations. Additionally, we introduce modality-agnostic (MA) variants of our method to tackle the challenging task of learning from both modalities using a unified network (shared backbone). Modality-agnostic methods are particularly useful given their ability to accept different data streams and solve a variety of tasks using the same backbone. Moreover, they ease the challenge of designing individual models for each modality, thus attracting attention in recent studies (Girdhar et al. 2022, 2023; Akbari et al. 2021).

Inspired by the recent success of Transformers in different domains (Devlin et al. 2018; Gong, Chung, and Glass 2021a; He et al. 2021; Feichtenhofer et al. 2022), we use ViT (Dosovitskiy et al. 2020) as the backbone of our framework for both audio and visual modalities. We use the 33 different sizes of datasets to pretrain the framework, including AudioSet, Kinetics400, and Kinetics-Sound. The pretrained backbones are then evaluated on multiple datasets on a variety of downstream tasks. More specifically, UCF101, HMDB51, and Kinetics400 are used for video-related tasks; ESC50 and FSD50K are used for audio-related tasks; and Kinetics-Sound is used for multimodal evaluation.

Our contributions are summarized as follows:

  • We introduce XKD, a self-supervised framework for video representation learning, which uses a novel domain alignment strategy to enable self-supervised cross-modal knowledge distillation between audio and video as two heterogeneous modalities. The proposed domain alignment strategy minimizes the domain gap and identifies the most transferable features between the two domains for effective cross-modal knowledge distillation.

  • Rigorous experiments and thorough ablations are performed to analyse the proposed method. XKD achieves state-of-the-art or competitive performance on a variety of downstream tasks including video action recognition, sound classification, and multimodal fusion.

  • Our proposed modality-agnostic variants achieve very competitive performance compared to the modality-specific ones, enabling the use of a single pretrained encoder for a variety of audio and visual downstream tasks.

The code, pretrained models, and supplementary material are available on the project website.

2 Related Work

Video self-supervised learning. Self-supervised learning has been widely used in a variety of different areas, including the challenging task of video representation learning (Schiappa, Rawat, and Shah 2022). Several prior works have attempted to learn video representations through both uni-modal (Feichtenhofer et al. 2021; Qian et al. 2021; Jing et al. 2018) as well as multimodal (Morgado, Vasconcelos, and Misra 2021; Alayrac et al. 2020; Recasens et al. 2021; Xiao, Tighe, and Modolo 2022; Han, Xie, and Zisserman 2020) pretraining. For example, prior works have explored self-supervised frameworks for learning video representations through contrastive (Morgado, Vasconcelos, and Misra 2021; Morgado, Misra, and Vasconcelos 2021; Alayrac et al. 2020), non-contrastive (Sarkar and Etemad 2023; Recasens et al. 2021), and deep clustering (Alwassel et al. 2020; Asano et al. 2020) techniques.

Masked data modelling. Inspired by the success of BERT (Devlin et al. 2018) in natural language processing, several prior works have attempted to learn meaningful representations through the reconstruction of masked (corrupted) inputs. Such methods employ encoder-decoder setups, where the encoder compresses the inputs into a lower dimension, while the decoder is trained for reconstruction. This simple approach shows promise in different domains including image (Bao, Dong, and Wei 2021; He et al. 2021; Bachmann et al. 2022), video (Wang et al. 2021; Tong et al. 2022; Feichtenhofer et al. 2022), and audio (Niizumi et al. 2022b; Gong, Chung, and Glass 2021a; Chong et al. 2022) among others.

Cross-modal knowledge distillation. The main goal of cross-modal knowledge distillation is to transfer knowledge across different modalities (Afouras, Chung, and Zisserman 2020; Dai, Das, and Bremond 2021; Chen et al. 2021; Ren et al. 2021; Aytar, Vondrick, and Torralba 2016; Albanie et al. 2018; Piergiovanni, Angelova, and Ryoo 2020). For example, (Afouras, Chung, and Zisserman 2020; Ren et al. 2021) attempted knowledge distillation between audio teachers and video students to improve visual representations. Cross-modal knowledge distillation amongst different visual modalities has been performed in (Dai, Das, and Bremond 2021), where a pretrained optical flow teacher is used to improve RGB student’s representation.

Modality-agnostic networks. While modality-specific models have been the preferred approach toward developing representation learning solutions due to their strong performance, they do not have the ability to learn from multiple modalities using the same backbone, which makes them harder to develop. Recently, (Girdhar et al. 2022; Akbari et al. 2021) introduced modality-agnostic models capable of learning different modalities using the same backbones. In our paper, we attempt to develop modality-agnostic variants of our solution which can handle 22 very different modalities, i.e., audio and video to solve a variety of downstream tasks.

Refer to caption
Figure 1: An overview of our proposed framework, which consists of 33 steps. Masked reconstruction: autoencoders are used to learn representations from individual modalities through reconstruction of highly masked inputs. Domain alignment: To enable cross-modal knowledge distillation, two domains are aligned through feature refinement and minimizing domain discrepancies. Cross-modal knowledge distillation: The students are used to distil knowledge from their respective cross-modal teachers.

3 Method

3.1 Overview

Figure 1 presents an overview of our framework. Our method consists of two sets of autoencoders for audio (θaea\theta_{ae}^{a}) and video (θaev\theta_{ae}^{v}). First, we train these autoencoders to reconstruct from the masked inputs, which helps the encoders (θea\theta_{e}^{a} and θev\theta_{e}^{v}) to learn modality-specific information. Next, to transfer knowledge between modalities to learn complementary information, we align the two domains by identifying the most transferable features and minimizing the domain gaps. Finally, audio (θta\theta_{t}^{a}) and video (θtv\theta_{t}^{v}) teachers are employed to provide cross-modal supervision to the opposite modalities. The details of our proposed method are mentioned below.

3.2 Data Embedding

Let be given video clip xx, where the visual frames and audio spectrograms are denoted as xvx_{v} and xax_{a}, respectively. We create ‘global’ and ‘local’ views from each modality which are then used by the teachers and students respectively. First, we apply augmentations on xvx_{v} and xax_{a} to generate the global views as xvgx_{v}^{g} and xagx_{a}^{g}. Next, we generate nn local views xvlx_{v}^{l} and xalx_{a}^{l} from xvx_{v} and xax_{a} respectively, where xvl={xv1l,,xvnl}x_{v}^{l}=\{x_{v1}^{l},\dots,x_{vn}^{l}\} and xal={xa1l,,xanl}x_{a}^{l}=\{x_{a1}^{l},\dots,x_{an}^{l}\}. Specifically, xailx_{ai}^{l} is an augmented time-frequency crop of xax_{a} and xvilx_{vi}^{l} is an augmented spatio-temporal crop of xvx_{v}. To further elaborate, xvlx_{v}^{l} and xalx_{a}^{l} are differently augmented than xvgx_{v}^{g} and xagx_{a}^{g}. Both local and global views of audio and visual inputs are projected into an embedding space. For example, xaF×Tax_{a}\in\mathbb{R}^{F\times T_{a}} are reshaped into NaN_{a} smaller patches of size f×taf\times t_{a}, where Na=F/f×Ta/taN_{a}=F/f\times T_{a}/t_{a}. Similarly, we reshape the videos xvTv×H×W×Cx_{v}\in\mathbb{R}^{T_{v}\times H\times W\times C} into NvN_{v} smaller cuboids of size tv×h×w×Ct_{v}\times\!h\times\!w\times\!C, where Nv=Tv/tv×H/h×W/w×CN_{v}=T_{v}/t_{v}\!\times\!H/h\!\times\!W/w\!\times\!C. Finally, the spectrogram patches and visual cuboids are flattened into vectors and linearly projected onto the embedding space, which are then fed to the encoders.

3.3 Masked Data Modelling

Inspired by the recent success and scalability of pretraining with masked reconstruction in different domains (Devlin et al. 2018; Bao, Dong, and Wei 2021; Wang et al. 2021; Gong, Chung, and Glass 2021a; Baevski et al. 2022; He et al. 2021; Niizumi et al. 2022b; Tong et al. 2022), we adopt masked data modelling in our framework to learn modality-specific representations. The masked reconstruction employs an autoencoder θae\theta_{ae}, which consists of an encoder θe\theta_{e} and a decoder θd\theta_{d}. Let xx be the input, which can be further expressed as a sequence of tokens {xi}i=1N\{x_{i}\}_{i=1}^{N}, as described in Section 3.2. Here, we randomly mask some of the input tokens with m{0,1}Nm\in\{0,1\}^{N}, hence the masked tokens x[m]x^{[m]} are represented as {xi|mi=1}i=1N\{x_{i}|m_{i}=1\}_{i=1}^{N} while the corrupted inputs x^\hat{x} are represented as {xi|mi=0}i=1N\{x_{i}|m_{i}=0\}_{i=1}^{N}. Further, we drop the masked tokens x[m]x^{[m]} before feeding the input to θae\theta_{ae} for computational efficiency (Akbari et al. 2021; He et al. 2021). We train θae\theta_{ae} to reconstruct xx, based on input x^\hat{x}. Here, θae\theta_{ae} is trained to minimize the reconstruction loss recon\mathcal{L}_{\mathrm{recon}} as:

recon(θae(x^),x[m])=1Nmi=1Nm(θae(x^i)xi[m])2.\mathcal{L}_{\mathrm{recon}}(\theta_{ae}(\hat{x}),x^{[m]})=\frac{1}{N_{m}}\sum_{i=1}^{N_{m}}(\theta_{ae}(\hat{x}_{i})-x_{i}^{[m]})^{2}. (1)

In particular, using Equation 1, we define the video and audio reconstruction losses as reconv\mathcal{L}_{\mathrm{recon}}^{v} and recona\mathcal{L}_{\mathrm{recon}}^{a} for given inputs xvgx_{v}^{g} and xagx_{a}^{g}, to train θaev\theta_{ae}^{v} and θaea\theta_{ae}^{a}, respectively. Here θaev\theta_{ae}^{v} and θaea\theta_{ae}^{a} denote video and audio autoencoders. To jointly train the audio-visual masked autoencoder, we define the final reconstruction loss ae\mathcal{L}_{\mathrm{ae}} as:

𝐚𝐞=𝐫𝐞𝐜𝐨𝐧(θ𝐚𝐞𝐯(𝐱^𝐯𝐠),𝐱𝐯𝐠[𝐦])+𝐫𝐞𝐜𝐨𝐧(θ𝐚𝐞𝐚(𝐱^𝐚𝐠),𝐱𝐚𝐠[𝐦]).\mathbf{\mathcal{L}_{\mathrm{\bf ae}}\!=\!\mathcal{L}_{\mathrm{\bf recon}}(\theta_{ae}^{v}(\hat{x}_{v}^{g}),x_{v}^{g[m]})+\mathcal{L}_{\mathrm{\bf recon}}(\theta_{ae}^{a}(\hat{x}_{a}^{g}),x_{a}^{g[m]})}. (2)

3.4 Knowledge Distillation w/ Domain Alignment

To facilitate cross-modal knowledge sharing, we adopt a teacher-student setup (Tarvainen and Valpola 2017). The teacher (θt\theta_{t}) and student (θs\theta_{s}) are comprised of a backbone and a projector head (θh\theta_{h}), where the teacher and student network architectures are the same but differently parameterized. Moreover, we parameterize θs\theta_{s} as {θe\{\theta_{e}, θh}\theta_{h}\}, where θe\theta_{e} is the same encoder used in reconstruction (explained in the previous subsection) and θh\theta_{h} is a newly added projector head. We define θsv\theta_{s}^{v} and θsa\theta_{s}^{a} as video and audio students, whereas, θtv\theta_{t}^{v} and θta\theta_{t}^{a} are denoted as video and audio teachers.

As mentioned earlier, cross-modal knowledge distillation between audio and video is particularly difficult due to the inherent diversity, complexity, and domain-specific nature of these two modalities. To tackle this, we perform domain alignment by identifying the most transferable features and minimizing the domain discrepancy. The proposed domain alignment ensures meaningful target distributions are set by the teachers in order to perform successful cross-modal knowledge distillation.

Domain Alignment. Both the audio and video carry a rich and diverse set of information about the source. Therefore, first, we identify the most transferable features by re-weighting the teachers’ representations based on their cross-modal feature importance through a soft-selection process. Specifically, we obtain the cross-modal attention maps to calculate the cross-modal feature importance with respect to the corresponding modalities. In order to calculate the cross-modal attention maps, we first extract the modality-specific attention maps (A{A}) from the last attention layers of the teacher networks as:

A=softmax(Q[CLS]KTd).\begin{split}{A}=\mathrm{softmax}(Q^{\mathrm{[CLS]}}\cdot\frac{K^{T}}{\sqrt{d}}).\end{split} (3)

Here, QQ denotes the query, KK is the key, and VV is the value. Specifically, AH×N{A}\in\mathbb{R}^{H\times N} is calculated as the correlation between the query (QQ) embedding of the class token (CLS) and the key (KK) embeddings of all the other patches or cuboids. Note, HH denotes the number of attention heads and NN denotes the number of patches or cuboids. We obtain the visual attention AvH×Nv{A}_{v}\in\mathbb{R}^{H\times N_{v}} and audio attention AaH×Na{A}_{a}\in\mathbb{R}^{H\times N_{a}} as per Equation 3. Next, we obtain the respective cross-modal attention maps as Av×{A}_{v}^{\times} and Aa×{A}_{a}^{\times} as:

Av×=MeanPool(AvAaT)/scalev;Aa×=MeanPool(AaAvT)/scalea.\begin{split}{A}_{v}^{\times}=\mathrm{MeanPool}({A}_{v}\cdot{A}_{a}^{T})/\mathrm{scale_{v}};\\ {A}_{a}^{\times}=\mathrm{MeanPool}({A}_{a}\cdot{A}_{v}^{T})/\mathrm{scale_{a}}.\end{split} (4)

Here, AvAaTH×Nv×Na{A}_{v}\cdot{A}_{a}^{T}\in\mathbb{R}^{H\times N_{v}\times N_{a}} and AaAvTH×Na×Nv{A}_{a}\cdot{A}_{v}^{T}\in\mathbb{R}^{H\times N_{a}\times N_{v}}, we apply MeanPool\mathrm{MeanPool} across the last dimension. Additionally, scalev\mathrm{scale_{v}} and scalea\mathrm{scale_{a}} are scaling factors, obtained as 1Nvi=1NvAv\frac{1}{N_{v}}\sum_{i=1}^{N_{v}}{A}_{v} and 1Nai=1NaAa\frac{1}{N_{a}}\sum_{i=1}^{N_{a}}{A}_{a}, used to re-scale the computed cross-modal attention maps back to their original range for numerical stability. We identify the most transferable features obtained from the teachers as ftvf_{t}^{v} and ftaf_{t}^{a}, as refine(θtv(xvg),Av×)\mathrm{refine}(\theta_{t}^{v}(x_{v}^{g}),{A}_{v}^{\times}) and refine(θta(xag),Aa×)\mathrm{refine}(\theta_{t}^{a}(x_{a}^{g}),{A}_{a}^{\times}) respectively. Here Av×{A}_{v}^{\times} and Aa×{A}_{a}^{\times} are used to re-weight the visual and audio representations respectively. We formulate refine\mathrm{refine} as:

refine(θt(xg),A×)=θt(xg)1Hh=1HAh×Ω,\mathrm{refine}(\theta_{t}(x^{g}),{A}^{\times})=\theta_{t}(x^{g})\cdot\frac{1}{H}\sum_{h=1}^{H}{A}_{h}^{\times}\cdot\Omega, (5)

where Ah×{A}_{h}^{\times} represents the cross-modal attention of each head and Ω\Omega is a non-negative scalar defined as the ratio of prior and posterior energy, expressed as:

Ω=θt(xg)22θt(xg)A×22.\Omega=\frac{\parallel\!\theta_{t}(x^{g})\!\parallel_{2}^{2}}{\parallel\!\theta_{t}(x^{g})\cdot{A}^{\times}\!\parallel_{2}^{2}}. (6)

Next, to improve the knowledge transferability, we reduce the domain gaps by minimizing the Maximum Mean Discrepancy (MMD) (Gretton et al. 2006) loss, estimated as:

mmd(IPa,IPv)=𝔼xaIPa[k(,xa)]𝔼xvIPv[k(,xv)]k.\mathcal{L}_{\mathrm{mmd}}(\text{I\kern-1.49994ptP}_{a},\text{I\kern-1.49994ptP}_{v})=\parallel\!\mathbb{E}_{x_{a}\sim\text{I\kern-1.04996ptP}_{a}}[k(\cdot,x_{a})]-\mathbb{E}_{x_{v}\sim\text{I\kern-1.04996ptP}_{v}}[k(\cdot,x_{v})]\parallel\!_{\mathcal{H}_{k}}. (7)

Here xax_{a} and xvx_{v} are drawn from distributions IPa\text{I\kern-1.49994ptP}_{a} and IPv\text{I\kern-1.49994ptP}_{v} respectively. Additionally, k\parallel\cdot\parallel_{\mathcal{H}_{k}} is the RKHS norm (Gretton et al. 2006) and and kk represents the Gaussian kernel with bandwidth σ\sigma, written as:

k(xa,xv)=exp(xaxv22σ2).k(x_{a},x_{v})=\exp\left(\frac{-\parallel x_{a}-x_{v}\parallel^{2}}{2\sigma^{2}}\right). (8)

Using Equation 7, we define domain alignment loss da\mathcal{L}_{\mathrm{da}} as:

𝐝𝐚=𝐦𝐦𝐝(𝐟𝐬𝐯,𝐟𝐬𝐚)+𝐦𝐦𝐝(𝐟𝐭𝐯,𝐟𝐭𝐚).\begin{split}\mathbf{\mathcal{L}_{\mathrm{\bf da}}=\mathcal{L}_{\mathrm{\bf mmd}}(f_{s}^{v},f_{s}^{a})+\mathcal{L}_{\mathrm{\bf mmd}}(f_{t}^{v},f_{t}^{a})}.\end{split} (9)

Here, fsaf_{s}^{a} and fsvf_{s}^{v} refer to audio and visual representations obtained from θsa\theta_{s}^{a} and θsv\theta_{s}^{v} respectively.

Knowledge Distillation. We perform knowledge distillation between modalities to provide cross-modal supervision in order to learn complementary information. We train the student networks to match the distribution of the cross-modal teacher networks. Specifically, we train θsv\theta_{s}^{v} to match the distribution of θta\theta_{t}^{a}, while θsa\theta_{s}^{a} is trained to match the distribution of θtv\theta_{t}^{v}. Following (Caron et al. 2021), we normalize ftaf_{t}^{a} and ftvf_{t}^{v} with a ‘mean’ calculated based on the current batch statistics, which helps the teachers’ outputs be more uniformly distributed. Additionally, to prevent abrupt updates of batch means, we slowly update the ‘means’ using an exponential moving average (EMA). Finally, we minimize the cross-entropy loss H(a,b)H(a,b) formulated as alogb-a\log b, where a and b represent the output probability distributions of θt\theta_{t} and θs\theta_{s} respectively. Here, the probability PP over JJ distributions is obtained by using a softmax\mathrm{softmax} function with the temperature parameter τ\tau, where 0<τ<10<\tau<1 used to sharpen the distribution as:

P(f(i))=exp(f(i)/τ)j=1Jexp(f(j)/τ).P(f^{(i)})=\frac{\exp(f^{(i)}/\tau)}{\sum_{j=1}^{J}\exp(f^{(j)}/\tau)}. (10)

Accordingly, we define the cross-modal knowledge distillation loss kd\mathcal{L}_{\mathrm{kd}} as:

𝐤𝐝=𝐏(𝐟𝐭𝐯)log(𝐏(𝐟𝐬𝐚))𝐏(𝐟𝐭𝐚)log(𝐏(𝐟𝐬𝐯)).\begin{split}\mathbf{\mathcal{L}_{\mathrm{\bf kd}}=-P(f_{t}^{v})\log(P(f_{s}^{a}))-P(f_{t}^{a})\log(P(f_{s}^{v}))}.\end{split} (11)

3.5 Final Loss

To train XKD, we define the final loss function as the combination of reconstruction, domain alignment, and cross-modal knowledge distillation losses expressed as:

𝐱𝐤𝐝=λ𝐚𝐞×𝐚𝐞+λ𝐝𝐚×𝐝𝐚+λ𝐤𝐝×𝐤𝐝.\mathbf{\mathcal{L}_{\mathrm{\bf xkd}}=\lambda_{\mathrm{\bf ae}}\!\times\!\mathcal{L}_{\mathrm{\bf ae}}\!+\!\lambda_{\mathrm{\bf da}}\!\times\!\mathcal{L}_{\mathrm{\bf da}}\!+\!\lambda_{\mathrm{\bf kd}}\!\times\!\mathcal{L}_{\mathrm{\bf kd}}}. (12)

Here, λae\lambda_{\mathrm{ae}}, λda\lambda_{\mathrm{da}}, and λkd\lambda_{\mathrm{kd}} are the loss coefficients corresponding to the three loss terms, respectively. Empirically, we set λae\lambda_{\mathrm{ae}}, λda\lambda_{\mathrm{da}}, and λkd\lambda_{\mathrm{kd}} as 55, 11, and 11 respectively. It should be noted that xkd\mathcal{L}_{\mathrm{xkd}} is only used to train θs\theta_{s} and θae\theta_{ae}, not θt\theta_{t}. EMA (Grill et al. 2020; Tarvainen and Valpola 2017; Chen, Xie, and He 2021) is used to slowly update θta\theta_{t}^{a} and θtv\theta_{t}^{v} as:

θtaλaθta+(1λa)θsaandθtvλvθtv+(1λv)θsv,\begin{split}\theta_{t}^{a}\leftarrow\lambda_{a}\theta_{t}^{a}+(1-\lambda_{a})\theta_{s}^{a}\;\text{and}\;\theta_{t}^{v}\leftarrow\lambda_{v}\theta_{t}^{v}+(1-\lambda_{v})\theta_{s}^{v},\end{split} (13)

where λa\lambda_{a} and λv\lambda_{v} are the EMA coefficients corresponding to θta\theta_{t}^{a} and θtv\theta_{t}^{v}. We present the proposed algorithm in Alg. 1.

1:Input: xvx_{v}, xax_{a}
2:Initalize: θdv\theta_{d}^{v}, θda\theta_{d}^{a}, θsv\theta_{s}^{v}, θsa\theta_{s}^{a}, θtv\theta_{t}^{v}, θta\theta_{t}^{a}
3:for t in iterations do
4:     Obtain xvgx_{v}^{g} and xvlx_{v}^{l} from xvx_{v}
5:     Obtain xagx_{a}^{g} and xalx_{a}^{l} from xax_{a}
6:for # masked data reconstruction
7:     fsvg=θev(x^vg)f_{sv}^{g}\!=\!\theta_{e}^{v}(\hat{x}_{v}^{g}); fsag=θea(x^ag)f_{sa}^{g}=\theta_{e}^{a}(\hat{x}_{a}^{g})
8:for # total mask reconstruction loss
9:     ae=recon(θdv(fsvg),xvg[m])+recon(θda(fsag),xag[m])\mathcal{L}_{ae}\!=\!\mathcal{L}_{recon}(\theta_{d}^{v}(f_{sv}^{g}),x_{v}^{g[m]})\!+\!\mathcal{L}_{recon}(\theta_{d}^{a}(f_{sa}^{g}),x_{a}^{g[m]})
10:for # cross-modal knowledge distil. w/ domain alignment
11:     with no-gradient:      
12:         ftvg=θtv(xvg)f_{tv}^{g}=\theta_{t}^{v}(x_{v}^{g}); ftag=θta(xag)f_{ta}^{g}=\theta_{t}^{a}(x_{a}^{g})
13:         Obtain Av×{A}_{v}^{\times}, Aa×{A}_{a}^{\times} \triangleright Eqn. 4
14:for for # feature refinement
15:         ftv=refine(ftvg,Av×)f_{t}^{v}\!=\!\mathrm{refine}(f_{tv}^{g},{A}_{v}^{\times}) \triangleright Eqn. 5
16:         fta=refine(ftag,Aa×)f_{t}^{a}\!=\!\mathrm{refine}(f_{ta}^{g},{A}_{a}^{\times}) \triangleright Eqn. 5     
17:     fsvl=θsv(xvl)f_{sv}^{l}\!=\!\theta_{s}^{v}(x_{v}^{l}); fsal=θsa(xal)f_{sa}^{l}\!=\!\theta_{s}^{a}(x_{a}^{l})
18:     fsv=f_{s}^{v}\!= Concat (θhv(fsvg),fsvl)(\theta_{h}^{v}(f_{sv}^{g}),f_{sv}^{l})
19:     fsa=f_{s}^{a}\!= Concat (θha(fsag),fsal)(\theta_{h}^{a}(f_{sa}^{g}),f_{sa}^{l})
20:for # domain alignment loss
21:     da=mmd(fsv,fsa)+mmd(ftv,fta)\mathcal{L}_{da}\!=\!\mathcal{L}_{mmd}(f_{s}^{v},f_{s}^{a})\!+\!\mathcal{L}_{mmd}(f_{t}^{v},f_{t}^{a})
22:for # cross-modal knowledge distillation loss
23:     kd=P(ftv)log(P(fsa))P(fta)log(P(fsv))\mathcal{L}_{kd}\!=\!-P(f_{t}^{v})\log(P(f_{s}^{a}))\!-\!P(f_{t}^{a})\log(P(f_{s}^{v}))
24:for # final loss
25:     xkd=λae×ae+λda×da+λkd×kd\mathcal{L}_{xkd}=\lambda_{ae}\!\times\!\mathcal{L}_{ae}\!+\!\lambda_{da}\!\times\!\mathcal{L}_{da}\!+\!\lambda_{kd}\!\times\!\mathcal{L}_{kd}
26:     Update θdv\theta_{d}^{v}, θda\theta_{d}^{a}, θsv\theta_{s}^{v}, θsa\theta_{s}^{a} based on xkd\mathcal{L}_{xkd}
27:     Update θtv\theta_{t}^{v} and θta\theta_{t}^{a} \triangleright Eqn. 13
28:end for
29:Output: θsv\theta_{s}^{v}, θsa\theta_{s}^{a}, θtv\theta_{t}^{v}, θta\theta_{t}^{a}
Algorithm 1 XKD Training Algorithm.

3.6 Modality-agnostic Variant

We take our proposed approach a step forward and attempt to train it in a modality-agnostic fashion with the goal of developing a ‘general’ network capable of handling both modalities. This is a very challenging task in the context of our work given the very diverse nature of audio and video streams. We introduce two modality-agnostic variants XKD-MATS and XKD-MAS. As the name suggests, in XKD-MATS the audio and visual teachers share their backbones, and so do the audio and visual students. In the case of XKD-MAS, the audio and visual students share their backbones, while the audio and visual teachers use modality-specific backbones. Please note that in all the setups, we use the Equation 13 to update the teachers using their respective students. Moreover, given the need to reconstruct different modalities, all of the variants use modality-specific decoders and input projection layers. The rest of the setups remain the same as the original XKD. Both variants are trained with xkd\mathcal{L}_{\mathrm{xkd}} (see Equation 12).

4 Experiments and Results

4.1 Implementation Details

Datasets. We pretrain XKD on 3 datasets of different sizes including the small-scale Kinetics-Sound (Arandjelovic and Zisserman 2017), large-scale Kinetics400 (Kay et al. 2017), and very large-scale AudioSet (Gemmeke et al. 2017). We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 66 datasets for downstream tasks, namely Kinetics400 (K400400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101101) (Soomro, Zamir, and Shah 2012), HMDB51 (H5151) (Kuehne et al. 2011), ESC50 (E5050) (Piczak 2015), and FSD50K (F5050K) (Fonseca et al. 2022). The dataset details are provided in the supplementary material (Suppl. Mat.) Sec. A.1. Unless mentioned otherwise, Kinetics400 is used for pretraining.

Input setup. During pretraining, to save computation we downsample the video input at 88 FPS and resize the frame resolution at 1122112^{2}. Additionally, we re-sample the audio waveforms at 1616 kHz. and generate mel-spectrograms using 8080 mel filters. Next, we create global and local views for both audio and video. We use 44 seconds of audio-visual input for the global views. Followed by the local views are generated by taking random temporal segments of 11 second unless stated otherwise. The final input sizes to the teachers are 3×32×11223\!\times\!32\!\times\!112^{2} and 80×44880\!\times\!448. Similarly, the input sizes to the students are 3×8×9623\!\times\!8\!\times\!96^{2} and 80×11280\!\times\!112, for video and audio respectively. Moreover, the inputs to the encoders for masked reconstructions are the same as the input to the teacher networks, except they are heavily masked. We use a patch size of 4×164\!\times\!16 for audio spectrograms and a cuboid size of 4×1624\!\times 16^{2} for video input. Please see additional details in the Suppl. Mat. Sec. A.2 and A.3.

Architecture. We choose ViT (Dosovitskiy et al. 2020) as the backbone for both audio and video, due to its stability in performance across different data streams (Akbari et al. 2021). In particular, we experiment with two ViT variants ViT-B and ViT-L. By default, ViT-B is used as the video backbone for pretraining with Kinetics400 and Kinetics-Sound, whereas, ViT-Large is used when pretrained with AudioSet. In all the setups, ViT-B is used as the audio backbone. The additional details are in the Suppl. Mat. Sec. A.3 and A.4.

Evaluation. Following (Sarkar and Etemad 2023; Morgado, Vasconcelos, and Misra 2021; Alwassel et al. 2020; Recasens et al. 2021), we evaluate XKD in both linear and finetuning setup. We redirect readers to see the details on evaluation protocols in the Suppl. Mat. Sec. A.5 and A.6.

4.2 Effect of Cross-modal Knowledge Distillation

As discussed, the proposed XKD is pretrained to solve masked data modelling and cross-modal knowledge distillation. Therefore, to obtain an accurate understanding of the impact of cross-modal knowledge distillation, we compare XKD (𝐱𝐤𝐝\mathbf{\mathcal{L}_{\mathrm{\bf xkd}}}) with respect to the following baselines:

  • masked video reconstruction (𝐫𝐞𝐜𝐨𝐧𝐯\mathbf{\mathcal{L}_{\mathrm{\bf recon}}^{v}})

  • masked audio reconstruction (𝐫𝐞𝐜𝐨𝐧𝐚\mathbf{\mathcal{L}_{\mathrm{\bf recon}}^{a}})

  • audio-video masked autoencoder (𝐚𝐞\mathbf{\mathcal{L}_{\mathrm{\bf ae}}}).

We pretrain the above variants for the full training schedule of 800800 epochs and report linear evaluation results on the split-1 of downstream benchmarks.

Visual representations. The results presented in Table 4.2 show that video representations significantly benefit from the cross-modal supervision obtained through knowledge distillation. In comparison to reconv\mathcal{L}_{\mathrm{recon}}^{v}, xkd\mathcal{L}_{\mathrm{xkd}} improves video action recognition by 8.6%8.6\%, 8.2%8.2\%, and 13.9%13.9\% on UCF101, HMDB51, and Kinetics-Sound, respectively. Furthermore, we challenge XKD on Kinetics400 in a linear evaluation setup which improves the top-1 accuracy by 15.7%15.7\%. We interestingly notice that the joint audio-visual masked reconstruction (ae\mathcal{L}_{\mathrm{ae}}) does not make considerable improvements, e.g., it improves top-1 accuracies by only 0.1%0.2%0.1\%-0.2\% on UCf101, HMDB51, and Kinetics-Sound.

Loss UCF101 HHDB51 Kin-Sound Kinetics400
𝐫𝐞𝐜𝐨𝐧𝐯\mathbf{\mathcal{L}_{\mathrm{\bf recon}}^{v}} 76.1 (8.6){(\downarrow\!8.6)} 51.1 (8.2){(\downarrow\!8.2)} 56.8 (13.9){(\downarrow\!13.9)} 30.7 (15.7){(\downarrow\!15.7)}
𝐚𝐞\mathbf{\mathcal{L}_{\mathrm{\bf ae}}} 76.3 (8.4){(\downarrow\!8.4)} 51.2 (8.1){(\downarrow\!8.1)} 56.9 (13.8){(\downarrow\!13.8)} 32.2 (14.2){(\downarrow\!14.2)}
𝐱𝐤𝐝\mathbf{\mathcal{L}_{\mathrm{\bf xkd}}} 84.7 59.3 70.7 46.4
Table 1: Effect of cross-modal knowledge distillation in video action recognition. We report linear eval. top-1 acc.
Loss FSD50K ESC50
𝐫𝐞𝐜𝐨𝐧𝐚\mathbf{\mathcal{L}_{\mathrm{\bf recon}}^{a}} 44.6 (1.2){(\downarrow\!1.2)} 90.0 (1.0){(\downarrow\!1.0)}
𝐚𝐞\mathbf{\mathcal{L}_{\mathrm{\bf ae}}} 44.3 (1.5){(\downarrow\!1.5)} 90.0 (1.0){(\downarrow\!1.0)}
𝐱𝐤𝐝\mathbf{\mathcal{L}_{\mathrm{\bf xkd}}} 45.8 91.0
Table 2: Effect of cross-modal knowledge distillation in sound classification. We report mAP for FSD50K and top-1 acc. for ESC50.
[Uncaptioned image]
Figure 2: Visualizing the distribution of audio and visual representations, obtained from Kinetics400. It shows that audio has a sharper distribution compared to video. We find that a sharper distribution inherently provides better cross-modal supervision.

Audio representations. In Table 2, we notice that cross-modal knowledge distillation improves the performance in sound classification by 1.5%1.5\% and 1%1\% on FSD50K and ESC50 respectively. We note that the improvement is relatively less prominent compared to visual representations. Our thorough literature review in this regard reveals that a similar phenomenon has also been noticed amongst earlier works (Chen et al. 2021; Ren et al. 2021) that have attempted cross-modal knowledge distillation between audio and video in a semi-supervised setting. We conclude that while audio-to-video knowledge distillation is highly effective, video-to-audio provides a less substantial improvement. This is likely since a sharper distribution is preferred to provide supervision (Caron et al. 2021; Chen, Xie, and He 2021). As shown in Figure 2, the distribution of audio is sharper while the distribution of video is quite wider in nature. We find that such constraints can be overcome to some extent by applying aggressive sharpening of visual representations. Please find additional discussions on temperature scheduling in the Suppl. Mat. Sec. B.1. Additionally, readers are redirected to the Suppl. Mat. Sec. B.2 to B.4 for studies on EMA schedule, local views, and mask ratio.

Loss UCF HMDB Kinetics Remarks
𝐚𝐞\mathbf{\mathcal{L}_{\mathrm{\bf ae}}} 76.3 51.2 32.2 multimodal baseline.
+kd{~{}+\!\mathcal{L}_{\mathrm{kd}}} 76.3 51.2 32.2 without da\mathcal{L}_{\mathrm{da}}, knowledge distillation fails.
+da~{}+\!\mathcal{L}_{\mathrm{da}} 77.8 51.4 33.7 without kd\mathcal{L}_{\mathrm{kd}}, da\mathcal{L}_{\mathrm{da}} has marginal impact.
+kd+da~{}+\!\mathcal{L}_{\mathrm{kd}}\!+\!\mathcal{L}_{\mathrm{da}} 84.7 59.3 46.4 da+kd\mathcal{L}_{\mathrm{da}}+\mathcal{L}_{\mathrm{kd}} improves the accuracy by 814%8\!-\!14\%.
Table 3: Effect of domain alignment. da\mathcal{L}_{\mathrm{da}} and kd\mathcal{L}_{\mathrm{kd}} are complementary to each other. While da\mathcal{L}_{\mathrm{da}} and kd\mathcal{L}_{\mathrm{kd}} are not effective when applied separately, their combined optimization significantly improves the performance.

4.3 Effect of Domain Alignment

We conduct a thorough ablation study investigating the impact of domain alignment in our proposed framework, as presented in Table 3. First, without domain alignment, the model fails to perform cross-modal knowledge distillation due to domain discrepancy, and the model behaves as an audio-visual masked autoencoder. Second, quite expectedly, naively aligning the two domains without knowledge distillation has a minor impact. Last, our results exhibit that the proposed da\mathcal{L}_{da} and kd\mathcal{L}_{kd} are complementary to each other, and their combined effect significantly improves the performance, e.g., by 8.114.2%8.1-14.2\% on UCF101, HMDB51, and Kinetics400. In addition to the absence of domain alignment, we identify two more factors that could cause training instability, discussed in the Suppl. Mat. Sec. B.5 and B.6. Please see alternative design choices for domain alignment in the Suppl. Mat. Sec. B.7.

4.4 Effect of Feature Refinement (𝐫𝐞𝐟𝐢𝐧𝐞\mathrm{\bf refine})

To study the impact of refine\mathrm{refine} in cross-modal knowledge distillation, we modify da{\mathcal{L}_{\mathrm{da}}} in the final loss function (Equation 12) as mmd(θta(xag),θtv(xvg))+mmd(fsv,fsa)\mathcal{L}_{\mathrm{mmd}}(\theta_{t}^{a}(x_{a}^{g}),\theta_{t}^{v}(x_{v}^{g}))+\mathcal{L}_{\mathrm{mmd}}(f_{s}^{v},f_{s}^{a}). The results presented in Table 4 demonstrate that refine\mathrm{refine} improves downstream performance by 5.8%5.8\%, 3.7%3.7\%, and 3.0%3.0\% on HMDB51, UCF101, and ESC50 respectively. In Figure 3, we visualize the attention from the last layer with and without refine\mathrm{refine}, which further confirms the ability of refine\mathrm{refine} in identifying the most transferable and key features for enhanced downstream performance. Please see alternative design choices of feature refinement Suppl. Mat. Sec. B.8.

without with
HMDB51 53.5 (5.8){(\downarrow\!5.8)} 59.3
UCF101 81.0 (3.7){(\downarrow\!3.7)} 84.7
ESC50 88.0 (3.0){(\downarrow\!3.0)} 91.0
Table 4: Effect of 𝐫𝐞𝐟𝐢𝐧𝐞\mathrm{\bf refine} is presented. 𝐫𝐞𝐟𝐢𝐧𝐞\mathrm{\bf refine} significantly improves downstream task performance on both audio and video tasks.
without refine with refine

Crying

[Uncaptioned image] [Uncaptioned image]

Stretching

[Uncaptioned image] [Uncaptioned image]
Figure 3: Visualizing the effect of 𝐫𝐞𝐟𝐢𝐧𝐞\mathrm{\bf refine} in identifying the most transferable and key visual features. Please see more qualitative examples in the Suppl. Mat. Sec. C.1.
MATS MAS MS
HMDB51 53.5(5.8){(\downarrow\!5.8)} 55.2(4.1){(\downarrow\!4.1)} 59.3
UCF101 80.3(4.4){(\downarrow\!4.4)} 81.5(3.2){(\downarrow\!3.2)} 84.7
ESC50 90.3(0.7){(\downarrow\!0.7)} 89.5(1.5){(\downarrow\!1.5)} 91.0
Table 5: Comparison between modality-agnostic (MATS, MAS) and modality-specific (MS) variants.
Student Teacher
HMDB51 57.8(1.5){(\downarrow\!1.5)} 59.3
UCF101 84.4(0.3){(\downarrow\!0.3)} 84.7
ESC50 90.3(0.7){(\downarrow\!0.7)} 91.0
Table 6: Comparison between students vs. teachers.

4.5 Comparing Modality-Agnostic vs. -Specific

In Table 6, we compare the performance of modality-agnostic (MA) variants with the modality-specific (MS) one. Amongst the MA variants, XKD-MATS works slightly better on ESC50, whereas, XKD-MAS shows better performance on both UCF101 and HMDB51. The results are promising, as these variants show minor performance drops (e.g., 0.7%4.1%0.7\%-4.1\%) in comparison to the default XKD. Please note that such performance drops are expected as we keep the encoder size fixed for both MA and MS variants (Akbari et al. 2021). To further elaborate, while we dedicate 22 separate backbones of 8787M parameters to learn audio and visual representations for the MS variant, we use just one backbone of 8787M parameters for both audio and visual modalities in the MA variants. Therefore, the network parameters or model weights become saturated relatively quickly, limiting their performance. A simple solution could be to use a larger backbone for MA variants, which could be explored in future.

4.6 Comparing Teacher vs. Student

In Table 6, we present the comparison between teachers and students. Our results exhibit that due to the slow weight update schedule using EMA (see Equation 13), the teachers become slightly stronger learners compared to the students. We study the impact of different EMA schedules (presented in the Suppl. Mat. Sec. B.2) and find that EMA coefficients as 0.9970.997 work best in our setup. As presented in Table 6, the video teacher outperforms the video student by 0.3%0.3\% and 1.5%1.5\% on UCF101 and HMDB51. Next, the audio teacher outperforms the audio student by 0.7%0.7\% on ESC50. Please note that by default, we use the teachers to perform downstream tasks. We present a detailed comparison between the modality-specific and modality-agnostic variants using both teacher and student encoders in the Suppl. Mat. Sec. B.9.

4.7 Scalability

We study the scalability of XKD on 33 pretraining datasets of different sizes, similar to (Sarkar and Etemad 2023; Ma et al. 2020), i.e., Kinetics-Sound (2222K), Kinetics400 (240240K), and AudioSet (1.81.8M). Additionally, we experiment with 22 different sizes of ViT variants, i.e., ViT-B (8787M) and ViT-L (305305M). We try both ViT-B and ViT-L as the video backbone when pretrained on AudioSet. We report the linear evaluation top-1 accuracy averaged over all the splits on UCF101, HMDB51, and ESC50. Figure 4 shows that XKD continues to perform better as we increase the number of training samples and/or the size of the network. Such scalability is a much-desired property, which shows XKD can likely be scaled on even larger datasets like HowTo100M (Miech et al. 2019) or larger networks like ViT-22B (Dehghani et al. 2023). We also study the effect of longer pretraining of XKD, presented in the Suppl. Mat. Sec. B.10.

Refer to caption
Figure 4: We study scalability both in terms of the dataset (x-axis) and network (ViT-B/ViT-L) size. Our results exhibit that XKD is scalable to both larger datasets and networks.
Method Pretrain Mod. UCF101 HMDB51 Kinetics400
Lin. FT. Lin. FT. Lin. FT.
AVSlowFast (2020) K400 VA 77.4 87.0 44.1 54.6 - -
SeLaVi (2020) K400 VA - 83.1 - 47.1 - -
XDC (2020) K400 VA - 86.8 - 52.6 - -
CMACC (2020) K400 VA - 90.2 - 61.8 - -
AVID (2021) K400 VA 72.3 87.5 41.4 60.8 44.5 -
CMAC (2021) K400 VA - 90.3 - 61.1 - -
GDT (2021a) K400 VA 70.1 90.9 38.5 62.3 - -
STiCA (2021b) K400 VA - 93.1 - 67.0 - -
CrissCross (2023) K400 VA 83.9 91.5 50.0 64.7 44.5 -
XKD K400 VA 83.8 94.1 57.4 69.0 51.4 77.6
XKD-MAS K400 VA 81.7 93.4 55.1 65.9 50.1 75.9
XKD-MATS K400 VA 80.1 93.1 53.1 65.7 48.8 75.7
XDC (2020) AS VA 85.3 93.0 56.0 63.7 - -
MMV (2020) AS VA 83.9 91.5 60.0 70.1 - -
CM-ACC (2020) AS VA - 93.5 - 67.2 - -
BraVe∗∗ (2021) AS VA 90.0 93.6 63.6 70.8 - -
AVID (2021) AS VA - 91.5 - 64.7 48.9 -
CrissCross (2023) AS VA 87.7 92.4 56.2 67.4 50.1 -
XKD AS VA 88.4 95.8 62.2 75.7 56.5 80.1
VideoMAE (2022) U101/H51 V - 91.3 - 62.6 - -
BEVT (2021) K400 V - - - - - 76.2
VideoMAE (2022) K400 V - - - - - 79.0
VideoMAE∗∗ (2022) K400 V 80.0 96.1 54.3 73.3 - 80.0
CPD (2020) K400 VT - 90.5 - 63.6 - -
CoCLR (2020) K400 VF 74.5 - 46.1 - - -
BraVe∗∗ (2021) AS VFA 93.2 96.9 69.9 79.4 - -
MIL-NCE (2020) HT VT - 91.3 - 61.0 - -
VATT (2021) AS+HT VAT 89.6 - 65.2 - - 81.1
VATT-MA (2021) AS+HT VAT 84.4 - 63.1 - - 79.9
ELo (2020) YT VFA - 93.8 - 67.4 - -
Mod.: Modality, V: Video, A: Audio, T: Text, F: Flow. computed by us using official checkpoints. ∗∗industry-level computation, e.g., VideoMAE uses 6464 vs. ours 88 GPUs, BraVe pretrains with very high temporal resolutions (128 frames) compared to others (8-32). More comparisons with VideoMAE are in the Suppl. Mat. Sec. B.11.
Table 7: Comparison on video action recognition. XKD outperforms or achieves competitive performance compared to state-of-the-art methods.

4.8 Video Action Recognition

Following the standard practice in (Alwassel et al. 2020; Sarkar and Etemad 2023; Ma et al. 2020), we compare XKD with other leading audio-visual self-supervised frameworks in Table 7 in both linear evaluation (Lin.) and finetuning (FT.) on UCF101, HMDB51, and Kinetics400. We note a large variability in the experiment setups amongst the prior works, however, they are included for a more inclusive and thorough comparison. We report top-1 accuracy averaged over all the splits for both UCF101 and HMDB51.

The results presented in Table 7 show that XKD outperforms or achieves competitive performance amongst the leading audio-visual self-supervised methods. For example, XKD pretrained on Kinetics400 outperforms AVID and CrissCross in linear evaluation on Kinetics400, by a very large margin. Moreover, XKD outperforms powerful state-of-the-art models like STiCA, BraVe, and CrissCross among several others when finetuned on UCF101. Additionally, XKD outperforms prior works that are pretrained with massive datasets and tri-modalities like Elo pretrained with video, audio, and optical flow from YouTube8M (YT) (Abu-El-Haija et al. 2016), compared to XKD pretrained with audio and video from AudioSet 2M. Our modality-agnostic variants also achieve encouraging results, e.g., a minor performance drop is noticed between XKD-MAS and XKD, e.g., 0.7%0.7\% on UCF101 and 1.7%1.7\% on Kinetics400, when finetuned.

Method Pretrain Mod. ESC50 FSD50K
Lin. FT. Lin. FT.
XDC (2020) K400 VA 78.0 - - -
AVID (2021) K400 VA 79.1 - - -
STiCA (2021b) K400 VA 81.1 - - -
CrissCross (2023) K400 VA 86.8 - - -
XKD K400 VA 89.4 93.6 45.8 54.1
XKD-MAS K400 VA 87.3 92.7 43.0 52.3
XKD-MATS K400 VA 88.7 92.9 43.4 53.8
AVTS (2018) AS VA 80.680.6 - - -
AVID (2021) AS VA 89.289.2 - - -
GDT (2021a) AS VA 88.588.5 - - -
CrissCross (2023) AS VA 90.590.5 - - -
XKD AS VA 93.6 96.5 51.5 58.5
AudioMAE (2022) AS A - 93.6 - -
MaskSpec (2022) AS A - 89.6 - -
MAE-AST (2022) AS A - 90.0 - -
BYOL-A (2022a) AS A - - 44.844.8
Aud. T-former (2021) AS A - - - 53.7
SS-AST (2021a) AS A - 88.8 - -
PSLA (2021b) AS A - - - 55.8
VATT (2021) AS+HT VAT 84.784.7 - - -
VATT-MA (2021) AS+HT VAT 81.281.2 - - -
PaSST(SL) (2021) AS+IN AI - 95.5 - 58.4
Here, I: Image, IN: ImageNet (Russakovsky et al. 2015).
Table 8: Comparison on sound classification. XKD outperforms the prior state-of-the-art methods.

4.9 Sound Classification

In Table 8, we compare the performance of our proposed method using linear evaluation (Lin.) and finetuning (FT.) on sound classification using 22 popular audio benchmarks ESC50 and FSD50K. Following (Fonseca et al. 2022; Piczak 2015), we report top-1 accuracy averaged over all the splits on ESC50 and mean average precision on FSD50K. XKD outperforms prior works like XDC, AVID, and CrissCross on ESC50 in both linear evaluation and finetuning. Moreover, XKD and its MA variants outperform VATT and VATT-MA on ESC50 by 4.7%4.7\% and 7.5%7.5\%, even though VATT is pretrained with 136136M videos, compared to XKD which is only trained on 240240K samples. Additionally, when evaluated on FSD50K, XKD outperforms BYOL-A, AudioTransformer, and PSLA among others. Lastly, XKD shows state-of-the-art performance on ESC50, achieving top-1 finetuned accuracy of 96.5%96.5\% when pretrained with AudioSet.

Method Audio + Video
CrissCross 66.7
ae\mathcal{L}_{ae} (AV-MAE) 75.7
XKD 81.2
XKD-MAS 78.8
XKD-MATS 78.3
Table 9: Multimodal action classification by late fusion on Kinetics-Sound.

4.10 Multimodal Fusion

Following (Sarkar and Etemad 2023), we evaluate XKD in multimodal action classification using Kinetics-Sound. We extract fixed audio and visual embeddings from the pretrained encoders and concatenate them together (i.e., late fusion), followed by a linear SVM classifier is trained and top-1 accuracy is reported. The results presented in Table 9 show that XKD outperforms CrissCross by 14.5%14.5\% and baseline audio-visual masked autoencoder (𝐚𝐞\mathbf{\mathcal{L}_{\bf ae}}) by 5.5%5.5\%.

4.11 In-painting.

We present reconstruction examples in Figure 5, which shows that XKD retains its reconstruction ability even when a very high masking ratio is applied, for both audio and video modalities. This makes XKD also suitable for in-painting tasks. More examples are in the Suppl. Mat. Sec. C.2.

Original Masked Reconstructed

Video

Refer to caption Refer to caption Refer to caption

Audio

Refer to caption Refer to caption Refer to caption
Figure 5: Reconstruction examples from highly masked inputs, video and audio mask ratios are 80%80\% and 70%70\%.

5 Summary

In this work, we propose XKD, a novel self-supervised framework to improve video representation learning using cross-modal knowledge distillation. To effectively transfer knowledge between audio and video modalities, XKD aligns the two domains by identifying the most transferable features and minimizing the domain gaps. Our study shows that cross-modal knowledge distillation significantly improves video representations on a variety of benchmarks. Additionally, to develop a general network with the ability to process different modalities, we introduce modality-agnostic variants of XKD which show promising results in handling both audio and video using the same backbone. We believe that our approach can be further expanded to perform cross-modal knowledge distillation between other modalities as well (e.g., vision and language), which could be investigated in the future.

Acknowledgments

We are grateful to the Bank of Montreal and Mitacs for funding this research. We are thankful to SciNet HPC Consortium for helping with the computation resources.

References

  • Abu-El-Haija et al. (2016) Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; and Vijayanarasimhan, S. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
  • Afouras, Chung, and Zisserman (2020) Afouras, T.; Chung, J. S.; and Zisserman, A. 2020. Asr is all you need: Cross-modal distillation for lip reading. In ICASSP, 2143–2147. IEEE.
  • Akbari et al. (2021) Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.-H.; Chang, S.-F.; Cui, Y.; and Gong, B. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 34.
  • Alayrac et al. (2020) Alayrac, J.-B.; Recasens, A.; Schneider, R.; Arandjelovic, R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; and Zisserman, A. 2020. Self-Supervised MultiModal Versatile Networks. NeurIPS, 2(6): 7.
  • Albanie et al. (2018) Albanie, S.; Nagrani, A.; Vedaldi, A.; and Zisserman, A. 2018. Emotion recognition in speech using cross-modal transfer in the wild. In ACM Multimedia, 292–301.
  • Alwassel et al. (2020) Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Ghanem, B.; and Tran, D. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. NeurIPS, 33.
  • Arandjelovic and Zisserman (2017) Arandjelovic, R.; and Zisserman, A. 2017. Look, listen and learn. In ICCV, 609–617.
  • Asano et al. (2020) Asano, Y. M.; Patrick, M.; Rupprecht, C.; and Vedaldi, A. 2020. Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS.
  • Aytar, Vondrick, and Torralba (2016) Aytar, Y.; Vondrick, C.; and Torralba, A. 2016. Soundnet: Learning sound representations from unlabeled video. NeurIPS, 29.
  • Baade, Peng, and Harwath (2022) Baade, A.; Peng, P.; and Harwath, D. 2022. Mae-ast: Masked autoencoding audio spectrogram transformer. arXiv preprint arXiv:2203.16691.
  • Bachmann et al. (2022) Bachmann, R.; Mizrahi, D.; Atanov, A.; and Zamir, A. 2022. MultiMAE: Multi-modal Multi-task Masked Autoencoders. arXiv preprint arXiv:2204.01678.
  • Baevski et al. (2022) Baevski, A.; Hsu, W.-N.; Xu, Q.; Babu, A.; Gu, J.; and Auli, M. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
  • Bao, Dong, and Wei (2021) Bao, H.; Dong, L.; and Wei, F. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
  • Caron et al. (2021) Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In ICCV, 9650–9660.
  • Chen et al. (2020) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In ICML, 1597–1607.
  • Chen and He (2021) Chen, X.; and He, K. 2021. Exploring simple siamese representation learning. In CVPR, 15750–15758.
  • Chen, Xie, and He (2021) Chen, X.; Xie, S.; and He, K. 2021. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640–9649.
  • Chen et al. (2021) Chen, Y.; Xian, Y.; Koepke, A.; Shan, Y.; and Akata, Z. 2021. Distilling audio-visual knowledge by compositional contrastive learning. In CVPR, 7016–7025.
  • Chong et al. (2022) Chong, D.; Wang, H.; Zhou, P.; and Zeng, Q. 2022. Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training. arXiv preprint arXiv:2204.12768.
  • Cubuk et al. (2020) Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 702–703.
  • Dai, Das, and Bremond (2021) Dai, R.; Das, S.; and Bremond, F. 2021. Learning an augmented rgb representation with cross-modal knowledge distillation for action detection. In ICCV, 13053–13064.
  • Dehghani et al. (2023) Dehghani, M.; Djolonga, J.; Mustafa, B.; Padlewski, P.; Heek, J.; Gilmer, J.; Steiner, A. P.; Caron, M.; Geirhos, R.; Alabdulmohsin, I.; et al. 2023. Scaling vision transformers to 22 billion parameters. In ICML, 7480–7512. PMLR.
  • Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • Feichtenhofer et al. (2022) Feichtenhofer, C.; Fan, H.; Li, Y.; and He, K. 2022. Masked Autoencoders As Spatiotemporal Learners. arXiv preprint arXiv:2205.09113.
  • Feichtenhofer et al. (2021) Feichtenhofer, C.; Fan, H.; Xiong, B.; Girshick, R.; and He, K. 2021. A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 3299–3309.
  • Fonseca et al. (2022) Fonseca, E.; Favory, X.; Pons, J.; Font, F.; and Serra, X. 2022. FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 829–852.
  • Gemmeke et al. (2017) Gemmeke, J. F.; Ellis, D. P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R. C.; Plakal, M.; and Ritter, M. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 776–780.
  • Girdhar et al. (2023) Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K. V.; Joulin, A.; and Misra, I. 2023. Imagebind: One embedding space to bind them all. In CVPR, 15180–15190.
  • Girdhar et al. (2022) Girdhar, R.; Singh, M.; Ravi, N.; van der Maaten, L.; Joulin, A.; and Misra, I. 2022. Omnivore: A single model for many visual modalities. In CVPR, 16102–16112.
  • Gong, Chung, and Glass (2021a) Gong, Y.; Chung, Y.-A.; and Glass, J. 2021a. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.
  • Gong, Chung, and Glass (2021b) Gong, Y.; Chung, Y.-A.; and Glass, J. 2021b. Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3292–3306.
  • Gretton et al. (2006) Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; and Smola, A. 2006. A kernel method for the two-sample-problem. NeurIPS, 19.
  • Grill et al. (2020) Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.; Guo, Z.; Azar, M.; et al. 2020. Bootstrap Your Own Latent: A new approach to self-supervised learning. In NeurIPS.
  • Han, Xie, and Zisserman (2020) Han, T.; Xie, W.; and Zisserman, A. 2020. Self-supervised Co-training for Video Representation Learning. In NeurIPS.
  • He et al. (2021) He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; and Girshick, R. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
  • Huang et al. (2022) Huang, P.-Y.; Xu, H.; Li, J.; Baevski, A.; Auli, M.; Galuba, W.; Metze, F.; and Feichtenhofer, C. 2022. Masked autoencoders that listen. NeurIPS, 35: 28708–28720.
  • Jing et al. (2018) Jing, L.; Yang, X.; Liu, J.; and Tian, Y. 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387.
  • Kay et al. (2017) Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  • Korbar, Tran, and Torresani (2018) Korbar, B.; Tran, D.; and Torresani, L. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In NeruIPS, 7774–7785.
  • Koutini et al. (2021) Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; and Widmer, G. 2021. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069.
  • Kuehne et al. (2011) Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In ICCV, 2556–2563.
  • Li and Wang (2020) Li, T.; and Wang, L. 2020. Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691.
  • Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • Ma et al. (2020) Ma, S.; Zeng, Z.; McDuff, D.; and Song, Y. 2020. Active Contrastive Learning of Audio-Visual Video Representations. In ICLR.
  • Micikevicius et al. (2018) Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. 2018. Mixed Precision Training. In ICLR.
  • Miech et al. (2020) Miech, A.; Alayrac, J.-B.; Smaira, L.; Laptev, I.; Sivic, J.; and Zisserman, A. 2020. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 9879–9889.
  • Miech et al. (2019) Miech, A.; Zhukov, D.; Alayrac, J.-B.; Tapaswi, M.; Laptev, I.; and Sivic, J. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
  • Min et al. (2021) Min, S.; Dai, Q.; Xie, H.; Gan, C.; Zhang, Y.; and Wang, J. 2021. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. arXiv preprint arXiv:2106.06939.
  • Misra and Maaten (2020) Misra, I.; and Maaten, L. v. d. 2020. Self-supervised learning of pretext-invariant representations. In CVPR, 6707–6717.
  • Morgado, Misra, and Vasconcelos (2021) Morgado, P.; Misra, I.; and Vasconcelos, N. 2021. Robust Audio-Visual Instance Discrimination. In CVPR, 12934–12945.
  • Morgado, Vasconcelos, and Misra (2021) Morgado, P.; Vasconcelos, N.; and Misra, I. 2021. Audio-visual instance discrimination with cross-modal agreement. In CVPR, 12475–12486.
  • Niizumi et al. (2022a) Niizumi, D.; Takeuchi, D.; Ohishi, Y.; Harada, N.; and Kashino, K. 2022a. BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations. arXiv preprint arXiv:2204.07402.
  • Niizumi et al. (2022b) Niizumi, D.; Takeuchi, D.; Ohishi, Y.; Harada, N.; and Kashino, K. 2022b. Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation. arXiv preprint arXiv:2204.12260.
  • Patrick et al. (2021a) Patrick, M.; Asano, Y. M.; Kuznetsova, P.; Fong, R.; Henriques, J. F.; Zweig, G.; and Vedaldi, A. 2021a. Multi-modal Self-Supervision from Generalized Data Transformations. ICCV.
  • Patrick et al. (2021b) Patrick, M.; Huang, P.-Y.; Misra, I.; Metze, F.; Vedaldi, A.; Asano, Y. M.; and Henriques, J. F. 2021b. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning. In ICCV, 10560–10572.
  • Piczak (2015) Piczak, K. J. 2015. ESC: Dataset for Environmental Sound Classification. In ACM Multimedia, 1015–1018. .
  • Piergiovanni, Angelova, and Ryoo (2020) Piergiovanni, A.; Angelova, A.; and Ryoo, M. S. 2020. Evolving losses for unsupervised video representation learning. In CVPR, 133–142.
  • Qian et al. (2021) Qian, R.; Meng, T.; Gong, B.; Yang, M.-H.; Wang, H.; Belongie, S.; and Cui, Y. 2021. Spatiotemporal contrastive video representation learning. In CVPR, 6964–6974.
  • Recasens et al. (2021) Recasens, A.; Luc, P.; Alayrac, J.-B.; Wang, L.; Strub, F.; Tallec, C.; Malinowski, M.; Pătrăucean, V.; Altché, F.; Valko, M.; et al. 2021. Broaden your views for self-supervised video learning. In ICCV, 1255–1265.
  • Reddi, Kale, and Kumar (2019) Reddi, S. J.; Kale, S.; and Kumar, S. 2019. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
  • Ren et al. (2021) Ren, S.; Du, Y.; Lv, J.; Han, G.; and He, S. 2021. Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In CVPR, 13325–13333.
  • Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. IJCV, 115: 211–252.
  • Sarkar and Etemad (2023) Sarkar, P.; and Etemad, A. 2023. Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity. In AAAI, volume 37, 9723–9732.
  • Schiappa, Rawat, and Shah (2022) Schiappa, M. C.; Rawat, Y. S.; and Shah, M. 2022. Self-supervised learning for videos: A survey. arXiv preprint arXiv:2207.00419.
  • Soomro, Zamir, and Shah (2012) Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  • Szegedy et al. (2016) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In CVPR, 2818–2826.
  • Tarvainen and Valpola (2017) Tarvainen, A.; and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 30.
  • Tong et al. (2022) Tong, Z.; Song, Y.; Wang, J.; and Wang, L. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv preprint arXiv:2203.12602.
  • Verma and Berger (2021) Verma, P.; and Berger, J. 2021. Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335.
  • Wang et al. (2021) Wang, R.; Chen, D.; Wu, Z.; Chen, Y.; Dai, X.; Liu, M.; Jiang, Y.-G.; Zhou, L.; and Yuan, L. 2021. Bevt: Bert pretraining of video transformers. arXiv preprint arXiv:2112.01529.
  • Xiao et al. (2020) Xiao, F.; Lee, Y. J.; Grauman, K.; Malik, J.; and Feichtenhofer, C. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740.
  • Xiao, Tighe, and Modolo (2022) Xiao, F.; Tighe, J.; and Modolo, D. 2022. MaCLR: Motion-Aware Contrastive Learning of Representations for Videos. In ECCV, 353–370. Springer.
  • Yun et al. (2019) Yun, S.; Han, D.; Oh, S. J.; Chun, S.; Choe, J.; and Yoo, Y. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 6023–6032.
  • Zhang et al. (2017) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

Supplementary Material

The organization of the supplementary material is as follows:

  • A: Implementation Details

    • A.1 Datasets

    • A.2 Augmentation

    • A.3 Pretraining

    • A.4 Modality-agnostic setups

    • A.5 Linear evaluation

    • A.6 Finetuning

  • B: Additional Experiments and Results

    • B.1 Effect of temperature schedules

    • B.2 Effect of different EMA schedules

    • B.3 Effect of local views

    • B.4 Effect of different mask ratio

    • B.5 Avoiding collapse and training instability

    • B.6 Design choices for projector head

    • B.7 Design choices for domain alignment

    • B.8 Design choices for feature refinement

    • B.9 Comparing different variants of XKD

    • B.10 Training schedule

    • B.11 Detailed comparison with VideoMAE

  • C: Qualitative Analysis

    • C.1 Effect of feature refinement

    • C.2 XKD reconstruction results

Appendix A Implementation Details

A.1 Datasets

The details of the datasets are presented.

Kinetics400. Kinetics400(Kay et al. 2017) is a large-scale action recognition audio-visual dataset consisting of 400400 action classes. It has a total of 240240K video clips with an average duration of 1010 seconds. Following (Sarkar and Etemad 2023; Alwassel et al. 2020; Morgado, Vasconcelos, and Misra 2021; Tong et al. 2022; Feichtenhofer et al. 2022), we use the Kinetics400 for both pretraining and downstream evaluation.

AudioSet. AudioSet (Gemmeke et al. 2017) is a very large-scale audio-visual dataset comprised of 1.81.8M videos and collected from YouTube. The samples from AudioSet are of average duration of 1010 seconds and spread over 632632 audio events. We use AudioSet to conduct experiments on large-scale pretraining. Please note that none of the labels are used in pretraining.

UCF101. UCF101(Soomro, Zamir, and Shah 2012) is a popular action recognition dataset used for downstream evaluation. It consists of a total of 1313K clips of an average duration of 77 seconds and spread over 101101 action classes. We use UCF101 for downstream evaluation on video action recognition.

HMDB51. HMDB51 (Kuehne et al. 2011) is one of earlier video action recognition datasets. It contains 77K video clips distributed over 5151 action classes. HMDB51 is used for downstream evaluation on video action recognition.

Kinetics-Sound. Kinetics-Sound(Arandjelovic and Zisserman 2017) is originally a subset of Kinetics400, which has a total of 2222K video clips consisting of 3232 action classes. Following (Sarkar and Etemad 2023), we use Kinetics-Sound for multimodal action classification by late fusion. We particularly choose Kinetics-Sound for multimodal evaluation as it consists of hand-picked samples from Kinetics400 which are prominently manifested both audibly and visually. In addition to downstream evaluation, Kinetics-Sound is also used to conduct experiments on small-scale pretraining.

ESC50. ESC50 is a popular sound classification dataset that consists of a total of 22K samples of 5050 environmental sound classes, with an average duration of 55 seconds. Following (Akbari et al. 2021; Sarkar and Etemad 2023; Morgado, Vasconcelos, and Misra 2021; Recasens et al. 2021), we use ESC50 for downstream evaluation of audio representations.

FSD50K. FSD50K is a popular audio dataset of human-labeled sound events, which consists of 5050K audio samples distributed over 200200 classes. Moreover, FSD50K is a multi-labeled dataset, where samples are between 0.30.3 to 3030 seconds. Following (Niizumi et al. 2022a) We use FSD50K for downstream evaluation of audio representations.

A.2 Augmentation

We adopt standard augmentation strategies (Sarkar and Etemad 2023; Chen et al. 2020; Niizumi et al. 2022a) to create the global and local views from the raw audio and visual streams. Particularly, we apply Multi-scale crop, Color Jitter, and Random Horizontal Flip to create global visual views. To create local views, we apply Gray Scale and Gaussian Blur, in addition to the augmentations applied on the global views with minor changes in the augmentation parameters. For example, different from global visual views, we apply aggressive cropping to create local visual views. The details of the augmentation parameters are presented in Table S1. Next, Volume Jitter is applied to the audio stream to create the global audio views. We apply Random Crop in addition to the Volume Jitter to create the local audio views. The details of the audio augmentation parameters are presented in Table S2.

Augmentation Param Global View Local View Linear Eval.
Multi-Scale Crop Crop Scale [0.2, 1] [0.08, 0.4] [0.08, 1]
Horizontal Flip Probability 0.5 0.5 0.5
Color Jitter Brightness 0.4 0.4 0.4
Contrast 0.4 0.4 0.4
Saturation 0.4 0.4 0.4
Hue 0.2 0.2 0.2
Gray Scale Probability n/a 0.2 0.2
Gaussian Blur Probability n/a 0.5 n/a
Table S1: Visual augmentation parameters for pretraining and linear evaluation are presented. n/a: not applied.
Augmentation Param Global View Local View Linear Eval.
Volume Jitter Scale ±\pm 0.1 ±\pm 0.2 ±\pm 0.1
Random Crop Range n/a [0.6, 1.5] n/a
Crop Scale n/a [1, 1.5] n/a
Table S2: Audio augmentation parameters for pretraining and linear evaluation are presented. n/a: not applied.

A.3 Pretraining

We train XKD using an AdamW (Loshchilov and Hutter 2017; Reddi, Kale, and Kumar 2019) optimizer and warm-up cosine learning rate scheduler. During ablation studies, the models are trained for 400400 epochs using Kinetics400. We use ViT-B (Dosovitskiy et al. 2020) as the backbone for both audio and visual modalities. Following (He et al. 2021), we use a shallow decoder for reconstruction and adopt a similar projector head to the one proposed in (Caron et al. 2021). Please see the additional architecture details in Table S3. We perform distributed training with a batch size of 256256 using 88 NVIDIA V100 3232 GB GPUs in parallel. As mentioned in Table S3, we use visual frames with resolutions of 96296^{2} and 1122112^{2}, such lower resolutions allow us to train the proposed XKD with a limited computation setup. Additionally, we perform mixed-precision (Micikevicius et al. 2018) training to save the computation overhead. In our setup, it takes approximately 1010 days to train the XKD for 800800 epochs with Kinetics400. We present the additional details for the training parameters in Table S4. Please note that the parameters are tuned based on Kinetics400 and the same parameters are used for both AudioSet and Kinetics-Sound. Due to resource constraints, we do not further tune the parameters for individual datasets. As the AudioSet is very large, we pretrain XKD for 400400 epochs, whereas, the models are pretrained for 800800 epochs for Kinetics400 and Kinetics-Sound.

Name Value
Video (global) 32×112232\times 112^{2}
Video (local) 8×9628\times 96^{2}
Video Patch 4×1624\times 16^{2}
Audio (global) 80×44880\times 448
Audio (local) 80×11280\times 112
Audio Patch 4×164\times 16
Backbone ViT-Base (87M)
embed dim:768
depth:12
num heads:12
Decoder Transformer
embed dim:384
depth:4
num heads:12
Projector MLP
out dim: 8192
bottleneck dim: 256
hidden dim: 2048
num layers: 3
Table S3: Architecture details of XKD.
Name Value
Epochs 800800
Batch Size 256256
Optimizer AdamW
Optimizer Betas [0.9,0.95][0.9,0.95]
LR Scheduler Cosine
LR Warm-up Epochs 3030
LR Warm-up 0
LR Base 0.00010.0001
LR Final 0
Weight Decay 0.30.3
EMA Scheduler Cosine
EMA Base 0.9970.997
EMA Final 11
Video Mask Ratio 0.85%0.85\%
Audio Mask Ratio 0.80%0.80\%
Video Student Temp. 0.10.1
Audio Student Temp. 0.10.1
Video Teacher Temp. 0.10.1
Audio Teacher Temp. 0.070.07
Center Momentum 0.90.9
Table S4: XKD pretraining setups for Kinetics400.

A.4 Modality-agnostic setups

In Table S5, we summarize the detailed network setups of the modality-agnostic variants. The student encoders are shared in XKD-MAS, while both the teachers’ and students’ backbones are shared in XKD-MATS. Please note that in none of the setups, the input projection layer and the decoders are shared.

Variant Input Projection Teacher Student Decoder
XKD-MAS NS NS S NS
XKD-MATS NS S S NS
XKD NS NS NS NS
Table S5: Summary of parameter sharing in modality-agnostic setups. Here, S and NS refer to shared and not-shared parameters between audio and visual modalities respectively.

A.5 Linear evaluation

UCF101 and HMDB51. To perform linear evaluation, we feed 3232 frames with a spatial resolution of 1122112^{2} to the pretrained visual encoder. We extract the fixed features as the mean of all the patches from the last layer of the encoder. We randomly select 2525 samples per clip during training, while during testing, we uniformly fetch 33 clips per sample to extract frozen features. Next, the features are used for action classification using a linear SVM kernel. We sweep a range of cost values {0.00001,0.0001,0.001,0.01,0.1,1}\{0.00001,0.0001,0.001,0.01,0.1,1\} and report the best top-1 accuracy. We present the augmentation parameters applied on the training set in Table S1. Please note that none of the augmentations are applied to the test set. We then tune the models using split 1 of the datasets and report the top-1 accuracies averaged over all the splits.

ESC50. We feed 44 seconds of audio input to the pretrained audio encoder and extract approximately 1010-epochs worth of fixed features. Similar to UCF101 and HMDB51 we extract the features as the mean of all the patches from the last layer of the encoder. During training, we apply Volume Jitter as augmentation and no augmentation is applied at test time. Similar to the setup of UCF101 and HMDB51, we sweep the same range of cost values to find the best model. We report the final top-1 accuracy averaged over all the splits.

FSD50K. To perform the downstream evaluation on FSD50K we follow the same setup as ESC50, with the exception of using a linear fully-connected layer instead of SVM. During training, we randomly extract 1010 clips per sample and feed them to the pretrained audio encoder. Moreover, during test, we uniformly select 88 clips per sample and report the mean average precision (mAP) (Fonseca et al. 2022). The extracted features are used to train a fully-connected layer using an AdamW (Loshchilov and Hutter 2017; Reddi, Kale, and Kumar 2019) optimizer for 5050 epochs at a fixed learning rate of 0.0010.001. Moreover, we apply weight decay of 1e51e-5 and dropout of 0.30.3 to prevent overfitting. We use a batch size of 256256 and train on a single RTX6000 2424 GB NIVIDIA GPU.

Kinetics-Sound and Kinetics400. We mainly use Kinetics-Sound for multimodal evaluation using late fusion. To perform late fusion, we simply concatenate the corresponding audio and visual features, and the joint representations are then directly used to perform action classification using a linear kernel SVM. Finally, we report the top-1 accuracy.

We follow similar setups to those used for UCF101 and HMDB51 to perform linear evaluation on Kinetics400, with the exception of randomly sampling 33 clips per video during training and testing to extract the features (we do not sample more clips during training due to memory constraints). Lastly, we report top-1 accuracy in action classification. In addition to the SVM (used in ablation study and analysis), we also perform FC-tuning (used to compare with the prior works in Table 7) on Kinetics400, considering Kinetics400 is sufficiently large compared to UCF101 and HMDB51. Similar to the setup mentioned earlier, we extract fixed features, followed by a linear fully-connected layer is trained to perform action recognition. We use an AdamW (Loshchilov and Hutter 2017; Reddi, Kale, and Kumar 2019) optimizer with a Cosine scheduler having a base learning rate of 0.010.01 and batch size of 512512 to train for 2020 epochs.

A.6 Finetuning

Kinetics400, UCF101, and HMDB51. We use the pretrained visual encoder and add a fully-connected layer to finetune on Kinetics400, UCF101, and HMDB51. Different from our pretraining setup, we use a spatial resolution of 2242224^{2} in finetuning. During training, we apply Random Augmentation (Cubuk et al. 2020) in addition to Multi-scale Crop and Random Horizontal Flip on the visual data, similar to (Tong et al. 2022; Feichtenhofer et al. 2022; He et al. 2021). We use an AdamW (Loshchilov and Hutter 2017; Reddi, Kale, and Kumar 2019) optimizer to train the network for 100100 epochs using a cosine learning rate scheduler. In addition to the cosine learning rate scheduler, we employ layer-wise-layer-decay (Bao, Dong, and Wei 2021) which further stabilizes the training. We perform distributed training with a batch size of 9696 using 1616 NVIDIA V100 3232 GB GPUs in parallel. Additionally, we employ MixUp (Zhang et al. 2017), CutMix (Yun et al. 2019), and label smoothing (Szegedy et al. 2016), which boosts the finetuning performance. We present the details of the finetuning parameters in Table S6. During training, we randomly sample 11 clip per sample, while during testing, 33 clips are uniformly selected per sample. We report the sample level prediction (top-1 and top-5 accuracy) averaged over all the clips.

Name Kinetics400 UCF101 HMDB51
Epochs 100100
Batch Size 9696
Optimizer AdamW
Optimizer Betas [0.9,0.999][0.9,0.999]
LR Scheduler Cosine
LR Warm-up Epochs 3030 55 2020
LR Warm-up 0
LR Base 0.00050.0005 0.00050.0005 0.00010.0001
LR Final 1.0e061.0e^{-06}
Weight Decay 0.050.05
RandAug (9, .05)
Label Smoothing 0.10.1
MixUp 0.60.6 0.60.6 1.01.0
CutMix 0.50.5 0.50.5 1.01.0
Drop Path 0
Dropout 0.50.5
Layer-wise LR Decay 0.450.45 0.450.45 0.550.55
Table S6: Video finetuning parameters.

ESC50 and FSD50K. We use the pretrained audio encoder and add a linear layer on top of it to finetune on ESC50 and FSD50K. During training, we randomly sample 11 audio segment of 44 seconds for both datasets. However, in the case of validation, we uniformly sample 33 and 88 clips per sample for ESC50 and FSD50K respectively. The number of clips per sample during testing is chosen based on the average length of the samples to ensure the full length of the audio signal is captured. Following (Sarkar and Etemad 2023), we apply strong augmentations including Volume Jitter, Time-frequency Mask, Timewarp, and Random Crop. We use an AdamW optimizer with a batch size of 6464 to train the network for 100100 epochs. We use standard categorical cross-entropy error for ESC50, and binary cross-entropy error for FSD50K, as FSD50K is a multi-label multi-class classification dataset. We report top-1 accuracy for ESC50 and mAP for FSD50K. The networks are trained on 4 NVIDIA V100 32 GB GPUs. We list the additional hyperparameters in Table S7.

Name ESC50 FSD50K
Epochs 100 30
Batch Size 6464
Optimizer AdamW
Optimizer Betas [0.9,0.999][0.9,0.999]
LR Scheduler cosine fixed
LR Warm-up Epochs 1010 -
LR Base 0.00010.0001 0.00010.0001
LR Final 0 -
Weight Decay 0.005 1.0e051.0e-05
Early stop no yes
Volume Jitter 0.20.2
Time-mask [0,20][0,20]
Frequency-mask [0,10][0,10]
Num of masks 22
Timewarp window 2020
Drop Path 0
Dropout 0.50.5 0.30.3
Layer-wise LR Decay 0.650.65 0.650.65
Table S7: Audio finetuning parameters.

Appendix B Additional Experiments and Results

Following, we present an in-depth study analysing the key concepts of our proposed framework. Our thorough analysis includes the effect of temperature and EMA schedules; and the impact of local views and masking ratios; design choices for domain alignment, feature refinement, and projector head; and the performance of different XKD variants; among others. The models are pretrained for 400400 epochs on Kinetics400 and we report linear evaluation top-1 accuracy using the split-1 of UCF101, HMDB51, and ESC50 unless stated otherwise.

Audio 0.040.04 0.060.06 0.070.07 0.080.08 0.070.07 0.070.07 0.070.07 0.070.07 [0.04,0.06][0.04,0.06] [0.04,0.06][0.04,0.06]
Video 0.040.04 0.060.06 0.070.07 0.080.08 0.090.09 0.100.10 0.110.11 0.120.12 [0.04,0.06][0.04,0.06] [0.09,0.11][0.09,0.11]
HMDB51 55.3 57.5 57.5 55.6 54.4 55.9 54.5 53.8 58.5 52.7
UCF101 80.0 81.0 82.0 81.6 81.0 81.9 79.5 78.5 82.6 79.0
ESC50 85.8 85.3 84.3 85.8 87.0 89.0 89.0 87.3 88.0 86.5
Table S8: Effect of temperature schedules. We highlight the best with bold and the second best with underline. We find that a temperature schedule of [0.4,0.6][0.4,0.6] (increasing 0.40.4 to 0.60.6 using a Cosine scheduler) achieves an overall stable performance. Interestingly, excessively increasing the video temperature to 0.100.10 or 0.110.11 (i.e., sharpening the video distribution) improves the performance of audio downstream tasks, however, it hurts the performance of visual tasks. On the other hand, a fixed temperature of 0.60.6 or 0.70.7 shows stable performance in video tasks, but it hurts audio performance. By default, we use [0.4,0.6][0.4,0.6] in all the experiments.

B.1 Effect of temperature schedules

We study the effect of a wide range of temperature settings (τ\tau in Equation 10) in our proposed framework. The results presented in Table S8 show that a relatively lower temperature works well on the audio teacher (0.040.04 to 0.070.07), whereas a high temperature (0.10.1 to 0.110.11) is required for the video teacher. Please note that the students’ temperatures are kept fixed at 0.10.1. To find the best performance, we sweep a range of combinations and notice that for both audio and video teachers, increasing the temperatures from 0.040.04 to 0.060.06 works fairly well for both modalities. Additionally, we notice that setting fixed temperatures of 0.070.07 and 0.10.1 for audio and video teachers respectively, works slightly better on ESC50; however, a performance drop is noticed for video, specifically on HMDB51. We conjecture that carrying more information and thus the wider distribution of the video stream (see Figure 2 in the main paper) requires a higher temperature in order to further sharpen the distribution. Based on the results in Table S8, we use a scheduler to increase the temperature from 0.040.04 to 0.060.06 for both audio and video teachers throughout our experiments, unless mentioned otherwise.

Refer to caption
Figure S1: Effect of EMA schedules. EMA of 0.9970.997 achieves the most stable performance in both audio (ESC50) and visual (HMDB51 and UCF101) tasks.

B.2 Effect of different EMA schedules

We update the weights of the teachers using EMA as described in Equation 13. In particular, we set the values of EMA coefficients using a cosine scheduler with a final value of 11. This means that we update the teachers more frequently at the beginning of training, and slow down their weight update towards the end of the training. This strategy results in a very smooth update of the networks’ parameters and results in stable performance (Tarvainen and Valpola 2017). To find the optimal setup, we vary a wide range of EMA coefficients between 0.980.98 to 0.9990.999. The results presented in Figure S1 show that the EMA coefficient of 0.9970.997 works well on both audio and video-related tasks.

B.3 Effect of local views

Temporal window size. We study the effect of the size of temporal windows used to create the local views for training the student networks. For the video stream, we create the local views by taking spatial crops of 96×9696\!\times\!96 pixels and vary the temporal windows in the range of {1,2,4}\{1,2,4\} seconds. Similarly for the audio stream, we explore temporal windows in the range of {1,2}\{1,2\} seconds. Our results presented in Table S9 reveal some interesting findings. First, local views comprised of audio and visual sequences of just 11 second work the best on all three benchmarks. Interestingly, we notice that the downstream audio classification performance drops significantly when XKD uses long visual local views (see column 414-1 in Table S9), while the downstream video classification performance remains more or less unchanged. On the other hand, we notice considerable performance drops on downstream video classification when using large audio segments to create the local views (see columns 222-2 and 121-2 in Table S9). These observations indicate that local views comprised of relatively shorter temporal windows are most effective when learning from the cross-modal teachers, which in turn results in improved downstream classification performance.

Vid-Aud (sec.) 𝟏𝟏\mathbf{1\!-\!1} 𝟐𝟐\mathbf{2\!-\!2} 𝟐𝟏\mathbf{2\!-\!1} 𝟏𝟐\mathbf{1\!-\!2} 𝟒𝟏\mathbf{4\!-\!1}
HMDB51 58.5 55.2 57.1 55.6 58.4
UCF101 82.6 81.1 82.3 81.0 82.6
ESC50 88.0 86.8 87.3 86.8 82.3
Table S9: Effect of different temporal window sizes for local views. Here bold shows the best and underline shows the worst results. Our results indicate that local views comprised of relatively shorter temporal windows are most effective to learn from the cross-modal teachers.

Number of local views. We further investigate the effect of varying the number of local views. This experiment shows that single local views for both audio and visual modalities work fairly well. We also find that adding more visual local views (e.g., 22 or 33) when using a single audio local view improves the performance on HMDB51 and UCF101, as shown in columns 212-1 and 313-1 in Table S10. Additionally, we notice that adding more audio local views worsens the model’s performance in video classification (see columns 121\!-\!2 and 222\!-\!2 in Table S10). Lastly, we notice considerable improvements in sound classification when using 22 local views for both audio and video.

Vid-Aud (no.) 𝟏𝟏\mathbf{1\!-\!1} 𝟏𝟐\mathbf{1\!-\!2} 𝟐𝟏\mathbf{2\!-\!1} 𝟐𝟐\mathbf{2\!-\!2} 𝟑𝟏\mathbf{3\!-\!1}
HMDB51 58.5 56.3 58.8 56.1 57.5
UCF101 82.6 79.5 82.2 79.0 82.9
ESC50 88.0 85.8 87.0 90.8 88.8
Table S10: Effect of additional local views. We highlight the best results in bold and the worst with underline. We find that XKD learns very strong representations just by using a single local view. Additionally, we encounter a few setups which show superior results on different downstream benchmarks. However, we do not see one single variant that performs the best in all setups.

B.4 Effect of different mask ratio

We study the effect of different amounts of masking ratios in our proposed framework. Specifically, we test different combinations of audio-masking ratio vs. video-masking ratio to test if there is an internal dependency that might result in optimal performance for both modalities. We sweep a wide ranges of mask ratios, particularly {0.80,0.85,0.90}\{0.80,0.85,0.90\} for video masking and {0.75,0.80,0.85}\{0.75,0.80,0.85\} for audio masking. The results presented in Figure S2 show that a video mask ratio of 85%85\% and an audio mask ratio of 80%80\% work well on both UCF101 and HMDB51. However, we notice that a slightly lower video mask ratio (e.g., 80%) improves audio performance. We conjecture that this setup benefits from the availability of more visual patches which provides better supervision through cross-modal knowledge distillation. Please note that unless stated otherwise, we use a visual mask ratio of 85%85\% and an audio mask ratio of 80%80\%.

Refer to caption
Figure S2: Effect of different masking ratios. We find that a video mask ratio of 85%85\% works best on both UCF101 and HMDB51. Additionally, an audio mask ratio of 80%80\% achieves superior results on ESC50.
Refer to caption
Figure S3: Visualizing pretraining instability and collapse in cross-modal knowledge distillation.

B.5 Avoiding collapse and training instability

Here we summarize the setups that cause a collapse and instability in cross-modal knowledge distillation. We identify three such instances, (i) without normalization layer in the projectors, (ii) without da\mathcal{L}_{da}, and (iii) without normalizing the teacher’s representations based on the current batch statistics, i.e., Centering (Caron et al. 2021). To identify collapse, we mainly track (not optimize) the Kullback–Leibler divergence between teacher and students (kld\mathcal{L}_{kld}) and Knowledge Distillation (kd\mathcal{L}_{kd}) losses, as shown in Figure S3. A collapse can be identified if either kd\mathcal{L}_{kd} or kld\mathcal{L}_{kld} is zero. First, when no normalization layer is added in the projector head, the kld\mathcal{L}_{kld} becomes zero, which indicates that the models output a constant vector. Second, without da\mathcal{L}_{da}, the kld\mathcal{L}_{kld} also reaches zero (with minor spikes), which indicates training collapse. Third, we notice that without the Centering, kld\mathcal{L}_{kld} does not become zero, while the kd\mathcal{L}_{kd} reaches zero, which also indicates training collapse. In addition to the ablation studies presented in Tables 4.2 and 2 (in the main paper), we attempt to train XKD without the masked reconstruction loss (i.e., setting λae\lambda_{ae} to 0 in Equation 12), and quite expectedly we face training collapse. This is due to the fact that in order to perform effective cross-modal knowledge distillation, the model first needs to learn meaningful modality-specific representations. To provide more insights into the training process, we present the default training curves in Figure S4.

Refer to caption
Figure S4: Stable training curves are presented.

B.6 Design choices for projector head

The configuration of the projector heads plays an important role in the performance of our method. We experiment with a variety of different setups including normalization, output dimension, and the number of layers.

Effect of normalization. We investigate three setups, (a) no normalization, (b) normalization on the last layer, and (c) normalization on every layer. Our study shows that the cross-modal knowledge distillation fails in the absence of the normalization layer (please see additional discussions in B.5). We then observe that adding normalization on the last layer prevents the model from collapsing while adding normalization on every layer shows significant improvements (2.9%2.9\% to 4.5%4.5\%) as presented in Figure S5.

Refer to caption
Figure S5: Effect of normalization layers in the projector head. We find that applying normalization to all the layers of the projector head improves the performance across all the downstream tasks.
No. of layers 2 3 4
HMDB51 55.4 58.5 55.7
UCF101 81.0 82.6 81.0
ESC50 83.4 88.0 84.6
Table S11: Effect of different number of projector layers. We notice an MLP projector head with 33 layers achieves the best performance.
Output dim. 4096 8192 16384 32768
HMDB51 53.2 58.5 57.0 54.4
UCF101 78.1 82.6 81.5 81.7
ESC50 87.5 88.0 84.1 83.8
Table S12: Effect of different projector output dimensions. We notice that the output dimensions of 81928192 achieve the best performance. Please note that in all the setup the hidden layers consist of fixed dimensions of 20482048.

Effect of the number of layers and output dimension. We further explore different setups for the projector head. First, we present the effect of varying the number of layers and output dimensions in Tables S11 and S12. The results show that a projector head consisting of 33 layers performs most steadily on all the datasets. We also find that an output dimension of 81928192 shows stable performance on all the downstream benchmarks.

da𝟏\mathbf{\mathcal{L}_{\mathrm{da}}^{1}} da𝟐\mathbf{\mathcal{L}_{\mathrm{da}}^{2}} da\mathbf{\mathcal{L}_{\mathrm{da}}} (default)
HMDB51 56.1 55.4 58.5
UCF101 81.1 82.2 82.6
ESC50 87.3 87.5 88.0
Table S13: Exploring the design choices for domain alignment. We find that just minimizing the MMD loss between the teachers and between the students compared to the other variants works best for effective cross-modal knowledge distillation.

B.7 Design choices for domain alignment

We conduct a thorough study exploring the optimal setup for domain alignment (da\mathcal{L}_{\mathrm{da}}). Recall, da\mathcal{L}_{\mathrm{da}} is defined as:

da=mmd(fsv,fsa)+mmd(ftv,fta)\begin{split}\mathcal{L}_{\mathrm{da}}=\mathcal{L}_{\mathrm{mmd}}(f_{s}^{v},f_{s}^{a})+\mathcal{L}_{\mathrm{mmd}}(f_{t}^{v},f_{t}^{a})\end{split}

We explore the following setups. First, instead of minimizing the MMD loss between teachers (fta,ftv)(f_{t}^{a},f_{t}^{v}) and between students (fsa,fsv)(f_{s}^{a},f_{s}^{v}) as done in da\mathcal{L}_{\mathrm{da}}, we minimize the distance between the teachers and students as:

da1=mmd(fta,fsv)+mmd(ftv,fsa).\begin{split}\mathcal{L}_{\mathrm{da}}^{1}\!=\!\mathcal{L}_{\mathrm{mmd}}(f_{t}^{a},f_{s}^{v})\!+\!\mathcal{L}_{\mathrm{mmd}}(f_{t}^{v},f_{s}^{a}).\end{split} (S1)

Next, we study the effect of optimizing da1\mathcal{L}_{\mathrm{da}}^{1} in addition to da\mathcal{L}_{\mathrm{da}}, expressed as:

da2=da+da1.\mathcal{L}_{\mathrm{da}}^{2}=\mathcal{L}_{\mathrm{da}}+\mathcal{L}_{\mathrm{da}}^{1}. (S2)

The results presented in Table S13 indicate drops in performance when using da1\mathcal{L}_{\mathrm{da}}^{1} and da2\mathcal{L}_{\mathrm{da}}^{2}. In other words, we find that just minimizing the MMD loss between representations of a similar nature is more effective for domain alignment and minimizing domain discrepancy. To further clarify, teachers are slowly updated than the students and teachers provide global representations while students focus on local representations. We use da\mathcal{L}_{\mathrm{da}} by default in all our experiments.

B.8 Design choices for feature refinement

Here, we perform additional studies to evaluate our design choices for the refine\mathrm{refine} step.

Cross-modal attention map. To calculate cross-modal feature relevance, we use the cross-modal attention maps as mentioned in Equation 4. We find that a potential alternative to our default Equation 4 can be Equation S3. In particular, the scaling operation in Equation 4 can be simply replaced by a softmax\mathrm{softmax} function.

Av×=softmax(MeanPool(AvAaT));Aa×=softmax(MeanPool(AaAvT))\begin{split}{A}_{v}^{{}^{\prime}\times}=\mathrm{softmax}(\mathrm{MeanPool}({A}_{v}\cdot{A}_{a}^{T}));\\ {A}_{a}^{{}^{\prime}\times}=\mathrm{softmax}(\mathrm{MeanPool}({A}_{a}\cdot{A}_{v}^{T}))\end{split} (S3)

We notice that this technique works slightly better on audio representations in comparison to our default setup; however, a considerable performance drop is noticed on visual representations. Please see the results in Table S14. Please note that unless mentioned otherwise, we use the default Equation 4 in all the setups.

𝐀×\mathbf{{A}^{{}^{\prime}\times}} w/o CLS default
HMDB51 54.6 54.5 58.5
UCF101 79.8 79.3 82.6
ESC50 88.3 84.5 88.0
Table S14: Exploring design choices for feature refinement. We find that an alternative strategy to obtain the cross-modal attention maps (𝐀×\mathbf{{A}^{{}^{\prime}\times}}) may work marginally better on audio, but it significantly degrades the video performance. Additionally, we notice that the models benefit from the availability of CLS token in both audio and visual tasks. These results confirm the superiority of our proposed design choices for effective cross-modal knowledge distillation.

CLS token. As mentioned in Equation 3, by default we use the CLS tokens to obtain Av{A}_{v} and Aa{A}_{a}, which are then used to generate Av×{A}_{v}^{{}^{\prime}\times} and Aa×{A}_{a}^{{}^{\prime}\times}. Here, we explicitly study if CLS tokens are necessary in our framework. To test this, we modify Equation 3 and calculate the intra-modal attention map as the correlation of the mean of the query embeddings and the key embeddings of all other patches. Please note that the rest of the setup remains the same. Table S14 shows that removing the CLS token significantly degrades the model performance on both audio and visual representations.

B.9 Comparing different variants of XKD

We conduct a thorough comparison between XKD and its different variants. In Table S15, we present a detailed comparison between the modality-specific and modality-agnostic variants using both teacher and student encoders. As the teachers and students may converge at different rates due to their difference in the weight update schedule, these variants are trained for full training schedules of 800800 epochs. We report top-1 accuracy of linear evaluations using the split-1 of UCF101, HMDB51, and ESC50. As shown in Table S15, modality-agnostic variants show performance drops (e.g., 3.2%3.2\% on UCF101 and 0.7%0.7\% on ESC50) in comparison to the modality-specific one. Moreover, as presented in the main paper, the performance gap between the modality-agnostic and modality-specific variants further reduces when finetuned (e.g., 0.7% on UCF101, 1.7% on Kinetics400, and 0.3% on FSD50K). Amongst the modality-agnostic backbones, XKD-MATS works slightly better on ESC50, whereas, XKD-MAS shows better performance on both UCF101 and HMDB51. Lastly, when comparing the performance between the teachers and students, teachers show slightly better performance on all the benchmarks and in all the setups. Following (Caron et al. 2021), we adopt the terminology of ‘teacher-student’ to simply refer to the stronger network as a teacher and the slightly weaker one as a student.

XKD-MATS XKD-MAS XKD
T (MA) S (MA) T (MS) S (MA) T (MS) S (MS)
HMDB51 53.5 52.4 56.4 55.2(4.1){(\downarrow\!4.1)} 59.3 57.8
UCF101 80.3 80.0 81.9 81.5(3.2){(\downarrow\!3.2)} 84.7 84.4
ESC50 90.3(0.7){(\downarrow\!0.7)} 89.8 91.0 89.5 91.0 90.3
Table S15: Comparison between teachers (T) and students (S) for both MA and MS variants. We highlight the performance drop in the best MA variants with respect to the best MS variant. Moreover, teachers always perform better with respect to the students.

B.10 Training schedule

In Figure S6, we present the effect of longer pretraining schedules on video action classification for both teacher and student networks. We pretrain the XKD for up to 800800 epochs and find that pretraining for around 600600 epochs shows improvement, while beyond that point, a slight performance drop is noticed on HMDB51 and no change is noticed on UCF101. Moreover, we find that the teacher always outperforms the student in downstream evaluation.

Refer to caption
Figure S6: Effect of longer pretraining schedules with respect to the downstream task performance. We notice that the teacher always outperforms the students throughout the training. Additionally, little or no improvements are noticed beyond 600600 epochs.
Method Backbone Input Patch Size # Tokens UCF101 HMDB51 Kinetics-Sound Kinetics400
VideoMAE∗∗ (224) (Tong et al. 2022) ViT-B 16×224216\times 224^{2} 2×1622\times 16^{2} 1568 80.0 54.3 - -
VideoMAE (112) ViT-B 16×112216\!\times 112^{2} 2×1622\!\times 16^{2} 392 77.1 51.6 - -
VideoMAE ViT-B 32×112232\!\times 112^{2} 4×1624\!\times 16^{2} 392 78.9 51.5 56.8 30.7/38.4
XKD ViT-B 32×112232\!\times 112^{2} 4×1624\!\times 16^{2} 392 83.8 57.4 70.7 46.4/51.4
Table S16: Comparison between VideoMAE and XKD in linear evaluation. We report top-1 accuracy, averaged over all the splits for UCF101 and HMDB51. As Kinetics400 is sufficiently large, we perform linear evaluations in both setups, SVM and FC-tuning. XKD outperforms VideoMAE by 4.9%4.9\% on UCF101, 5.9%5.9\% on HMDB51, 13.9%13.9\% on Kinetics-Sound, and by 15.7%15.7\% on Kinetics400. VideoMAE∗∗ (224) uses 15681568 visual tokens, whereas, XKD uses 392392 tokens, reducing the pretraining computational cost for the video encoder by 44 times. Note: identical setup to XKD, ∗∗ industry level computation, SVM/FC.

B.11 Detailed comparison with VideoMAE

In addition to the earlier results presented in the main paper, here we perform a thorough comparison of our method to video masked reconstruction, i.e., VideoMAE (Tong et al. 2022), on video action recognition, and present the results in Table S16. In this experiment, we strictly follow the standard linear evaluation protocol (Morgado, Vasconcelos, and Misra 2021; Alwassel et al. 2020; Sarkar and Etemad 2023; Xiao et al. 2020) since it is a more appropriate method for evaluating self-supervised representation learning as pointed out in (Xiao et al. 2020; Qian et al. 2021). Our comparisons are mainly 3-fold.

First, we compare XKD to VideoMAE (Tong et al. 2022) and notice that XKD outperforms VideoMAE111We perform linear evaluations using the official checkpoints released at: https://github.com/MCG-NJU/VideoMAE/tree/main by 3.8%3.8\% and 3.1%3.1\% on UCF101 and HMDB51, respectively. The improvements in XKD’s results clearly show the benefits of cross-modal knowledge distillation in learning better and more discriminative representations. However, it should be noted that VideoMAE (Tong et al. 2022) is pretrained using industry-level computation (6464 3232GB V100 GPUs) with a higher frame resolution of 2242224^{2}, whereas, we use a lower resolution of 1122112^{2} (using just 88 V100 GPUs). We use a frame resolution of 1122112^{2} to reduce the computation cost, i.e., while VideoMAE (Tong et al. 2022) is pretrained with 15681568 visual tokens, XKD uses 392392 tokens, reducing the computation cost by almost 4×4\times.

Therefore, to have a fair comparison between these two methods, we pretrain VideoMAE with a frame resolution of 1122112^{2} using the publicly available VideoMAE (Tong et al. 2022) codes. We refer to this variant as ‘VideoMAE (112)’ as presented in Table S16. XKD outperforms VideoMAE (112) by 6.7%6.7\% and 5.8%5.8\% on UCF101 and HMDB51. Lastly, we find another minor difference in the input setup between XKD and VideoMAE. VideoMAE uses 16 frames as input with a temporal patch size of 22, whereas we use 3232 frames and a temporal patch size of 44. Therefore, for a more fair comparison, we pretrain VideoMAE in an identical setup to ours, i.e, with visual inputs of 32×112232\times 112^{2} and patch size of 4×164\times 16, and denote it as VideoMAE in Table S16. As presented in Table S16, XKD outperforms VideoMAE by 4.9%4.9\% and 5.9%5.9\% on UCF101 and HMDB51, respectively. Lastly, XKD outperforms VideoMAE by a very large margin of 13.913.9 and 15.715.7 on Kinetics-Sound and Kinetics400. Such large performance improvement demonstrates that XKD learns better features compared to video-masked reconstruction frameworks.

Appendix C Qualitative analysis

C.1 Effect of feature refinement

In Figure S7, we present several examples showing the impact of our proposed refine\mathrm{refine} strategy in identifying the key visual features.

C.2 XKD reconstruction results

In Figures S8, S9 and S10, we present several examples showing the reconstruction capability of XKD with respect to varying masked inputs. Additional examples of reconstruction of video frames and audio spectrograms from highly masked inputs are presented in Figures S11 and S12. XKD demonstrates high reconstruction performance even when a high mask ratio is applied.

raw frames without refinement with refinement raw frames without refinement with refinement

abseiling

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

arm wrestling

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

bending metal

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

doing nails

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

driving car

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

eating ice cream

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

shuffling cards

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

side kick

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

passing american football

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

playing harp

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

playing organ

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

playing poker

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

robot dancing

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

shining shoes

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

skiing

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

spray painting

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Figure S7: Visualizing the effect of refine in identifying the key visual features. (Best viewed in color.)
Original frames
[Uncaptioned image]
Masked frames Reconstructed frames

60%60\%

[Uncaptioned image] [Uncaptioned image]

70%70\%

[Uncaptioned image] [Uncaptioned image]

80%80\%

[Uncaptioned image] [Uncaptioned image]

90%90\%

[Uncaptioned image] [Uncaptioned image]
Original spectrogram
[Uncaptioned image]
Masked spectrogram Reconstructed spectrogram

50%50\%

[Uncaptioned image] [Uncaptioned image]

60%60\%

[Uncaptioned image] [Uncaptioned image]

70%70\%

[Uncaptioned image] [Uncaptioned image]

80%80\%

[Uncaptioned image] [Uncaptioned image]
Figure S8: Examples of reconstruction with varying masked inputs. We use the pretrained XKD and during inference, the video mask ratio is varied from 60%60\% to 90%90\% and the audio mask ratio is varied from 50%50\% to 80%80\%. XKD demonstrates high reconstruction performance even when a very high mask ratio is applied for both audio and visual modalities.
Original frames
[Uncaptioned image]
Masked frames Reconstructed frames

60%60\%

[Uncaptioned image] [Uncaptioned image]

70%70\%

[Uncaptioned image] [Uncaptioned image]

80%80\%

[Uncaptioned image] [Uncaptioned image]

90%90\%

[Uncaptioned image] [Uncaptioned image]
Original spectrogram
[Uncaptioned image]
Masked spectrogram Reconstructed spectrogram

50%50\%

[Uncaptioned image] [Uncaptioned image]

60%60\%

[Uncaptioned image] [Uncaptioned image]

70%70\%

[Uncaptioned image] [Uncaptioned image]

80%80\%

[Uncaptioned image] [Uncaptioned image]
Figure S9: Examples of reconstruction with varying masked inputs. We use the pretrained XKD and during inference, the video mask ratio is varied from 60%60\% to 90%90\% and the audio mask ratio is varied from 50%50\% to 80%80\%. XKD demonstrates high reconstruction performance even when a very high mask ratio is applied for both audio and visual modalities.
Original frames
[Uncaptioned image]
Masked frames Reconstructed frames

60%60\%

[Uncaptioned image] [Uncaptioned image]

70%70\%

[Uncaptioned image] [Uncaptioned image]

80%80\%

[Uncaptioned image] [Uncaptioned image]

90%90\%

[Uncaptioned image] [Uncaptioned image]
Original spectrogram
[Uncaptioned image]
Masked spectrogram Reconstructed spectrogram

50%50\%

[Uncaptioned image] [Uncaptioned image]

60%60\%

[Uncaptioned image] [Uncaptioned image]

70%70\%

[Uncaptioned image] [Uncaptioned image]

80%80\%

[Uncaptioned image] [Uncaptioned image]
Figure S10: Examples of reconstruction with varying masked inputs. We use the pretrained XKD and during inference, the video mask ratio is varied from 60%60\% to 90%90\% and the audio mask ratio is varied from 50%50\% to 80%80\%. XKD demonstrates high reconstruction performance even when a very high mask ratio is applied for both audio and visual modalities.
Original Masked Reconstructed
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Figure S11: Examples of video frame reconstruction ability of XKD from highly masked (80%80\%) inputs.
Original Masked Reconstructed
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Figure S12: Examples of audio spectrogram reconstruction ability of XKD from highly masked (70%70\%) inputs.