This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Georgia Institute of Technology 22institutetext: GenAI, Meta 33institutetext: University of Illinois Urbana-Champaign
{bolin.lai,fkryan,wenqi.jia}@gatech.edu  [email protected][email protected]

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Bolin Lai 11    Fiona Ryan 11    Wenqi Jia 11    Miao Liu 22**    James M. Rehg 33**
Abstract

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate the audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning. The code and data split are available on our website (https://bolinlai.github.io/CSTS-EgoGazeAnticipation/).

Keywords:
Egocentric Vision Gaze Behavior Audio-Visual Learning
footnotetext: * Equal corresponding author.

1 Introduction

A person’s eye movements during their daily activities are reflective of their intentions and goals (see [22] for a representative cognitive science study). The ability to predict the future gaze targets of the camera-wearer from egocentric videos, known as egocentric gaze anticipation, is therefore a key step towards understanding and modeling cognitive processes and decision making. Furthermore, this capability could enable new applications in Augmented Reality and Wearable Computing, especially in social scenarios – for example, providing memory aids for patients with cognitive impairments, or reducing the latency of content delivery in such AR systems. However, forecasting the gaze fixations of a camera-wearer using only the egocentric view (i.e., without eye tracking at testing time) is very challenging due to the complexity of egocentric scene content and the dynamic nature of gaze behaviors.

Refer to caption
Figure 1: The problem setting of egocentric gaze anticipation. τo\tau_{o} denotes the observation time, and τa\tau_{a} denotes the anticipation time. Given the video frames and audio signals of the Input Video Sequence, the model seeks to predict the gaze fixation distribution for the time steps in the Gaze Anticipation Sequence. Green dots indicate the gaze targets in future frames and the heatmap shows the gaze anticipation result from our model.

We argue that audio signals can serve as an important auxiliary cue for egocentric gaze forecasting. Consider the example in Fig. 1. In the input sequence, the camera view shifts from the paper held by the camera wearer to the standing speaker who asks a question. Then the sitting speaker on the far right answers the question, which is captured by the audio stream. In the anticipation sequence, the camera wearer’s gaze shifts towards the sitting person’s head after hearing her response. In this case, the audio stream (the sitting person’s response) is an important stimulus that triggers this gaze movement. The influence of audio signals on eye movements is also evidenced by neuroscience research [57]. Therefore, we address the problem of forecasting the gaze fixation of the camera-wearer in unseen future frames using a short egocentric video clip and corresponding audio. As shown in Fig. 1, the model’s ability to fuse the audio and visual cues enables it to correctly predict the future attention to the seated subject.

Though many works have addressed egocentric gaze estimation [29, 30, 28, 37, 38, 63, 36], the egocentric gaze anticipation task is largely understudied [77]. Moreover, how to leverage both the visual modality and the audio modality for egocentric gaze modeling has not been explored yet.Existing methods on audio-visual learning commonly [67, 73, 6, 24, 54, 5, 7] fuse visual and audio embeddings simultaneously in time and space. However, such a fusion mechanism is not ideal under the egocentric setting, where the camera wearer’s reaction to the audio stimuli causes a drastic change of camera viewpoint. In Fig. 1, as a reaction to the question and answer, the camera wearer shifts the attention from the paper to the standing person and then to the sitting person. The viewpoint and scene also have changed because of head movement (see the first and last frame). Moreover, due to the natural delay of reaction time, the audio stimulus and gaze reaction will not occur at the same time. Therefore, predicting the future gaze behavior demands a model can (1) learn possible viewpoint and scene change driven by the audio stream over time and (2) locate the potential future gaze target in the visual space. Fusing two modalities in time and space simultaneously may result in limited performance in the two targets because of spurious audio-visual correlations. Hence, a spatial-temporal separable fusion model is a better solution for egocentric gaze anticipation task.

To address the challenges in our task, we propose a novel Contrastive Spatial-Temporal Separable (CSTS) audio-visual fusion method for egocentric gaze anticipation. Specifically, we input the egocentric video frames and the corresponding audio spectrograms into a video encoder and an audio encoder respectively. Then we develop a spatial fusion module and a temporal fusion module in parallel based on self-attention mechanism for modeling the spatial and temporal audio-visual correlation separately, exactly addressing the aforementioned demands. The output representations from the two branches are merged by channel-wise reweighting and fed into a visual decoder to predict the future gaze target. We also propose a novel strategy that uses a multi-modal contrastive loss [2] on the reweighted representations (referred to as post-fusion contrastive loss) from the fusion modules to facilitate audio-visual correspondence learning. We demonstrate the benefits of our approach on two egocentric video datasets that capture social scenarios and everyday activities: Ego4D [20] and Aria [43]. The proposed model achieves state-of-the-art gaze anticipation performance on both datasets. Our contributions are summarized as follows:

  • \bullet

    We introduce the first approach that utilizes visual and audio signals for modeling egocentric gaze behaviors.

  • \bullet

    We propose a novel CSTS model that leverages a spatio-temporal separable fusion module and a post-fusion contrastive learning scheme to facilitate audio-visual representation learning for egocentric gaze anticipation.

  • \bullet

    We present comprehensive experiment results on the Ego4D [20] and Aria [43] datasets. Our ablation studies show audio modality can improve the performance by +2.5% and +2.4% respectively in F1 score on Ego4D and Aria. The experiments also demonstrate our model outperforms prior state-of-the-art method by +1.9% and +1.6% in F1 score on the two datasets.

2 Related Work

Egocentric Gaze Modeling. Modeling human gaze behavior in egocentric videos is an important topic in egocentric vision. Most prior efforts target at egocentric gaze estimation [38, 36, 29, 30, 37, 28]. Huang et al. [29] propose learning temporal attention transitions from video features that reflect drastic gaze movements. Li et al. [38] and Huang et al. [28] utilize the correlation of gaze behaviors and actions, modeling them jointly with a convolutional network. Lai et al. [36] encode global scene context into a single global token and explicitly model the global-local correlations in the visual embedding for gaze estimation. In contrast, egocentric gaze anticipation, which seeks to predict future gaze targets from past video frames, addresses an understudied dimension of modeling gaze. Zhang et al. [77] introduce this task and utilize a convolutional network and a discriminator to generate future video frames, which are further used to anticipate future gaze targets. They enhance their model by adding an additional branch for gaze forecasting [76]. All previous efforts on both egocentric gaze estimation and anticipation model gaze behavior from only the visual properties of the video stream, and do not consider the relationship between audio signals and gaze behavior. In this work, we introduce the first model that leverages both visual and audio signals for egocentric gaze anticipation task.

Audio-Visual Saliency Prediction. Audio-visual saliency prediction is a well-studied problem in computer vision [56, 58, 55, 61, 10, 46]. Another related research topic is sound source localization [5, 60, 23, 24, 25, 26] which localizes sound source in the image/video corresponding to a given audio stream. Here, we mainly discuss previous approaches for fusing audio and visual representations in saliency prediction problem. Early CNN-based approaches adopt a late-fusion strategy [66, 67, 47, 68, 69] for saliency prediction. Recently, new findings suggest audio-visual fusion at the intermediate features is a more effective way to leverage advantages of both modalities [1, 74, 8, 64] for saliency prediction. Jain et al. [31] investigate two fusion methods at the middle level which achieve new state of the art on multiple datasets. Yao et al. [75] propose to incorporate the audio signal at multiple decoder layers by using an inner-product operation. Similarly, Chang et al. [6] and Xiong et al. [73] merge audio features into visual features at multiple levels of the visual encoder. Notably, our problem differs from the audio-visual saliency prediction in two aspects: First, the goal of our task is forecasting gaze behavior in the future, while saliency prediction focuses more on studying human’s attention mechanism in the current video frame. Second, our problem focuses egocentric videos that capture the changing viewpoint when people respond to audio and visual stimuli, while saliency prediction uses videos captured from a fixed viewpoint, and fail to reflect gaze reaction to real-time events. Apart from the difference on problem settings, we also want to emphasize that the transformer-based fusion methods have not been applied in the audio-visual saliency prediction problem. Moreover, we propose a well-motivated spatio-temporal separable fusion module to address this challenging problem

Contrastive Audio-Visual Representation Learning. Our work draws from a rich literature on leveraging contrastive learning to learn audiovisual feature representations [4, 35, 50, 48, 49, 52, 44, 3, 2, 19, 21, 45]. These works learn correspondences between audio and visual signals in an self-supervised manner, constructing positive pairs from matching video frames and audio segments, and negative pairs from all other pairwise combinations. We employ a similar contrastive loss to learn correspondences between co-occurring audio and visual features. However, while prior methods calculate contrastive loss on the raw embedding from each modality, we propose to apply contrastive loss on re-weighted audio and visual representations from our proposed spatial and temporal fusion mechanism.

3 Method

The egocentric gaze anticipation problem is illustrated in  Fig. 1. Given an egocentric video and audio from time tτot-\tau_{o} to tt, the goal is to predict the future gaze from tt to t+τat+\tau_{a} seconds. We denote the input video and audio as xx and aa, respectively, and model the gaze fixation as a probabilistic distribution on a 2D image plane (following [38, 36]).

Notably, visual and audio signals have correlations in both spatial and temporal dimensions for gaze modeling. Spatially, the visual region (e.g., sound source) that has a stronger correlation with audio signals is more likely to be the potential future gaze target. Temporally, events in the audio signal may drive both egocentric viewpoint change (via head movement) and gaze movements as the camera wearer responds to new sounds. Our key insight is thus separating the connection of audio and visual signals into spatial and temporal correlations.

Refer to caption
Figure 2: Overview of the proposed model. The video embeddings ϕ(x)\phi(x) and audio embeddings ψ(a)\psi(a) are obtained by two transformer-based encoders. We then model the correlations of visual and audio embeddings using two separate branches – (1) spatial fusion, which learns the spatial co-occurence of audio signals and visual objects in each frame, and (2) temporal fusion, which captures the temporal correlations and possible gaze movement. A contrastive loss is adopted to facilitate audio-visual representation learning. We input fused embeddings into a decoder for final gaze anticipation results.

Fig. 2 demonstrates the overview of our model. We exploit the transformer-based encoders ϕ(x)\phi(x) and ψ(a)\psi(a) to extract the representations of the video frames xx and audio signals aa. We then employ a Contrastive Spatial-Temporal Separable (CSTS) audio-visual fusion approach. Specifically, a spatial fusion module captures the correlation between audio embeddings and spatial appearance-based features; a temporal fusion module captures the temporal correlation between the visual and audio embeddings; and a contrastive loss is applied on fused audio-visual embeddings to facilitate the representation learning. Finally, spatially and temporally fused audio-visual features are merged and fed into a decoder for future gaze anticipation.

3.1 Audio and Visual Feature Embedding

Visual Feature Embedding. We adopt the multi-scale vision transformer (MViT) architecture [14] as the video encoder ϕ(x)\phi(x). ϕ(x)\phi(x) splits the 3D video tensor input into multiple non-overlapping patches, and thereby extracts T×H×WT\times H\times W visual tokens with feature dimension DD from xx.

Audio Feature Embedding. We follow [34] to adopt a sliding window approach for audio signal preprocessing. Specifically, for a video frame at time step tit_{i}, the corresponding audio segment has a range of [ti12Δtw,ti+12Δtw][t_{i}-\frac{1}{2}\Delta t_{w},t_{i}+\frac{1}{2}\Delta t_{w}]. We then use STFT to convert all audio segments into log-spectrograms and feed the processed audio segments into a transformer-based audio encoder ψ(a)\psi(a). Since the audio stream has more sparse information than video stream, we adopt a light-weighted transformer architecture (inspired by [17, 12]) for the audio encoder ψ(a)\psi(a). In this way, ψ(a)\psi(a) extracts T×MT\times M tokens with feature dimension DD from the audio inputs aa.

3.2 Spatial-Temporal Separable Fusion

Spatial Audio-Visual Fusion. The spatial fusion branch identifies correlations between the audio signal corresponding to a video frame and its visual content in space. We first use convolutional operations to generate the audio representation za,sz_{a,s} for spatial fusion with dimensions T×1×DT\times 1\times D from the audio embedding ψ(a)\psi(a). This allows the model to extract a holistic audio embedding within each sliding window. We then input the visual embedding ϕ(x)\phi(x) and pooled audio embedding za,sz_{a,s} into an in-frame self-attention layer σ\sigma. In this layer, we masked out all cross-frame connections and only calculate the correlations among visual tokens within each frame and the corresponding single audio token. Therefore, the input to the spatial fusion consists of TT groups of visual tokens, and TT single audio embeddings. Formally, we have:

ϕ(x)=[ϕ(x)(1),,ϕ(x)(T)],\phi(x)=\left[\phi(x)^{(1)},...,\phi(x)^{(T)}\right], (1)
za,s=[za,s(1),,za,s(T)],\vspace{-0.1cm}z_{a,s}=\left[z_{a,s}^{(1)},...,z_{a,s}^{(T)}\right], (2)

where ϕ(x)(i)1×N×D\phi(x)^{(i)}\in\mathbb{R}^{1\times N\times D}, za,s(i)1×1×Dz_{a,s}^{(i)}\in\mathbb{R}^{1\times 1\times D} with i{1,,T}i\in\{1,...,T\}, and N=H×WN=H\times W. Hence, the input from each time step is denoted as:

zs(i)=[ϕ(i)(x),za,s(i)]1×(N+1)×Dz_{s}^{(i)}=\left[\phi^{(i)}(x),z_{a,s}^{(i)}\right]\in\mathbb{R}^{1\times(N+1)\times D} (3)

The in-frame self-attention operation for time step ii can be written as:

σ(zs(i))=Softmax(𝑸s(i)𝑲s(i)T/D)𝑽s(i)1×(N+1)×D,\sigma(z_{s}^{(i)})=Softmax\left(\bm{Q}_{s}^{(i)}{\bm{K}_{s}^{(i)}}^{T}/\sqrt{D}\right)\bm{V}_{s}^{(i)}\in\mathbb{R}^{1\times(N+1)\times D}, (4)

where 𝑸𝒔(𝒊),𝑲s(i),𝑽s(i)\bm{Q_{s}^{(i)}},\bm{K}_{s}^{(i)},\bm{V}_{s}^{(i)} refer to query, key, and value of the spatial self-attention at time step ii, respectively. We apply Eq. 4 independently for each time step ii and have the following overall in-frame self-attention:

σ(zs)=[σ(zs(i)),,σ(zs(T))]T×(N+1)×D.\sigma(z_{s})=\left[\sigma(z_{s}^{(i)}),...,\sigma(z_{s}^{(T)})\right]\in\mathbb{R}^{T\times(N+1)\times D}. (5)

In practice, we input all tokens to the in-frame self-attention layer simultaneously, mask out cross-frame correlations and calculate Eq. 4 in one shot to speed up training. We further add two linear layers after the self-attention outputs σ(zs)\sigma(z_{s}), following the standard self-attention layer design. The output of the spatial module is finally denoted as usT×(N+1)×Du_{s}\in\mathbb{R}^{T\times(N+1)\times D}.

Temporal Audio-Visual Fusion. The temporal fusion branch models relationships between audio and visual content across time. We apply two convolutional layers to integrate the embedding from each modality at each time step into a single token. The resulting visual and audio tokens are denoted as zv,tT×1×Dz_{v,t}\in\mathbb{R}^{T\times 1\times D} and za,tT×1×Dz_{a,t}\in\mathbb{R}^{T\times 1\times D}, respectively. Then we feed zt=[zv,t,za,t]2T×1×Dz_{t}=\left[z_{v,t},z_{a,t}\right]\in\mathbb{R}^{2T\times 1\times D} into a cross-frame self-attention layer π\pi that can be formulated as:

π(zt)=Softmax(𝑸t𝑲tT/D)𝑽t2T×1×D,\pi(z_{t})=Softmax\left(\bm{Q}_{t}\bm{K}_{t}^{T}/\sqrt{D}\right)\bm{V}_{t}\in\mathbb{R}^{2T\times 1\times D}, (6)

where 𝑸t,𝑲t,𝑽t\bm{Q}_{t},\bm{K}_{t},\bm{V}_{t} are query, key and value matrices with dimension 2T×1×D2T\times 1\times D. Similar to the spatial fusion, two additional linear layers are added after π(zt)\pi(z_{t}) and result in the final temporal fusion output ut2T×1×Du_{t}\in\mathbb{R}^{2T\times 1\times D}.

Merging of Two Fusion Modules. After obtaining audio-visual representations from the two fusion modules, we merge the two branches by reweighting the output from spatial fusion with the temporal weights from temporal fusion in each channel, which produces a new representation for each modality that has been refined by multimodal spatial and temporal correlation. Specifically, we break down the output from spatial fusion usT×(N+1)×Du_{s}\in\mathbb{R}^{T\times(N+1)\times D} into uv,sT×N×Du_{v,s}\in\mathbb{R}^{T\times N\times D} and ua,sT×1×Du_{a,s}\in\mathbb{R}^{T\times 1\times D}, and the output from temporal fusion ut2T×1×Du_{t}\in\mathbb{R}^{2T\times 1\times D} into uv,tT×1×Du_{v,t}\in\mathbb{R}^{T\times 1\times D} and ua,tT×1×Du_{a,t}\in\mathbb{R}^{T\times 1\times D}. The reweighted visual representation is formulated as

uv=uv,suv,tT×N×D,\vspace{-0.04cm}u_{v}=u_{v,s}\otimes u_{v,t}\in\mathbb{R}^{T\times N\times D},\vspace{-0.04cm} (7)

where \otimes denotes element-wise multiplication with broadcast mechanism. uvu_{v} is then fed into a decoder to generate final prediction for future gaze target. We follow [36] to add skip connections from the video encoder to the decoder and optimize the network with a KL-Divergence loss kld\mathcal{L}_{kld}.

3.3 Contrastive Learning for Audio-Visual Fusion

In addition to using KL-Divergence loss to supervise gaze anticipation, we propose to leverage the intrinsic alignment of visual and audio modalities to learn a more robust audio-visual representation by using a contrastive learning scheme. Multi-modal contrastive loss has been proved to be effective in self-supervised learning [3, 2]. Rather than calculating the contrastive loss directly on the raw embedded features, we innovatively propose to use the reweighted video and audio representations from the spatial and temporal fusion modules, which has not been studied in prior works. In our experiments, we show this is a more effective representation learning method for egocentric gaze anticipation.

To this end, we reweight the raw audio embedding ψ(a)T×M×D\psi(a)\in\mathbb{R}^{T\times M\times D} from the audio encoder by temporal weights ua,tu_{a,t} from the temporal fusion module in a similar way to Eq. 7. We then get the reweighted audio feature as

ua=ψ(a)ua,tT×M×Du_{a}=\psi(a)\otimes u_{a,t}\in\mathbb{R}^{T\times M\times D} (8)

We don’t use an additional learnable token to aggregate information from other tokens as prior works did [3, 2, 39]. We instead average all tokens of uv{u_{v}} and ua{u_{a}} respectively to obtain the single-vector representations uv,ua1×Du^{\prime}_{v},u^{\prime}_{a}\in\mathbb{R}^{1\times D} and then map them to a low-dimensional common space using linear layers followed by L2 normalization. It can be formulated as wv=Norm(f1(uv))w_{v}=Norm\left(f_{1}(u^{\prime}_{v})\right) and wa=Norm(f2(ua))w_{a}=Norm\left(f_{2}(u^{\prime}_{a})\right), where f1(),f2()f_{1}(\cdot),f_{2}(\cdot) are linear layers. The resulting visual vector and audio vector are denoted as wv,wa1×Dw_{v},w_{a}\in\mathbb{R}^{1\times D^{\prime}}, where DD^{\prime} is the new dimension of the common space. Within each mini-batch, corresponding audio and visual embeddings are considered as positive pairs, and all other pairwise combinations are considered as negative. Following [39], we calculate video-to-audio loss and audio-to-video loss separately. The video-to-audio contrastive loss is defined as

cntrv2a=1||i=1||logexp(wv(i)Twa(i)/𝒯)jexp(wv(i)Twa(j)/𝒯),\mathcal{L}^{v2a}_{cntr}=-\frac{1}{|\mathcal{B}|}\sum_{i=1}^{|\mathcal{B}|}\log\frac{\exp({w_{v}^{(i)}}^{T}w_{a}^{(i)}/\mathcal{T})}{\sum_{j\in\mathcal{B}}\exp({w_{v}^{(i)}}^{T}w_{a}^{(j)}/\mathcal{T})}, (9)

where \mathcal{B} is the training batch ={1,2,,n}\mathcal{B}=\{1,2,\dots,n\} and 𝒯\mathcal{T} is the temperature factor. Superscripts (i)(i) and (j)(j) denote the ii-th and jj-th samples in the batch. The audio-to-video loss is defined in a symmetric way. Finally, the contrastive loss is defined as cntr=cntrv2a+cntra2v\mathcal{L}_{cntr}=\mathcal{L}_{cntr}^{v2a}+\mathcal{L}_{cntr}^{a2v}. kld\mathcal{L}_{kld} and cntr\mathcal{L}_{cntr} are linearly combined with a parameter α\alpha for the final training loss, i.e., =kld+αcntr\mathcal{L}=\mathcal{L}_{kld}+\alpha\mathcal{L}_{cntr}.

3.4 Implementation Details

In our experiments, we set observation time τo\tau_{o} as 33 seconds and anticipation time τa\tau_{a} as 22 seconds. For video input, we sample 8 frames from the observable segment and resize to a spatial size of 256×\times256. For audio input, following [34], we first resample the audio signal to 24kHz and set a time window with Δtw=1.28s\Delta t_{w}=1.28s to crop the audio segment corresponding to each video frame. We then convert it to a log-spectrogram using a STFT with window size 10ms and hop length 5ms. The number of frequency bands is set as 256 resulting in a spectrogram matrix of size 256×\times256. The output of the decoder is the gaze distribution on 8 frames uniformly sampled from the 2-second anticipation time. More details about model architecture and training hyper-parameters can be found in supplementary.

4 Experiments

We first introduce the datasets and evaluation metrics used in our experiments. We then present detailed ablation studies to validate the contribution of each component in our method, and demonstrate the performance improvement over prior state-of-the-art methods for gaze anticipation as well as gaze estimation models applied to the gaze anticipation task. Finally, we visualize the predictions and weights of our model to provide qualitative insight into our method.

4.1 Experiment Setup

Datasets. We conduct experiments on two egocentric datasets that contain aligned video and audio streams and gaze tracking data – Ego4D [20] and Aria [43]. Note that another widely used gaze estimation benchmark EGTEA Gaze+ [38] does not release audio data and thus is not feasible for our study. Other popular egocentric video datasets, such as Epic-Kitchens [11] and Charades-Ego [62], are also not applicable to our task because they don’t have eye-tracking labels. Ego4D and Aria are the two largest public datasets that provide all necessary data and labels (i.e., egocentric videos, aligned audio streams and eye-tracking data) for egocentric audio-visual gaze anticipation.

The Ego4D eye-tracking subset is collected in social settings (i.e., social interaction benchmark) and totals 31 hours of egocentric videos from 80 participants. All videos have a fixed 30 fps frame rate and spatial resolution of 1088×\times1080, and audio streams are recorded with a sampling rate of 44.1kHz. We use the train/test split released in [36] in our experiments, i.e., 15310 video segments for training and the other 5202 video segments for testing.

The Aria dataset contains 143 egocentric videos (totaling 7.3 hours) collected with Project Aria glasses. It covers a variety of indoor everyday activities including cooking, exercising and spending time with friends. All videos have a fixed 20 fps frame rate and spatial resolution of 1408×\times1408. A sliding window is used to trim long videos into 5-second video segment with a stride of 2 seconds. We use 107 videos (10456 segments) for training and 36 videos (2901 segments) for testing. We will release our split to facilitate future studies in this direction.

Evaluation Metrics. As suggested in recent work on egocentric gaze estimation [36], AUC score can easily get saturated due to the long-tailed distribution of gaze on 2D video frames. Therefore, we follow [36, 38] to adopt F1 score (primary), recall and precision as our evaluation metrics.

4.2 Ablation Study

We first quantify the performance contribution of each key module from our proposed method. Specifically, we denote the model only using our proposed spatial fusion module as S-fusion, the model only using our proposed temporal fusion module as T-fusion, the model using both modules and our spatial-temporal separable fusion strategy without the contrastive learning schema as STS. We finally present the performance of our full CSTS model (i.e., STS + contrastive learning). As demonstrated in Tab. 1, compared with models trained solely on RGB frames (Vision only), S-fusion and T-fusion boost the F1 score by +1.4% and +1.5% on Ego4D, and +1.1% and +1.1% on Aria. Moreover, the STS model further achieves a F1 score of 39.2% on Ego4D and 59.3% on Aria. These results suggest that both the spatial and and the temporal correlation between video and audio signal play a vital role for egocentric gaze anticipation. Contrastive loss further improves F1 score by +0.5% and +0.6% suggesting its contributions to audio-visual representative learning. We also observe that the full model doesn’t achieve the best in recall. This is because some incomplete baselines don’t leverage audio modality as effectively as the full model and thus produce more uncertainty in output, resulting in higher recall and lower precision. Therefore, we consider F1 as the primary metric. Similar phenomenon is also observed in the following experiments.

Table 1: Ablations on each key component of our proposed model. CSTS (highlighted in green) refers to the complete model of our approach. The best results are highlighted with boldface. Please refer to Sec. 4.2 for more discussions.
Methods Ego4D Aria
F1 Score Recall Precision F1 Score Recall Precision
Vision only 37.2 54.1 28.3 57.5 62.4 53.3
S-fusion 38.6 54.1 30.1 58.6 67.1 52.0
T-fusion 38.7 53.8 30.1 58.6 65.9 52.8
STS 39.2 53.7 30.8 59.3 66.8 53.3
CSTS 39.7 53.3 31.6 59.9 66.8 54.3
Table 2: Analysis on proposed fusion strategies. The best results are highlighted with boldface. STS (highlighted in green) refers to the proposed spatial-temporal separable fusion method (without contrastive learning). More discussions are in Sec. 4.3.
Methods Ego4D Aria
F1 Score Recall Precision F1 Score Recall Precision
Vision only 37.2 54.1 28.3 57.5 62.4 53.3
Linear 38.2 53.0 29.9 58.1 65.9 51.9
Bilinear 37.6 52.8 29.2 57.7 66.8 50.8
Concat. 38.1 53.6 29.5 58.0 66.8 51.2
Vanilla SA 38.5 53.3 30.1 58.0 67.2 51.1
STS 39.2 53.7 30.8 59.3 66.8 53.3

4.3 Analysis on Fusion and Contrastive Learning Strategies

Directly feeding all visual and audio tokens into a fusion layer (i.e., joint fusion) is a widely used approach for audio-visual saliency prediction [67, 73, 6] and action recognition [34, 17, 70]. To show the superiority of the proposed spatial-temporal separable (STS) fusion approach in handling the unique challenges of our task, we provide additional comparison with four joint fusion strategies that are widely used in audio-visual saliency prediction and audio-visual action recognition. Specifically, the four strategies are (1) fusing two modalities with a few linear layers [17] (denoted as Linear); (2) feeding video and audio embeddings to a single bilinear layer [31, 75] (denoted as Bilinear); (3) concatenating audio and visual embeddings along channel dimension (denoted as Concat.) as in [34, 31]; (4) feeding all embedded video and audio tokens together into a standard self-attention layer (denoted as Vanilla SA), inspired by [40, 73]. We replace our fusion modules with the four strategies in our framework for a fair comparison. We elaborate the implementation details of each baseline in supplementary.

As shown in Tab. 2, Linear, Bilinear, Concat. and Vanilla SA methods have limited improvement over the vision-only baseline, suggesting that previous fusion strategies for audio-visual saliency prediction and general action recognition are sub-optimal for our problem setting. In contrast, our proposed fusion strategy (STS) yields larger performance boost (+2.0% on Ego4D and +1.8% on Aria) even without using the contrastive loss, which shows the benefits of spatial-temporal separable fusion mechanism. The possible reason is that prior joint fusion methods are designed for third-person videos without a drastic viewpoint change. However, forecasting gaze in egocentric view has the unique challenges caused by camera movement and the latency of gaze response to audio stimuli. Our approach fuses two modalities in space and time separately and hence avoids spurious correlations that may happen in joint fusion baselines.

Table 3: Analysis on the proposed contrastive learning schema. Post Contr denotes our post-fusion contrastive learning. STS + Post Contr refers to the complete CSTS model. The best results are highlighted with boldface. More discussions are in Sec. 4.3.
Methods Ego4D Aria
F1 Score Recall Precision F1 Score Recall Precision
Vanilla SA 38.5 53.3 30.1 58.0 67.2 51.1
SA + Vanilla Contr 38.5 52.4 30.5 58.4 67.0 51.8
SA + Post Contr 38.9 54.4 30.3 58.8 66.4 52.8
\hdashlineSTS 39.2 53.7 30.8 59.3 66.8 53.3
STS + Vanilla Contr 39.0 53.7 30.6 59.1 66.5 53.1
STS + Post Contr 39.7 53.3 31.6 59.9 66.8 54.3

We also evaluate the benefits of our proposed post-fusion contrastive learning scheme in Tab. 3. Here, we consider another baseline (denoted as Vanilla Contr) that calculates the contrastive loss using raw video and audio embeddings (i.e., ϕ(x)\phi(x) and ψ(a)\psi(a) in Fig. 2), as is typical in prior work [65, 44, 19, 21]. Our novel strategy of adding contrastive loss on fused features is denoted as Post Contr. Vanilla Contr makes only minor differences on Vanilla SA model and even slightly reduces performance when accompanied by our proposed STS mechanism. In contrast, our proposed Post Contr scheme improves the performance of Vanilla SA by +0.4% and 0.8% and improves STS by +0.5% and +0.6% on the two datasets. These results further suggest that post-fusion contrastive learning is more robust for audio-visual learning in our task. More experiments of different contrastive learning strategies are provided in supplementary.

4.4 Comparison with State-of-the-art Methods

Table 4: Comparison with state-of-the-art models on egocentric gaze anticipation. We also adapt previous egocentric gaze estimation approaches to the anticipation setting for a more thorough comparison. The best results are highlighted with boldface. The green row shows our model performance. Please refer to Sec. 4.4 for more discussions.
Methods Ego4D Aria
F1 Score Recall Precision F1 Score Recall Precision
Center Prior 13.6 9.4 24.2 24.9 17.3 44.4
GazeMLE [38] 36.3 52.5 27.8 56.8 64.1 51.0
AttnTrans [29] 37.0 55.0 27.9 57.4 65.5 51.0
I3D-R50 [15] 36.9 52.1 28.6 57.4 63.6 52.2
MViT [14] 37.2 54.1 28.3 57.5 62.4 53.3
GLC [36] 37.8 52.9 29.4 58.3 65.4 52.6
\hdashlineDFG [77] 37.2 53.2 28.6 57.4 63.6 52.3
DFG+ [76] 37.3 52.3 29.0 57.6 65.5 51.3
CSTS 39.7 53.3 31.6 59.9 66.8 54.3
Refer to caption
Figure 3: The performance of gaze anticipation in each frame. Our model (CSTS) consistently outperforms all prior methods by a notable margin.

Most existing works on egocentric gaze modeling target at egocentric gaze estimation rather than anticipation. In order to provide a thorough comparison, in addition to comparing against SOTA egocentric gaze anticipation models (DFG[77], DFG+[76]), we also adapt the recent SOTA egocentric gaze estimation model GLC[36] and all baselines from [36] (I3D-Res50 [71], MViT[14], GazeMLE[38] and AttnTrans[29]) to the anticipation task.

As presented in Tab. 4, our method outperforms its direct competitor DFG+, which is the previous SOTA model for egocentric gaze anticipation, by +2.4% F1 on Ego4D and +2.3% F1 on Aria. Note that the original DFG and DFG+ used a less powerful backbone encoder, so for fair comparison, we reimplement their method using the same MViT backbone as our method. We also observe that methods originally designed for egocentric gaze estimation still work as strong baselines for the egocentric gaze anticipation task. Our proposed CSTS model also outperforms these methods, surpassing the recent SOTA for egocentric gaze estimation – GLC by +1.9% F1 on Ego4D and +1.6% F1 on Aria. In addition, We also incorporate audio stream into the strongest baseline (GLC) by a straightforward concatenation whose F1 score is 38.1% on Eg4D and 58.5% on Aria. The marginal gain over GLC (+0.3%/+0.2%) suggests that simply using audio stream in a strong baseline without specific design leads to sub-optimal solution in egocentric gaze anticipation problem, which in turn validates the effectiveness and necessity of our approach.

Refer to caption
Figure 4: Egocentric gaze anticipation results from our model and other baselines. We show the results of four future time steps uniformly sampled from the anticipation segments. Green dots indicate the ground truth gaze location.

In addition, we evaluate gaze anticipation on each anticipation time step independently and compare with previous methods in Fig. 3. Unsurprisingly, the anticipation problem becomes more challenging as the anticipation time step increases farther into the future. Our CSTS method consistently outperforms all baselines at all future time steps. Moreover, we note that our model also produces new SOTA results on egocentric gaze estimation, demonstrating the generalizability and robustness of our approach across gaze modeling tasks. We include these results in supplementary.

Refer to caption
Figure 5: Visualization of the spatial correlation weights. All video frames are sorted in a chronological order indexed by the numbers on the top-right corner.

4.5 Visualization of Predictions

We visually showcase the anticipation results of CSTS and the baselines in Fig. 4. We can see that GazeMLE [38] and AttnTransit [29] produce more uncertainty in prediction heatmaps. Other methods fail to anticipate the true gaze target, and are likely to be misled by other salient objects. Our CSTS approach produces the best gaze anticipation results among all methods. We attribute this improvement to our novel model design that effectively addresses the unique challenges of forecasting gaze targets in egocentric view.

4.6 Visualization of Learned Correlations

We provide further insight on our model by visualizing the audio-visual correlations from the spatial fusion module. For each time step tt, we calculate the correlation of each visual token with the single audio token and map it back to the input frames. The correlation heatmaps are shown in Fig. 5. In the first example, the speaker in the middle speaks, then turns her head around to talk with a social partner in the background (frame 1-3). We observe that our model captures that the audio signal has the highest correlation with spatial region of the speaker while she is speaking. Then, when she stops talking and turns her head back, the correlation is highest in the background regions, indicating the potential location of her social partner. The second example illustrates a similar phenomenon: the model captures the speaker at the beginning when she is talking, then attends to background locations when she stops. These examples suggest our model has the capability to model the audio-visual correlations in spatial dimension to learn a robust audio-visual representation.

5 Conclusion

In this paper, we propose a novel contrastive spatial-temporal separable fusion approach (CSTS) for egocentric gaze anticipation. Our key contribution is breaking down the fusion of the audio and visual modalities into a separate spatial fusion module for learning the spatial co-occurrence of visual features and audio signals, and a temporal fusion module for modeling the changing viewpoint and scene driven by audio stimuli. We further adopt a contrastive loss on the reweighted audio-visual representations from the fusion modules to facilitate multimodal representation learning. We demonstrate the benefits of our proposed model design on two egocentric video datasets: Ego4D and Aria. Our work is a key step for probing into human cognitive process with computational models, and provides important insights into multimodal representation learning, visual forecasting and egocentric video understanding.

References

  • [1] Agrawal, R., Jyoti, S., Girmaji, R., Sivaprasad, S., Gandhi, V.: Does audio help in deep audio-visual saliency prediction models? In: Proceedings of the 2022 International Conference on Multimodal Interaction. pp. 48–56 (2022)
  • [2] Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34, 24206–24221 (2021)
  • [3] Alayrac, J.B., Recasens, A., Schneider, R., Arandjelović, R., Ramapuram, J., De Fauw, J., Smaira, L., Dieleman, S., Zisserman, A.: Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems 33, 25–37 (2020)
  • [4] Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE international conference on computer vision. pp. 609–617 (2017)
  • [5] Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European conference on computer vision (ECCV). pp. 435–451 (2018)
  • [6] Chang, Q., Zhu, S.: Temporal-spatial feature pyramid for video saliency detection. Cognitive Computation (2021)
  • [7] Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16867–16876 (2021)
  • [8] Cheng, S., Gao, X., Song, L., Xiahou, J.: Audio-visual salieny network with audio attention module. In: 2021 2nd International Conference on Artificial Intelligence and Information Systems. pp. 1–5 (2021)
  • [9] Chudasama, V., Kar, P., Gudmalwar, A., Shah, N., Wasnik, P., Onoe, N.: M2fnet: Multi-modal fusion network for emotion recognition in conversation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4652–4661 (2022)
  • [10] Coutrot, A., Guyader, N.: Multimodal saliency models for videos. From Human Attention to Computational Attention: A Multidisciplinary Approach pp. 291–304 (2016)
  • [11] Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720–736 (2018)
  • [12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (2020)
  • [13] Fan, H., Li, Y., Xiong, B., Lo, W.Y., Feichtenhofer, C.: Pyslowfast. https://github.com/facebookresearch/slowfast (2020)
  • [14] Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6824–6835 (2021)
  • [15] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6202–6211 (2019)
  • [16] Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 35–53 (2018)
  • [17] Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: Action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10457–10467 (2020)
  • [18] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
  • [19] Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.: Contrastive audio-visual masked autoencoder. International Conference on Learning Representations (2022)
  • [20] Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18995–19012 (2022)
  • [21] Gurram, S., Fang, A., Chan, D., Canny, J.: Lava: Language audio vision alignment for contrastive video pre-training. arXiv preprint arXiv:2207.08024 (2022)
  • [22] Hayhoe, M., Ballard, D.: Eye movements in natural behavior. Trends in cognitive sciences 9(4), 188–194 (2005)
  • [23] Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9248–9257 (2019)
  • [24] Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., Dou, D.: Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems 33, 10077–10087 (2020)
  • [25] Hu, X., Chen, Z., Owens, A.: Mix and localize: Localizing sound sources in mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10483–10492 (2022)
  • [26] Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22910–22921 (2023)
  • [27] Huang, P.Y., Sharma, V., Xu, H., Ryali, C., Li, Y., Li, S.W., Ghosh, G., Malik, J., Feichtenhofer, C., et al.: Mavil: Masked audio-video learners. Advances in Neural Information Processing Systems 36 (2024)
  • [28] Huang, Y., Cai, M., Li, Z., Lu, F., Sato, Y.: Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing 29, 7795–7806 (2020)
  • [29] Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European conference on computer vision (ECCV). pp. 754–769 (2018)
  • [30] Huang, Y., Cai, M., Sato, Y.: An ego-vision system for discovering human joint attention. IEEE Transactions on Human-Machine Systems 50(4), 306–316 (2020)
  • [31] Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., Gandhi, V.: Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3520–3527. IEEE (2021)
  • [32] Jia, W., Liu, M., Rehg, J.M.: Generative adversarial network for future hand segmentation from egocentric video. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII. pp. 639–656. Springer (2022)
  • [33] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  • [34] Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5492–5501 (2019)
  • [35] Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems 31 (2018)
  • [36] Lai, B., Liu, M., Ryan, F., Rehg, J.: In the eye of transformer: Global-local correlation for egocentric gaze estimation. British Machine Vision Conference (2022)
  • [37] Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: Proceedings of the IEEE international conference on computer vision. pp. 3216–3223 (2013)
  • [38] Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)
  • [39] Lin, K.Q., Wang, A.J., Soldan, M., Wray, M., Yan, R., Xu, E.Z., Gao, D., Tu, R., Zhao, W., Kong, W., et al.: Egocentric video-language pretraining. Advances in Neural Information Processing Systems (2022)
  • [40] Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
  • [41] Liu, Y., Tan, Y., Lan, H.: Self-supervised contrastive learning for audio-visual action recognition. In: 2023 IEEE International Conference on Image Processing (ICIP). pp. 1000–1004. IEEE (2023)
  • [42] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
  • [43] Lv, Z., Miller, E., Meissner, J., Pesqueira, L., Sweeney, C., Dong, J., Ma, L., Patel, P., Moulon, P., Somasundaram, K., Parkhi, O., Zou, Y., Raina, N., Saarinen, S., Mansour, Y.M., Huang, P.K., Wang, Z., Troynikov, A., Artal, R.M., DeTone, D., Barnes, D., Argall, E., Lobanovskiy, A., Kim, D.J., Bouttefroy, P., Straub, J., Engel, J.J., Gupta, P., Yan, M., Nardi, R.D., Newcombe, R.: Aria pilot dataset. https://about.facebook.com/realitylabs/projectaria/datasets (2022)
  • [44] Ma, S., Zeng, Z., McDuff, D., Song, Y.: Active contrastive learning of audio-visual video representations. International Conference on Learning Representations (2020)
  • [45] Ma, S., Zeng, Z., McDuff, D., Song, Y.: Contrastive learning of global-local video representations. arXiv preprint arXiv:2104.05418 (2021)
  • [46] Min, X., Zhai, G., Gu, K., Yang, X.: Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13(1), 1–23 (2016)
  • [47] Min, X., Zhai, G., Zhou, J., Zhang, X.P., Yang, X., Guan, X.: A multimodal saliency model for videos with high audio-visual correspondence. IEEE Transactions on Image Processing 29, 3805–3819 (2020)
  • [48] Morgado, P., Li, Y., Nvasconcelos, N.: Learning representations from audio-visual spatial alignment. Advances in Neural Information Processing Systems 33, 4733–4744 (2020)
  • [49] Morgado, P., Misra, I., Vasconcelos, N.: Robust audio-visual instance discrimination. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12934–12945 (2021)
  • [50] Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12475–12486 (2021)
  • [51] Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. Advances in neural information processing systems 34, 14200–14213 (2021)
  • [52] Patrick, M., Asano, Y., Kuznetsova, P., Fong, R., Henriques, J.F., Zweig, G., Vedaldi, A.: Multi-modal self-supervision from generalized data transformations. arXiv preprint (2020)
  • [53] Praveen, R.G., de Melo, W.C., Ullah, N., Aslam, H., Zeeshan, O., Denorme, T., Pedersoli, M., Koerich, A.L., Bacon, S., Cardinal, P., et al.: A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2486–2495 (2022)
  • [54] Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W.: Multiple sound sources localization from coarse to fine. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. pp. 292–308. Springer (2020)
  • [55] Ratajczak, R., Pellerin, D., Labourey, Q., Garbay, C.: A fast audiovisual attention model for human detection and localization on a companion robot. In: VISUAL 2016-The First International Conference on Applications and Systems of Visual Paradigms (VISUAL 2016) (2016)
  • [56] Ruesch, J., Lopes, M., Bernardino, A., Hornstein, J., Santos-Victor, J., Pfeifer, R.: Multimodal saliency-based bottom-up attention a framework for the humanoid robot icub. In: 2008 IEEE International Conference on Robotics and Automation. pp. 962–967. IEEE (2008)
  • [57] Schaefer, K., Süss, K., Fiebig, E.: Acoustic-induced eye movements. Annals of the New York Academy of Sciences 374, 674–688 (1981)
  • [58] Schauerte, B., Kühn, B., Kroschel, K., Stiefelhagen, R.: Multimodal saliency-based attention for object-based scene analysis. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 1173–1179. IEEE (2011)
  • [59] Senocak, A., Kim, J., Oh, T.H., Li, D., Kweon, I.S.: Event-specific audio-visual fusion layers: A simple and new perspective on video understanding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2237–2247 (2023)
  • [60] Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4358–4366 (2018)
  • [61] Sidaty, N., Larabi, M.C., Saadane, A.: Toward an audiovisual attention model for multimodal video content. Neurocomputing 259, 94–111 (2017)
  • [62] Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018)
  • [63] Soo Park, H., Shi, J.: Social saliency prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4777–4785 (2015)
  • [64] Tavakoli, H.R., Borji, A., Rahtu, E., Kannala, J.: Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693 (2019)
  • [65] Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)
  • [66] Tsiami, A., Koutras, P., Katsamanis, A., Vatakis, A., Maragos, P.: A behaviorally inspired fusion approach for computational audiovisual saliency modeling. Signal Processing: Image Communication 76, 186–200 (2019)
  • [67] Tsiami, A., Koutras, P., Maragos, P.: Stavis: Spatio-temporal audiovisual saliency network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4766–4776 (2020)
  • [68] Wang, G., Chen, C., Fan, D.P., Hao, A., Qin, H.: From semantic categories to fixations: A novel weakly-supervised visual-auditory saliency detection approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15119–15128 (2021)
  • [69] Wang, G., Chen, C., Fan, D.P., Hao, A., Qin, H.: Weakly supervised visual-auditory fixation prediction with multigranularity perception. arXiv preprint arXiv:2112.13697 (2021)
  • [70] Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12695–12705 (2020)
  • [71] Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)
  • [72] Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020)
  • [73] Xiong, J., Wang, G., Zhang, P., Huang, W., Zha, Y., Zhai, G.: Casp-net: Rethinking video saliency prediction from an audio-visual consistency perceptual perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6441–6450 (2023)
  • [74] Yang, Q., Li, Y., Li, C., Wang, H., Yan, S., Wei, L., Dai, W., Zou, J., Xiong, H., Frossard, P.: Svgc-ava: 360-degree video saliency prediction with spherical vector-based graph convolution and audio-visual attention. IEEE Transactions on Multimedia (2023)
  • [75] Yao, S., Min, X., Zhai, G.: Deep audio-visual fusion neural network for saliency estimation. In: 2021 IEEE International Conference on Image Processing (ICIP). pp. 1604–1608. IEEE (2021)
  • [76] Zhang, M., Ma, K.T., Lim, J.H., Zhao, Q., Feng, J.: Anticipating where people will look using adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41(8), 1783–1796 (2018)
  • [77] Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4372–4381 (2017)

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Supplementary Material

This is the supplementary material for the paper titled "Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation". We organize the content as follows:

  • \bullet

    A – Comparison with Prior Audio-Visual Learning Strategies

  • \bullet

    B – Additional Experiment Results

    • \diamond

      B.1 – Experiments about Model Generalization Capability

    • \diamond

      B.2 – Experiments on Egocentric Gaze Estimation

    • \diamond

      B.3 – Additional Experiments on Contrastive Learning

    • \diamond

      B.4 – Additional Visualization

  • \bullet

    C – More Implementation Details

    • \diamond

      C.1 – Implementation Details of Our Model

    • \diamond

      C.2 – Implementation Details of Baseline Fusion Strategies

  • \bullet

    D – Limitation and Future Work

  • \bullet

    E – Code and License

Appendix A Comparison with Prior Audio-Visual Learning Strategies

We have specified the key differences between the egocentric action anticipation task and saliency prediction task in the second paragraph of Sec. 2 in the main paper. The experiment results also validate that our proposed spatial-temporal separable fusion strategy performs better in our task than other fusion strategies designed for saliency prediction and action recognition (please refer to Tab. 2 in the main paper). In this section, we further compare our model with typical audio-visual learning methods for saliency prediction and recognition tasks in terms of model design.

In Tab. 5, all prior methods are designed for exocentric videos (i.e., third-person videos) that have a fixed camera viewpoint through all frames. Though various fusion approaches are used in these methods, they fuse audio-visual embeddings jointly in time and space. In contrast, the egocentric gaze anticipation task has the unique challenges of moving viewpoint together the latency between the audio stimuli and human reactions. To address these challenges, our model uses a novel spatial-temporal separable fusion strategy which has not been studied in prior work. The experiments in Tab. 2 of the main paper shows that our method achieves the best performance in egocentric gaze anticipation task compared with prior audio-visual learning strategies. In addition, using contrastive learning to boost audio-visual representations in a specific task is still an understudied area. Huang et al. [27] use inter- and intra- contrastive loss to learn aligned audio and visual embeddings. However, they straightforwardly apply contrastive loss on the raw embeddings right after the encoders. In our model, we innovatively propose to adopt contrastive loss on the embeddings after fusion layers (i.e., post-fusion contrastive learning). We also validate its advantage in Tab. 3 of the main paper. These key differences consolidate our contributions and clearly distinguish our model from other audio-visual learning methods.

Table 5: Comparison with typical audio-visual learning methods for audio-visual saliency prediction and recognition. If more than one fusion strategies have been tried in one method, we only show the strategy leading to the best performance.
Methods View Fusion Cntr Architecture Task
Tavakoli et al. [64] Exo Concatenation w/o CNN Saliency Prediction
Min et al. [47] Exo Correlation Analysis w/o CNN Saliency Prediction
Tsiami et al. [67] Exo Bilinear w/o CNN Saliency Prediction
Yao et al. [75] Exo Inner Product w/o CNN Saliency Prediction
Change et al. [6] Exo Bilinear w/o CNN Saliency Prediction
Jain et al. [31] Exo Bilinear w/o CNN Saliency Prediction
Wang et al. [68] Exo Concatenation w/o CNN Saliency Prediction
Xiong et al. [73] Exo Self-Attentation w/o CNN Saliency Prediction
Nagrani et al. [51] Exo Attention Bottleneck w/o Transformer Video Classification
Huang et al. [27] Exo Self-Attention w/ Transformer Video Classification
Gao et al. [16] Exo Linear w/o LSTM Action Recognition
Kazakos et al. [34] Exo Linear w/o CNN Action Recognition
Wang et al. [70] Exo Weigted Sum w/o CNN Action Recognition
Xiao et al. [72] Exo Self-Attentation w/o CNN Action Recognition
Liu et al. [41] Exo Linear w/o CNN Action Recognition
Senocak et al. [59] Exo Linear w/o CNN Action Recognition
Praveen et al. [53] Exo Self-Attention w/o CNN Emotion Recognition
Chudasama et al. [9] Exo Self-Attention w/o Transformer Emotion Recognition
CSTS (Ours) Ego Spatial-Temporal w/ Transformer Gaze Anticipation
Separable

Appendix B Additional Experiment Results

Table 6: Zero-shot experiments on Aria dataset. All baselines and our model are trained only on Ego4D training set. We consider F1 score as the primary metric in our experiments. The green row refers to our model, and the best results are highlighted with boldface. See Sec. B.1 for further discussion.
Methods F1 Score Recall Precision
GazeMLE [38] 44.0 59.0 35.0
AttnTransit [29] 43.1 57.5 34.5
I3D-R50 [15] 41.5 77.2 28.4
MViT [14] 44.1 59.7 35.0
GLC [36] 46.9 72.8 34.6
\hdashlineDFG [77] 39.3 80.4 26.0
DFG+ [76] 43.1 76.4 30.0
CSTS 50.8 62.2 42.9

B.1 Experiments about Model Generalization Capability

To validate the generalization capability of our model, we compare our model with prior state-of-the-art models in a zero-shot setting. Specifically, We train our model and all baselines with Ego4D training set and test them with Aria test set. Note that the Aria data is invisible to all models during training. The results are presented in Tab. 6. Our model outperforms the best egocentric gaze anticipation model (DFG+) by +7.7% and also exceeds the strongest baseline (GLC) by +3.9% in F1 score (primary metric). The remarkable improvement suggests that, with our novel fusion and contrastive learning approaches, our model is able to generalize better to other unseen data, which is critical for applying it to real-world problems.

Table 7: Comparison with prior state-of-the-art models on egocentric gaze estimation. The green row refers to our model. The best results are highlighted with boldface. See Sec. B.2 for further discussion.
Methods Ego4D Aria
F1 Score Recall Precision F1 Score Recall Precision
Center Prior 14.9 21.9 11.3 28.9 21.7 43.1
GazeMLE [38] 35.4 49.7 27.5 58.7 63.4 54.7
AttnTransit [29] 36.4 47.6 29.5 59.2 60.2 58.3
I3D-R50 [15] 37.5 52.5 29.2 60.9 69.5 54.2
MViT [14] 40.9 57.4 31.7 61.7 71.2 54.5
GLC [36] 43.1 57.0 34.7 63.2 67.4 59.5
CSTS 43.7 58.0 35.1 64.5 69.6 60.1

B.2 Experiments on Egocentric Gaze Estimation

In addition to egocentric gaze anticipation, we also evaluate the advantage of our model in another gaze modeling problem – egocentric gaze estimation. Instead of forecasting future gaze, egocentric gaze estimation requires gaze target prediction in the current video frames. We use the same experiment setup from the recent state-of-the-art method [36].

As demonstrated in Tab. 7, the prior work [36] has shown the superiority of using a transformer-based architecture for egocentric gaze estimation. By incorporating the audio modality, CSTS surpasses the backbone MViT [14] (vision-only counterpart) by +2.8% on both Ego4D on Aria in terms of F1 score. These results indicate the audio modality also makes important contributions to the performance on egocentric gaze estimation. Furthermore, our model outperforms GLC [36] by +0.6% and +1.3% on Ego4D and Aria respectively, achieving a new state-of-the-art performance for this problem. However, our method has a smaller performance improvement on the gaze estimation task compared to gaze anticipation. The possible reason is that the audio stream has a stronger connection with future gaze targets than current gaze behaviors because of the natural latency between the audio stimuli and human reactions.

B.3 Additional Experiments on Contrastive Learning

Table 8: Study of different strategies for contrastive loss implementation. Post Cntr refers to our proposed post-fusion contrastive learning strategy, and the green row refers to the complete CSTS model. The best results are highlighted with boldface. See Sec. B.3 for further discussion.
Methods Ego4D Aria
F1 Score Recall Precision F1 Score Recall Precision
STS + Vanilla Contr 39.0 53.7 30.6 59.1 66.5 53.1
STS + S-Contr 38.5 53.5 30.0 59.0 66.3 53.1
STS + T-Contr 38.9 54.0 30.5 59.0 66.7 53.0
STS + Cross Contr 38.9 54.4 30.2 59.3 66.8 53.3
STS + Post Contr 39.7 53.3 31.6 59.9 66.8 54.3

In our model, we propose to use the audio-visual representations obtained after fusion (i.e. uvu_{v} and uau_{a}) to calculate contrastive loss (i.e., post-fusion contrastive learning). As a comparison, we also implement a baseline by feeding the raw embeddings from the encoders (i.e. ϕ(x)\phi(x) and ψ(a)\psi(a)) to the contrastive loss which is denoted as Vanilla Contr. To further investigate the contribution of contrastive learning, we also conduct experiments with three additional strategies:

Cross Contr. In our final model (CSTS), we use the new visual representation uv=uv,suv,tu_{v}=u_{v,s}\otimes u_{v,t} and the new audio representation ua=ψ(a)ua,tu_{a}=\psi(a)\otimes u_{a,t} as input to the contrastive loss. In Cross Contr, we still use uvu_{v} yet replace uau_{a} by reweighting the audio representation ua,su_{a,s} after the spatial fusion with weight ua,tu_{a,t} from the temporal fusion, i.e. ua=ua,sua,tu_{a}^{*}=u_{a,s}\otimes u_{a,t}, as input to the contrastive loss. Please refer to Fig. 2 in the main paper for the meaning of each notation.

S-Contr. We use the output from the spatial fusion module (uv,su_{v,s},ua,su_{a,s}) to calculate the contrastive loss.

T-Contr. We use the output from the temporal fusion module (uv,tu_{v,t},ua,tu_{a,t}) to calculate the contrastive loss.

We implement all contrastive learning baselines above on our proposed model architecture and fusion strategy (i.e., STS). The results are summarized in Tab. 8. Both S-Contr and T-Contr lag behind or perform on par with Vanilla Contr. One possible reason is that conducting contrastive learning using features obtained from only one fusion branch may compromise the representation learning of the other branch. Additionally, Cross Contr works on-par with Vanilla Contr on Ego4D but performs better on Aria. It also consistently outperforms S-Contr and T-Contr. This result validates our claim that implementing contrastive loss with reweighted representations from both spatial and temporal fusion leads to more gains for egocentric gaze anticipation. Moreover, our proposed strategy (reweighting the raw audio embedding ψ(a)\psi(a) rather than the fused embedding after spatial fusion) outperforms Cross Contr. This is because in Cross Contr ua,su_{a,s} is derived from spatial fusion, where each audio token is fused with 64 visual tokens in the spatial fusion branch resulting in the dilution of audio features. All results further demonstrate the benefits of our proposed contrastive learning strategy.

Refer to caption
Figure 6: Additional egocentric gaze anticipation results from our model and other baselines. Green dots indicate the ground truth gaze location. The first two examples are from the Ego4D dataset, and the last example is from the Aria dataset.
Refer to caption
Figure 7: Failure cases of our model and baselines. Green dots indicate the ground truth gaze location. The first example is from the Ego4D dataset, and the second example is from the Aria dataset.

B.4 Additional Visualization

We showcase more qualitative comparisons with all the baselines for egocentric gaze anticipation in Fig. 6. We observe CSTS makes the most accurate predictions. We also illustrate some typical failure cases in Fig. 7. In the first example, our model makes an accurate prediction in the first frame but fails at the following time steps due to the gaze movement. In the second example, the the camera view and gaze target move from the left to the right. This drastic change causes the mistake in our model’s predictions. Similar failures also happen in the predictions of all baselines. Notably, existing deep models tend to only successfully anticipate steady gaze fixations or small gaze movements in the near future, and can not effectively capture large gaze shifts. This is the a common limitation shared by many existing works of future anticipation [32] in egocentric videos.

Appendix C More Implementation Details

C.1 Implementation Details of Our Model

Architecture. Inspired by [17], we use a light-weight audio encoder composed of four self-attention blocks from MViT [14]. The model architecture is further detailed in Tab. 9. We initialize the video encoder with Kinetics-400 pretraining [33] and initialize the audio encoder using Xavier initialization [18]. The resulting video embeddings ϕ(x)\phi(x) have a dimension of T=4,H=8,W=8,D=768T=4,H=8,W=8,D=768, and the resulting audio embeddings ψ(a)\psi(a) have a dimension of T=4,M=64,D=768T=4,M=64,D=768. We follow [39] to map audio-visual representation vectors to dimension D=256D^{\prime}=256 for the contrastive loss. The output from the decoder is a downsampled heatmap which is upsampled to match the input size using trilinear interpolation. Following [36], we add intermediate features from each video encoder block to the corresponding decoder block output via skip connections to compensate for the loss of low-level textures.

Training. We set both temperature factor 𝒯\mathcal{T} of contrastive loss and re-weight parameter α\alpha as 0.05. Follow [38, 36], we use a Gaussian distribution with kernel size of 19 centered on the gaze location in each frame as the ground truth gaze heatmap during training. The model is trained with AdamW [42] optimization for 15 epochs. The momentum and weight decay are set as 0.9 and 0.05. The initial learning rate is 10410^{-4} which decreases with the cosine learning rate decay strategy. The model is trained with a batch size of 8 across 4 GPUs.

C.2 Implementation Details of Baseline Fusion Strategies

We compare with multiple different audio-visual fusion strategies in main paper Tab. 2. The details of each baseline are listed as follows:

Linear. We reshape the video embedding and audio embedding to the shape N^×D\hat{N}\times D. We concatenate the two reshaped embeddings (resulting in the dimension of N^×2D\hat{N}\times 2D) and input it to two linear layers. The dimension of the output is N^×D\hat{N}\times D and we reshape it back to T×H×W×DT\times H\times W\times D which is fed into the decoder.

Bilinear. We reduce the length of video tokens and audio tokens to 256 using a linear layer respectively. Then we input the resulting video and audio tokens into a bilinear layer. The output is fed into the decoder for gaze forecasting.

Concat. We reshape the audio embedding ψ(a)T×H×W×D\psi(a)\in\mathbb{R}^{T\times H\times W\times D} to the same dimension as the video embedding ϕ(x)T×H×W×D\phi(x)\in\mathbb{R}^{T\times H\times W\times D} and concatenate them along the channel to obtain an audio-visual representation with dimension of T×H×W×2DT\times H\times W\times 2D. This representation is fed into the decoder for gaze forecasting.

Vanilla SA. In this baseline, we flatten the video embedding and audio embedding into a list of tokens and thereby obtain T×(N+M)T\times(N+M) tokens in total, where N=H×WN=H\times W. Then we input all tokens to a standard self-attention layer followed by multiple linear layers to perform fusion in the spatial and temporal dimensions simultaneously. We split the output into a new visual embedding incorporating audio information with dimension of T×N×DT\times N\times D and a new audio embedding incorporating visual information with dimension of T×M×DT\times M\times D. The new visual embedding is input into the decoder.

STS. This is a baseline using the same fusion strategy as our method but without using the contrastive loss for training.

Appendix D Limitation and Future Work

In this paper, we propose a novel contrastive spatial-temporal separable fusion model for audio-visual egocentric gaze anticipation. Our method is validated on the Ego4D [20] and Aria [43] datasets. Our method has larger performance improvement on the Aria dataset comparing with Ego4D dataset. We believe this is because the multi-person social interaction setting from Ego4D dataset incurs additional challenges for audio representation learning, like multiple people and speakers present. Our current model design did not explicitly address this challenging nature of multi-speaker social interactions. Another limitation is that our model fails to anticipate the drastic gaze movements (see the failure cases in Fig. 7). In addition, in this work we do not explore the spatial geometry context provided by multi-channel audio signals. Our approach and experiments sugggest several important future research directions:

  • \bullet

    The proposed CSTS model can be applied to other video understanding tasks related to the audio modality, such as action recognition, action localization, and video question answering. We hope to further investigate our proposed approach on these problem settings.

  • \bullet

    A model explicitly designed for audio-visual representation learning in multi-person, multi-speaker environments merits further investigation.

  • \bullet

    A model that learns better temporal representations for anticipating large gaze shifts remains to be explored.

  • \bullet

    The visualization of correlation weights in the spatial fusion module indicates the potential of our model for weakly-supervised/self-supervised sound localization and active speaker detection, which can be investigated in further work.

Appendix E Code and License

The usage of the Aria dataset is under the Apache 2.0 License11 https://github.com/facebookresearch/vrs/blob/main/LICENSE, and the usage of the Ego4D dataset is under the license agreement22 https://ego4d-data.org/pdfs/Ego4D-Licenses-Draft.pdf. Our implementation is built on top of [13], which is under the Apache License33 https://github.com/facebookresearch/SlowFast/blob/main/LICENSE. Our code and the train/test split on Aria dataset will be available at: https://bolinlai.github.io/CSTS-EgoGazeAnticipation/.

Stages Operators Output Size
Video Encoder ϕ(x)\phi(x) video frames - 8×256×256×38\times 256\times 256\times 3
video token embedding Conv(3×7×7, 96)stride 2×4×4\begin{array}[]{c}Conv(3\times 7\times 7,\ 96)\\ stride\ 2\times 4\times 4\end{array} 4×64×64×964\times 64\times 64\times 96
tokenization flattening (4×64×64)×96(4\times 64\times 64)\times 96
video encoder block1 [MSA(96)MLP(384)]×1\left[\begin{array}[]{c}MSA(96)\\ MLP(384)\end{array}\right]\times 1 (4×64×64)×192(4\times 64\times 64)\times 192
video encoder block2 [MSA(192)MLP(768)]×2\left[\begin{array}[]{c}MSA(192)\\ MLP(768)\end{array}\right]\times 2 (4×32×32)×384(4\times 32\times 32)\times 384
video encoder block3 [MSA(384)MLP(1536)]×11\left[\begin{array}[]{c}MSA(384)\\ MLP(1536)\end{array}\right]\times 11 (4×16×16)×768(4\times 16\times 16)\times 768
video encoder block4 [MSA(768)MLP(3072)]×2\left[\begin{array}[]{c}MSA(768)\\ MLP(3072)\end{array}\right]\times 2 (4×8×8)×768(4\times 8\times 8)\times 768
Audio Encoder ψ(a)\psi(a) audio spectrograms - 8×256×256×18\times 256\times 256\times 1
audio token embedding Conv(3×7×7, 96)stride 2×4×4\begin{array}[]{c}Conv(3\times 7\times 7,\ 96)\\ stride\ 2\times 4\times 4\end{array} 4×64×64×964\times 64\times 64\times 96
tokenization flattening (4×64×64)×96(4\times 64\times 64)\times 96
audio encoder block1 [MSA(96)MLP(384)]×1\left[\begin{array}[]{c}MSA(96)\\ MLP(384)\end{array}\right]\times 1 (4×4096)×192(4\times 4096)\times 192
audio encoder block2 [MSA(192)MLP(768)]×1\left[\begin{array}[]{c}MSA(192)\\ MLP(768)\end{array}\right]\times 1 (4×1024)×384(4\times 1024)\times 384
audio encoder block3 [MSA(384)MLP(1536)]×1\left[\begin{array}[]{c}MSA(384)\\ MLP(1536)\end{array}\right]\times 1 (4×256)×768(4\times 256)\times 768
audio encoder block4 [MSA(768)MLP(3072)]×1\left[\begin{array}[]{c}MSA(768)\\ MLP(3072)\end{array}\right]\times 1 (4×64)×768(4\times 64)\times 768
Fusion Modules conv1 Conv(768×1×8×8, 768)stride 1×1×1\begin{array}[]{c}Conv(768\times 1\times 8\times 8,\ 768)\\ stride\ 1\times 1\times 1\end{array} 4×1×7684\times 1\times 768
in-frame self-attention σ()\sigma(\cdot) [MSA(768)MLP(3072)]×1\left[\begin{array}[]{c}MSA(768)\\ MLP(3072)\end{array}\right]\times 1 4×(64+1)×7684\times(64+1)\times 768
conv2 Conv(768×1×8×8, 768)stride 1×1×1\begin{array}[]{c}Conv(768\times 1\times 8\times 8,\ 768)\\ stride\ 1\times 1\times 1\end{array} 4×1×7684\times 1\times 768
conv3 Conv(768×1×8×8, 768)stride 1×1×1\begin{array}[]{c}Conv(768\times 1\times 8\times 8,\ 768)\\ stride\ 1\times 1\times 1\end{array} 4×1×7684\times 1\times 768
cross-frame self-attention π()\pi(\cdot) [MSA(768)MLP(3072)]×1\left[\begin{array}[]{c}MSA(768)\\ MLP(3072)\end{array}\right]\times 1 8×1×7688\times 1\times 768
reweighting uv,suv,tu_{v,s}\otimes u_{v,t} 8×64×7688\times 64\times 768
reweighting ψ(a)ua,t\psi(a)\otimes u_{a,t} 8×64×7688\times 64\times 768
Decoder decoder block1 [MSA(1536)MLP(3072)]×1\left[\begin{array}[]{c}MSA(1536)\\ MLP(3072)\end{array}\right]\times 1 (4×16×16)×768(4\times 16\times 16)\times 768
decoder block2 [MSA(768)MLP(1536)]×1\left[\begin{array}[]{c}MSA(768)\\ MLP(1536)\end{array}\right]\times 1 (4×32×32)×384(4\times 32\times 32)\times 384
decoder block3 [MSA(384)MLP(768)]×1\left[\begin{array}[]{c}MSA(384)\\ MLP(768)\end{array}\right]\times 1 (4×64×64)×192(4\times 64\times 64)\times 192
decoder block4 [MSA(192)MLP(384)]×1\left[\begin{array}[]{c}MSA(192)\\ MLP(384)\end{array}\right]\times 1 (8×64×64)×96(8\times 64\times 64)\times 96
head Conv(1×1×1, 1)stride 1×1×1\begin{array}[]{c}Conv(1\times 1\times 1,\ 1)\\ stride\ 1\times 1\times 1\end{array} 8×64×64×18\times 64\times 64\times 1
Table 9: Architecture of the proposed model. Convolutional layers are denoted as Conv(kernelsize,outputchannels)Conv(kernel\ size,\ output\ channels). The number of input channels in multi-head self-attention is shown in the parenthesis of MSAMSA. The dimension of the hidden layer in multi-layer perceptron is listed in parenthesis of MLPMLP. conv1 is the convolutional layer in the spatial fusion module. conv2 and conv3 are convolutional layers in the temporal fusion module.