Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering

Haopeng Li, Qiuhong Ke, Mingming Gong, and Tom Drummond Haopeng Li and Tom Drummond are with the School of Computing and Information Systems, University of Melbourne. E-mail: [email protected], [email protected] Ke is with the Department of Data Science & AI, Monash University and the School of Computing and Information Systems, University of Melbourne. E-mail: [email protected] Gong is with the School of Mathematics and Statistics, University of Melbourne. E-mail: [email protected].

Abstract

While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research. This paper seeks to bridge that gap by incorporating VideoQA into a curriculum learning (CL) framework that progressively trains models from simpler to more complex data. Recognizing that conventional self-paced CL methods rely on training loss for difficulty measurement, which might not accurately reflect the intricacies of video-question pairs, we introduce the concept of uncertainty-aware CL. Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. Furthermore, we address the challenge posed by uncertainty by presenting a probabilistic modeling approach for VideoQA. Specifically, we conceptualize VideoQA as a stochastic computation graph, where the hidden representations are treated as stochastic variables. This yields two distinct types of uncertainty: one related to the inherent uncertainty in the data and another pertaining to the model’s confidence. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments. The findings affirm that our approach not only achieves enhanced performance but also effectively quantifies uncertainty in the context of VideoQA.

Index Terms:

Video question answering, curriculum learning, uncertainty, stochastic computation graph.

Video question answering (VideoQA) has garnered increasing attention from researchers in recent years [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Significant efforts have been dedicated to enhancing various aspects of this task, including video encoding [13, 4], interaction between video and questions [14, 6], and feature fusion [2, 5]. Nevertheless, existing works train VideoQA models in a random order, overlooking the fact that optimizing a VideoQA model essentially involves a teaching process, during which the model learns to answer questions of varying difficulties. It has been proven that presenting training examples in a meaningful order, as opposed to random order, can enhance the generalization capacity of models across a wide range of tasks [15, 16, 17]. Such strategies refer to curriculum learning (CL) [18] wherein the model is gradually exposed to basic knowledge before advancing to more complex concepts, mimicking human learning. In this work, our goal is to incorporate VideoQA into CL, aiming to enhance the performance of the models.

A major challenge in integrating CL with VideoQA is quantifying data difficulty. Many existing self-paced curriculum learning (SPL) approaches utilize the training loss as a measure of difficulty quantification [19, 20, 21, 22]. However, this approach has its shortcomings. Firstly, the training loss generally measures the discrepancy between predictions and ground truth but cannot precisely reflect the inherent difficulty of data. For instance, a math problem could be difficult even if it has a simple answer. Secondly, the training loss varies significantly during training and across different tasks, necessitating a meticulous design of the training scheduler for stable optimization and improved performance [23, 20, 24]. In response to these challenges, we propose an enhancement to CL for VideoQA by incorporating the principle of uncertainty into the dynamic scheduling of difficulty. We term this approach uncertainty-aware curriculum learning (UCL) for VideoQA. Compared to the training loss, uncertainty offers the advantage of being independent of the ground truth and better reflecting the inherent difficulty of the data. Intuitively, a high-uncertainty video-question pair indicates the presence of potential noise or the model’s lack of confidence in its prediction, making it more challenging to handle.

To quantify data uncertainty and alleviate its negative impact, we propose the utilization of probabilistic modeling for VideoQA. Specifically, we treat VideoQA as a stochastic computation graph [25], wherein a video and a question serve as inputs, subsequently undergoing encoding into stochastic representations by a visual encoder and a text encoder. The final predictive distribution of the answer is derived through a combination of the video-question interaction module and the answer prediction module, employing variational inference [26, 27]. Within the framework of probabilistic modeling, we define two forms of uncertainty: feature uncertainty, which gauges the intrinsic uncertainty in the data, and predictive uncertainty, which quantifies the model’s confidence in its predictions. Notably, our approach to probabilistic modeling and uncertainty quantification remains applicable to both classification and regression tasks. Our contributions can be summarized as follows:

•

We develop a self-paced CL framework for VideoQA, where the difficulty of data is measured by the uncertainty that reflects the inherent characteristic of data.
•

We propose probabilistic modeling for VideoQA by considering VideoQA as a stochastic computation graph to capture the data uncertainty and mitigate its impact.
•

We integrate VideoQA into our uncertainty-aware curriculum learning framework and conduct extensive experiments. The results show that our method achieves better performance and valid uncertainty quantification.

I Related Work

I-A Video Question Answering

Video question answering (VideoQA) is the generalization of visual question answering [28, 29, 30, 31, 32] from image domain to the video domain. It requires temporal reasoning over a sequence of events in videos, and various techniques are exploited, such as the attention mechanism [33, 34], graph neural networks [2, 35, 36], memory networks [5, 4], and hierarchical structures [37, 13]. For example, a dual-LSTM-based approach with both spatial and temporal attention is proposed in [38]. For example, MASN [2] models each object as a graph node and captures the spatial and temporal dependencies of all objects with graph neural networks. HQGA [13] is developed to model the video as a conditional graph hierarchy to align with the multi-granular nature of questions. , which achieves great results on MSVD and MSRVTT [39]. Besides the efforts in improving the model structure, many works develop new frameworks or methodologies for this task [3, 40]. For example, invariant grounding is exploited in [3] for VideoQA, which aims to find the question-critical scenes whose causal relations with answers are invariant. The atemporal probe (ATP) [40] is presented to degrade the video-language task to image-level understanding, which provides a stronger baseline on the performance of image-level understanding in the video-language setting than random frames. Recently, large-scale video-text pretraining has shown great power in promoting the performance of multimodal video understanding [41, 42, 43]. For instance, MERLOT [41], VIOLET [42], and All-in-One [43] attain the state-of-the-art performance on several VideoQA benchmarks. However, they require large-scale data and computational resources for training. In this work, we follow [2, 13, 40, 3] and make comparisons only to the methods without pretraining to show the effectiveness of modeling approaches instead of that of more training data. Despite the progress made in VideoQA, existing works do not consider the impact of the order of training samples or the uncertainty in the data. In this work, we propose a new training framework for VideoQA concerning appropriate difficulty scheduling based on uncertainty.

I-B Curriculum Learning

Curriculum learning (CL) [18] emulates the human learning process by starting with easier tasks and gradually progressing to more challenging ones. Two central components of CL are the difficulty measure and the training scheduler. In the case of self-paced CL (SPL, where difficulty is measured during training), the loss function is often used as the difficulty measure [19, 20, 21, 22]. Initially, during training, samples with higher losses are excluded from optimization. As training advances, the threshold is gradually increased to incorporate more complex data into the optimization process. However, relying solely on loss might not accurately represent the inherent difficulty of data, as difficulty is an intrinsic attribute of samples and should be independent of ground truth labels. To overcome this limitation, we propose employing uncertainty as the difficulty measure for SPL. To the best of our knowledge, [44] is the only work that also uses uncertainty for CL. However, our method is essentially different from it: 1) We derive uncertainty by probabilistic modeling, while it obtains data uncertainty by a pretrained language model (predefined CL); 2) We perform CL by re-weighting the data, while it adopts baby step [18] to arrange data; 3) We focus on VideoQA, while it addresses neural machine translation. The pretrained model and the training scheduler based on baby step make [44] more complex to implement than ours.

I-C Uncertainty Modeling

In Bayesian modeling, there exist two main types of uncertainty: model (epistemic) uncertainty and data (aleatoric) uncertainty [45, 46]. Specifically, model uncertainty accounts for uncertainty in the model parameters and comes from our ignorance about which model generated the data. This type of uncertainty can be reduced by giving more training data. As for data uncertainty, it captures the inherent noise in our observations such as blur in images and videos. This type of uncertainty is an inherent characteristic of data and cannot be alleviated with more collected data. A great number of methods exploit data uncertainty and achieve considerable improvements in various tasks [47, 48, 49, 50]. For example, DeNet [51] is proposed to resolve query uncertainty and label uncertainty in temporal grounding, where a decoupling module and a de-bias mechanism are designed for the probabilistic language encoding and diverse temporal regression. UGPT [52] is presented for complex action recognition, where the attention scores in Transformer are modeled as probabilistic variables to capture the complex and long-term interaction of actions. Besides, uncertainty has also been applied to CL [53, 44]. For instance, in the case of [44], they incorporated data and model uncertainty into the CL process for neural machine translation, focusing on pre-computed data uncertainty to facilitate a baby-step-based CL approach. However, this approach introduces complexity due to the need for prior data uncertainty computation. In contrast, our method stands out by dynamically learning uncertainty during training, which is then utilized to adjust sample weights, leading to a more straightforward and practical implementation. Moreover, the work presented in [53] utilized snippet-level uncertainty to assign varying weights to different snippets in the context of weakly-supervised temporal action localization. This was accomplished through the lens of evidential deep learning [54, 55], with a focus on addressing intra-action variation. Our approach, however, distinguishes itself by introducing probabilistic modeling for uncertainty quantification, with the primary objective of enhancing the generalization capacity of VideoQA models. More importantly, our framework is designed to be a versatile tool applicable to diverse VideoQA models and to improve their performance.

II Uncertainty-Aware Curriculum Learning for Video Question Answering

II-A Uncertainty-Based Curriculum Learning

II-A1 Self-Paced Curriculum Learning Revisit

Curriculum learning (CL) is a training strategy where the networks are trained from easy data to hard data [19, 20, 21, 22]. It mimics the organization of curricula for humans, i.e., starting from basic knowledge to complex concepts. Initially, we introduce the concept of self-paced curriculum learning (SPL) as follows. Given the training set $\left\{(\bm{x}_{i},y_{i})\right\}_{i=1}^{D}$ , where $\bm{x}_{i}$ is the observation, $y_{i}$ is the ground truth, and $D$ is the number of training data, SPL aims to minimize the following loss in epoch $e$ ,

\mathcal{L}(\bm{\xi},\bm{w};e)=\frac{1}{D}\sum_{i=1}^{D}w_{i}l\left(f_{\bm{\xi}}(\bm{x}_{i}),y_{i}\right)+R\left(\bm{w};e\right)

(1)

where $f_{\bm{\xi}}$ is a network parameterized by $\bm{\xi}$ , $l(\cdot,\cdot)$ is a loss function (such as MSE and cross-entropy), $\bm{w}=\left\{w_{i}\right\}_{i=1}^{D}$ is the set of weights ranging from 0 to 1 for training data at epoch $e$ , and $R(\bm{w};e)$ is the regularization preventing $w_{i}$ from dropping to 0. In the original SPL [19], $l_{1}$ norm is exploited as the regularization, i.e.,

R\left(\bm{w};e\right)=-\lambda(e)\sum_{i=1}^{D}w_{i},

(2)

where $\lambda(e)$ is a scheduler (increasing function) determining the difficulty changing during training. The optimal $\bm{w}$ can be analytically solved and given as follows,

w_{i}^{*}=\left\{\begin{matrix}1,&\mathrm{if}\ l\left(f_{\bm{\xi}}(\bm{x}_{i}),y_{i}\right)<\lambda(e),\\ 0,&\mathrm{otherwise.}\\ \end{matrix}\right.

(3)

An intuitive explanation of the original SPL is that, in epoch $e$ , only the training data whose loss are less than $\lambda(e)$ are used for optimization, where the loss can be regarded as a difficulty measure and $\lambda(e)$ is the threshold. As the training proceeds, $\lambda(e)$ is gradually increased to include more difficult data.

Recognizing that hard/binary weighting of samples might restrict the flexibility of SPL, soft regularizers have been developed [20, 22]. Nonetheless, these SPL methods still rely on the training loss to determine difficulty [19, 20, 21, 22], which has limitations in two aspects: 1) The training loss typically measures the distance between predictions and ground truth, failing to precisely reflect the inherent data difficulty that should remain independent of ground truth; 2) The training loss varies significantly throughout training and across different tasks, necessitating a more meticulous design of $\lambda(e)$ for stable optimization and improved performance.

II-A2 Uncertainty-Based Curriculum Learning

To tackle the limitations of conventional SPL approaches, we introduce an approach called uncertainty-based curriculum learning. In this method, we leverage uncertainty to quantify the level of difficulty within the curriculum learning framework. In contrast to relying on the training loss, the utilization of uncertainty offers a measure that remains independent of the ground truth and effectively captures the intricacies of the data. Intuitively, a sample with high uncertainty signifies the potential presence of noise or indicates that the model lacks confidence in its prediction, thereby classifying the sample as more challenging to handle.

Concretely, in our SPL, the loss function in epoch $e$ is defined as follows,

\mathcal{L}(\bm{\xi};e)=\frac{1}{D}\sum_{i=1}^{D}w_{i}l\left(f_{\bm{\xi}}(\bm{x}_{i}),y_{i}\right),

(4)

where $w_{i}$ is computed as $w_{i}=1-\sigma\left(\frac{U_{i}}{\lambda(e)}\right)$ , and $U_{i}$ represents the (normalized) uncertainty of the sample $\bm{x}_{i}$ , which will be explained in more detail later. $\sigma(\cdot)$ denotes the Sigmoid function, and $\lambda(e)$ is a monotonically increasing function. At the outset of training, samples with low uncertainty (easier data) are assigned higher weights compared to samples with high uncertainty (more challenging examples). As training progresses (with increasing $e$ ), the weights of low-uncertainty samples and high-uncertainty samples converge towards equality, ultimately resulting in the involvement of all data.

Essentially, the uncertainty $U_{i}$ (and the weight $w_{i}$ based on it) proposed in this work is a function of network parameters. Consequently, it would also possess gradients with respect to these parameters during the backward propagation process. However, this should be prevented because the model would predict the results that have no uncertainty eventually if no constraint is applied to the weight $w_{i}$ . Previous SPL methods prevent this by applying a regularization to the weight as shown in Eq. 1. Nevertheless, we lack knowledge of the uncertainty prior across the training set, making it challenging to determine an appropriate weight regularization. To overcome this hurdle, we take an alternative approach: we detach the weight from the computation graph and nullify the gradients with respect to the parameters, i.e.,

\nabla_{\bm{\xi}}w_{i}:=\bm{0},\forall i=1,\cdots,D.

(5)

II-B Probabilistic Modeling for VideoQA

A remaining challenge for our UCL is the uncertainty quantification. To tackle this challenge, we propose probabilistic modeling for VideoQA. Specifically, we consider VideoQA as a stochastic computation graph, where the input nodes are the video $V$ and the question $Q$ . Following video encoding (parameterized by $\phi$ ) and question encoding (parameterized by $\psi$ ), we obtain the stochastic video representation $M$ and the stochastic question representation $N$ . The distribution of the answer $Y$ can be deduced by variational inference:

	$\displaystyle p_{\phi,\psi,\theta}(y\|V,Q)$	$\displaystyle=\int_{m,n}p_{\phi,\psi,\theta}(y,m,n\|V,Q)\mathrm{d}m\mathrm{d}n$
		$\displaystyle=\int_{m,n}p_{\theta}(y\|m,n)q_{\phi}(m\|V)q_{\psi}(n\|Q)\mathrm{d}m\mathrm{d}n.$

Since the integral over $(m,n)$ is intractable, we sample $m$ and $n$ from $q_{\phi}(m|V)$ and $q_{\psi}(n|Q)$ , respectively, for $K$ times to approximate the predictive distribution of $Y$ , i.e.,

p_{\phi,\psi,\theta}(y|V,Q)\approx\frac{1}{K}\sum_{k=1}^{K}p_{\theta}(y|m_{k},n_{k}),

(6)

where $m_{k}\sim q_{\phi}(m|V)$ and $n_{k}\sim q_{\psi}(n|Q)$ . In practice, we use the reparameterization trick [56] to make the sampling differentiable. $p_{\theta}(y|m_{k},n_{k})$ can be specified for classification and regression as follows,

	$\displaystyle\textnormal{{cls}: }p_{\theta}(y\|m_{k},n_{k})=\mathrm{softmax}\left(g_{k}\right),$		(7)
	$\displaystyle\textnormal{{reg}: }p_{\theta}(y\|m_{k},n_{k})=\mathcal{N}(\mu_{k},\sigma^{2}_{k}),$		(8)

where $g_{k}=g(m_{k},n_{k};\theta)\in\mathbb{R}^{C}$ is the predicted logits ( $C$ is the number of classes), and $\mu_{k}=\mu(m_{k},n_{k};\theta)$ and $\sigma_{k}^{2}=\sigma^{2}(m_{k},n_{k};\theta)$ are the predicted expectation and variance.

The proposed stochastic computation graph for VideoQA is optimized by maximizing the evidence lower bound (ELBO) that is derived as follows¹¹1The derivation can be found in Appendix.,

	$\displaystyle\log p_{\theta,\phi,\psi}(y\|V,Q)$
	$\displaystyle\geq\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log\frac{p_{\theta,\phi,\psi}(y,m,n\|V,Q)}{q_{\phi,\psi}(m,n\|V,Q)}\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-\log p(V)p(Q)$
	$\displaystyle-D_{\mathrm{KL}}\left(q_{\phi}(m\|V)\|\|p(m)\right)-D_{\mathrm{KL}}\left(q_{\psi}(n\|Q)\|\|p(n)\right),$

where $D_{\mathrm{KL}}\left(\cdot||\cdot\right)$ is the Kullback–Leibler (KL) divergence. The KL divergence can also be regarded as regularization to $M$ and $N$ , preventing them from degradating to deterministic representations. Since $p(V)$ and $p(Q)$ are irrelevant to optimization, once we apply normal Gaussian prior to $M$ and $N$ ( $q_{\phi}(m|V)=\mathcal{N}(\mu_{m},\sigma^{2}_{m}),q_{\psi}(n|Q)=\mathcal{N}(\mu_{n},\sigma^{2}_{n})$ ), the objective can be re-written and approximated as follows,

	$\displaystyle\mathcal{L(\theta,\phi,\psi)}=$	$\displaystyle-\frac{1}{K}\sum_{k=1}^{K}\log p_{\theta}(y\|m_{k},n_{k})$
		$\displaystyle+\frac{\alpha}{2}\sum_{d}(1+\log(\sigma_{m})_{d}^{2}-(\mu_{m})_{d}^{2}-(\sigma_{m})_{d}^{2})$
		$\displaystyle+\frac{\alpha}{2}\sum_{d}(1+\log(\sigma_{n})_{d}^{2}-(\mu_{n})_{d}^{2}-(\sigma_{n})_{d}^{2}),$

where $\alpha$ is a hyper-parameter. For classification and regression, $\log p_{\theta}(y|m_{k},n_{k})$ is specified as

	$\displaystyle\textnormal{{cls}: }\log p_{\theta}(y\|m_{k},n_{k})=\log p^{k}_{c^{*}},$
	$\displaystyle\textnormal{{reg}: }\log p_{\theta}(y\|m_{k},n_{k})=-\frac{1}{\sigma_{k}^{2}}(\mu_{k}-y)^{2}-\log\sigma_{k}^{2}-\log 2\pi,$

where $p^{k}_{c^{*}}$ is the predictive probability for the correct class $c^{*}$ , and $y$ is the ground truth value.

We then define two types of uncertainty for difficulty quantification based on our probabilistic modeling: feature uncertainty and predictive uncertainty. Feature uncertainty measures inherent uncertainty in data such as noise, blur and occlusion in videos. This type of uncertainty is computed based on the variance of $p_{\theta}(y|m_{k},n_{k})$ across different sampling results. Concretely, the feature uncertainty for classification is defined as the variance of predicted logits, while the feature uncertainty for regression is computed as the variance of predicted expectations, i.e.,

	$\displaystyle\textnormal{{cls}: }U_{F}=\frac{1}{CK}\left\lVert\sum_{k=1}^{K}(g_{k}-\bar{g})^{2}\right\rVert_{1},$		(9)
	$\displaystyle\textnormal{{reg}: }U_{F}=\frac{1}{K}\sum_{k=1}^{K}(\mu_{k}-\bar{\mu})^{2},$		(10)

where $\bar{g}=\frac{1}{K}\sum_{k=1}^{K}g_{k}\in\mathbb{R}^{C}$ and $\bar{\mu}=\frac{1}{K}\sum_{k=1}^{K}\mu_{k}\in\mathbb{R}$ are the average of predicted logits and that of predicted expectations, respectively, and the power in Eq. 9 is performed element-wisely. Feature uncertainty essentially measures the difference among the predictions from different sampled features. That is to say, if the variances of feature $m$ and $n$ are zeros (i.e., there is no uncertainty in them), the computed feature uncertainty would be zero because the random sampling degrades to a deterministic process. As for the predictive uncertainty, it measures the confidence of the model in the final outputs. Concretely, this type of uncertainty for classification is defined as the entropy of the output distribution, while for regression, we use the predicted variance as the predictive uncertainty, i.e.,

	$\displaystyle\textnormal{{cls}: }U_{P}=-\sum_{c=1}^{C}p_{c}\log p_{c},$		(11)
	$\displaystyle\textnormal{{reg}: }U_{P}=\frac{1}{K}\sum_{k=1}^{K}\sigma^{2}_{k},$		(12)

where $p_{c}=\frac{1}{K}\sum_{k=1}^{K}p^{k}_{c}$ is the predicted probability of class $c$ . A lower predictive uncertainty means the model has more confidence in its prediction. The proposed feature uncertainty and predictive uncertainty depict the inherent characteristics of data and are agnostic to ground truth, which are more reasonable for the difficulty measurement in SPL.

Distinguishing itself from aleatoric and epistemic uncertainty, our uncertainty serves a distinct purpose: our feature uncertainty assesses the uncertainty linked to high-level video features, while predictive uncertainty conveys the model’s confidence in its predictions. In contrast, aleatoric uncertainty quantifies uncertainty within observations, while epistemic uncertainty pertains to uncertainty in model parameters. Importantly, it’s worth noting that aleatoric and epistemic uncertainty can also be incorporated into our UCL framework. We substantiate this further by presenting a comparison of various uncertainties in the supplementary.

Despite that the above two types of uncertainty can measure the difficulty of data, they cannot be directly applied to SPL because their ranges are intractable, which makes it difficult to determine suitable schedulers for SPL (similar to the design of $\lambda(e)$ in the original SPL). To address this issue, we assume they are Gaussian distributed and normalize them to the standard Gaussian distribution, i.e.,

\bar{U}=\frac{U-\mathrm{E}[U]}{\sqrt{\mathrm{Var}[U]}},\\

(13)

where $U$ represents $U_{F}$ or $U_{P}$ , and $\mathrm{E}[U]$ and $\mathrm{Var}[U]$ are the mean and the variance of the uncertainty $U$ , respectively. Consequently, $\bar{U}_{F},\bar{U}_{P}\sim\mathcal{N}(0,1)$ and they are statistically bounded. Since the mean and variance for the whole dataset are intractable, we use batch normalization [57] as a practical means to estimate the normalized uncertainty. In this work, we exploit the two types of normalized uncertainty for SPL. To maintain simplicity, we still use $U$ to denote the normalized $U_{F}$ or $U_{P}$ . Fig. 1 demonstrates the overview of our framework.

Our uncertainty-aware curriculum learning framework is designed to be versatile and adaptable, fitting seamlessly into various model structures. Alg. 1 outlines the pseudo code for our uncertainty-aware curriculum learning applied to VideoQA. It’s worth noting that the algorithm encompasses both classification-based (CB) VideoQA and regression-based (RB) VideoQA. For CB VideoQA, retain the blue lines (the tops of Line 11, 13, 14, 16) and disregard the red ones (the bottoms of Line 11, 13, 14, 16). Conversely, for RB VideoQA, follow the opposite steps. Furthermore, we have chosen predictive uncertainty for CB VideoQA and feature uncertainty for RB VideoQA in the examples provided. It’s important to understand that these uncertainties are interchangeable. The KL divergence regularization for the feature distributions within the loss is omitted here for the sake of simplicity.

Refer to caption — Figure 1: The overview of our UCLQA framework. Note that only probabilistic modeling for the video is applied, and the number of sampling times is set to 3 for simplicity.

Require: VideoQA training set

\left\{(V_{i},Q_{i},a_{i})\right\}_{i=1}^{D}

Require: Video encoder

F_{\phi}

, question encoder

G_{\psi}

, VQ interaction and answer decoder

H_{\theta}

Require: Number of epoch

E

, scheduler

\lambda(e)

, batch size

B

, learning rate

\gamma

, sampling times

K

1 for $e=1..E$ do

2 while not done do

3 Sample batch of data

\left\{(V_{b},Q_{b},a_{b})\right\}_{b=1}^{B}

4 for $b=1..B$ do

(\mu_{m},\sigma_{m})=F_{\phi}(V_{b})

(\mu_{n},\sigma_{n})=G_{\psi}(Q_{b})

7 for $k=1..K$ do

8 Sample

\epsilon_{m},\epsilon_{n}

from

\mathcal{N}(0,1)

m_{k}=\mu_{m}+\sigma_{m}\epsilon_{m}

n_{k}=\mu_{n}+\sigma_{n}\epsilon_{n}

\begin{cases}\color[rgb]{0,0,1}\textnormal{{cls}:}&\color[rgb]{0,0,1}p^{k}=H_{\theta}(m_{k},n_{k})\\ \color[rgb]{1,0,0}\textnormal{{reg}:}&\color[rgb]{1,0,0}\mu_{k},\sigma^{2}_{k}=H_{\theta}(m_{k},n_{k})\end{cases}

13 end for

\begin{cases}\color[rgb]{0,0,1}\textnormal{{cls}:}&\color[rgb]{0,0,1}p^{b}=\frac{1}{K}\sum_{k}p^{k}\\ \color[rgb]{1,0,0}\textnormal{{reg}:}&\color[rgb]{1,0,0}\mu_{b}=\frac{1}{K}\sum_{k}\mu_{k}\end{cases}

\begin{cases}\color[rgb]{0,0,1}\textnormal{{cls}:}&\color[rgb]{0,0,1}U_{b}=-\sum_{c}p^{b}_{c}\log p^{b}_{c}\\ \color[rgb]{1,0,0}\textnormal{{reg}:}&\color[rgb]{1,0,0}U_{b}=\frac{1}{K}\sum_{k}(\mu_{k}-\mu_{b})^{2}\end{cases}

w_{b}=\mathrm{Detach}\left(1-\sigma\left(\frac{\mathrm{BN}(U_{b})}{\lambda(e)}\right)\right)

\begin{cases}\color[rgb]{0,0,1}\textnormal{{cls}:}&\color[rgb]{0,0,1}l_{b}=-\frac{1}{K}\sum_{k}\log p^{k}_{a_{b}}\\ \color[rgb]{1,0,0}\textnormal{{reg}:}&\color[rgb]{1,0,0}l_{b}=\frac{1}{K}\sum_{k}\left(\frac{(\mu_{k}-a_{b})^{2}}{\sigma_{k}^{2}}+\log\sigma_{k}^{2}\right)\end{cases}

23 end for

\mathcal{L}=\frac{1}{B}\sum_{b}w_{b}l_{b}

(\phi,\psi,\theta)=(\phi,\psi,\theta)-\gamma\nabla_{(\phi,\psi,\theta)}\mathcal{L}

27 end while

29 end for

Algorithm 1 Uncertainty-aware Curriculum Learning for Classification/Regression-based VideoQA

II-C Uncertainty-Aware Curriculum Learning for Video Question Answering

The proposed framework is agnostic to the essential modules in VideoQA models (i.e., video encoder, question encoder, video-question interaction module, and answer decoder). Therefore, it can be applied to existing VideoQA methods as a plug-and-play method. In this section, we show how a VideoQA model can be adapted into our framework. Specifically, we choose MASN [2] for our purpose because 1) it is a typical VideoQA model with clear VideoQA modules; 2) it uses various types of visual features that may contain more uncertainty in the hidden space.

MASN exploits four types of visual features, i.e., the global/local appearance/motion feature. In this paper, we extract these features as described in [2]. Concretely, for a video of $T$ frames, we can obtain the global appearance features $\left\{\bm{g}_{a}^{t}\right\}_{t=1}^{T}$ , global motion features $\left\{\bm{g}_{m}^{t}\right\}_{t=1}^{T}$ , local appearance features $\left\{\bm{l}_{a}^{it}\right\}_{i=1,t=1}^{N,T}$ , and local motion features $\left\{\bm{l}_{m}^{it}\right\}_{i=1,t=1}^{N,T}$ , where $\bm{g}_{a}^{t}\in\mathbb{R}^{1024}$ , $\bm{g}_{m}^{t},\bm{l}_{a}^{it},\bm{l}_{m}^{it}\in\mathbb{R}^{2048}$ and $N$ is the number of objects in each frame.

There are two parallel streams in the video encoder of MASN for appearance and motion feature encoding, respectively. Since the two streams are identical in structure, we elaborate on them without specifying appearance or motion. The visual encoder is essentially a graph convolution network (GCN) [58] where the object features are modeled as a graph. Specifically, the graph node $\bm{n}^{it}\in\mathbb{R}^{d}$ ( $d$ is the dimension of the node representations) is computed as follows,

	$\displaystyle\hat{\bm{l}}^{it}=\mathrm{ReLU}\left(\bm{W}_{l}\left[\bm{l}^{it},\bm{b}^{it},\bm{s}(t)\right]\right),$		(14)
	$\displaystyle\hat{\bm{g}}^{t}=\bm{W}_{g}\bm{g}^{t}+\bm{s}(t),$		(15)
	$\displaystyle\bm{n}^{it}=\mathrm{ReLU}\left(\bm{W}_{n}\left[\hat{\bm{l}}^{it},\hat{\bm{g}}^{t}\right]\right),$		(16)

where $\bm{b}^{it}\in\mathbb{R}^{4}$ is the coordinate of the bounding box of the $i$ -th object in the $t$ -th frame, $\bm{s}(t)$ is the sinusoidal function for time-step encoding [59], $[\cdot]$ represents concatenation, and $\bm{W}_{l},\bm{W}_{g},\bm{W}_{n}$ are parameters. For simplicity, we flatten the two indexes of $\bm{n}^{it}$ and denote them as $\left\{\bm{n}^{z}\right\}_{z=1}^{Z}$ , where $Z=NT$ is the number of all objects.

We combine all the nodes $\left\{\bm{n}^{z}\right\}_{z=1}^{Z}$ to form a object matrix $\bm{N}\in\mathbb{R}^{Z\times d}$ . The adjacent matrix for the graph is computed based on the dot-product of the projected nodes, i.e,

\bm{A}=\mathrm{softmax}\left((\bm{N}\bm{W}_{1})(\bm{N}\bm{W}_{2})^{\mathrm{T}}\right),

(17)

where $\bm{W}_{1},\bm{W}_{2}$ are parameters. The visual features are then encoded by a GCN, i.e., $\bm{M}=\mathrm{GCN}(\bm{N},\bm{A})$ , where $\bm{M}\in\mathbb{R}^{Z\times d}$ contains interaction information among objects.

In this work, we encode the video to a probabilistic representation, which is defined as the combination of all encoded objects $\mathcal{R}=\left\{\bm{r}^{z}\right\}_{z=1}^{Z}$ . $\bm{r}^{z}$ is the encoded $z$ -th object in the video, which is modeled as a multivariate Gaussian distribution, i.e., $\bm{r}^{z}\sim\mathcal{N}(\bm{\mu}^{z},\bm{\sigma}^{z}\bm{I}_{d})$ , where $\bm{\mu}^{z},\bm{\sigma}^{z}\in\mathbb{R}^{d}$ is the expectation and the variance, and $\bm{I}_{d}$ is the $d$ -order identity matrix (we assume the dimensions in $\bm{r}^{z}$ are independent). The expectation and variance are obtained from $\bm{M}$ with a two-layer perceptron (MLP) as follows,

\bm{\mu}^{z},\bm{\sigma}^{z}=\mathrm{MLP}(\bm{M}^{z}),

(18)

where $\bm{M}^{z}\in\mathbb{R}^{d}$ is the $z$ -th row in $\bm{M}$ . In practice, we use the reparameterization trick to sample $K$ sets of visual representations $\left\{\mathcal{R}_{k}\right\}_{k=1}^{K}$ , where $\mathcal{R}_{k}=\left\{\bm{r}_{k}^{z}\right\}_{z=1}^{Z}$ and $\bm{r}_{k}^{z}=\bm{\mu}^{z}+\bm{\sigma}^{z}\odot\bm{\epsilon}_{k}^{z}$ ( $\bm{\epsilon}_{k}^{z}\sim\mathcal{N}(\bm{0},\bm{I}_{d})$ , and $\odot$ represents element-wise multiplication). Note that the process above is parallel for both appearance and motion encoding, and thus two types of visual representations are obtained. $\left\{\mathcal{R}_{k}\right\}_{k=1}^{K}$ is then exploited for video-question interaction.

The question encoder comprises word embedding and LSTM [60]. Specifically, the words are transformed into 300D vectors using GloVe [61]. Subsequently, a linear projection and LSTM are utilized to process the word sequence. To summarize, the LSTM encodes each question into a sequence of hidden states. We do not apply probabilistic modeling to question encoding as our prior experiments show it brings no improvement for MASN.

After obtaining the appearance/motion representation and the question encoding, appearance/motion-question interaction is applied, by which the appearance/motion-question interacted representation is obtained as follows. For the sake of simplicity, we will not explicitly denote the sampling index $k$ since the processing is the same across the $K$ sets of visual features. Upon obtaining $\mathcal{R}=\left\{\bm{r}^{z}\right\}_{z=1}^{Z}$ (applicable to either appearance or motion features) and question encoding $\left\{\bm{w}^{j}\right\}_{j=1}^{J}$ ( $\bm{w}^{j}\in\mathbb{R}^{d}$ and $J$ represents the number of words in the question), we combine each of these sets into a matrix. These matrices are represented as $\bm{R}\in\mathbb{R}^{Z\times d}$ and $\bm{W}\in\mathbb{R}^{J\times d}$ , respectively. The video-question interaction is performed using the bilinear attention network (BAN) [14] as follows,

\bm{B}_{i}=\bm{1}\cdot{\rm BAN}_{i}(\bm{B}_{i-1},\bm{R};\bm{C}_{i})^{\mathrm{T}}+\bm{B}_{i-1},i=1,\cdots,4,

(19)

where $\bm{B}_{0}=\bm{W}$ , $\bm{1}=[1,1,\cdots,1]^{\mathrm{T}}\in\mathbb{R}^{J}$ , and $\bm{C}_{i}$ is the attention map. Four blocks of BAN are applied, and the final result is the video-question interacted representation, which is denoted as $\bm{Q}\in\mathbb{R}^{J\times d}$ .

The above stream is applied to appearance features and motion features in parallel. Then two types of video-question representation, $\bm{Q}_{a}$ for appearance and $\bm{Q}_{m}$ for motion, are obtained. $\bm{Q}_{a}$ and $\bm{Q}_{m}$ are fused with the motion-appearance-centered attention. Specifically, three types of attention are computed as follows,

	$\displaystyle\bm{P}_{a}=\mathrm{Attn}(\bm{U},\bm{Q}_{a},\bm{Q}_{a}),$		(20)
	$\displaystyle\bm{P}_{m}=\mathrm{Attn}(\bm{U},\bm{Q}_{m},\bm{Q}_{m}),$		(21)
	$\displaystyle\bm{P}_{mix}=\mathrm{Attn}(\bm{U},\bm{U},\bm{U}),$		(22)

where $\bm{U}=[\bm{Q}_{a};\bm{Q}_{m}]\in\mathbb{R}^{2J\times d}$ , and the function $\mathrm{Attn}(\cdot,\cdot,\cdot)$ refers to scaled dot-product attention [59], taking as inputs query, key, and value. Then, a residual connection [62] and a LayerNorm [63] are further applied as follows,

\bm{Z}_{x}=\mathrm{LayerNorm}\left(\bm{P}_{x}+\bm{U}\right),

(23)

where $x\in\{a,m,mix\}$ , and $\bm{Z}_{a}/\bm{Z}_{m}/\bm{Z}_{mix}\in\mathbb{R}^{2J\times d}$ is the appearance-centered/motion-centered/mix attention. We then fused the results with the guidance of questions as follows,

	$\displaystyle\bm{Z}=\sum_{x\in\{a,m,mix\}}\mathrm{softmax}_{x}\left(\frac{\bm{z}_{x}^{\mathrm{T}}\bm{w}^{J}}{\sqrt{d}}\right)\bm{Z}_{x},$		(24)
	$\displaystyle\bm{O}=\mathrm{LayerNorm}\left(\bm{Z}+\mathrm{MLP}(\bm{Z})\right),$		(25)

where $\bm{z}_{x}\in\mathbb{R}^{d}$ is the sum of $\bm{Z}_{x}$ along the first dimension. Finally, the video-question feature is computed by aggregating $\bm{O}\in\mathbb{R}^{2J\times d}$ along the first dimension as follows,

\bm{s}=\sum_{j}\mathrm{softmax}_{j}\left(\mathrm{MLP}(\bm{O}^{j})\right)\bm{O}^{j},

(26)

where $\bm{O}^{j}\in\mathbb{R}^{d}$ is the $j$ -th row of $\bm{O}$ , and $\mathrm{MLP}(\cdot)$ projects $\bm{O}^{j}$ to a scalar. $\bm{s}\in\mathbb{R}^{d}$ is the fused feature for answer prediction. Given that we have $K$ sets of appearance/motion representations, we compute $K$ features $\left\{\bm{s}_{k}\right\}_{k=1}^{K}$ .

The answer decoder varies according to the task at hand. For the regression task, the answer is predicted by projecting $\bm{s}$ to two scalars (predicted expectation and variance). For the classification task, the answer distribution is computed by projecting $\bm{s}$ to logits and applying Softmax to the logits. For the multi-choice task, we first concatenate the question and each answer, and then model VideoQA as binary classification. The prediction is the answer with the highest estimated probability of being correct. Importantly, $\left\{\bm{s}_{k}\right\}_{k=1}^{K}$ yields $K$ predictions, and the ultimate outcome is their average. The optimization follows our uncertainty-aware CL.

III Experiments

III-A Implementation Details

In the probabilistic modeling, the visual representations are sampled five times during training and ten times during testing. As for the CL scheduler $\lambda(e)$ , we employ a linear²²2Concave/convex functions yield similar results to the linear one after tuning hyperparameters. function with increasing values over epochs, ranging from 3 to 7. The model is trained with a learning rate of $1\times 10^{-4}$ and a batch size of 32, utilizing the Adam optimizer [64]. All experiments are conducted using PyTorch with NVIDIA A100 GPUs.

We evaluate our methods on four datasets: TGIF-QA [38], NExT-QA [65], MSVD-QA [39], and MSRVTT-QA [39]. Specifically, TGIF-QA consists of four sub-tasks: Count, FrameQA, Action, and Transition. For instance, Count is formulated as regression, FrameQA involves open-ended tasks (multi-class classification), while Action and Transition are multi-choice tasks (binary classification for each option). NExT-QA is a multi-choice dataset encompassing description, chronology, and causality. MSVD-QA and MSRVTT-QA are open-ended VideoQA datasets. Regarding evaluation metrics, we employ mean square error (MSE) for Count and accuracy for the other sub-tasks and datasets.

III-B Comparisons with Existing Methods

TABLE I: The results on TGIF-QA. The metric is MSE (the lower the better) for Count and is accuracy (%, the higher the better) for others. ^∗ means the result is re-implementation with the official codes. UCLQA_F/UCLQA_P represents our model trained with feature/predictive-uncertainty-aware CL.

Models	Count $\downarrow$	FrameQA $\uparrow$	Action $\uparrow$	Trans. $\uparrow$
B2A [36]	3.71	57.5	75.9	82.6
HAIR [66]	3.88	60.2	77.8	82.3
HOSTER [67]	4.13	58.2	75.6	82.1
MASN [2]	3.64^∗	58.5^∗	82.1^∗	85.7^∗
HQGA [13]	3.97^∗	61.3	76.9	85.6
IGV [3]	3.67^∗	52.8^∗	78.5^∗	85.7^∗
UCLQA_F	3.22	60.5	84.0	87.7
UCLQA_P	3.22	60.4	84.0	87.8

In this section, we compare our model with the existing methods. As we discussed in Related Work, we follow HQGA [13], IGV [3], and ATP [40], and compare only with the methods without large-scale video-text pretraining for fair comparisons. We carefully selected the compared methods based on specific criteria: 1) the state of the art in recent years such as MASN [2], IGV [3], and HQGA [13]; 2) the methods that report results on the corresponding datasets in the original paper. In summary, we compare the best and result-available methods on each dataset.

III-B1 Results on TGIF-QA

We first compare our methods with the existing ones on TGIF-QA. The results are shown in Table I. Note that UCLQA_F/UCLQA_P in the table represents our model trained with feature/predictive-uncertainty-aware CL. As we can see from the results, both UCLQA_F/UCLQA_P achieve state-of-the-art performance on all sub-tasks except for FrameQA. The improvements on the three sub-tasks are significant compared with the previous methods. As for FrameQA, HQGA [13] achieves the best results for its hierarchical structure that is effective in capturing the global dependencies, and such global dependencies are crucial to FrameQA. Note that the state-of-the-art method, IGV [3], is not effective on FrameQA and achieves the worst performance. Besides, UCLQA_F and UCLQA_P have similar performance on all sub-tasks of the TGIF-QA dataset.

III-B2 Results on NExT-QA

We compare our methods with previous ones on NExT-QA. The results (accuracy on the validation/testing set) are shown in Table II. As we can see from the results, our UCLQA_F achieves better performance than the state-of-the-art methods, i.e., IGV [3], ATP [40] and HQGA [13]. Specifically, the improvement in the accuracy on descriptive questions are significant compared to previous results. As for UCLQA_F, it further promotes the performance to a noticeable extent compared to HQGA and UCLQA_F. Besides, the improvement on causal questions is noteworthy. Nevertheless, our methods obtain inferior results on temporal questions. We assume the reason is that MASN models the relations among objects across frames, which diminishes the impact of temporal information on answer prediction.

III-B3 Results on MSVD-QA and MSRVTT-QA

We have conducted a comparative analysis of our methodologies against previous approaches using the MSVD-QA [39] and MSRVTT-QA [39] datasets, with the results presented in Table III. Notably, UCLQA_P and IGV [3] jointly achieve the highest performance on MSVD-QA, whereas UCLQA_F exhibits slightly lower results. Conversely, when considering MSRVTT-QA, both UCLQA_P and UCLQA_F trail behind IGV. This discrepancy may be attributed to the datasets’ inherent properties. MSRVTT-QA features clear causal relations, which IGV capitalizes on by effectively leveraging invariant grounding to identify pivotal scenes for causal relation reasoning.

TABLE II: The results on the validation/testing sets of NExT-QA. The metric is accuracy (%). MASN^∗ is re-implementation.

Models	Val.	Testing
Models	Val.	Causal	Temp.	Descrip.	All
CoMem [5]	44.2	45.9	50.0	54.4	48.5
HCRN [37]	48.2	47.1	49.3	54.0	48.8
HME [4]	48.1	46.8	48.9	57.4	49.2
MASN^∗ [2]	50.8	47.7	49.4	57.8	49.8
HGA [6]	49.7	48.1	49.1	57.8	50.0
IGV [3]	—	48.6	51.7	59.6	51.3
ATP [40]	—	48.6	49.3	65.0	51.5
HQGA [13]	51.4	49.0	52.3	59.4	51.8
UCLQA_F	51.6	49.4	50.6	61.9	51.8
UCLQA_P	52.3	50.3	50.5	61.8	52.2

TABLE III: The accuracy (%) on MSVD-QA/MSRTT-QA.

Method	MSVD-QA	MSRTT-QA
B2A [36]	37.2	36.9
HAIR [66]	37.5	36.9
MASN [2]	38.0	35.3
DualVGR [35]	39.0	35.5
HOSTER [67]	39.4	35.9
IGV [3]	40.8	38.3
UCLQA_F	40.6	37.3
UCLQA_P	40.8	37.3

III-C More Analysis

III-C1 Ablation Study

We conduct ablation studies to show the impact of the proposed probabilistic modeling and uncertainty-aware CL. The baseline model is constructed by discarding probabilistic modeling, where the visual representations (including appearance and motion) are deterministic variables instead of random variables, and CL is not applied in the training process. Besides, we also evaluate the model with probabilistic modeling but trained without CL. The results on TGIF-QA are shown in Table V. As we can from the results, the baseline model achieves considerable results. Furthermore, the probabilistic modeling improves the performance on all sub-tasks of TGIF-QA, which means modeling the uncertainty in the visual feature space is beneficial to capturing robust spatial-temporal dependencies in videos for question answering. Furthermore, feature-uncertainty-aware CL (CL_F) and predictive-uncertainty-aware CL (CL_P) both improve the performance on all sub-tasks. Specifically, the improvements on Count, FrameQA, and Action are significant. We assume the reason for the less significant improvement on Transition is that the variance of the uncertainty distribution of this dataset is small, which diminishes the impact of CL.

TABLE IV: The results of the model trained by different CL methods. CL_H is the original SPL [19]. CL_L is the SPL with linear regularizer [20]. CL_F/CL_P represents feature/predictive-uncertainty-aware CL. The improvements/declines in performance are highlighted in green/red.

CL	TGIF-QA				MSVD $\uparrow$	MSRVTT $\uparrow$
CL	Count $\downarrow$	FrameQA $\uparrow$	Action $\uparrow$	Trans. $\uparrow$	MSVD $\uparrow$	MSRVTT $\uparrow$
/	3.33	59.7	82.8	87.2	38.9	36.1
CL_H	3.59 (+0.26)	59.9 (+0.2)	83.0 (+0.2)	86.6 (-0.6)	38.8 (-0.1)	36.4 (+0.3)
CL_L	3.46 (+0.13)	60.0 (+0.3)	82.8 (+0.0)	87.1 (-0.1)	39.4 (+0.5)	36.4 (+0.3)
CL_F	3.22 (-0.11)	60.5 (+0.8)	84.0 (+1.2)	87.7 (+0.5)	40.6 (+1.7)	37.3 (+1.2)
CL_P	3.22 (-0.11)	60.4 (+0.7)	84.0 (+1.2)	87.8 (+0.6)	40.8 (+1.9)	37.3 (+1.2)

TABLE V: The results of ablation studies on TGIF-QA. PM is the abbreviation for probabilistic modeling. CL_F/CL_P represents feature/predictive-uncertainty-aware curriculum learning. The improvements are highlighted in green.

PM	CL_F	CL_P	Count $\downarrow$	FrameQA $\uparrow$	Action $\uparrow$	Trans. $\uparrow$
			3.64	58.5	82.1	85.7
✓			3.33 (-0.31)	59.7 (+1.2)	82.8 (+0.7)	87.2 (+1.5)
✓	✓		3.22 (-0.42)	60.5 (+2.0)	84.0 (+1.9)	87.7 (+2.0)
✓		✓	3.22 (-0.42)	60.4 (+1.9)	84.0 (+1.9)	87.8 (+2.1)

III-C2 Curriculum Learning

We compare our uncertainty-aware curriculum learning with the original SPL (CL_H) [19] and the SPL with linear regularizer (CL_L) [20]. The results on three datasets are shown in Table IV. The baseline model is our model with only probabilistic modeling (the second row in Table V). For fair comparisons, we also apply probabilistic modeling to the model trained with CL_H and CL_L. As we can see from the results, CL_H makes little difference to the performance on MSVD-QA, and it has a negative impact on Count and Transition. Besides, the improvements on FrameQA, Action, and MSRVTT-QA are marginal. As for CL_L, it brings obvious improvement only on MSVD-QA, while the impact on other datasets is either little or negative. In contrast, the proposed CL_P and CL_F both achieve obvious improvements on all sub-tasks and datasets, which shows the superiority of our methods over traditional SPL.

III-C3 Uncertainty Quantification

We have conducted a comparison of the uncertainty-aware curriculum learning, employing distinct forms of uncertainty quantification. Specifically, we have juxtaposed the uncertainty quantification technique proposed in [46] ( $U_{T}$ ) with our two variations of uncertainty ( $U_{F}$ and $U_{P}$ ) for both regression and classification tasks on Count and FrameQA, respectively. The results, as depicted in Table VI, are contrasted against a baseline model trained without curriculum learning. From the outcomes, it is evident that curriculum learning enhanced with $U_{T}$ yields improvements in performance for both the regression task (Count) and the classification task (FrameQA). This underscores the efficacy of our uncertainty-aware curriculum learning framework, demonstrating its adaptability to different uncertainty quantification methodologies. Furthermore, our proposed types of uncertainty provide even more significant performance gains compared to $U_{T}$ , largely attributed to the added benefit of probabilistic modeling.

TABLE VI: The results of the uncertainty-aware CL equipped with different types of uncertainty quantification. The improvements are highlighted in green.

Uncertainty	Count $\downarrow$	FrameQA $\uparrow$
Baseline	3.64	58.5
$U_{T}$	3.50 (-1.4)	59.7 (+1.2)
$U_{F}$	3.22 (-4.2)	60.5 (+2.0)
$U_{P}$	3.22 (-4.2)	60.4 (+1.9)

TABLE VII: The comparisons (accuracy %) between the models trained without/with our UCL framework. ^∗IGV is trained by simple cross-entropy loss for fair comparisons.

Models	TGIF-Action		MSVD-QA
Models	w/o UCL	w/ UCL	w/o UCL	w/ UCL
HGA	76.0	78.5 (+2.5)	33.1	34.7 (+1.6)
HQGA	76.9	78.8 (+1.9)	39.7	41.5 (+1.8)
IGV^∗	78.5	79.6 (+1.1)	35.6	37.0 (+1.4)
MASN	82.1	84.0 (+1.9)	38.0	40.8 (+2.8)

III-C4 Generalization Ability

Demonstrating the robustness of our approach in terms of generalization across various VideoQA models, we also applied it to HGA, HQGA, and IGV. The outcomes, presented in Table VII, consistently showcase substantial enhancements across these models, both for TGIF-Action (multi-choice) and MSVD-QA (open-ended) datasets. Notably, a straightforward modification results in accuracy improvements exceeding 1.5% in the majority of cases. These findings underscore the versatility of our UCL framework, showcasing its applicability across a diverse range of VideoQA models to achieve superior performance.

III-C5 Quality of Uncertainty

To quantitatively analyze the accuracy of our uncertainty, we discretize the predictive uncertainty (normalized to [0,1]) into ten levels and compute the accuracy within each level. The Uncertainty-Accuracy curve on TGIF-Action and TGIF-Transition is shown in Fig. 2. We have observed a negative correlation between confidence and accuracy, which supports the validity of our uncertainty estimation. Therefore, a promising application of our model is that it can quantify the uncertainty in VideoQA data and assess the difficulty of videos and QA pairs, which can greatly accelerate the collection of challenging data.

III-C6 Uncertainty Visualization

Fig. 3(a) shows some examples of high predictive uncertainty in TGIF-Transition, where the uncertainty of each option is provided. As the figure shows, the wrong predictions are of high predictive uncertainty, which means our model has less confidence in its predictions and thus makes incorrect choices. Furthermore, there exists obvious ambiguity in the high-uncertainty videos, e.g., in the right video of Fig. 3(a), while the man is dancing, he also performs a pointing action that is very obvious, so it makes sense that our model has low confidence in both “Dance” and “Point”. Fig. 3(b) illustrates some examples of high feature uncertainty in TGIF-Transition. As we can see from the figure, the videos of high feature uncertainty are generally in poor visual quality. For example, the question of the left video in Fig. 3(b) asks about the actions on TV, which are not clear enough to tell what actually happens. Besides, for the right video, the man moves very quickly, and it is hard to tell whether he is laughing or smiling. These examples demonstrate that poor visual quality results in high feature uncertainty, which could have a negative impact on the accuracy of the predictions.

III-D Hyper-parameter Analysis

III-D1 KL Divergence Weight $\alpha$

In our probabilistic modeling, we apply KL divergence regularization to the stochastic representations to prevent their degradation into deterministic ones, with the regularization strength balanced by a weight denoted as $\alpha$ . The impact of varying $\alpha$ on the model trained with predictive-uncertainty-aware CL for NExT-QA is presented in Table VIII. The reported results include both validation and testing accuracy. As observed from the table, performance peaks at $\alpha=1\times 10^{-4}$ , while larger values of $\alpha$ lead to decreased accuracy. This pattern is likely attributed to the high-quality videos in the NExT-QA dataset, where minor uncertainty exists in the feature space. Introducing strong regularization to the features might inadvertently compromise information integrity. Similar analyses were conducted on TGIF-QA, MSVD, and MSRVTT datasets, yielding optimal results around $\alpha=1\times 10^{-1}$ or $\alpha=1\times 10^{-2}$ across various datasets or sub-tasks within TGIF-QA. This trend suggests that larger $\alpha$ values are advantageous for datasets with visually unsatisfactory videos (such as TGIF-QA), facilitating the learning of more robust representations.

TABLE VIII: The accuracy (%) of different

\alpha

on NExT-QA.

$\alpha$	Val.	Testing
$\alpha$	Val.	Causal	Temp.	Descrip.	All
$1\times 10^{-5}$	52.2	48.4	49.9	62.6	51.2
$5\times 10^{-5}$	52.2	49.7	50.7	61.7	52.0
$1\times 10^{-4}$	52.3	50.3	50.5	61.8	52.2
$5\times 10^{-4}$	51.8	49.7	49.4	61.9	51.6
$1\times 10^{-3}$	51.1	48.8	49.5	60.9	51.0

TABLE IX: The results of different training scheduler on the Count task of TGIF-QA. The MSE of validation/testing is presented.

$S_{1}$ \ $S_{2}$	5	6	7
1	3.24 / 3.22	3.33 / 3.35	3.26 / 3.30
2	3.29 / 3.30	3.28 / 3.30	3.25 / 3.29
3	3.26 / 3.25	3.31 / 3.30	3.32 / 3.33

III-D2 Training Scheduler $\lambda(e)$

The training scheduler $\lambda(e)$ plays a pivotal role in controlling the rate at which the difficulty level changes in the curriculum learning process. In this study, a linear increasing function with respect to the epoch is adopted, given by:

\lambda(e)=\frac{S_{2}-S_{1}}{E-1}e+S_{1},

(27)

where $E$ corresponds to the total number of training epochs ( $e=0,1,\cdots,E-1$ ), and $S_{1}$ and $S_{2}$ ( $S_{1}<S_{2}$ ) serve as hyper-parameters that govern the rate of difficulty adaptation. The results for different settings of $S_{1}$ (1, 2, 3) and $S_{2}$ (5, 6, 7) on the Count task within TGIF-QA are displayed in Table IX, featuring the reported mean squared error (MSE) on the validation set. The outcomes reveal that the optimal performance is achieved when $S_{1}=1$ and $S_{2}=5$ . In our experimental procedures, the selection of the optimal $S_{1}$ and $S_{2}$ values was based on validation performance across all datasets (or sub-tasks within TGIF-QA). Generally, for smaller-scale datasets (or sub-tasks) like Count and FrameQA, smaller and more rapidly increasing scheduler values (e.g., $1\rightarrow 5$ ) tend to yield better results. Conversely, for larger-scale datasets such as Transition and NExT-QA, a larger, slower-increasing scheduler (e.g., $3\rightarrow 7$ ) is preferred. This distinction arises from the observation that larger-scale data necessitate a longer duration to adapt to the evolving difficulty level effectively.

TABLE X: The results of models with different sampling times. C/F/A/T represents the subset of TGIF-QA: Count / FrameQA / Action / Transition. UCLQA_K represents the model with

K

sampling times. The numbers of parameters are also reported.

Model	C $\downarrow$	F $\uparrow$	A $\uparrow$	T $\uparrow$	#Param. (M)
MASN [2]	3.64^∗	58.5^∗	82.1^∗	85.7^∗	25.7
HQGA [13]	3.97^∗	61.3	76.9	85.6	11.0
IGV [3]	3.67^∗	52.8^∗	78.5^∗	85.7^∗	34.5
UCLQA₁	3.22	60.4	83.4	87.5	27.3
UCLQA₃	3.24	60.4	83.6	87.6
UCLQA₇	3.21	60.4	83.9	87.6
UCLQA₁₀	3.22	60.4	84.0	87.8

III-D3 Sampling Times

Table X presents the outcomes of our model with varied sampling times during inference on TGIF-QA (trained with predictive-uncertainty-aware CL). Throughout training, a consistent sampling time of 5 is employed across all sub-tasks. The results indicate that, during the testing phase, different sampling times have minimal impact on predictions across all sub-tasks except for Action. This observation likely stems from the enhanced robustness of our model due to probabilistic modeling. Consequently, minor disruptions in the encoded visual representations exert limited influence on video-question interaction and answer prediction. For Action, performance improves with more sampling times. However, the enhancement becomes marginal when the sampling time is large. Table X also provides insights into model parameters. Notably, HQGA features the fewest parameters, while IGV boasts the most. In comparison, our model and MASN exhibit similar parameter counts. Importantly, the sampling is performed after video encoding, and the subsequent repeated computation can be implemented in a parallel way instead of the serial one. Consequently, the inference speed remains relatively unaffected. For instance, the inference time (in seconds) of UCLQA₁(MASN)/UCLQA₃/UCLQA₇/UCLQA₁₀ on Count is 201/208/215/221 on an NVIDIA A100 GPU.

IV Conclusion

In this paper, we propose a novel uncertainty-aware CL framework for VideoQA, where difficulty is gauged using uncertainty as a measure. To capture data uncertainty and mitigate its negative impact, we present a probabilistic modeling for VideoQA. Specifically, VideoQA is reformulated as a stochastic computation graph, wherein hidden representations of videos and questions become stochastic variables. Within this probabilistic modeling framework, we define feature uncertainty and predictive uncertainty to guide curriculum learning. In practice, we seamlessly integrate the VideoQA model into our framework and conduct experiments to demonstrate the superiority of our methods. The results illustrate that our approach achieves state-of-the-art performance across multiple datasets, while also providing meaningful uncertainty quantification for VideoQA.

V Future Work

Current VideoQA methodologies primarily leverage deep neural networks for deterministic video and text encoding, yielding encoded features [2, 13, 3]. However, inherent uncertainty is present in data due to factors like noise, blur, and occlusion within videos. To mitigate this challenge, we propose probabilistic modeling, where representations are treated as random variables. This approach aims to diminish the influence of uncertainty and quantify it effectively. Conversely, a model employing feature-level probabilistic modeling might exhibit reduced sensitivity to subtle changes in videos. In other words, a model trained within our framework might possess a diminished ability to differentiate visually similar concepts, potentially impacting fine-grained video comprehension. This specific concern will be a focal point in our forthcoming work.

Appendix: Derivation of ELBO

The elaborate derivation of the evidence lower bound (ELBO) can be found in Eq. 28.

	$\displaystyle\log p_{\theta,\phi,\psi}(y\|V,Q)$
	$\displaystyle=\log\int_{m,n}p_{\theta,\phi,\psi}(y,m,n\|V,Q)\mathrm{d}m\mathrm{d}n$
	$\displaystyle=\log\int_{m,n}p_{\theta,\phi,\psi}(y,{m,n}\|V,Q)\frac{q_{\phi}({m,n}\|V,Q)}{q_{\phi}({m,n}\|V,Q)}\mathrm{d}m\mathrm{d}n$
	$\displaystyle=\log\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\frac{p_{\theta,\phi,\psi}(y,m,n\|V,Q)}{q_{\phi,\psi}(m,n\|V,Q)}\right]$
	$\displaystyle\geq\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log\frac{p_{\theta,\phi,\psi}(y,m,n\|V,Q)}{q_{\phi,\psi}(m,n\|V,Q)}\right]\textnormal{(Jensen's inequality)}$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log\frac{p_{\theta}(y\|m,n)p(m,n)}{q_{\phi,\psi}(m,n\|V,Q)p(V,Q)}\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)-\log\frac{q_{\phi,\psi}(m,n\|V,Q)}{q(m,n)}-\log p(V,Q)\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-\mathbb{E}_{(m,n)\sim q_{\phi,\psi}(m,n\|V,Q)}\left[\log\frac{q_{\phi,\psi}(m,n\|V,Q)}{q(m,n)}\right]-\mathbb{E}_{z\sim q_{\phi,\psi}(m,n\|V,Q)}\left[\log p(V,Q)\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-D_{\mathrm{KL}}\left(q_{\phi,\psi}(m,n\|V,Q)\|\|p(m,n)\right)-\log p(V,Q)$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-D_{\mathrm{KL}}\left(q_{\phi}(m\|V)q_{\psi}(n\|Q)\|\|p(m)p(n)\right)-\log p(V)p(Q)$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-D_{\mathrm{KL}}\left(q_{\phi}(m\|V)\|\|p(m)\right)-D_{\mathrm{KL}}\left(q_{\psi}(n\|Q)\|\|p(n)\right)-\log p(V)p(Q)$		(28)

References

[1] D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan, “Location-aware graph convolutional networks for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 021–11 028.
[2] A. Seo, G.-C. Kang, J. Park, and B.-T. Zhang, “Attend what you need: Motion-appearance synergistic networks for video question answering,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 6167–6177.
[3] Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Invariant grounding for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2928–2937.
[4] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, “Heterogeneous memory enhanced multimodal attention model for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1999–2007.
[5] J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6576–6585.
[6] P. Jiang and Y. Han, “Reasoning with heterogeneous graph alignment for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 109–11 116.
[7] J. Jiang, Z. Liu, and N. Zheng, “Livlr: A lightweight visual-linguistic reasoning framework for video question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 5002–5013, 2023.
[8] F. Zhang, R. Wang, F. Zhou, Y. Luo, and J. Li, “Psam: Parameter-free spatiotemporal attention mechanism for video question answering,” IEEE Transactions on Multimedia, pp. 1–16, 2023.
[9] W. Zhang, S. Tang, Y. Cao, S. Pu, F. Wu, and Y. Zhuang, “Frame augmented alternating attention network for video question answering,” IEEE Transactions on Multimedia, vol. 22, no. 4, pp. 1032–1041, 2020.
[10] Z. Guo, J. Zhao, L. Jiao, X. Liu, and F. Liu, “A universal quaternion hypergraph network for multimodal video question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 38–49, 2023.
[11] J. Wang, B.-K. Bao, and C. Xu, “Dualvgr: A dual-visual graph reasoning unit for video question answering,” IEEE Transactions on Multimedia, vol. 24, pp. 3369–3380, 2022.
[12] T. Qian, R. Cui, J. Chen, P. Peng, X. Guo, and Y.-G. Jiang, “Locate before answering: Answer guided question localization for video question answering,” IEEE Transactions on Multimedia, pp. 1–10, 2023.
[13] J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.-S. Chua, “Video as conditional graph hierarchy for multi-granular question answering.” AAAI, 2022.
[14] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” Advances in neural information processing systems, vol. 31, 2018.
[15] L. Yang, Y. Shen, Y. Mao, and L. Cai, “Hybrid curriculum learning for emotion recognition in conversation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 595–11 603.
[16] Z. Zhou, X. Ning, Y. Cai, J. Han, Y. Deng, Y. Dong, H. Yang, and Y. Wang, “Close: Curriculum learning on the sharing extent towards better one-shot nas,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX. Springer, 2022, pp. 578–594.
[17] P. Soviany, R. T. Ionescu, P. Rota, and N. Sebe, “Curriculum learning: A survey,” International Journal of Computer Vision, vol. 130, no. 6, pp. 1526–1565, 2022.
[18] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
[19] M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” Advances in neural information processing systems, vol. 23, 2010.
[20] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann, “Easy samples first: Self-paced reranking for zero-example multimedia search,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 547–556.
[21] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann, “Self-paced learning for matrix factorization,” in Twenty-ninth AAAI conference on artificial intelligence, 2015.
[22] M. Gong, H. Li, D. Meng, Q. Miao, and J. Liu, “Decomposition-based evolutionary multiobjective optimization to self-paced learning,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 2, pp. 288–302, 2018.
[23] X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4555–4576, 2021.
[24] H. Li and M. Gong, “Self-paced convolutional neural networks.” in IJCAI, 2017, pp. 2110–2116.
[25] J. Schulman, N. Heess, T. Weber, and P. Abbeel, “Gradient estimation using stochastic computation graphs,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[26] M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.
[27] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.
[28] B. Qin, H. Hu, and Y. Zhuang, “Deep residual weight-sharing attention network with low-rank attention for visual question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 4282–4295, 2023.
[29] H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang, and X.-S. Hua, “Self-adaptive neural module transformer for visual question answering,” IEEE Transactions on Multimedia, vol. 23, pp. 1264–1273, 2021.
[30] T. Qian, J. Chen, S. Chen, B. Wu, and Y.-G. Jiang, “Scene graph refinement network for visual question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 3950–3961, 2023.
[31] J. Yu, W. Zhang, Y. Lu, Z. Qin, Y. Hu, J. Tan, and Q. Wu, “Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval,” IEEE Transactions on Multimedia, vol. 22, no. 12, pp. 3196–3209, 2020.
[32] Y. Liu, W. Wei, D. Peng, X.-L. Mao, Z. He, and P. Zhou, “Depth-aware and semantic guided relational attention network for visual question answering,” IEEE Transactions on Multimedia, vol. 25, pp. 5344–5357, 2023.
[33] J. Jiang, Z. Chen, H. Lin, X. Zhao, and Y. Gao, “Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 101–11 108.
[34] Y. Jang, Y. Song, C. D. Kim, Y. Yu, Y. Kim, and G. Kim, “Video question answering with spatio-temporal reasoning,” International Journal of Computer Vision, vol. 127, no. 10, pp. 1385–1412, 2019.
[35] J. Wang, B.-K. Bao, and C. Xu, “Dualvgr: A dual-visual graph reasoning unit for video question answering,” IEEE Transactions on Multimedia, vol. 24, pp. 3369–3380, 2021.
[36] J. Park, J. Lee, and K. Sohn, “Bridge to answer: Structure-aware graph interaction network for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 526–15 535.
[37] T. M. Le, V. Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9972–9981.
[38] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766.
[39] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1645–1653.
[40] S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei-Fei, and J. C. Niebles, “Revisiting the” video” in video-language understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2917–2927.
[41] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 634–23 651, 2021.
[42] T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “Violet: End-to-end video-language transformers with masked visual-token modeling,” arXiv preprint arXiv:2111.12681, 2021.
[43] Y. Zeng, X. Zhang, H. Li, J. Wang, J. Zhang, and W. Zhou, “X²-vlm: All-in-one pre-trained model for vision-language tasks,” arXiv preprint arXiv:2211.12402, 2022.
[44] Y. Zhou, B. Yang, D. F. Wong, Y. Wan, and L. S. Chao, “Uncertainty-aware curriculum learning for neural machine translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6934–6944.
[45] A. Der Kiureghian and O. Ditlevsen, “Aleatory or epistemic? does it matter?” Structural safety, vol. 31, no. 2, pp. 105–112, 2009.
[46] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
[47] J. Chang, Z. Lan, C. Cheng, and Y. Wei, “Data uncertainty learning in face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5710–5719.
[48] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding box regression with uncertainty for accurate object detection,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 2888–2897.
[49] W. Yang, T. Zhang, X. Yu, T. Qi, Y. Zhang, and F. Wu, “Uncertainty guided collaborative training for weakly supervised temporal action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 53–63.
[50] Y. Shi and A. K. Jain, “Probabilistic face embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6902–6911.
[51] H. Zhou, C. Zhang, Y. Luo, Y. Chen, and C. Hu, “Embracing uncertainty: Decoupling and de-bias for robust temporal grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8445–8454.
[52] H. Guo, H. Wang, and Q. Ji, “Uncertainty-guided probabilistic transformer for complex action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 052–20 061.
[53] M. Chen, J. Gao, S. Yang, and C. Xu, “Dual-evidential learning for weakly-supervised temporal action localization,” in European Conference on Computer Vision. Springer, 2022, pp. 192–208.
[54] A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 927–14 937, 2020.
[55] M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,” Advances in neural information processing systems, vol. 31, 2018.
[56] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[57] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
[58] M. Welling and T. N. Kipf, “Semi-supervised classification with graph convolutional networks,” in J. International Conference on Learning Representations (ICLR 2017), 2016.
[59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[60] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[61] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[63] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[64] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[65] J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9777–9786.
[66] F. Liu, J. Liu, W. Wang, and H. Lu, “Hair: Hierarchical visual-semantic relational reasoning for video question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1698–1707.
[67] L. H. Dang, T. M. Le, V. Le, and T. Tran, “Hierarchical object-oriented spatio-temporal reasoning for video question answering,” arXiv preprint arXiv:2106.13432, 2021.

	$\displaystyle\log p_{\theta,\phi,\psi}(y\|V,Q)$
	$\displaystyle\geq\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log\frac{p_{\theta,\phi,\psi}(y,m,n\|V,Q)}{q_{\phi,\psi}(m,n\|V,Q)}\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-\log p(V)p(Q)$
	$\displaystyle-D_{\mathrm{KL}}\left(q_{\phi}(m\|V)\|\|p(m)\right)-D_{\mathrm{KL}}\left(q_{\psi}(n\|Q)\|\|p(n)\right),$

	$\displaystyle\log p_{\theta,\phi,\psi}(y\|V,Q)$
	$\displaystyle=\log\int_{m,n}p_{\theta,\phi,\psi}(y,m,n\|V,Q)\mathrm{d}m\mathrm{d}n$
	$\displaystyle=\log\int_{m,n}p_{\theta,\phi,\psi}(y,{m,n}\|V,Q)\frac{q_{\phi}({m,n}\|V,Q)}{q_{\phi}({m,n}\|V,Q)}\mathrm{d}m\mathrm{d}n$
	$\displaystyle=\log\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\frac{p_{\theta,\phi,\psi}(y,m,n\|V,Q)}{q_{\phi,\psi}(m,n\|V,Q)}\right]$
	$\displaystyle\geq\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log\frac{p_{\theta,\phi,\psi}(y,m,n\|V,Q)}{q_{\phi,\psi}(m,n\|V,Q)}\right]\textnormal{(Jensen's inequality)}$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log\frac{p_{\theta}(y\|m,n)p(m,n)}{q_{\phi,\psi}(m,n\|V,Q)p(V,Q)}\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)-\log\frac{q_{\phi,\psi}(m,n\|V,Q)}{q(m,n)}-\log p(V,Q)\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-\mathbb{E}_{(m,n)\sim q_{\phi,\psi}(m,n\|V,Q)}\left[\log\frac{q_{\phi,\psi}(m,n\|V,Q)}{q(m,n)}\right]-\mathbb{E}_{z\sim q_{\phi,\psi}(m,n\|V,Q)}\left[\log p(V,Q)\right]$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-D_{\mathrm{KL}}\left(q_{\phi,\psi}(m,n\|V,Q)\|\|p(m,n)\right)-\log p(V,Q)$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-D_{\mathrm{KL}}\left(q_{\phi}(m\|V)q_{\psi}(n\|Q)\|\|p(m)p(n)\right)-\log p(V)p(Q)$
	$\displaystyle=\mathbb{E}_{(m,n)\sim q_{\phi,\psi}({m,n}\|V,Q)}\left[\log p_{\theta}(y\|m,n)\right]-D_{\mathrm{KL}}\left(q_{\phi}(m\|V)\|\|p(m)\right)-D_{\mathrm{KL}}\left(q_{\psi}(n\|Q)\|\|p(n)\right)-\log p(V)p(Q)$		(28)