This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Predicting the Next Action by Modeling the Abstract Goal

Debaditya Roy Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, #16-16 Connexis, Singapore 138632, Republic of Singapore. Basura Fernando Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, #16-16 Connexis, Singapore 138632, Republic of Singapore. Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, #16-16 Connexis, Singapore 138632, Republic of Singapore.
Abstract

The problem of anticipating human actions is an inherently uncertain one. However, we can reduce this uncertainty if we have a sense of the goal that the actor is trying to achieve. Here, we present an action anticipation model that leverages goal information for the purpose of reducing the uncertainty in future predictions. Since we do not possess goal information or the observed actions during inference, we resort to visual representation to encapsulate information about both actions and goals. Through this, we derive a novel concept called abstract goal which is conditioned on observed sequences of visual features for action anticipation. We design the abstract goal as a distribution whose parameters are estimated using a variational recurrent network. We sample multiple candidates for the next action and introduce a goal consistency measure to determine the best candidate that follows from the abstract goal. Our method obtains impressive results on the very challenging Epic-Kitchens55 (EK55), EK100, and EGTEA Gaze+ datasets. We obtain absolute improvements of +13.69, +11.24, and +5.19 for Top-1 verb, Top-1 noun, and Top-1 action anticipation accuracy respectively over prior state-of-the-art methods for seen kitchens (S1) of EK55. Similarly, we also obtain significant improvements in the unseen kitchens (S2) set for Top-1 verb (+10.75), noun (+5.84) and action (+2.87) anticipation. Similar trend is observed for EGTEA Gaze+ dataset, where absolute improvement of +9.9, +13.1 and +6.8 is obtained for noun, verb, and action anticipation. It is through the submission of this paper that our method is currently the new state-of-the-art for action anticipation in EK55 and EGTEA Gaze+. https://competitions.codalab.org/competitions/20071#results.

Code available at https://github.com/debadityaroy/Abstract_Goal

1 Introduction

Human action anticipation is an important problem having applications in human-robot collaboration, smart houses, assistive robotics and wearable virtual assistants. Action anticipation models aim to predict the most plausible future action that is going to happen in the immediate future. Humans are rational and actions taken by us take us closer to what we would like to attain at the end, i.e. our goal. After we identify the goal, humans formulate a plan to execute actions to achieve that goal as explained in the seven stages of action cycle [35]. Therefore, to predict what a person is going to do next accurately, it is useful to infer their goals. Besides, it has been shown that it is possible to infer the goal of the person from observing their actions [4]. However, action anticipation literature [13, 16, 29, 30, 33, 37, 39, 40] has not made use of this vital information that governs human actions or it seems that goal modeling is not a popular solution.

Goal inference from observed actions (or features) is an extremely challenging task. Two different goals can share the same partial action sequences while different persons may have different execution plans for the same goal. Therefore, goal modeling is a highly stochastic problem. In this paper, we make use of a stochastic method [7, 12] for goal modeling to improve action anticipation that goes beyond the deterministic latent goal representation introduced in [31]. Interestingly, if we know the goal of the person, then it reduces the uncertainty when predicting future actions that they might take to accomplish the goal as backed up by the cognitive science literature [3, 4, 35]. Our approach is also motivated by procedure planning [6] where a sequence of intermediate actions is predicted based on goals provided as the final visual representation or the overall activity [27].

Refer to caption
Figure 1: Illustration of the model design for abstract goal-based action anticipation. Yellow ellipses represent distributions and pink boxes represent various variables of the model. Best viewed on a screen.

Explicit goal inference is not trivial in action anticipation as the goal of the activity is not available as ground truth. To this end, we outline our approach in Figure 1. We learn a latent distribution from the observed visual features using stochastic RNN [7] which we call “feature-based abstract goal” distribution. Given the hidden state of RNN and a sampled feature-based abstract goal (𝐳𝐓\mathbf{z_{T}} in Figure 1), we obtain a representation for the observed action (𝐚𝐎\mathbf{a_{O}}). Afterward, we model the next action-representation distribution that is conditioned on the sampled feature-based abstract goal and the observed action-representation and sample a candidate’s next action-representation (𝐚𝐍\mathbf{a_{N}}) from it. Given the observed and next action representations, we learn the distribution of the “action-based abstract goal” using the generative variational framework. Therefore, we infer two kinds of abstract goals - one using visual feature sequence and another using action-representations unlike  [1, 25] that use a single latent distribution to model past actions.

The setup of next action anticipation assumes that the next action in the sequence can be reliably inferred from the observed action(s). Hence, the feature-based abstract goal distribution derived from observed features and action-based abstract goal distribution derived from next action representation should be consistent. The action that is most likely to happen in the future (“next best action”) is the one that maximizes consistency between the two abstract goal distributions. We sample multiple next action-representation candidates and introduce a new goal consistency criterion to measure the suitability of each candidate when predicting the next action–see Figure 1. During learning, we also use goal consistency as a loss function to obtain a better model. Such a mechanism is not present in previous stochastic approaches  [1, 25] which only minimizes KL divergence between prior and posterior latent distribution to obtain the best samples. We show that goal consistency has the biggest impact on action anticipation.

Our approach yields large improvements when predicting the next action in unscripted activities from the Epic-Kitchens55 and EGTEA Gaze+ datasets. We also obtain significantly better results for unseen kitchens of the Epic-Kitchens100. Our contributions are: (1) a novel stochastic abstract goal concept for action anticipation. (2) a novel stochastic model to learn abstract goal distributions and then use it for effective action anticipation in unscripted activities. (3) a novel goal consistency term that measures how well a plausible future action (next action) aligns with abstract goal distributions. The code will be made publicly available.

2 Related work

Research in action anticipation is gaining popularity in recent years thanks to progress in datasets [9] and challenges [8].

2.1 Goals in action anticipation

The activity label of the entire action sequence is used to anticipate the next action in [33]. In  [31], observed features are used to obtain a fixed latent goal from visual features. However, humans often pursue multiple goals simultaneously [9]. Hence, we propose an abstract goal distribution that serves as a representation for one or more underlying intention(s). The final visual representation is considered as the goal in [6]. Our abstract goal allows us to anticipate actions without any knowledge of the overall activity label [33] or final visual representation [6].

2.2 Features for action anticipation

Multiple representations of such as spatial, motion, and bag of objects are used to predict future actions in [13]. The authors showed that unrolling an LSTM for multiple time steps in the future is beneficial for the next action prediction. In [30], human-object interactions are encoded as features and fed to a transformer encoder-decoder to predict the features of future frames and the corresponding future actions. In [23], spatial attention maps of future human-object interactions are estimated to predict the next action.

In [36], multiple future visual representations are generated from an input frame using a CNN and the future action is predicted. Similarly in [39], an RNN is used to generate the intermediate frames between the observed frames and the anticipated action. In [20], temporal features are computed using time-conditioned skip connections to anticipate the next action. In [2, 24], RNNs are used to predict future actions conditioned on observed action labels. Similarly, in [17], a transformer is used to encode past actions and duration while another transformer decoder is used to predict both future actions and their duration. In [16], every frame is divided into patches and combined using the Visual Transformer (ViT) [10] to obtain a frame representation. These frame representations are combined using a temporal transformer to predict future features and action labels. In [38], long-range sequences are summarized by processing smaller temporal sequences using multi-scale ViTs and caching them in memory as context. Each context is attended hierarchically in time using multiple temporal transformers for action anticipation. We use a Gated Recurrent Unit (GRU) to summarize frame features and learn feature-based abstract goal distribution parameters to generate observed and next action representations.

2.3 Past-Future Correlation for anticipation

An action anticipation model that correlates past observed features with the future using Jaccard vector similarity is presented in [11]. In [15] proposes a neural memory network to compare an input (spatial representation or labels) with the existing memory content to predict future action labels. Similarly, [29] proposes an action anticipation framework with a self-regulated learning process. They correlate similarities and differences between past and current frames. In [26], a predictive model directly anticipates the future action and a low-rank linear transitional model predicts the current action and correlates with the predicted future actions. Similarly, counterfactual reasoning is used to improve action anticipation in [41]. Our approach correlates the past and future by enforcing goal consistency between the two abstract goal distributions computed using observed features and the next action. Although our method can be also used for long term action forecasting [28] and early action recognition [34, 32], in this work we primarily focus on action anticipation.

3 Action anticipation with abstract goals

In this section we explain our model design outlined in Figure 1 and then describe the related work in Section 2.

3.1 Feature-based abstract goal

In this section we describe how to generate feature-based abstract goal representation using variational RNN (VRNN) framework [7, 12]. Let us denote the observed feature sequence by 𝐱1,𝐱𝟐,,𝐱T\mathbf{x}_{1},\mathbf{x_{2}},\cdots,\mathbf{x}_{T} where 𝐱tdf\mathbf{x}_{t}\in\mathbb{R}^{d_{f}}. Following standard VRNN, a Gaussian distribution qt=𝒩(𝝁t,prior,𝝈t,prior)q_{t}=\mathcal{N}(\boldsymbol{\boldsymbol{\mu}}_{t,prior},\boldsymbol{\boldsymbol{\sigma}}_{t,prior}) is used to model the prior distribution of the abstract goal (𝐳𝐭\mathbf{z_{t}}) given the observed feature sequence t{1,,T}t\in\{1,\cdots,T\} where 𝝁t,prior,𝝈t,priordz\boldsymbol{\mu}_{t,prior},\boldsymbol{\boldsymbol{\sigma}}_{t,prior}\in\mathbb{R}^{d_{z}}. The parameters of abstract goal distribution qtq_{t} are estimated using an MLP denoted by ϕprior:dhdz\phi_{prior}:\mathbb{R}^{d_{h}}\to\mathbb{R}^{d_{z}}) using the hidden state of a RNN (𝐡t1dh\mathbf{h}_{t-1}\in\mathbb{R}^{d_{h}}) learned from the previous t1t-1 features as follows:

q(𝐳t|𝐱1:t1)𝒩(𝝁t,prior,𝝈t,prior),\displaystyle q(\mathbf{z}_{t}|\mathbf{x}_{1:t-1})\sim\mathcal{N}(\boldsymbol{\boldsymbol{\mu}}_{t,prior},\boldsymbol{\boldsymbol{\sigma}}_{t,prior}), (1)
where (𝝁t,prior,𝝈t,prior)=ϕprior(𝐡t1).\displaystyle\text{ where }(\boldsymbol{\boldsymbol{\mu}}_{t,prior},\boldsymbol{\boldsymbol{\sigma}}_{t,prior})=\phi_{prior}(\mathbf{h}_{t-1}). (2)

Note that there are two separate MLPs, one to obtain 𝝁t,prior\boldsymbol{\boldsymbol{\mu}}_{t,prior} and another to obtain 𝝈t,prior\boldsymbol{\boldsymbol{\sigma}}_{t,prior}. We apply softplussoftplus activation to the ϕprior\phi_{prior} network that estimates the standard deviation (𝝈t,prior\boldsymbol{\boldsymbol{\sigma}}_{t,prior}). Unless otherwise specified, all ϕ<>()\phi_{<>}(\cdot) used in our model are two layered neural networks with ReLU activation.

The posterior distribution of the abstract goal (rr) computes the effect of observing the incoming new feature 𝐱t\mathbf{x}_{t} as follows:

r(𝐳t|𝐱1:t)𝒩(𝝁t,pos,𝝈t,pos).\displaystyle r(\mathbf{z}_{t}|\mathbf{x}_{1:t})\sim\mathcal{N}(\boldsymbol{\boldsymbol{\mu}}_{t,pos},\boldsymbol{\boldsymbol{\sigma}}_{t,pos}). (3)

As before, the parameters of rr are computed by a two layered MLP ϕpos:2×dzdz\phi_{pos}:\mathbb{R}^{2\times d_{z}}\to\mathbb{R}^{d_{z}} using both the last hidden state of RNN (𝐡t1\mathbf{h}_{t-1}) and the incoming feature (𝐱t\mathbf{x}_{t}) as follows:

𝝁t,pos,𝝈t,pos=ϕpos([ϕx(𝐱t),ϕh(𝐡t1)]),\displaystyle\boldsymbol{\boldsymbol{\mu}}_{t,pos},\boldsymbol{\boldsymbol{\sigma}}_{t,pos}=\phi_{pos}([\phi_{x}(\mathbf{x}_{t}),\phi_{h}(\mathbf{h}_{t-1})]), (4)

where ϕx:dfdz\phi_{x}:\mathbb{R}^{d_{f}}\to\mathbb{R}^{d_{z}}, ϕh:dhdz\phi_{h}:\mathbb{R}^{d_{h}}\to\mathbb{R}^{d_{z}} are linear layers and [,][\cdot,\cdot] represents vector concatenation. We use the reparameterization trick [21] to sample an abstract goal (𝐳tdz\mathbf{z}_{t}\in\mathbb{R}^{d_{z}}) from the prior distribution q(𝐳t|𝐱1:t1)q(\mathbf{z}_{t}|\mathbf{x}_{1:t-1}) as follows:

𝐳t=𝝁t,prior+𝝈t,priorϵ,\mathbf{z}_{t}=\boldsymbol{\boldsymbol{\mu}}_{t,prior}+\boldsymbol{\boldsymbol{\sigma}}_{t,prior}\odot\boldsymbol{\epsilon}, (5)

where ϵ𝒩(𝟎,𝟏)dz\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{1})\in\mathbb{R}^{d_{z}} is a standard Gaussian distribution. Then sampled 𝐳t\mathbf{z}_{t} is used to obtain the next hidden state of the RNN111Our RNN is a standard GRU cell. as follows:

𝐡t=RNN(𝐡t1,[ϕx(𝐱t),ϕz(𝐳t)]),t1,,T\mathbf{h}_{t}=RNN(\mathbf{h}_{t-1},[\phi_{x}(\mathbf{x}_{t}),\phi_{z}(\mathbf{z}_{t})]),\forall t\in{1,\cdots,T} (6)

where ϕz:dzdz\phi_{z}:\mathbb{R}^{d_{z}}\to\mathbb{R}^{d_{z}} acts as a feature extractor over 𝐳t\mathbf{z}_{t}. The sampled abstract goal (𝐳t\mathbf{z}_{t}) can be used to reconstruct (or generate) the feature sequence as done in VRNN framework [7, 12]. However, we use it to represent feature-based abstract goal. Our intuition comes from the fact that humans derive action plans from goals, and videos are a realization of this action plan. Therefore, by construction, goal determines the video (feature evolution in our case). Interestingly, as the abstract goal latent variable encapsulates the video feature generation process, by analogical similarity, we make the proposition that latent variable (𝐳t\mathbf{z}_{t}) represents the notion of feature-based abstract goal.

Therefore, we denote the \sayfeature-based abstract goal distribution as follows:

p(𝐳T)=q(𝐳T|𝐱1:T1).p(\mathbf{z}_{T})=q(\mathbf{z}_{T}|\mathbf{x}_{1:T-1}). (7)

The abstract goal distribution represents all abstract goals with respect to a particular observed features sequence. Any observed action may lead to more than one goal. Our abstract goal representation captures these variations.

3.2 Action representations

Human actions are causal in nature and the next action in a sequence depends on the earlier actions. For example, washing vegetables is succeeded by cutting vegetables when the goal is \saymaking a salad. We capture the causality between observed and next actions using “the observed action representation” and the “next action representation”. We obtain the observed action representation (𝐚O\mathbf{a}_{O}) using feature-based abstract goal and the hidden state of RNN as follows:

𝐚O=ϕO([ϕz(𝐳T),ϕh(𝐡T)]).\mathbf{a}_{O}=\phi_{O}([\phi_{z}(\mathbf{z}_{T}),\phi_{h}(\mathbf{h}_{T})]). (8)

Here ϕO:2×dzdh\phi_{O}:\mathbb{R}^{2\times d_{z}}\to\mathbb{R}^{d_{h}} and 𝐳T\mathbf{z}_{T} is sampled from the abstract goal distribution p(𝐳T)p(\mathbf{z}_{T}) using Equation 5.

Then we obtain the distribution of next action representation (𝐚N\mathbf{a}_{N}) conditioned on the hidden state of the RNN and the observed action representation denoted by p(𝐚N|𝐡T,𝐚O)p(\mathbf{a}_{N}|\mathbf{h}_{T},\mathbf{a}_{O}). The reason for modeling next action representation as a distribution conditioned on hidden state and the observed action representation is twofold. First, a particular observed action may lead to different next actions depending on the context and goal. Note that in our model, both observed action representation 𝐚O\mathbf{a}_{O} and the RNN hidden state 𝐡T\mathbf{h}_{T} depend on the feature-based abstract goal representation. Second, there can be variations in human behavior when executing the same task. The next action representations are generated using a Gaussian distribution 𝒩(𝝁𝐚N,𝝈𝐚N2)\mathcal{N}(\boldsymbol{\mu}_{\mathbf{a}_{N}},\boldsymbol{\sigma}^{2}_{\mathbf{a}_{N}}) where 𝝁𝐚N,𝝈𝐚N2dz\boldsymbol{\mu}_{\mathbf{a}_{N}},\boldsymbol{\sigma}^{2}_{\mathbf{a}_{N}}\in\mathbb{R}^{d_{z}} whose parameters are estimated as follows:

p(𝐚N|𝐡T,𝐚O)𝒩(𝝁𝐚N,𝝈𝐚N2)\displaystyle p(\mathbf{a}_{N}|\mathbf{h}_{T},\mathbf{a}_{O})\sim\mathcal{N}(\boldsymbol{\mu}_{\mathbf{a}_{N}},\boldsymbol{\sigma}^{2}_{\mathbf{a}_{N}})
where (𝝁𝐚N,(𝝈𝐚N))=ϕN([ϕh(𝐡T),ϕa(𝐚O)]),\displaystyle\text{ where }(\boldsymbol{\mu}_{\mathbf{a}_{N}},(\boldsymbol{\sigma}_{\mathbf{a}_{N}}))=\phi_{N}([\phi_{h}(\mathbf{h}_{T}),\phi_{a}(\mathbf{a}_{O})]), (9)

where ϕa:dhdz\phi_{a}:\mathbb{R}^{d_{h}}\to\mathbb{R}^{d_{z}} and ϕN:2×dzdz\phi_{N}:\mathbb{R}^{2\times d_{z}}\to\mathbb{R}^{d_{z}} are two separate MLPs. Now we sample multiple next action representations from the next action representation distribution using the reparameterization trick as in Equation 10,

𝐚N=𝝁𝐚N+𝝈𝐚Nϵ,\mathbf{a}_{N}=\boldsymbol{\mu}_{\mathbf{a}_{N}}+\boldsymbol{\sigma}_{\mathbf{a}_{N}}\odot\boldsymbol{\epsilon}, (10)

where ϵ𝒩(𝟎,𝟏)dz\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{1})\in\mathbb{R}^{d_{z}} is a standard Gaussian distribution.

3.3 Action-based abstract goal representation

Now, we obtain action-based abstract goal from observed and next action representations using generative variational framework [21]. The distribution for action-based abstract goal is modeled with a Gaussian distribution conditioned on the next action representation denoted by q(𝐳N|𝐚N)q(\mathbf{z}_{N}|\mathbf{a}_{N}) whose parameters are computed as:

q(𝐳N|𝐚N)𝒩(𝝁Nq,𝝈Nq),\displaystyle q(\mathbf{z}_{N}|\mathbf{a}_{N})\sim\mathcal{N}(\boldsymbol{\mu}_{Nq},\boldsymbol{\sigma}_{Nq}),
where (𝝁Nq,𝝈Nq)=ϕNq(ϕa(𝐚N)),\displaystyle\text{ where }(\boldsymbol{\mu}_{Nq},\boldsymbol{\sigma}_{Nq})=\phi_{Nq}(\phi_{a}(\mathbf{a}_{N})), (11)

where 𝝁Nq,𝝈Nqdz\boldsymbol{\mu}_{Nq},\boldsymbol{\sigma}_{Nq}\in\mathbb{R}^{d_{z}} and ϕNq:dhdz\phi_{Nq}:\mathbb{R}^{d_{h}}\to\mathbb{R}^{d_{z}} is implemented with two MLPs. On the other hand, parameters of the action-based abstract goal distribution conditioned on both observed and next action representation are given as follows:

r(𝐳N|𝐚N,𝐚O)𝒩(𝝁Nr,𝝈Nr)\displaystyle r(\mathbf{z}_{N}|\mathbf{a}_{N},\mathbf{a}_{O})\sim\mathcal{N}(\boldsymbol{\mu}_{Nr},\boldsymbol{\sigma}_{Nr})
where (𝝁Nr,𝝈Nr)=ϕNr([ϕa(𝐚N),ϕa(𝐚O)])\displaystyle\text{ where }(\boldsymbol{\mu}_{Nr},\boldsymbol{\sigma}_{Nr})=\phi_{Nr}([\phi_{a}(\mathbf{a}_{N}),\phi_{a}(\mathbf{a}_{O})]) (12)

where 𝝁Nr,𝝈Nrdz\boldsymbol{\mu}_{Nr},\boldsymbol{\sigma}_{Nr}\in\mathbb{R}^{d_{z}} and ϕNr:dhdz\phi_{Nr}:\mathbb{R}^{d_{h}}\to\mathbb{R}^{d_{z}} is a dual headed MLP. Finally, the action-based abstract goal distribution for the next action p(𝐳N)p(\mathbf{z}_{N}) is given by the distribution

p(𝐳N)=q(𝐳N|𝐚N).p(\mathbf{z}_{N})=q(\mathbf{z}_{N}|\mathbf{a}_{N}). (13)

We use both feature-based and action-based abstract goal representation to find the best candidate for next action as explained in next section. It should be noted that while the q(𝐳N|𝐚N)q(\mathbf{z}_{N}|\mathbf{a}_{N}) only depends on 𝐚N\mathbf{a}_{N}, the r(𝐳N|𝐚N,𝐚O)r(\mathbf{z}_{N}|\mathbf{a}_{N},\mathbf{a}_{O}) depends on both 𝐚N\mathbf{a}_{N} and 𝐚O\mathbf{a}_{O}.

3.4 Next action anticipation with goal consistency

Given a sampled feature-based abstract goal 𝐳T\mathbf{z}_{T}, we select the best next action representation 𝐚N\mathbf{a}_{N}^{*} using the divergence between p(𝐳T)p(\mathbf{z}_{T}) distribution (eq. 7) and p(𝐳N)p(\mathbf{z}_{N}) distribution (eq. 13). We call this divergence as the goal consistency criterion. Given 𝐳T\mathbf{z}_{T}, observed action 𝐚O\mathbf{a}_{O} and the next sampled action 𝐚N\mathbf{a}_{N}, the goal consistency criterion is derived from the symmetric KL-divergence between p(𝐳T)p(\mathbf{z}_{T}) and p(𝐳N)p(\mathbf{z}_{N}) as follows:

D(𝐚N)=DKL(p(𝐳T)||p(𝐳N))+DKL(p(𝐳N)||p(𝐳T))2.D(\mathbf{a}_{N})=\frac{D_{KL}(p(\mathbf{z}_{T})||p(\mathbf{z}_{N}))+D_{KL}(p(\mathbf{z}_{N})||p(\mathbf{z}_{T}))}{2}. (14)

We choose the best next action candidate (i.e. the anticipated action candidate representation) 𝐚N\mathbf{a}_{N}^{*} that minimizes the goal consistency criterion. The rationale is that the best anticipated action should have an action-based abstract goal representation p(𝐳N)p(\mathbf{z}_{N}) that aligns with the feature-based abstract goal distribution p(𝐳T)p(\mathbf{z}_{T}). We use the following algorithm to find the best next action candidate 𝐚N\mathbf{a}_{N}^{*}.

1:Sample feat-based abstract goal 𝐳T\mathbf{z}_{T} from eq. 7 \rightarrow 𝐳Tq(𝐳T|𝐱1:T1)\mathbf{z}_{T}\sim q(\mathbf{z}_{T}|\mathbf{x}_{1:T-1})
2:Get observed action representation 𝐚O\mathbf{a}_{O} (eq. 8)
3:Get next action representation distribution p(𝐚N|𝐡T,𝐚O)p(\mathbf{a}_{N}|\mathbf{h}_{T},\mathbf{a}_{O}) (eq. 9)
4:Sample KK next action representations 𝒩={𝐚N1,𝐚NK}p(𝐚N|𝐡T,𝐚O)\mathcal{N}=\{\mathbf{a}_{N}^{1},\cdots\mathbf{a}_{N}^{K}\}\sim p(\mathbf{a}_{N}|\mathbf{h}_{T},\mathbf{a}_{O})
5:Best next action 𝐚N=argmin𝐚Nk𝒩D(𝐚Nk)\mathbf{a}_{N}^{*}=\text{argmin}_{\mathbf{a}_{N}^{k}\in\mathcal{N}}D(\mathbf{a}_{N}^{k})
Algorithm 1 Best next action selection

Finally, we predict the anticipated action from the selected next action representation as follows:

𝐲^=ϕc(𝐚N)\hat{\mathbf{y}}=\phi_{c}(\mathbf{a}_{N}^{*}) (15)

where ϕc:dzdc\phi_{c}:\mathbb{R}^{d_{z}}\to\mathbb{R}^{d_{c}} is the MLP classifier and 𝐲^\hat{\mathbf{y}} is the class score vector. It should be noted that in Algorithm 1, we sample only one feature-based abstraction goal in line 1 of the algorithm. However, during training we sample QQ number of feature-based abstraction goals and for each of them we sample KK number of next action representations. In this case, we select the best candidate from all K×QK\times Q next action representation candidates using Equation 14. So, the next best action is consistent and does not rely too much on sampling as long as we sample sufficient candidate next actions.

Even if the feature-based abstract goal P(𝐳T)P(\mathbf{z}_{T}) is obtained from VRNN framework [7, 12], the formulation of action representations 𝐚𝐎\mathbf{a_{O}} and 𝐚𝐍\mathbf{a_{N}}, action-based abstract goal P(𝐳N)P(\mathbf{z}_{N}) and goal consistency criterion is drastically different from  [1, 25]. In [31], goal consistency is defined between latent goals before and after the action using a hard threshold. Instead, our goal consistency is a symmetric KL divergence between p(𝐳T)p(\mathbf{z}_{T}) and p(𝐳N)p(\mathbf{z}_{N}) distributions which aims to align the two abstract goal distributions. This also results in a massive improvement in next action anticipation performance as shown in the experiments.

3.5 Loss functions and training of our model

Our anticipation network is trained using a number of losses. In contrast to prior stochastic methods [25] [1], we introduce three KL divergence losses, based on a) feature-based abstract goal, b). action-based abstract goal (LNGL_{NG}), 3. goal-consistency loss (LGCL_{GC}). The first loss function is used to learn the parameters of the feature-based abstract goal distribution. We compute the KL-divergence between the conditional prior and posterior distributions for every feature in the observed feature sequence and minimize the sum given as follows:

OG=t=1TDKL(r(𝐳t|𝐱1:t)||q(𝐳t|𝐱1:t1)).\mathcal{L}_{OG}=\sum_{t=1}^{T}D_{KL}(r(\mathbf{z}_{t}|\mathbf{x}_{1:t})||q(\mathbf{z}_{t}|\mathbf{x}_{1:t-1})). (16)

This is based on the intuition that the abstract goal should not change due to a new observed feature. Our second loss arises when we learn the action-based abstract goal distribution. As in equation 16 above, we compute the KL-divergence between r(𝐳N|𝐚N,𝐚O)r(\mathbf{z}_{N}|\mathbf{a}_{N}^{*},\mathbf{a}_{O}) and q(𝐳N|𝐚N)q(\mathbf{z}_{N}|\mathbf{a}_{N}^{*}) distributions of action-based abstract goal distributions as follows:

NG=DKL(r(𝐳N|𝐚N,𝐚O)||q(𝐳N|𝐚N)).\mathcal{L}_{NG}=D_{KL}(r(\mathbf{z}_{N}|\mathbf{a}_{N}^{*},\mathbf{a}_{O})||q(\mathbf{z}_{N}|\mathbf{a}_{N}^{*})). (17)

We denote the corresponding best action-based abstract goal distribution by p(𝐳N)=q(𝐳N|𝐚N)p(\mathbf{z}^{*}_{N})=q(\mathbf{z}_{N}|\mathbf{a}_{N}^{*}). The intuition is same as before, the goal should not change because of the next best action 𝐚N\mathbf{a}_{N}^{*}. Furthermore, the feature-based and action-based abstract goal distributions should be aligned with respect to the selected next best action 𝐚N\mathbf{a}_{N}^{*}. Therefore, we minimize the symmetric KL-Divergence between the feature-based and best-action-based abstract goal distribution as follows:

GC=\displaystyle\mathcal{L}_{GC}= DKL(p(𝐳T)||p(𝐳N)+DKL(p(𝐳N)||p(𝐳T)2.\displaystyle\frac{D_{KL}(p(\mathbf{z}_{T})||p(\mathbf{z}^{*}_{N})+D_{KL}(p(\mathbf{z}^{*}_{N})||p(\mathbf{z}_{T})}{2}. (18)

We coin this loss as goal consistency loss. This loss is based on D(𝐚N)D({\mathbf{a}_{N}}) in Equation 14 with the only difference being that p(𝐳N)=q(𝐳N|𝐚N)p(\mathbf{z}^{*}_{N})=q(\mathbf{z}_{N}|\mathbf{a}_{N}^{*}) is computed with respect to the selected best next action representation 𝐚N\mathbf{a}_{N}^{*}. Finally, we have the cross-entropy loss for comparing the model’s prediction 𝐲^\mathbf{\hat{y}} with the ground truth one-hot label 𝐲\mathbf{y} as follows

NA=𝐲log(𝐲^).\displaystyle\mathcal{L}_{NA}=-\sum\mathbf{y}\odot\log(\mathbf{\hat{y}}). (19)

The loss function to train the model is a combination of all losses given as follows:

total=OG+NG+GC+NA.\mathcal{L}_{total}=\mathcal{L}_{OG}+\mathcal{L}_{NG}+\mathcal{L}_{GC}+\mathcal{L}_{NA}. (20)

We experimented with adding different weights to each loss but there is no significant difference in learning. Therefore, we weight them equally.

4 Experiments and results

4.1 Datasets, features, and training details

We use three well known action anticipation datasets, EPIC-KITCHENS55[8] (EK55), EPIC-KITCHENS100[9] (EK100) and EGTEA Gaze+[22] to evaluate. We validate our models using the TSN features obtained from RGB and optical flow videos, and bag of object features provided by [13] for a fair comparison with existing approaches. Our base model has the following parameters: observed duration - 2 seconds, frame rate - 3 fps, RNN (GRU) hidden dimension dh=d_{h}= 256, abstract goal dimension dz=d_{z}= 128, number of sampled feature-based abstract goals (Q=3Q=3), number of next-action-representation candidates (K=10K=10), and fixed anticipation time - 1s (following EK55 and EK100 evaluation server criteria), unless specified otherwise. We use a batch size of 128 videos and train for 15 epochs with a learning rate of 0.001 using Adam optimizer with weight decay (AdamW) in Pytorch. All our MLPs have 256 hidden dimensions.

4.2 Comparison with state-of-the-art

We compare the performance of Abstract Goal (our method) with current state-of-the-art approaches on both the seen and unseen test sets of EK55 datasets in Table 1 using a late fusion of TSN-RGB, TSN-Flow, and Object features like most of the prior work. We train separate models for verb and noun anticipation and combine their predictions to obtain action anticipation accuracy. Our method outperforms all other prior state-of-the-art methods by a significant margin for both seen kitchens (S1) and unseen kitchens (S2). Notably, we outperform Transformer-based AVT [16] and Temporal-Aggregation [33] in all measures in both seen and unseen kitchens except for Top-5 accuracy on unseen kitchens. A similar trend can be seen for EGTEA Gaze+ dataset in Table 2 where our method outperforms all compared methods. We believe this improvement is due to two factors, (i) stochastic modeling is massively important for action anticipation, and (ii) the effective use of goal information is paramount for better action anticipation.

Method Seen Kitchens (S1) Unseen Kitchens (S2)
Top-1 accuracy Top-5 accuracy Top-1 accuracy Top-5 accuracy
VERB NOUN ACT. VERB NOUN ACT. VERB NOUN ACT. VERB NOUN ACT.
RU-LSTM [13] 33.04 22.78 14.39 79.55 50.95 33.73 27.01 15.19 08.16 69.55 34.38 21.10
Lat. Goal [31] 27.96 27.40 8.10 78.09 55.98 26.46 22.40 19.12 04.78 72.07 42.68 16.97
SRL [29] 34.89 22.84 14.24 79.59 52.03 34.61 27.42 15.47 08.88 71.90 36.80 22.06
ImagineRNN [39] 35.44 22.79 14.66 79.72 52.09 34.98 29.33 15.50 09.25 70.67 35.78 22.19
Temp. Agg. [33] 37.87 24.10 16.64 79.74 53.98 36.06 29.50 16.52 10.04 70.13 37.83 23.42
MM-Trans [30] 28.59 27.18 10.85 78.64 57.66 30.83 26.80 18.40 06.76 70.40 44.18 20.04
MM-TCN [40] 37.16 23.75 15.45 79.48 51.86 34.37 30.66 14.92 08.91 72.00 36.67 21.68
AVT [16] 34.36 20.16 16.84 80.03 51.57 36.52 30.66 15.64 10.41 72.17 40.76 24.27
Abstract Goal 51.56 35.34 22.03 82.56 58.01 38.29 41.41 22.36 13.28 73.10 41.62 24.24
Table 1: Comparison of anticipation accuracy with state-of-the-art on EK55 evaluation server. ACT. is action.
Method Top-1 accuracy Mean Class accuracy
VERB NOUN ACT. VERB NOUN ACT.
I3D-Res50 [5] 48.0 42.1 34.8 31.3 30.0 23.2
FHOI [23] 49.0 45.5 36.6 32.5 32.7 25.3
RU-LSTM [13] 50.3 48.1 38.6 - - -
AVT [16] 54.9 52.2 43.0 49.9 48.3 35.2
Abstract Goal 64.8 65.3 49.8 63.4 55.6 37.4
Table 2: Comparison on anticipation performance on EGTEA Gaze+. All methods are evaluated on fixed anticipation time of 0.5s following [16].
Method Overall Unseen Kitchens
VERB NOUN ACT. VERB NOUN ACT.
RU-LSTM [9] 25.25 26.69 11.19 19.36 26.87 09.65
Temp. Agg.
[33]
21.76 30.59 12.55 17.86 27.04 10.46
AVT [16] 26.69 32.33 16.74 21.03 27.64 12.89
TransAction [18] 36.15 32.20 13.39 27.60 24.24 10.05
MeMViT [38] 32.20 37.00 17.70 28.60 27.40 15.20
Abstract goal 31.40 30.10 14.29 31.36 35.56 17.34
Table 3: Comparison on EK100 dataset on evaluation server using test set. Accuracy measured by class-mean recall@5 (%) following the standard protocol.

Despite, these excellent results in both EK55 and EGTEA Gaze+ datasets, our overall results on EK100 are not state-of-the-art–see Table 3. The overall performance is affected by tail classes where our method performs not as well as recent transformer methods (see Supplementary Table 1) that are heavily trained on external data [16, 38]. EK100 dataset is dominated by long-tailed distribution where 228 noun classes out of 300 are in the tail classes. Similarly, 86 verb classes out of 97 are in the tail classes. In our model, the next-action-representation is modeled with a Gaussian distribution (Equation 11), and therefore, it is not able to cater to exceptionally long tail class distributions as in EK100. This is a limitation of our method. We do not witness the tail-class issue in EK55 as the performance measure used is accuracy compared to mean-recall in EK100. Accuracy is influenced heavily by frequent classes but mean-recall treats all classes equally. However, our model shows excellent generalization results on unseen kitchens of EK100 dataset outperforming the best Transformer model [38].

4.3 Impact of goal consistency criterion and loss

In this section, we evaluate the impact of Goal Consistency (GC) criterion and loss (GC\mathcal{L}_{GC}) using the validation set of EK55 and EK100 datasets. We train separate models for verb and noun anticipation using TSN-RGB (RGB) and Object (OBJ) features, respectively. As Mean and Median sampling are used in prior variational prediction models [1], here we use mean and median sampling as two baselines to show the effect of GC. After sampling Q×KQ\times K number of next-action representations (𝐚𝐍\mathbf{a_{N}}), instead of selecting the best next-action candidate using GC (Algorithm 1), we obtain the mean/median vector of all sampled candidates and then make the prediction using the classifier (e.g. mean vector=𝐚𝐍Q×K=\frac{\sum\mathbf{a_{N}}}{Q\times K}). We also experimented with a majority/median class prediction baseline. In this case, we take all Q×KQ\times K predictions from the classifier (from the next action-representation candidates) and pick the majority/median class as the final prediction. Everything else stays the same for all these mean/majority/median baseline models, except we do not use GC loss (GC\mathcal{L}_{GC}) or the GC criterion (Equation 14). Results are reported in Table 4.

Goal candidate (Q) & Action candidate (K) EK55 EK100
V@1 V@5 N@1 N@5 V@1 V@5 N@1 N@5
Mean Q=1, K=10 41.79 72.23 25.79 49.50 44.51 76.89 22.72 50.78
Median 41.16 71.32 24.30 48.31 45.44 77.91 22.15 51.23
Majority class 41.98 72.89 25.98 50.01 42.98 74.56 24.13 53.45
Median class 41.02 72.11 22.88 49.87 44.19 77.00 22.97 51.98
Our model 45.18 77.30 28.16 51.08 48.84 80.52 27.50 55.83
Mean Q=3, K=10 39.40 72.23 24.22 48.96 45.90 77.88 22.41 50.87
Median 41.32 71.32 26.60 51.70 45.63 77.02 24.33 52.87
Majority class 38.39 69.42 24.70 48.22 45.72 78.61 22.61 50.89
Median class 40.43 71.43 26.52 52.33 45.84 78.09 23.78 52.33
Our model 44.68 77.14 28.29 53.78 49.02 80.86 28.52 54.91
Without GC\mathcal{L}_{GC} Q=1, K=1 38.31 70.77 19.74 43.11 43.82 77.45 21.25 51.99
With GC\mathcal{L}_{GC} 40.88 71.43 22.09 46.29 46.80 78.41 26.80 53.32
Table 4: The impact of goal consistency criterion and loss. @1 and @5 denotes Top-1 and Top-5 accuracy and V stands for verb and N stands for noun.

As can be seen from the results, there is a significant impact of GC. Especially, there is an improvement of 3.39 and 2.37 for top-1 verb and noun accuracy respectively using our GC model in the EK55 dataset for Q=1,K=10Q=1,K=10 over Mean sampling baseline. Similar trend can be seen for EK100 and Q=3,K=10Q=3,K=10 as well. Our model also outperforms majority and median class sampling baselines for both [Q=1,K=10Q=1,K=10] and [Q=3,K=10Q=3,K=10] configurations indicating the effectiveness of goal consistency. Overall, our method with GC loss and criterion performs better than all other variants. Perhaps this is because the GC criterion allows the model to regularize the candidate selection while GC loss allows the model to enforce this during the training. This clearly shows the impact of goal consistency formulation of our model for action anticipation. To understand goal consistency qualitatively, we measured KLD between abstract goals of different videos. We found that videos with similar goals have lower KLD (6.41×1056.41\times 10^{-5}) than dissimilar goals (2×1042\times 10^{-4}). Details in Supplementary Table 2.

We perform a more controlled experiment to further evaluate the impact of GC loss where we set Q=1Q=1 and K=1K=1 and train our model with and without GC loss (GC\mathcal{L}_{GC}). It should be noted that when Q=1Q=1 and K=1K=1, there is no impact of GC criterion. To obtain a statistically meaningful result, we repeat this experiment 10 times and report the mean performance. As it can be seen from the results in Table 4 (last two rows), clearly the GC loss has a positive impact even when we just sample a single action candidate from our stochastic model. Adding GC loss improves the top1-verb prediction by 2.57 and top1-noun prediction by 2.35 on EK55 dataset. On EK100 dataset, the improvement is 2.98 and 4.55 for top-1 verb and noun prediction. This clearly proves that not only does the GC criterion help, but also the GC loss improves the performance of the model. We see that compared to our model variant [Q=1,K=1Q=1,K=1 with GC\mathcal{L}_{GC}] the [Q=1,K=10Q=1,K=10 with GC\mathcal{L}_{GC}] model performs significantly better (last row vs row 5 of Table 4). This indicates the impact of next-action-representation sampling (Equation 11) even for a single sampled feature-based abstract goal (Q=1Q=1). We conclude that the goal consistency loss, the goal consistency criterion, and next-action-representation distribution modeling (all novel concepts introduced in this paper) are effective for action anticipation.

parameter value EK55 EK100
V@1 V@5 N@1 N@5 V@1 V@5 N@1 N@5
num. feature-based abstract goals (Q) (K = 10) 1 45.18 77.30 28.16 51.08 48.84 80.52 27.50 55.83
2 44.44 76.19 28.47 52.38 49.25 80.44 28.41 55.65
3 44.68 77.14 28.29 53.78 49.02 80.86 28.52 54.91
4 45.31 77.91 26.28 50.33 48.86 80.46 28.16 55.11
5 45.80 77.40 26.95 51.93 49.71 80.40 28.04 55.16
num. next action candidates (K) (Q=3) 1 39.81 72.31 21.48 44.96 44.24 75.67 20.06 42.56
3 40.49 74.20 22.60 46.22 44.37 76.11 21.07 44.51
5 41.32 74.26 23.17 48.23 45.61 78.91 22.91 45.12
10 44.68 77.14 28.29 53.78 49.02 80.86 28.52 54.91
20 43.79 79.00 27.07 51.10 49.01 80.36 28.13 55.40
30 44.56 77.81 27.80 51.00 49.18 81.20 27.44 53.42
Table 5: Ablation on the sensitivity of number of sampled feature-based-abstract-goals (QQ) and next-action representation candidate KK on EK55 and EK100 validation set.

4.4 Sensitivity of model on QQ and KK

Next, we evaluate how sensitive our model is to the number of sampled feature-based abstract goals QQ and sampled next action-representation candidates KK. Results are shown in Table 5. Our model is not that sensitive to the sampled number of feature-based abstract goals QQ, especially when K>1K>1. However, having Q=3Q=3 is better than Q=1Q=1 for the majority of the performance measures except for top5-noun accuracy. As the model is not that sensitive to QQ, we select Q=3Q=3 as the default. The performance of the model increases with increasing KK. We select K=10K=10 as our default based on results in Table 5. Our model uses both GC loss and criterion during training. GC criterion helps the model to pick the most suitable next action candidate while GC loss encourages the model to generate good candidates. Therefore, our model is regularized to predict the most likely candidates accurately and may not need to rely too much on the sampling process. We also notice that the model is not sensitive to different starting seeds in the sampling process–see Supplementary Table 4.

Losses V@1 V@5 N@1 N@5
NA\mathcal{L}_{NA} 21.36 69.69 27.76 51.89
NA+OG\mathcal{L}_{NA}+\mathcal{L}_{OG} 44.42 77.79 28.41 51.31
NA+NG\mathcal{L}_{NA}+\mathcal{L}_{NG} 46.01 77.94 29.05 52.32
NA+GC\mathcal{L}_{NA}+\mathcal{L}_{GC} 43.83 77.43 28.06 51.87
NA+OG+NG\mathcal{L}_{NA}+\mathcal{L}_{OG}+\mathcal{L}_{NG} 44.47 77.12 28.51 51.34
NA+OG+GC\mathcal{L}_{NA}+\mathcal{L}_{OG}+\mathcal{L}_{GC} 45.47 77.42 28.61 52.34
NA+OG+NG+GC\mathcal{L}_{NA}+\mathcal{L}_{OG}+\mathcal{L}_{NG}+\mathcal{L}_{GC} 46.37 77.97 29.86 52.74
Table 6: Loss ablation on EK55 validation set. i.e.NAi.e.\mathcal{L}_{NA}-Next action cross-entropy loss, OG\mathcal{L}_{OG}-Feature-based abstract goal loss, NG\mathcal{L}_{NG}-Action-based abstract goal loss, GC\mathcal{L}_{GC}-Goal consistency loss.
Model V@1 V@5 N@1 N@5
VRNN (Mean) 27.76 61.23 22.34 46.78
VRNN (Median) 38.13 68.94 23.85 47.56
Abstract Goal 44.68 77.14 28.29 53.78
Table 7: Ablation on architecture. We compare our model with standard VRNN classification on EK55 dataset.

4.5 Evaluating the impact of all loss functions

We also study the impact of each loss function described in Section 3.5 and report the results in Table 6. If we use only the supervised cross-entropy loss (i.e., NA\mathcal{L}_{NA}), then the performance is the worst, especially for verbs. Both OG\mathcal{L}_{OG} and NG\mathcal{L}_{NG} help in regularizing the abstract goal representations (𝐳𝐭\mathbf{z_{t}} and 𝐚𝐍\mathbf{a_{N}}), and therefore results improve significantly. Especially, the NA+NG\mathcal{L}_{NA}+\mathcal{L}_{NG} is the best loss combination for a pair of losses. When we combine all four losses, we get the best results. While NA+NG\mathcal{L}_{NA}+\mathcal{L}_{NG} regularizes the learning of abstract goal representations, GC\mathcal{L}_{GC} which minimizes the divergence between feature-based and action-based goal distributions improves the choice of next verb or noun among the plausible candidates. We conclude that all four losses are important for our model.

4.6 Ablation on model architecture

We also compare our model with standard Variational RNN classification. We obtain the latent variable 𝐳𝐓\mathbf{z_{T}} and the observed action representation 𝐚𝐎\mathbf{a_{O}} from Equation 8, we classify 𝐚𝐎\mathbf{a_{O}} using a classifier to obtain the future action. In this case, we train the VRNN model with cross-entropy loss and KL-divergence (Equation 16). We call this baseline as VRNN, where we sample 30 candidates for 𝐚𝐎\mathbf{a_{O}} to match our baseline architecture with 30 next action candidates (Q=3,K=10Q=3,K=10). As shown in Table 7, our method performs much better than the VRNN method (either considering mean or median prediction).

4.7 Ablation on anticipation horizon

In Table 8, we see that our model produces consistent anticipation performance for various anticipation time horizons on EK55 dataset. One model is trained for each anticipation duration. On all anticipation times, abstract goal outperforms existing approaches substantially (5-12%). Even when the anticipation time is increased to 2 seconds, abstract goal performance does not drop significantly compared to RU-LSTM [14] or Temp. Agg. [33]. We conclude that our model can produce better predictions over longer anticipation times, due to stochastic modeling of action anticipation with goal consistency. Additional ablation studies in Supplementary Tables 3-9.

Top-5 Action Accuracy Anticipation Time
2 1.5 1 0.5
EL [19] 24.68 26.41 28.56 31.50
RU-LSTM [14]
29.44 32.24 35.32 37.37
Temp. Agg. [33]
30.90 33.70 36.40 39.50
Abstract Goal 42.90 43.61 43.91 44.26
Table 8: Comparing Action Anticipation performance for different anticipation times on EK55 validation set. Our model consistently outperforms other approaches by a large margin for all anticipation times.

5 Limitations, Discussion, and Conclusions

We present a novel approach for action anticipation where abstract goals are learned with a stochastic recurrent model. We significantly outperform existing approaches on EPIC-KITCHENS55 and EGTEA Gaze+ datasets. Our model generalizes to unseen kitchens in both EPIC-KITCHENS55 and EPIC-KITCHENS100 datasets, even outperforming Transformer-based models by a large margin. We also show the importance of goal consistency criterion, goal consistency loss, next-action representation modeling, and architecture. One limitation of the current work is the inability to directly interpret the abstract goal representation learned by our model. Second, our method is not able to tackle long-tail-class distribution issues. In the future, we aim to address these limitations of our model.

Acknowledgment

This research/project is supported in part by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-RP-2020-016) and the National Research Foundation, Singapore under its AI Singapore Program (AISG Award Number: AISG-RP-2019-010).

References

  • [1] Yazan Abu Farha and Juergen Gall. Uncertainty-aware anticipation of activities. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  • [2] Yazan Abu Farha, Alexander Richard, and Juergen Gall. When will you do what?-anticipating temporal occurrences of activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5343–5352, 2018.
  • [3] Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009.
  • [4] Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 29, 2007.
  • [5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  • [6] Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. In European Conference on Computer Vision, pages 334–350. Springer, 2020.
  • [7] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. Advances in neural information processing systems, 28:2980–2988, 2015.
  • [8] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
  • [9] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2021.
  • [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • [11] Basura Fernando and Samitha Herath. Anticipating human actions by correlating past with the future with jaccard similarity measures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13224–13233, 2021.
  • [12] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2207–2215, 2016.
  • [13] Antonino Furnari and Giovanni Farinella. Rolling-unrolling lstms for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [14] Antonino Furnari and Giovanni Maria Farinella. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE International Conference on Computer Vision, pages 6252–6261, 2019.
  • [15] Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. Forecasting future action sequences with neural memory networks. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, page 298. BMVA Press, 2019.
  • [16] Rohit Girdhar and Kristen Grauman. Anticipative Video Transformer. In ICCV, 2021.
  • [17] Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052–3061, 2022.
  • [18] Xiao Gu, Jianing Qiu, Yao Guo, Benny Lo, and Guang-Zhong Yang. Transaction: ICL-SJTU submission to epic-kitchens action anticipation challenge 2021. CoRR, abs/2107.13259, 2021.
  • [19] Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, and Ashutosh Saxena. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125. IEEE, 2016.
  • [20] Qiuhong Ke, Mario Fritz, and Bernt Schiele. Time-conditioned action anticipation in one shot. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9925–9934, 2019.
  • [21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [22] Y. Li, M. Liu, and J. Rehg. In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–1, jan 5555.
  • [23] Miao Liu, Siyu Tang, Yin Li, and James M Rehg. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In European Conference on Computer Vision, pages 704–721. Springer, 2020.
  • [24] Siyuan Brandon Loh, Debaditya Roy, and Basura Fernando. Long-term action forecasting using multi-headed attention-based variational recurrent neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2419–2427, June 2022.
  • [25] Nazanin Mehrasa, Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, and Greg Mori. A variational auto-encoder model for stochastic point processes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3165–3174, 2019.
  • [26] Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani, and Du Tran. Leveraging the present to anticipate the future in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [27] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  • [28] Yan Bin Ng and Basura Fernando. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. IEEE Transactions on Image Processing 2020, 2020.
  • [29] Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [30] Debaditya Roy and Basura Fernando. Action anticipation using pairwise human-object interactions and transformers. IEEE Transactions on Image Processing, 2021.
  • [31] Debaditya Roy and Basura Fernando. Action anticipation using latent goal learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2745–2753, January 2022.
  • [32] Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, and Lars Andersson. Encouraging lstms to anticipate actions very early. In Proceedings of the IEEE International Conference on Computer Vision, pages 280–289, 2017.
  • [33] Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video understanding. In European Conference on Computer Vision, pages 154–171. Springer, 2020.
  • [34] Yuge Shi, Basura Fernando, and Richard Hartley. Action anticipation with rbf kernelized feature mapping rnn. In Proceedings of the European Conference on Computer Vision (ECCV), pages 301–317, 2018.
  • [35] Elizabeth Shove. The design of everyday life, chapter 2, page 40. Berg, 2007.
  • [36] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 98–106, 2016.
  • [37] Xiaohan Wang, Linchao Zhu, Heng Wang, and Yi Yang. Interactive prototype learning for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8168–8177, 2021.
  • [38] Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. arXiv preprint arXiv:2201.08383, 2022.
  • [39] Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, and Fei Wu. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing, 30:1143–1152, 2021.
  • [40] Olga Zatsarynna, Yazan Abu Farha, and Juergen Gall. Multi-modal temporal convolutional network for anticipating actions in egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2249–2258, 2021.
  • [41] Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, and Yong Rui. What if we could not see? counterfactual analysis for egocentric action anticipation. In IJCAI, 2021.