Learning Temporal Dynamics from Cycles in Narrated Video
Abstract
Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We introduce a self-supervised approach to this problem that solves a multi-modal temporal cycle consistency objective jointly in vision and language. This objective requires a model to learn modality-agnostic functions to predict the future and past that undo each other when composed. We hypothesize that a model trained on this objective will discover long-term temporal dynamics in video. We verify this hypothesis by using the resultant visual representations and predictive models as-is to solve a variety of downstream tasks. Our method outperforms state-of-the-art self-supervised video prediction methods on future action anticipation, temporal image ordering, and arrow-of-time classification tasks, without training on target datasets or their labels.
1 Introduction
Prediction is a central problem in computer vision which researchers have been grappling with since the early days of the field [10, 12, 22, 30, 35, 40, 56]. Previous deep learning methods have largely focused on predicting fixed, small offsets into the future. To understand why this formulation is flawed, consider Figure 1. This figure shows a frame (a) from a video at time and three frames at times . Which of the three should be the output of a model that predicts the future? Option (d) is closest to the future that humans are likely to imagine. By predicting frames such as option (b), which occur in the immediate future [15, 16, 49, 55], we limit the scope of temporal transitions that can be learned by models and hurt downstream performance.
Motivated by this example, we identify three central challenges in training a model to predict the future. First, manually annotating videos with temporal relationships between frames is prohibitively expensive, and ground truth may be difficult to define. Therefore, models should be able to learn from large unlabeled datasets of in-the-wild action and discover transitions autonomously, to enable practical applications. Second, modeling the complex long-term transitions in the real world requires learning high-level concepts, more naturally found in abstract latent representations than raw pixels. Finally, the duration elapsed by temporal transitions can vary significantly depending on context, and models must be able to make predictions at varied offsets into the future. To satisfy these desiderata, we introduce a new self-supervised training objective, Multi-Modal Temporal Cycle Consistency (MMCC), and a model that learns a representation to solve it.

We show the MMCC objective in Figure 2. Starting from a sampled frame in a narrated video, our model learns to attend among all narration text to retrieve a relevant utterance. Combining both modalities, the model learns a function to predict a latent future, attending over the entire video to retrieve a future frame. This frame’s corresponding utterance is estimated, and a function to predict a past frame is learned in a similar way. The cycle constraint requires that the final model prediction be equal to the starting frame.
MMCC addresses all three challenges discussed above. In Figure 1, only (d) is a viable solution to our cycle formulation. Selecting (c) as a future would not allow the model to return to (a), since the two frames have no clear relationship. On the other hand, because the model does not know which modality its input comes from—and therefore must operate equally on vision and language—it is discouraged from selecting lower-level future frames such as (b), which likely do not accompany a predictable change in text.
We show that our model, trained end-to-end from scratch to solve the MMCC objective on the HowTo100M dataset [38], captures long-term dynamics in its predictive model of the future, and can be used without further training to anticipate future actions, order image collections, and identify salient temporal relationships in long videos. It also learns representations of video and text that contain information relevant to modeling temporal dynamics, which we demonstrate to be crucial to the quality of prediction.
Our main contributions are:
-
•
MMCC, a self-supervised multi-modal temporal cycle consistency objective that requires learning visual representations attuned to temporal dynamics, as well as long-term predictive models of the future and past.
-
•
An attention-based model to solve this objective, which uses cross-modal and temporal cues to discover relationships through time in video.
-
•
Since no previous self-supervised benchmarks exist in this area, a suite of qualitative and quantitative tasks to evaluate learned representations and predictive models. Our model outperforms the self-supervised SOTA in video prediction on all tasks.

2 Related Work
Modeling the future. Building predictive models of the future is a long-studied task in the computer vision community. Early work considers generating or warping pixels or optical flow to synthesize immediate futures [3, 11, 33, 42, 43, 48, 56, 57, 58, 59, 66]. More recent work attempts to model uncertainty in pixel-space, often by learning a distribution of futures that can be sampled [8, 18, 21, 28, 30, 50, 53, 54, 64]. These approaches tend to focus on synthetic or very short-term data, since synthesis is challenging in real video. Rather than predicting pixels, another line of work uses supervision to predict future action labels [19, 25, 29, 37, 46]. Sun et al. [51] also uses narrated video, but quantizes input video using Kinetics supervision, then learns a transformer-based model of vision-and-language sequences. Instead of using supervision, Vondrick et al. [55] predicts representations which are trained to capture abstract concepts but are automatically obtained on large collections of data. Recent work extends this, using contrastive learning or other techniques to predict future representations [13, 15, 16, 49, 62]. With very few exceptions [21], this line of work is concerned with predicting time given time . This formulation is highly constraining. Our model can predict arbitrarily far into the future and learns long-term dynamics from unlabeled, narrated video.
Learning from unlabeled narrated video. Self-supervised learning has a long history, even dating back to the early 1990s, where De Sa [6] considered audiovisual data to “derive label[s] from a co-occurring input to another modality”. We join an increasingly popular line of work and leverage automatic textual transcripts extracted from narrated videos uploaded online. Combining video and text has been widely explored in the deep learning era, with datasets largely focusing on manual textual annotation of video [2, 5, 63, 67] or on movies which have provided scripts [44, 45]. Other work instead learns from automatic transcripts of narrations in instructional videos [1, 34, 65]. A main benefit of learning from unlabeled video is that it unlocks unprecedented scales of data; Miech et al. [38] introduces a dataset of over 100 million video clips and their narration transcripts, which is later used to learn strong models of cross-modal correspondence [36]. We are inspired by their success in training vision-and-language models on large collections of narrated video, and build on their data and approach to learn temporal dynamics.
Learning with self-supervised cycles. Cycle consistency was recently proposed [68] as a natural cue for learning from unlabeled data or when ground truth is unavailable. In Zhu et al. [69], cycles are used for unpaired image-to-image translation; Recycle-GAN [4] builds on this in follow-up work that incorporates simple temporal prediction (one timestep into the future) into these cycles. Kulkarni et al. [27] uses cycles to learn mappings between canonical 3D surfaces and 2D images. Dwibedi et al. [9] uses cycles to enforce that moments from two different videos should be mutual nearest neighbors, aligning action sequences and learning features useful for downstream tasks. Another line of work uses cycles to track objects through time [20, 60], tracking a pixel forward and then backward in time and requiring that the final pixel be the same as the start pixel. We are inspired by all these applications and introduce a new type of temporal cycle, one which not only incorporates multi-modal information into its learning, but also predicts dynamically into the future, instead of at a fixed offset. In particular, we draw inspiration from Jabri et al. [20], which casts temporal edges as contrastive comparisons (i.e., attention) among candidate nodes.
3 Learning to Cycle through Narrated Video
Our model learns long-term temporal dynamics by cycling through narrated video. We formulate the cycle consistency problem as follows: Given a moment in a start modality (either video or text ), retrieve a corresponding moment in the other modality , then use both modalities to select a future moment in . From this future moment, find a correspondence in , then select a past moment in . For the cycle to be complete, this final moment must be the same as the initial moment . We illustrate the cycle in Figure 2. Solving this problem requires learning forward- and backward-in-time predictive functions that invert each other, as well as image and sentence embeddings that capture inter-modal correspondences and temporally relevant information.
3.1 Cycles as repeated soft attention
Let and be sequences of video and text, respectively, drawn from some temporal interval . These sequences can be discretized into frames and utterances , where are the number of instances the sequence is split into. We refer to each instance as a node, which allows viewing the training goal as learning a cyclic path through a graph, as depicted in Figure 2.
In order to differentiate through the cycle generation process, let be an edge in the graph shown in Figure 2. We implement edges as soft retrievals of given , as shown in Figure 3. This soft retrieval operation can be viewed as an application of the well-known attention mechanism [52].
We start by running all visual and textual nodes through embedding networks and initialized with random weights. We use the architecture from [36] for embedding text nodes and a ResNet-18 [17] for visual nodes. This operation yields series of embeddings and .

We then compute the cycle edges, where each edge is an instance of soft attention as described above. The attention operation accepts sets of query, key, and value vectors and returns a set of new values , computed (with -temperature softmax along the second dimension) as
(1) |
To cycle through narrated video, we first select a modality and a start node (we describe the process for selecting in Section 3.3). We find the representation of the corresponding node in the other modality with a cross-modal attention edge:
(2a) | |||
We learn to project representations into a shared semantic space, using a modality-specific projector that outputs vectors . For notational convenience, we denote by the same attention operation which considers keys (and values) at indices . In this notation we can rewrite the above as: | |||
(2b) |
The representations from both modalities are concatenated and run through a multi-layer perceptron . This operation yields , embedding the joint information back into the shared semantic space. This choice also allows us to train our temporal edges without cross-modal information and accept input from only one modality with some probability , since and also map to -space, in .
Our model must now go from this multi-modal state representation to a future state representation . First, we predict an estimated representation of the future in projection () space, with an MLP . We then retrieve the node in modality corresponding to this future state with a forward-in-time attention edge:
(3) |
It is important to note that attention is order-invariant in and , i.e. shuffling rows of and yields the same output , since the individual matrix-row multiplications are agnostic to row index. This means that, importantly, the model is not given temporal information about input nodes, which could be used as a shortcut in learning222E.g., the model could cycle by selecting the node it knows is at or .. We then retrieve the corresponding node in , , with another cross-modal edge, projecting queries and keys into -space:
(4) |
As before, these vectors are combined to yield a future state representation .
This process is repeated to predict backward in time. We compute , where shares its first few layers with to allow learning features useful for dynamics in either direction (see Section 3.5 for more details). To close the cycle, we compute the normalized similarity scores between and the -space nodes in :
(5) |
We train our system with the negative log likelihood loss on the score vector cycling back to the location of , which we denote :
(6) |
3.2 Cross-modal correspondence
A key component of our cycle model is the ability to find correspondences between vision and language. Eqs. 2b and 4 crucially rely on this ability in order to incorporate multi-modal information into temporal edges. Recent work has demonstrated remarkable progress in training models on massive datasets for this cross-modal retrieval task: given a moment at time in one modality of a video - - find the matching moment in the other modality .
We build on the approach presented in [36] which uses a contrastive loss to train representations of vision and language, where temporally co-occurring information is considered ground truth ( should retrieve ), and other vision-language pairs are used as negatives. To handle the common misalignment intrinsic to real-world video, [36] allows for representations within nodes of the ground truth node to be considered as positives. We adopt this approach to learn cross-modal correspondence, training it for finer-grained discrimination among a set of candidate moments drawn from the same video as opposed to randomly across the entire dataset. We denote the loss used to train cross-modal correspondence . For the full cross-modal formulation, please see Supplementary Material.
3.3 Starting the cycle
Our model will be unable to learn semantic transitions between states if the initial input node depicts noisy or unclear data. This is especially probable when training on unconstrained, real-world video datasets. Therefore, instead of randomly sampling start nodes, we sample from a distribution defined by a “concreteness” score . We calculate this score for each node as the highest cross-modal similarity between and some node in the other modality. Intuitively, this score captures concreteness since frames and utterances that align strongly tend to contain objects or actions which are salient in both modalities:
(7) |
We run the above scores through a softmax with , yielding a distribution from which we sample .
3.4 Avoiding collapse
Training on the above formulation of in practice may lead to fast collapse to a simple “looping in place” solution, where temporal edges always point to the current node. We propose two strategies to prevent this collapse:
Constraining candidate nodes. We can limit the range of temporal edges during training by removing nodes from and in Eqs. 3 and 5. We rewrite Eq. 3 with , i.e., since we know the index the cycle starts from, we can consider only those nodes after the start point in the forward edge. We similarly rewrite Eq. 5 with , where , i.e., the index of the node with highest similarity to the latent predicted future. This constrains the backward edge to only consider nodes that precede the estimated current index. This can also be seen as resolving the sign ambiguity inherent to the unconstrained formulation which allows the model to go back-then-forward or vice versa. Importantly, we run the model without this constraint at test time.
Penalizing visual similarity. Alternatively, we can encourage our model to select visually diverse nodes in its temporal edges:
(8) | ||||
where is the visual representation given by Eq. 5, replacing the values with , and is a margin.
In practice, we combine both the above strategies for the strongest results.
3.5 Implementation
We combine , , and in our final loss:
(9) |
We embed images into () using a ResNet-18 [17], and embed text using a word embedding matrix followed an MLP and global pooling, as in [36].
We implement all modules (, , , ) as MLPs, where each layer is followed by ReLU and a LayerNorm except for the final layer, which is followed by normalization if its output is in -space. and are one-layer MLPs, and are four-layer MLPs, with weights of the first two layers shared. We randomly sample batches of video segments of maximum duration . The sparsity at which data is sampled affects the time elapsed by input videos in a batch as well as the granularity of visual information provided to the model. Denser data is less likely to miss key moments, but more likely to contain redundant information. We therefore train models on various image sampling frame rates .
Because good cross-modal correspondence is necessary to learn strong, semantic cycles, we initialize and exponentially increase from some small value up to , across 30 epochs. We peg when using the similarity loss. For further details on training and architecture, please see Supplementary Material.
4 Experiments
This sections examines the design choices and learned temporal dynamics of our model. Since most previous benchmarks focus on supervised action anticipation with fixed categories and time offsets [5, 26], we design a suite of qualitative and quantitative experiments to evaluate different approaches.
4.1 Data
We train our model on unconstrained real-world video data. Specifically, we use a subset of the HowTo100M dataset [38], which contains around 1.23 million videos and their automatically extracted audio transcripts. Videos in this dataset are roughly categorized by subject area, and we use only the videos categorized “Recipe”, around a quarter of the dataset. We build a train-validation-test split such that of 338,033 total recipe videos, 80% are in train, 15% in validation, and 5% in test. Recipe videos are rich in complex objects, actions, and state transitions, and the subset allows us to train models faster.
For more controlled testing, we use the CrossTask dataset [70], which contains similar videos along with task-specific annotations. Videos are associated with tasks (e.g., “making pancakes”), where each task has a predefined sequence of high-level subtasks with rich long-term temporal inter-dependencies (e.g., [“pour flour into bowl”, “crack egg into bowl”, …, “drizzle maple syrup”]). Video segments that depict one of these subtasks are annotated as such.
4.2 Previous work and baselines
Baselines: We evaluate purely cross-modal features (Section 3.2), given by frozen embedding nets , , and also use these features as prediction targets for RA and TAP below. We also study ImageNet supervised features [7].
Representation Anticipation (RA): As a representative of the self-supervised line of work in predicting a fixed offset into the future, we implement RA [55] on our data and architecture, training a model to predict frozen representations of a network trained for cross-modal correspondence. In vision, we train the network to anticipate one second into the future, while in text, we anticipate the subsequent utterance (on average, 2 seconds into the future). We train:
(10) |
Time-Agnostic Prediction (TAP): Noting the restrictive nature of the fixed offset formulation, TAP [21] introduces the minimum-across-time formulation to allow the prediction of “bottleneck” predictable moments. We implement their loss, taking the minimum across all future moments in the sampled video segment:
(11) |
While the above two models do not consider the exact same setting as us, we re-implement their approaches as faithfully as possible, training them to predict SOTA features trained for cross-modal correspondence.
MemDPC: In order to efficiently model multiple future hypotheses, MemDPC [16] casts future prediction as estimation of convex combinations of memories stored in a codebook, and achieves SOTA performance on tasks of interest. We evaluate their trained visual future prediction model, which does not take textual information as input.
4.3 Evaluating cycle consistency
Central to our formulation is the model’s ability to learn dynamic predictions of the future and past that undo each other, as well as finding strong cross-modal correspondences. Thus, we begin by evaluating how well different model variants are able to solve our self-supervised objective on the Recipes test set. We ablate various design choices, including multi-modal information usage, cycle edge order, and temporal constraints on edges.
Choice | Variant | Percentile rank | |
Cycle | Cross-modal | ||
Temporal constraint | None* | - | - |
Similarity loss | 93.1 | 74.4 | |
Max-index | 92.6 | 74.3 | |
Max-index + sim. loss | 93.6 | 75.7 | |
Multi-modal info. | 89.8 | 74.3 | |
93.6 | 75.7 | ||
96.5 | 75.9 | ||
Start point selection | Cross-modal similarity | 93.6 | 75.7 |
Random | 88.7 | 74.5 | |
Input embedding | Fine-tuned | 93.6 | 75.7 |
Frozen cross-modal [36] | 67.5 | 76.8 | |
Cycle path | Within modalities | 93.6 | 75.7 |
Across modalities | 85.0 | 73.2 | |
Chance | 50.0 | 50.0 |
Multi-modal information: As an alternative to defining the state as a learned combination of visual and textual representations , we can use only one modality at a time, giving . The frequency at which only unimodal information is used can be controlled by a hyperparameter .
Cycle path: The above formulation navigates between moments in the start modality , optionally using information from to augment representations. We denote this variant Within modalities. The order of these edges can also be permuted, such that cycles start in , retrieve a moment in , find a future moment in , then cycle back through . This variant is denoted Across modalities.
Evaluating variants: To compare between different variants, we measure the average percentile rank (e.g. 100 = ground truth is ranked first among all candidates, 50 = ranked in the middle, 0 = ranked last) assigned by our model to ground truth cross-modal and cycle nodes. We show this ablation study in Table 1, observing significant gains using our cycle configuration. We hypothesize that across-modality cycles perform worse since switching modalities acts as a bottleneck, forcing the model to discard information that would be useful for subsequent edges.
Visualizing cycles: We show examples of cycles discovered by the trained model in Figure 4. Our model correctly cycles back around 66% of the time (chance is 4%). The model appears to traverse video according to long-term dynamics, as hypothesized. Note that these transitions occur up to one minute apart, highlighting the importance of allowing dynamic prediction offsets.

4.4 Zero-shot prediction

Ranking transitions by likelihood: To directly evaluate the learned representations and functions and , we can visualize the pairs of frames for which the probability of being the future of is highest. We model this probability as the product of the likelihood of states and and the forward and backward likelihood of :
(12) | ||||
where are the result of running and (optionally with cross-modal information) through and is the concreteness score defined in Equation 7.
We compute this probability efficiently for all pairs in long clips of continuously sampled video (). We then look at the top temporal transitions discovered by the model in each video. We show results on the Recipes test set in Figure 5. The top transitions show clear state transitions such as adding chocolate chips to dough, segmenting an orange, and baking a loaf. These predictions could not be made by a model trained to predict a fixed future, since they occur at varied temporal offsets.
Predicting future actions: Existing benchmarks e.g. [5, 26] focus on predicting action from a few frames in the immediate past or present. Instead, given a few frames, we wish to predict long-term temporal dynamics, which may unfold arbitrarily far into the future. While the former task is more well-defined, the latter is more interesting and relevant. However, ground truth for this task – i.e., per-frame annotation of related future action – is not widely available. We propose using CrossTask task steps as a proxy, since they capture long-term temporal relationships in video.
Model | Recall | Percentile rank | ||||
@ 1 | @ 5 | @ 10 | Worst | Mean | Best | |
MemDPC* [16] | 2.9 | 15.8 | 27.4 | 25.6 | 48.4 | 71.4 |
Cross-modal [36] | 2.9 | 14.2 | 24.3 | 28.2 | 47.9 | 68.2 |
Repr. Ant. [55] | 3.0 | 13.3 | 26.0 | 25.7 | 47.7 | 71.4 |
TAP [21] | 4.5 | 17.1 | 27.9 | 28.3 | 50.1 | 71.6 |
MMCC (ours) | 5.4 | 19.9 | 33.8 | 33.0 | 55.0 | 76.9 |
For a video belonging to task (with predefined subtasks), let be a clip with subtask label ( in the predefined sequence). We would like to predict future actions from . For example, given a short clip of eggs being added to a bowl with flour, the model should assign high likelihoods to subtasks such as “mix batter” and low likelihoods to “crack eggs” or “season steak”. Formally, we define a future likelihood score given a video segment and candidate future subtask . We first sample frames from the video segment and compute their average embedding . The likelihood score uses our learned representations and predictive model to define a score . We compute likelihood scores for all subtask descriptions in the CrossTask validation set, and consider the model’s prediction correct if any of the future actions in are predicted, since not all future subtasks are necessarily related to the given visual state.
Table 2 shows recall and percentile rank statistics for this task. We compare our model to [16, 21, 55], replacing with each method’s predictive model. Since [16] is vision-only, we set to the average visual representation of all video segments with a given subtask label . We also define a cross-modal similarity score as a strong baseline, taking advantage of contextual similarities in video and text. Our model outperforms all baselines and self-supervised state of the art on detecting the temporal relationships between visual states and future actions.
4.5 Further analysis

Unshuffling bags of frames: The ability to order a shuffled set of states is used to evaluate human language and common-sense reasoning skills, and has been explored as a learning signal in NLP [14, 31, 32]. This same ability can also be used to discover temporal structure and summaries of events from large image datasets, as in [23]. We solve this problem by finding the optimal explanation of shuffled video given by iterative application of our temporal dynamics model. Out of all possible orderings , we select the one for which is highest.
Given scores computed by Eq. 12, we induce a fully-connected directed graph with sampled frames as nodes and edge weights given by . Adding a special null node connected to all other nodes with edge weight 0 allows running this graph through an off-the-shelf traveling salesperson problem (TSP) solver333https://pypi.org/project/elkai/. The optimal TSP solution then represents the lowest-cost (ordered) path through all video clips, effectively unshuffling the input.
We run this experiment on CrossTask, where videos are annotated with ordered steps and their associated temporal segments. We treat each segment as a node by computing its average visual representation, as before. We then use these representations to find scores between labeled segments and solve an optimal path. We run this experiment both in vision only (by passing projected visual representations directly into in Eq. 12) as well as with ground truth vision-text pairings, and show results in Table 3. We show example predicted orderings in Figure 6. Again, we can replace with the future prediction model in other methods and run the same algorithm. Our model outperforms previous work on all evaluation metrics.
Model | Kendall’s () | Spearman’s () | Edit dist. () |
Chance | 0.0000 | 0.0000 | 6.5822 |
Repr. Ant. [56] | 0.3383 | 0.4132 | 5.4596 |
MemDPC [16] | 0.3492 | 0.4206 | 5.3398 |
TAP [21] | 0.3344 | 0.4107 | 5.4178 |
MMCC (ours) | 0.3632 | 0.4420 | 5.3343 |
MMCC (vision only) | 0.3530 | 0.4328 | 5.3370 |
Sampling strategy | |||||||
Rand | Cos sim | TAP | RA | Model | Avg | ||
Features | Random | 50.4 | 50.7 | 51.3 | 51.4 | 51.2 | 51.0 |
ImageNet | 51.5 | 52.1 | 50.8 | 50.9 | 53.4 | 51.7 | |
Cross-modal | 52.6 | 53.3 | 50.9 | 50.6 | 55.8 | 52.6 | |
Repr. Ant. [55] | 50.7 | 51.4 | 51.2 | 51.2 | 51.7 | 51.2 | |
TAP [21] | 50.8 | 51.4 | 51.4 | 51.3 | 51.8 | 51.3 | |
MMCC (ours) | 52.3 | 53.4 | 50.5 | 50.7 | 69.2 | 55.2 | |
Average | 51.4 | 52.0 | 51.0 | 51.0 | 55.5 | 52.2 | |
Chance | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | |
From scratch | 51.1 | 52.1 | 51.8 | 51.4 | 62.5 | 53.8 |
Discovering the arrow of time: To further examine whether our model has learned to discover meaningful transitions between states, we explore the arrow of time classification task, introduced in [39, 41, 61]. In [61], a network is trained on short videos (on the order of seconds) to predict whether input is being played forward or backward.
We consider the more challenging task of predicting the temporal relationship between two far-apart input frames – which one comes first? For frames which depict unrelated moments, this task is perhaps near-impossible, even for humans. But if frames show semantically related states, the direction of likely transition provides a useful signal for solving the arrow-of-time task.
We train linear classifiers on top of frozen features as well as a full network from scratch to solve the arrow of time task on randomly shuffled pairs of frames. We sample pairs of frames using our learned predictive model by selecting the highest-probability futures of start frames selected with the concreteness prior (Eq. 7). We demonstrate in Table 4 that the temporal ordering of frames mined by our model is much more classifiable than that of frames sampled using predictive models in previous work. Further, our learned features are much more able to classify a given pair of frames, since they must capture temporal information in training. This confirms that a strong understanding of dynamics that emerges from the cycle consistency task.
5 Conclusion
We introduce a self-supervised method to learn temporal dynamics by cycling through narrated video. Despite the simplicity of our architecture, our model is able to discover long-term state transitions in vision and language. We show that this model can be applied without further training to challenging downstream tasks such as anticipating far-away action and ordering collections of image data.
Acknowledgements: This work was done while Dave Epstein was a student researcher at Google. We thank Alexei Efros, Mia Chiquier, and Shiry Ginosar for their feedback, and Allan Jabri for inspiration in figure design. Dave would like to thank Dídac Surís and Carl Vondrick for insightful early discussions on cycling through time in video.
6 Appendix

A Additional Experiments
Dissecting neuron activations: To solve our cycle consistency problem, our model must be able to attend to relevant parts of the visual input which inform predictions of the future and past, and lead to strong cross-modal correspondences. We probe our learned representation for discovered objects and actions and visualize this in Figure 7.
First, we compute the correlation between the activation of each neuron in the visual representation given by and the presence of words in corresponding text, using the Spearman rank correlation test run for all neuron-word pairs, where is the vocabulary and the dimension of the embedding. We select neuron-word pairs with the highest correlation score and visualize examples that maximally activate these neurons, along with a GradCAM [47] localization of image regions that caused the activation. We generate the visualization by randomly selecting among images that yield the top 0.01% of neuron activation values. Note that the images in Figure 7 have selected without considering their corresponding textual information, and are filtered only by how much they excite the neuron in question. Our model appears to learn localization of common actions and objects in the training data, despite training without any supervision.
B Implementation Details
B.1 Cross-modal correspondence
A key component of our cycle model is the ability to find correspondences between vision and language. Recent work has demonstrated remarkable progress in training models on massive datasets for this cross-modal retrieval task: given a moment at time in one modality of a video - - find the matching moment in the other modality .
We build on the approach presented in [36] which uses a contrastive loss to train representations of vision and language, where temporally co-occurring information is considered ground truth ( should retrieve ). To handle the common misalignment intrinsic to real-world video, [36] allows for representations within nodes of the ground truth node to be considered as positives. We modify this approach to improve performance on the case where retrieval must discriminate among a set of candidate moments drawn from the same video as opposed to randomly across the entire dataset.
The concrete formulation in [36], which shares its general structure with the soft attention formulation used for cycle edges, is:
(13) |
Where are modality-specific embedder networks. We make three modifications to this method which we found to improve performance of cross-modal retrieval:
-
1.
We consider a window of nodes in both directions of retrieval. We allow both and . This adds a symmetric term to the numerator of the above equation.
-
2.
We use a Gaussian kernel to weight each positive pair based on the temporal distance between the two moments (), whereas in [36] all positive pairs have weight . This adds a weighting term ahead of each summand in the numerator of the above equation.
-
3.
We require our cross-modal retrieval to successfully compute correspondences among many candidates from the same video. Therefore, we augment the set of negatives with many moments from the same target video, as opposed to [36] which randomly samples negatives from throughout the dataset. This encourages the model to distinguish between moments using more fine-grained temporal cues and fewer higher-level topic cues.
B.2 Training
Because good cross-modal correspondence is necessary to learn strong, semantic cycles, we initialize and exponentially increase from some small value up to a final value in (depending on the experiment), across 30 epochs. We peg to be a fixed ratio of when using the similarity loss – specifically, we set . We find that this schedule maintains high performance on the cross-modal correspondence task even when updating model weights to solve cycles. We train with learning rate with the Adam optimizer [24] until convergence.
We set to balance between leveraging cross-modal information and learning to function on only one modality. Cycle edges and the cycle loss are computed within each video separately, whereas the cross-modal loss uses negatives from across the entire batch.
For the text architecture, each word in an input utterance is first mapped to a vector in by running through an embedding matrix and another linear layer . These vectors are then run through a ReLU and max-pooled to yield a single vector in , which is run through a layer to give the final utterance embedding.
C Effect of loss terms
Our model relies on cross-modal correspondence to use information from both modalities, but this can be learned in separate pre-training stages, which allows . This decreases cycle accuracy by 7%. Co-training with this loss term (and annealing its weight, Sup. Mat. L152-156) prevents catastrophic forgetting of cross-modal correspondences, which are used to learn cycles. The visual similarity loss is optional, since the constrained temporal attention functions well on its own, so we can set , slightly impacting performance (e.g., -1% on cycle rank). The model does not learn temporal dynamics at all with ; this can be thought of as the cross-modal baseline.
References
- [1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016.
- [2] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ICCV, 2017.
- [3] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv:1710.11252, 2017.
- [4] Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. Recycle-GAN: Unsupervised video retargeting. In ECCV, 2018.
- [5] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The EPIC-Kitchens dataset. In ECCV, 2018.
- [6] Virginia R de Sa. Learning classification with unlabeled data. In NeurIPS, 1994.
- [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- [8] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv:1802.07687, 2018.
- [9] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal cycle-consistency learning. In CVPR, 2019.
- [10] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. arXiv:1710.05268, 2017.
- [11] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. arXiv:1605.07157, 2016.
- [12] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.
- [13] Antonino Furnari and Giovanni Maria Farinella. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In CVPR, 2019.
- [14] Jingjing Gong, Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. End-to-end neural sentence ordering using pointer network. arXiv:1611.04953, 2016.
- [15] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In ICCV Workshops, 2019.
- [16] Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representation learning. arXiv:2008.01065, 2020.
- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- [18] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In NeurIPS, 2018.
- [19] De-An Huang and Kris M Kitani. Action-reaction: Forecasting the dynamics of human interaction. In ECCV, 2014.
- [20] Allan Jabri, Andrew Owens, and Alexei A Efros. Space-time correspondence as a contrastive random walk. In NeurIPS, 2020.
- [21] Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and Sergey Levine. Time-agnostic prediction: Predicting predictable video frames. In ICLR, 2019.
- [22] Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In ICCV, 2015.
- [23] Gunhee Kim, Leonid Sigal, and Eric P Xing. Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In CVPR, 2014.
- [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
- [25] Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and Martial Hebert. Activity forecasting. In ECCV, 2012.
- [26] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014.
- [27] Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical surface mapping via geometric cycle consistency. In ICCV, 2019.
- [28] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A conditional flow-based model for stochastic video generation. arXiv:1903.01434, 2019.
- [29] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. In ECCV, 2014.
- [30] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv:1804.01523, 2018.
- [31] Haejun Lee, Drew A Hudson, Kangwook Lee, and Christopher D Manning. SLM: Learning a discourse language representation with sentence unshuffling. arXiv:2010.16249, 2020.
- [32] Lajanugen Logeswaran, Honglak Lee, and Dragomir Radev. Sentence ordering and coherence modeling using recurrent neural networks. arXiv:1611.02654, 2016.
- [33] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv:1605.08104, 2016.
- [34] Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew Rabinovich, and Kevin Murphy. What’s cookin’? interpreting cooking videos using text, speech and vision. arXiv:1503.01558, 2015.
- [35] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv:1511.05440, 2015.
- [36] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
- [37] Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani, and Du Tran. Leveraging the present to anticipate the future in videos. In CVPR Workshops, 2019.
- [38] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- [39] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
- [40] Nemanja Petrovic, Aleksandar Ivanovic, and Nebojsa Jojic. Recursive estimation of generative models of video. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 79–86. IEEE, 2006.
- [41] Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. In CVPR, 2014.
- [42] Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeulders. Déja vu. In ECCV, 2014.
- [43] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv:1412.6604, 2014.
- [44] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In CVPR, 2015.
- [45] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Chris Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. IJCV, 2017.
- [46] Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, and Lars Andersson. Encouraging LSTMs to anticipate actions very early. In ICCV, 2017.
- [47] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
- [48] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
- [49] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Learning video representations using contrastive bidirectional transformer. arXiv:1906.05743, 2019.
- [50] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction of multi-agent interactions from partial observations. arXiv:1902.09641, 2019.
- [51] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7464–7473, 2019.
- [52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
- [53] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv:1706.08033, 2017.
- [54] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. arXiv:1704.05831, 2017.
- [55] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In CVPR, 2016.
- [56] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NeurIPS, 2016.
- [57] Carl Vondrick and Antonio Torralba. Generating the future with adversarial transformers. In CVPR, 2017.
- [58] Jacob Walker, Abhinav Gupta, and Martial Hebert. Patch to the future: Unsupervised visual prediction. In CVPR, 2014.
- [59] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In ICCV, 2015.
- [60] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019.
- [61] Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In CVPR, 2018.
- [62] Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, and Fei Wu. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing, 30:1143–1152, 2020.
- [63] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
- [64] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS, 2016.
- [65] Shoou-I Yu, Lu Jiang, and Alexander Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In ACM MM, 2014.
- [66] Jenny Yuen and Antonio Torralba. A data-driven approach for event prediction. In ECCV, 2010.
- [67] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
- [68] Tinghui Zhou, Philipp Krähenbühl, Mathieu Aubry, Qixing Huang, and Alexei A. Efros. Learning dense correspondence via 3D-guided cycle consistency. In CVPR, 2016.
- [69] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
- [70] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. In CVPR, 2019.