This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newfloatcommand

capbtabboxtable[][\FBwidth]

Improving Sequential Latent Variable Models
with Autoregressive Flows

Joseph Marino
California Institute of Technology
[email protected] &Lei Chen
Simon Fraser University
[email protected] Jiawei He
Simon Fraser University
[email protected] &Stephan Mandt
University of California Irvine
[email protected]
Abstract

We propose an approach for improving sequence modeling based on autoregressive normalizing flows. Each autoregressive transform, acting across time, serves as a moving frame of reference, removing temporal correlations and simplifying the modeling of higher-level dynamics. This technique provides a simple, general-purpose method for improving sequence modeling, with connections to existing and classical techniques. We demonstrate the proposed approach both with standalone flow-based models and as a component within sequential latent variable models. Results are presented on three benchmark video datasets and three other time series datasets, where autoregressive flow-based dynamics improve log-likelihood performance over baseline models. Finally, we illustrate the decorrelation and improved generalization properties of using flow-based dynamics.

1 Introduction

Data often contain sequential structure, providing a rich signal for learning models of the world. Such models are useful for representing sequences [47, 27] and planning actions [28, 10]. Recent advances in deep learning have facilitated learning sequential probabilistic models directly from high-dimensional data [26], like audio and video. A variety of techniques have emerged for learning deep sequential models, including memory units [32] and stochastic latent variables [11, 6]. These techniques have enabled sequential models to capture increasingly complex dynamics. In this paper, we explore the complementary direction, asking can we simplify the dynamics of the data to meet the capacity of the model? To do so, we aim to learn a frame of reference to assist in modeling the data.

Frames of reference are an important consideration in sequence modeling, as they can simplify dynamics by removing redundancy. For instance, in a physical system, the frame of reference that moves with the system’s center of mass removes the redundancy in displacement. Frames of reference are also more widely applicable to arbitrary sequences. Indeed, video compression schemes use predictions as a frame of reference to remove temporal redundancy [52, 2, 76]. By learning and applying a similar type of temporal normalization for sequence modeling, the model can focus on aspects that are not predicted by the low-level frame of reference, thereby simplifying dynamics modeling.

We formalize this notion of temporal normalization through the framework of autoregressive normalizing flows [44, 54]. In the context of sequences, these flows form predictions across time, attempting to remove temporal dependencies [66]. Thus, autoregressive flows can act as a pre-processing technique to simplify dynamics. We preview this approach in Figure 1, where an autoregressive flow modeling the data (top) creates a transformed space for modeling dynamics (bottom). The transformed space is largely invariant to absolute pixel value, focusing instead on capturing deviations and motion.

We empirically demonstrate this modeling technique, both with standalone autoregressive normalizing flows, as well as within sequential latent variable models. While normalizing flows have been applied in sequential contexts previously, our main contributions are in 1) showing how these models can act as a general pre-processing technique to improve dynamics modeling, 2) empirically demonstrating log-likelihood performance improvements, as well as generalization improvements, on three benchmark video datasets and time series data from the UCI machine learning repository. This technique also connects to previous work in dynamics modeling, probabilistic models, and sequence compression, enabling directions for further investigation.

Refer to caption
Figure 1: Sequence Modeling with Autoregressive Flows. Top: Pixel values (solid) for a particular pixel location in a video sequence. An autoregressive flow models the pixel sequence using an affine shift (dashed) and scale (shaded), acting as a frame of reference. Middle: Frames of the data sequence (top) and the resulting “noise” (bottom) from applying the shift and scale. The redundant, static background has been largely removed. Bottom: The noise values (solid) are modeled using a base distribution (dashed and shaded) provided by a higher-level model. By removing temporal redundancy from the data sequence, the autoregressive flow simplifies dynamics modeling.

2 Background

2.1 Autoregressive Models

Consider modeling discrete sequences of observations, 𝐱1:Tpdata(𝐱1:T)\mathbf{x}_{1:T}\sim p_{\small\textrm{data}}(\mathbf{x}_{1:T}), using a probabilistic model, pθ(𝐱1:T)p_{\theta}(\mathbf{x}_{1:T}), with parameters θ\theta. Autoregressive models [21, 7] use the chain rule of probability to express the joint distribution over time steps as the product of TT conditional distributions. These models are often formulated in forward temporal order:

pθ(𝐱1:T)=t=1Tpθ(𝐱t|𝐱<t).p_{\theta}(\mathbf{x}_{1:T})=\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t}). (1)

Each conditional distribution, pθ(𝐱t|𝐱<t)p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t}), models the dependence between time steps. For continuous variables, it is often assumed that each distribution takes a simple form, such as a diagonal Gaussian: pθ(𝐱t|𝐱<t)=𝒩(𝐱t;𝝁θ(𝐱<t),diag(𝝈θ2(𝐱<t))),p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t})=\mathcal{N}(\mathbf{x}_{t};\bm{\mu}_{\theta}(\mathbf{x}_{<t}),\operatorname{diag}(\bm{\sigma}^{2}_{\theta}(\mathbf{x}_{<t}))), where 𝝁θ()\bm{\mu}_{\theta}(\cdot) and 𝝈θ()\bm{\sigma}_{\theta}(\cdot) are functions denoting the mean and standard deviation. These functions may take past observations as input through a recurrent network or a convolutional window [69]. When applied to spatial data [70], autoregressive models excel at capturing local dependencies. However, due to their restrictive forms, such models often struggle to capture more complex structure.

2.2 Autoregressive (Sequential) Latent Variable Models

Autoregressive models can be improved by incorporating latent variables [50], often represented as a corresponding sequence, 𝐳1:T\mathbf{z}_{1:T}. The joint distribution, pθ(𝐱1:T,𝐳1:T)p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T}), has the form:

pθ(𝐱1:T,𝐳1:T)=t=1Tpθ(𝐱t|𝐱<t,𝐳t)pθ(𝐳t|𝐱<t,𝐳<t).p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})=\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t}). (2)

Unlike the Gaussian form, evaluating pθ(𝐱t|𝐱<t)p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t}) now requires integrating over the latent variables,

pθ(𝐱t|𝐱<t)=pθ(𝐱t|𝐱<t,𝐳t)pθ(𝐳t|𝐱<t)𝑑𝐳t,p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t})=\int p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{\leq t}|\mathbf{x}_{<t})d\mathbf{z}_{\leq t}, (3)

yielding a more flexible distribution. However, performing this integration in practice is typically intractable, requiring approximate inference techniques, like variational inference [38], or invertible models [45]. Recent works have parameterized these models with deep neural networks, e.g. [11, 24, 20, 39], using amortized variational inference [42, 60]. Typically, the conditional likelihood, pθ(𝐱t|𝐱<t,𝐳t)p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t}), and the prior, pθ(𝐳t|𝐱<t,𝐳<t)p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t}), are Gaussian densities, with temporal conditioning handled through recurrent networks. Such models have demonstrated success in audio [11, 20] and video modeling [75, 25, 15, 30, 47]. However, as noted by [45], such models can be difficult to train with standard log-likelihood objectives, often struggling to capture dynamics.

Refer to caption
Figure 2: Affine Autoregressive Transform. Computational diagram for an affine autoregressive transform [54]. Each 𝐲t\mathbf{y}_{t} is an affine transform of 𝐱t\mathbf{x}_{t}, with the affine parameters potentially non-linear functions of 𝐱<t\mathbf{x}_{<t}. The inverse transform, shown here, is capable of converting a correlated input, 𝐱1:T\mathbf{x}_{1:T}, into an uncorrelated output, 𝐲1:T\mathbf{y}_{1:T}.

2.3 Autoregressive Flows

Our approach is based on affine autoregressive normalizing flows [44, 54]. Here, we continue with the perspective of temporal sequences, however, these flows were initially developed and demonstrated in static settings. [44] noted that sampling from an autoregressive Gaussian model is an invertible transform, resulting in a normalizing flow [63, 16, 17, 59]. Flow-based models transform simple, base probability distributions into more complex ones while maintaining exact likelihood evaluation. To see their connection to autoregressive models, we can express sampling a Gaussian random variable using the reparameterization trick [42, 60]:

𝐱t=𝝁θ(𝐱<t)+𝝈θ(𝐱<t)𝐲t,\mathbf{x}_{t}=\bm{\mu}_{\theta}(\mathbf{x}_{<t})+\bm{\sigma}_{\theta}(\mathbf{x}_{<t})\odot\mathbf{y}_{t}, (4)

where 𝐲t𝒩(𝐲t;𝟎,𝐈)\mathbf{y}_{t}\sim\mathcal{N}(\mathbf{y}_{t};\mathbf{0},\mathbf{I}) is an auxiliary random variable and \odot denotes element-wise multiplication. Thus, 𝐱t\mathbf{x}_{t} is an invertible transform of 𝐲t\mathbf{y}_{t}, with the inverse given as

𝐲t=𝐱t𝝁θ(𝐱<t)𝝈θ(𝐱<t),\mathbf{y}_{t}=\frac{\mathbf{x}_{t}-\bm{\mu}_{\theta}(\mathbf{x}_{<t})}{\bm{\sigma}_{\theta}(\mathbf{x}_{<t})}, (5)

where division is element-wise. The inverse transform in Eq. 5, shown in Figure 2, normalizes (hence, normalizing flow) 𝐱1:T\mathbf{x}_{1:T}, removing statistical dependencies. Given the functional mapping between 𝐲t\mathbf{y}_{t} and 𝐱t\mathbf{x}_{t} in Eq. 4, the change of variables formula converts between probabilities in each space:

logpθ(𝐱1:T)=logpθ(𝐲1:T)log|det(𝐱1:T𝐲1:T)|.\log p_{\theta}(\mathbf{x}_{1:T})=\log p_{\theta}(\mathbf{y}_{1:T})-\log\left|\det\left(\frac{\partial\mathbf{x}_{1:T}}{\partial\mathbf{y}_{1:T}}\right)\right|. (6)

By the construction of Eqs. 4 and 5, the Jacobian in Eq. 6 is triangular, enabling efficient evaluation as the product of diagonal terms:

log|det(𝐱1:T𝐲1:T)|=t=1Tilogσθ,i(𝐱<t),\log\left|\det\left(\frac{\partial\mathbf{x}_{1:T}}{\partial\mathbf{y}_{1:T}}\right)\right|=\sum_{t=1}^{T}\sum_{i}\log\sigma_{\theta,i}(\mathbf{x}_{<t}), (7)

where ii denotes the observation dimension, e.g. pixel. For a Gaussian autoregressive model, the base distribution is pθ(𝐲1:T)=𝒩(𝐲1:T;𝟎,𝐈)p_{\theta}(\mathbf{y}_{1:T})=\mathcal{N}(\mathbf{y}_{1:T};\mathbf{0},\mathbf{I}). We can improve upon this simple set-up by chaining transforms together, i.e. parameterizing pθ(𝐲1:T)p_{\theta}(\mathbf{y}_{1:T}) as a flow, resulting in hierarchical models.

2.4 Related Work

Autoregressive flows were initially considered in the contexts of variational inference [44] and generative modeling [54]. These approaches are generalizations of previous approaches with affine transforms [16, 17]. While autoregressive flows are well-suited for sequential data, these approaches, as well as many recent approaches [34, 51, 43], were initially applied to static data, such as images.

Recent works have started applying flow-based models to sequential data. [71] and [55] distill autoregressive speech models into flow-based models. [57] and [40] instead train these models directly. [45] use a flow to model individual video frames, with an autoregressive prior modeling dynamics across time steps. [61, 62] use autoregressive flows for modeling vehicle motion, and [31] use flows for motion synthesis with motion-capture data. [77] model discrete observations (e.g., text) by using flows to model dynamics of continuous latent variables. Like these recent works, we apply flow-based models to sequences. However, we demonstrate that autoregressive flows can serve as a general-purpose technique for improving dynamics models. To the best of our knowledge, our work is the first to use flows to pre-process sequences to improve sequential latent variable models.

We utilize affine flows (Eq. 4), a family that includes methods like NICE [16], RealNVP [17], IAF [44], MAF [54], and Glow [43]. However, there has been recent work in non-affine flows [34, 37, 18], which offer further flexibility. We chose to investigate affine flows because they are commonly employed and relatively simple, however, non-affine flows could result in additional improvements.

Autoregressive dynamics models are also prominent in other related areas. Within the statistics and econometrics literature, autoregressive integrated moving average (ARIMA) is a standard technique [8, 29], calculating differences with an autoregressive prediction to remove non-stationary components of a temporal signal. Such methods simplify downstream modeling, e.g., by removing seasonal effects. Low-level autoregressive models are also found in audio [3] and video compression codecs [73, 2, 76], using predictive coding [52] to remove temporal redundancy, thereby improving downstream compression rates. Intuitively, if sequential inputs are highly predictable, it is far more efficient to compress the prediction error rather than each input (e.g., video frame) separately. Finally, we note that autoregressive models are a generic dynamics modeling approach and can, in principle, be parameterized by other techniques, such as LSTMs [32], or combined with other models, such as hidden Markov models (HMMs) [50].

3 Method

We now describe our approach for improving sequence modeling. First, we motivate using autoregressive flows to reduce temporal dependencies, thereby simplifying dynamics. We then show how this simple technique can be incorporated within sequential latent variable models.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: Redundancy Reduction. (a) Conditional densities for p(x2|x1)p(x_{2}|x_{1}). (b) The marginal, p(x2)p(x_{2}) differs from the conditional densities, thus, (x1;x2)>0\mathcal{I}(x_{1};x_{2})>0. (c) In the normalized space of yy, the corresponding densities p(y2|y1)p(y_{2}|y_{1}) are identical. (d) The marginal p(y2)p(y_{2}) is identical to the conditionals, so (y1;y2)=0.\mathcal{I}(y_{1};y_{2})=0. Thus, in this case, a conditional affine transform removed the dependencies.

3.1 Motivation: Temporal Redundancy Reduction

Normalizing flows, while often utilized for density estimation, originated from data pre-processing techniques [22, 35, 9], which remove dependencies between dimensions, i.e., redundancy reduction [4]. Removing dependencies simplifies the resulting probability distribution by restricting variation to individual dimensions, generally simplifying downstream tasks [46]. Normalizing flows improve upon these procedures using flexible, non-linear functions [14, 16]. While flows have been used for spatial decorrelation [1, 74] and with other models [33], this capability remains under-explored.

Our main contribution is showing how to utilize autoregressive flows for temporal pre-processing to improve dynamics modeling. Data sequences contain dependencies in time, for example, in the redundancy of video pixels (Figure 1), which are often highly predictable. These dependencies define the dynamics of the data, with the degree of dependence quantified by the multi-information,

(𝐱1:T)=t(𝐱t)(𝐱1:T),\mathcal{I}(\mathbf{x}_{1:T})=\sum_{t}\mathcal{H}(\mathbf{x}_{t})-\mathcal{H}(\mathbf{x}_{1:T}), (8)

where \mathcal{H} denotes entropy. Normalizing flows are capable of reducing redundancy, arriving at a new sequence, 𝐲1:T\mathbf{y}_{1:T}, with (𝐲1:T)(𝐱1:T)\mathcal{I}(\mathbf{y}_{1:T})\leq\mathcal{I}(\mathbf{x}_{1:T}), thereby reducing temporal dependencies. Thus, rather than fit the data distribution directly, we can first simplify the dynamics by pre-processing sequences with a normalizing flow, then fitting the resulting sequence. Through training, the flow will attempt to remove redundancies to meet the modeling capacity of the higher-level dynamics model, pθ(𝐲1:T)p_{\theta}(\mathbf{y}_{1:T}).

Example

To visualize this procedure for an affine autoregressive flow, consider a one-dimensional input over two time steps, x1x_{1} and x2x_{2}. For each value of x1x_{1}, there is a conditional density, p(x2|x1)p(x_{2}|x_{1}). Assume that these densities take one of two forms, which are identical but shifted and scaled, shown in Figure 3. Transforming these densities through their conditional means, μ2=𝔼[x2|x1]\mu_{2}=\mathbb{E}\left[x_{2}|x_{1}\right], and standard deviations, σ2=𝔼[(x2μ2)2|x1]1/2\sigma_{2}=\mathbb{E}\left[(x_{2}-\mu_{2})^{2}|x_{1}\right]^{1/2}, creates a normalized space, y2=(x2μ2)/σ2y_{2}=(x_{2}-\mu_{2})/\sigma_{2}, where the conditional densities are identical. In this space, the multi-information is

(y1;y2)=𝔼p(y1,y2)[logp(y2|y1)logp(y2)]=0,\mathcal{I}(y_{1};y_{2})=\mathbb{E}_{p(y_{1},y_{2})}\left[\log p(y_{2}|y_{1})-\log p(y_{2})\right]=0,

whereas (x1;x2)>0.\mathcal{I}(x_{1};x_{2})>0. Indeed, if p(xt|x<t)p(x_{t}|x_{<t}) is linear-Gaussian, inverting an affine autoregressive flow exactly corresponds to Cholesky whitening [56, 44], removing all linear dependencies.

In the example above, μ2\mu_{2} and σ2\sigma_{2} act as a frame of reference for estimating x2x_{2}. More generally, in the special case where 𝝁θ(𝐱<t)=𝐱t1\bm{\mu}_{\theta}(\mathbf{x}_{<t})=\mathbf{x}_{t-1} and 𝝈(𝐱<t)=𝟏\bm{\sigma}(\mathbf{x}_{<t})=\mathbf{1}, we recover 𝐲t=𝐱t𝐱t1=Δ𝐱t\mathbf{y}_{t}=\mathbf{x}_{t}-\mathbf{x}_{t-1}=\Delta\mathbf{x}_{t}. Modeling finite differences (or generalized coordinates [23]) is a well-established technique, (see, e.g. [10, 45]), which is generalized by affine autoregressive flows.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Model Diagrams. (a) An autoregressive flow pre-processes a data sequence, 𝐱1:T\mathbf{x}_{1:T}, to produce a new sequence, 𝐲1:T\mathbf{y}_{1:T}, with reduced temporal dependencies. This simplifies dynamics modeling for a higher-level sequential latent variable model, pθ(𝐲1:T,𝐳1:T)p_{\theta}(\mathbf{y}_{1:T},\mathbf{z}_{1:T}). Empty diamond nodes represent deterministic dependencies, not recurrent states. (b) Diagram of the autoregressive flow architecture. Blank white rectangles represent convolutional layers (see Appendix). The three stacks of convolutional layers within the blue region are shared. cat denotes channel-wise concatenation.

3.2 Modeling Dynamics with Autoregressive Flows

We now discuss utilizing autoregressive flows to improve sequence modeling, highlighting use cases for modeling dynamics in the data and latent spaces.

3.2.1 Data Dynamics

The form of an affine autoregressive flow across sequences is given in Eqs. 4 and 5, again, equivalent to a Gaussian autoregressive model. We can stack hierarchical chains of flows to improve the model capacity. Denoting the shift and scale functions at the mthm^{\textrm{th}} transform as 𝝁θm()\bm{\mu}_{\theta}^{m}(\cdot) and 𝝈θm()\bm{\sigma}_{\theta}^{m}(\cdot) respectively, we then calculate 𝐲m\mathbf{y}^{m} using the inverse transform:

𝐲tm=𝐲tm1𝝁θm(𝐲<tm1)𝝈θm(𝐲<tm1).\mathbf{y}^{m}_{t}=\frac{\mathbf{y}^{m-1}_{t}-\bm{\mu}^{m}_{\theta}(\mathbf{y}^{m-1}_{<t})}{\bm{\sigma}^{m}_{\theta}(\mathbf{y}^{m-1}_{<t})}. (9)

After the final (MthM^{\textrm{th}}) transform, we can choose the form of the base distribution, pθ(𝐲1:TM)p_{\theta}(\mathbf{y}^{M}_{1:T}), e.g. Gaussian. While we could attempt to model 𝐱1:T\mathbf{x}_{1:T} completely using stacked autoregressive flows, these models are limited to affine element-wise transforms that maintain the data dimensionality. Due to this limited capacity, purely flow-based models often require many transforms to be effective [43].

Instead, we can model the base distribution using an expressive sequential latent variable model (SLVM), or, equivalently, we can augment the conditional likelihood of a SLVM using autoregressive flows (Fig. 4(a)). Following the motivation from Section 3.1, the flow can remove temporal dependencies, simplifying the modeling task for the SLVM. With a single flow, the joint probability is

pθ(𝐱1:T,𝐳1:T)=pθ(𝐲1:T,𝐳1:T)|det(𝐱1:T𝐲1:T)|1,p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})=p_{\theta}(\mathbf{y}_{1:T},\mathbf{z}_{1:T})\left|\det\left(\frac{\partial\mathbf{x}_{1:T}}{\partial\mathbf{y}_{1:T}}\right)\right|^{-1}, (10)

where the SLVM distribution is given by

pθ(𝐲1:T,𝐳1:T)=t=1Tpθ(𝐲t|𝐲<t,𝐳t)pθ(𝐳t|𝐲<t,𝐳<t).p_{\theta}(\mathbf{y}_{1:T},\mathbf{z}_{1:T})=\prod_{t=1}^{T}p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{t}|\mathbf{y}_{<t},\mathbf{z}_{<t}). (11)

If the SLVM is itself a flow-based model, we can use maximum log-likelihood training. If not, we can resort to variational inference [11, 20, 49]. We derive and discuss this procedure in the Appendix.

3.2.2 Latent Dynamics

We can also consider simplifying latent dynamics modeling using autoregressive flows. This is relevant in hierarchical SLVMs, such as VideoFlow [45], where each latent variable is modeled as a function of past and higher-level latent variables. Using 𝐳t()\mathbf{z}_{t}^{(\ell)} to denote the latent variable at the th\ell^{\textrm{th}} level at time tt, we can parameterize the prior as

pθ(𝐳t()|𝐳<t(),𝐳t(>))=pθ(𝐮t()|𝐮<t(),𝐳t(>))|det(𝐳t()𝐮t())|1,p_{\theta}(\mathbf{z}^{(\ell)}_{t}|\mathbf{z}^{(\ell)}_{<t},\mathbf{z}^{(>\ell)}_{t})=p_{\theta}(\mathbf{u}^{(\ell)}_{t}|\mathbf{u}^{(\ell)}_{<t},\mathbf{z}^{(>\ell)}_{t})\left|\det\left(\frac{\partial\mathbf{z}^{(\ell)}_{t}}{\partial\mathbf{u}^{(\ell)}_{t}}\right)\right|^{-1}, (12)

converting 𝐳t()\mathbf{z}^{(\ell)}_{t} into 𝐮t()\mathbf{u}^{(\ell)}_{t} using the inverse transform 𝐮t()=(𝐳t()𝜶θ(𝐳<t()))/𝜷θ(𝐳<t())\mathbf{u}^{(\ell)}_{t}=(\mathbf{z}^{(\ell)}_{t}-\bm{\alpha}_{\theta}(\mathbf{z}^{(\ell)}_{<t}))/\bm{\beta}_{\theta}(\mathbf{z}^{(\ell)}_{<t}). As noted previously, VideoFlow uses a special case of this procedure, setting 𝜶θ(𝐳<t())=𝐳t1()\bm{\alpha}_{\theta}(\mathbf{z}^{(\ell)}_{<t})=\mathbf{z}^{(\ell)}_{t-1} and 𝜷θ(𝐳<t())=𝟏\bm{\beta}_{\theta}(\mathbf{z}^{(\ell)}_{<t})=\mathbf{1}. Generalizing this procedure further simplifies dynamics throughout the model.

4 Evaluation

We demonstrate and evaluate the proposed technique on three benchmark video datasets: Moving MNIST [67], KTH Actions [65], and BAIR Robot Pushing [19]. In addition, we also perform experiments on several non-video sequence datasets from the UC Irvine Machine Learning Repository.111https://archive.ics.uci.edu/ml/index.php Specifically, we look at an activity recognition dataset (activity_rec) [53], an indoor localization dataset (smartphone_sensor) [5], and a facial expression recognition dataset (facial_exp) [13]. Experimental setups are described in Section 4.1, followed by a set of analyses in Section 4.2. Further details and results can be found in the Appendix.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Decreased Temporal Correlation. (a) Affine autoregressive flows result in sequences, 𝐲1:T\mathbf{y}_{1:T}, with decreased temporal correlation, corr𝐲\textrm{corr}_{\mathbf{y}}, as compared with that of the original data, corr𝐱\textrm{corr}_{\mathbf{x}}. The presence of a more powerful base distribution (SLVM) reduces the need for decorrelation. Additional flow transforms further decrease correlation (note: |corr𝐲|<0.01|\textrm{corr}_{\mathbf{y}}|<0.01 for 2-AF). (b) For SLVM + 1-AF, corr𝐲\textrm{corr}_{\mathbf{y}} decreases during training on KTH Actions.

4.1 Experimental Setup

We empirically evaluate the improvements to downstream dynamics modeling from temporal pre-processing via autoregressive flows. For data space modeling, we compare four model classes: 1) standalone affine autoregressive flows with one (1-AF) and 2) two (2-AF) transforms, 3) a sequential latent variable model (SLVM), and 4) SLVM with flow-based pre-processing (SLVM + 1-AF). As we are not proposing a specific architecture, but rather a general modeling technique, the SLVM architecture is representative of recurrent convolutional video models with a single latent level [15, 27, 28]. Flows are implemented with convolutional networks, taking in a fixed window of previous frames (Fig. 4(b)). These models allow us to evaluate the benefits of temporal pre-processing (SLVM vs. SLVM + 1-AF) and the benefits of more expressive higher-level dynamics models (2-AF vs. SLVM + 1-AF).

To evaluate latent dynamics modeling with flows, we use the tensor2tensor library [72] to compare 1) VideoFlow222We used a smaller version of the original model architecture, with half of the flow depth, due to GPU memory constraints. and 2) the same model with affine autoregressive flow latent dynamics (VideoFlow + AF). VideoFlow is significantly larger (3×3\times more parameters) than the one-level SLVM, allowing us to evaluate whether autoregressive flows are beneficial in this high-capacity regime.

To enable a fairer comparison in our experiments, models with autoregressive flow dynamics have comparable or fewer parameters than baseline counterparts. We note that autoregressive dynamics adds only a constant computational cost per time-step, and this computation can be parallelized for training and evaluation. Full architecture, training, and analysis details can be found in the Appendix. Finally, as noted by [45], many previous works do not train SLVMs with proper log-likelihood objectives. Our SLVM results are consistent with previously reported log-likelihood values [49] for the Stochastic Video Generation model [15] trained with a log-likelihood bound objective.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Flow Visualization for SLVM + 1-AF on Moving MNIST (left) and KTH Actions (right).
Refer to caption
(a) VideoFlow
Refer to caption
(b) VideoFlow + AF
Figure 7: Improved Generated Samples. Random samples generated from (a) VideoFlow and (b) VideoFlow + AF, each conditioned on the first 33 frames. Using AF produces more coherent samples. The robot arm blurs for VideoFlow in samples 11 and 44 (red boxes), but does not blur for VideoFlow + AF.

4.2 Analyses

Visualization

In Figure 1, we visualize the pre-processing procedure for SLVM + 1-AF on BAIR Robot Pushing. The plots show the RGB values for a pixel before (top) and after (bottom) the transform. The noise sequence is nearly zero throughout, despite large changes in the pixel value. We also see that the noise sequence (center, lower) is invariant to the static background, capturing the moving robotic arm. At some time steps (e.g. fourth frame), the autoregressive flow incorrectly predicts the next frame, however, the higher-level SLVM compensates for this prediction error.

We also visualize each component of the flow. Figure 4(b) illustrates this for SLVM + 1-AF on an input from BAIR Robot Pushing. We see that 𝝁θ\bm{\mu}_{\theta} captures the static background, while 𝝈θ\bm{\sigma}_{\theta} highlights regions of uncertainty. In Figure 6 and the Appendix, we present visualizations on full sequences, where we see that different models remove varying degrees of temporal structure.

Temporal Redundancy Reduction

To quantify temporal redundancy reduction, we evaluate the empirical correlation (linear dependence) between frames, denoted as corr, for the data and noise variables. We evaluate corr𝐱\textrm{corr}_{\mathbf{x}} and corr𝐲\textrm{corr}_{\mathbf{y}} for 1-AF, 2-AF, and SLVM + 1-AF. The results are shown in Figure 5(a). In Figure 5(b), we plot corr𝐲\textrm{corr}_{\mathbf{y}} for SLVM + 1-AF during training on KTH Actions. Flows decrease temporal correlation, with additional transforms yielding further decorrelation. Base distributions without temporal structure (1-AF) yield comparatively more decorrelation. Temporal redundancy is progressively removed throughout training. Note that 2-AF almost completely removes temporal correlations (|corr𝐲|<0.01|\textrm{corr}_{\mathbf{y}}|<0.01). However, note that this only quantifies linear dependencies, and more complex non-linear dependencies may require the use of higher-level dynamics models, as shown through quantitative comparisons.

Performance Comparison

Table 1 reports average test negative log-likelihood results on video datasets. Standalone flow-based models perform surprisingly well. Increasing flow depth from 1-AF to 2-AF generally results in improvement. SLVM + 1-AF outperforms the baseline SLVM despite having fewer parameters. As another baseline, we also consider modeling frame differences, Δ𝐱𝐱t𝐱t1\Delta\mathbf{x}\equiv\mathbf{x}_{t}-\mathbf{x}_{t-1}, with SLVM, which can be seen as a special case of 1-AF with 𝝁θ=𝐱t1\bm{\mu}_{\theta}=\mathbf{x}_{t-1} and 𝝈θ=𝟏\bm{\sigma}_{\theta}=\mathbf{1}. On BAIR and KTH Actions, datasets with significant temporal redundancy (Fig. 5(a)), this technique improves performance over SLVM. However, on Moving MNIST, modeling Δ𝐱\Delta\mathbf{x} actually decreases performance, presumably by creating more complex spatial patterns. In all cases, the learned temporal transform, SLVM + 1-AF, outperforms this hard-coded transform, SLVM + Δ𝐱\Delta\mathbf{x}. Finally, incorporating autoregressive flows into VideoFlow results in a modest but noticeable improvement, demonstrating that removing spatial dependencies, through VideoFlow, and temporal dependencies, through autoregressive flows, are complementary techniques.

Table 1: Quantitative Comparison. Average test negative log-likelihood (lower is better) in nats per dimension for Moving MNIST, BAIR Robot Pushing, and KTH Actions.
M-MNIST BAIR KTH
1-AF 2.152.15 3.053.05 3.343.34
2-AF 2.132.13 2.902.90 3.353.35
SLVM 1.92\leq 1.92 3.57\leq 3.57 4.63\leq 4.63
SLVM + Δ𝐱\Delta\mathbf{x} 2.45\leq 2.45 3.07\leq 3.07 2.49\leq 2.49
SLVM + 1-AF 1.86\leq\mathbf{1.86} 2.35\leq\mathbf{2.35} 2.39\leq\mathbf{2.39}
VideoFlow 1.531.53
VideoFlow + AF 1.50\mathbf{1.50}
Refer to caption
(a) SLVM
Refer to caption
(b) SLVM + 1-AF
Refer to caption
(c)
Figure 8: Improved Generalization. The low-level reference frame improves generalization to unseen sequences. Train and test negative log-likelihood bound histograms for (a) SLVM and (b) SLVM + 1-AF on KTH Actions. (c) The generalization gap for SLVM + 1-AF remains small for varying amounts of KTH training data, while it becomes worse in the low-data regime for SLVM.
Results on Non-Video Sequence Dataset

In Table 2, we report negative log-density results on non-video data in nats per time step. Note that log-densities can be positive or negative. Again, we see that 2-AF consistently outperforms 1-AF, which are typically on-par or better than SLVM. However, SLVM + 1-AF outperforms all other model classes, achieving the lowest (best) log-densities across all datasets. With non-video data, we see that using the special case of modeling temporal differences (SLVM + Δ𝐱\Delta\mathbf{x}), performance is actually slightly worse than that of SLVM on all datasets. This, again, highlights the importance of using a learned pre-processing transform in comparison with hard-coded temporal differences.

Table 2: Non-Video Quantitative Comparison. Average test log-likelihood (lower is better) in nats per time step on various non-video datasets.
activity_rec smartphone_sensor facial_exp
1-AF 2.712.71 7.46-7.46 241-241
2-AF 2.062.06 8.53-8.53 259-259
SLVM 2.77\leq 2.77 5.21\leq-5.21 164\leq-164
SLVM + Δ𝐱\Delta\mathbf{x} 5.61\leq 5.61 4.02\leq-4.02 154\leq-154
SLVM + 1-AF 1.46\mathbf{\leq 1.46} 9.82\mathbf{\leq-9.82} 𝟑𝟎𝟔\mathbf{\leq-306}
Improved Samples

The quantitative improvement over VideoFlow is less dramatic, as this is already a high-capacity model. However, qualitatively, we observe that incorporating autoregressive flow dynamics improves sample quality (Figure 7). In these randomly selected samples, the robot arm occasionally becomes blurry for VideoFlow (red boxes) but remains clear for VideoFlow + AF.

Improved Generalization

Our temporal normalization technique also improves generalization to unseen examples, a key benefit of normalization schemes, e.g., batch norm [36]. Intuitively, higher-level dynamics are often preserved, whereas lower-level appearance is not. This is apparent on KTH Actions, which contains a substantial degree of train-test mismatch, due to different identities and activities. NLL histograms on KTH are shown in Figure 8, with greater overlap for SLVM + 1-AF. We also train SLVM and SLVM + 1-AF on subsets of KTH Actions. In Figure 8(c), we see that autoregressive flows enable generalization in the low-data regime, whereas SLVM becomes worse.

5 Conclusion

We have presented a technique for improving sequence modeling using autoregressive flows. Learning a frame of reference, parameterized by autoregressive transforms, reduces temporal redundancy in input sequences, simplifying dynamics. Thus, rather than expanding the model, we can simplify the input to meet the capacity of the model. This approach is distinct from previous works with normalizing flows on sequences, yet contains connections to classical modeling and compression. We hope these connections lead to further insights and applications. Finally, we have analyzed and empirically shown how autoregressive pre-processing in both the data and latent spaces can improve sequence modeling and lead to improved sample quality and generalization.

The underlying assumption behind using autoregressive flows for sequence modeling is that sequences contain smooth or predictable temporal dependencies, with more complex, higher-level dependencies as well. In both video and non-video data, we have seen improvements from combining sequential latent variable models with autoregressive flows, suggesting that such assumptions are generally reasonable. Using affine autoregressive flows restricts our approach to sequences of continuous data, but future work could investigate discrete data, such as natural language. Likewise, we assume regularly sampled sequences (i.e., a constant frequency), however, future work could also investigate irregularly sampled event data.

References

  • Agrawal and Dukkipati [2016] Siddharth Agrawal and Ambedkar Dukkipati. Deep variational inference without pixel-wise reconstruction. arXiv preprint arXiv:1611.05209, 2016.
  • Agustsson et al. [2020] Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici. Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8503–8512, 2020.
  • Atal and Schroeder [1979] B Atal and M Schroeder. Predictive coding of speech signals and subjective error criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(3):247–254, 1979.
  • Barlow et al. [1961] Horace B Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1:217–234, 1961.
  • Barsocchi et al. [2016] Paolo Barsocchi, Antonino Crivello, Davide La Rosa, and Filippo Palumbo. A multisource and multivariate dataset for indoor localization methods based on wlan and geo-magnetic field fingerprinting. In 2016 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pages 1–8. IEEE, 2016.
  • Bayer and Osendorfer [2014] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. In NeurIPS 2014 Workshop on Advances in Variational Inference, 2014.
  • Bengio and Bengio [2000] Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000.
  • Box et al. [2015] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
  • Chen and Gopinath [2001] Scott Saobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in Neural Information Processing Systems, pages 423–429, 2001.
  • Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
  • Chung et al. [2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information processing Systems, pages 2980–2988, 2015.
  • Clevert et al. [2015] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
  • de Almeida Freitas et al. [2014] Fernando de Almeida Freitas, Sarajane Marques Peres, Clodoaldo Aparecido de Moraes Lima, and Felipe Venancio Barbosa. Grammatical facial expressions recognition with machine learning. In The Twenty-Seventh International Flairs Conference, 2014.
  • Deco and Brauer [1995] Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. In Advances in Neural Information Processing Systems, pages 247–254, 1995.
  • Denton and Fergus [2018] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1182–1191, 2018.
  • Dinh et al. [2015] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In International Conference on Learning Representations, 2015.
  • Dinh et al. [2017] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017.
  • Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, 2019.
  • Ebert et al. [2017] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In Conference on Robot Learning, 2017.
  • Fraccaro et al. [2016] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pages 2199–2207, 2016.
  • Frey et al. [1996] Brendan J Frey, Geoffrey E Hinton, and Peter Dayan. Does the wake-sleep algorithm produce good density estimators? In Advances in Neural Information Processing Systems, pages 661–667, 1996.
  • Friedman [1987] Jerome H Friedman. Exploratory projection pursuit. Journal of the American statistical association, 82(397):249–266, 1987.
  • Friston [2008] Karl Friston. Hierarchical models in the brain. PLoS computational biology, 4(11):e1000211, 2008.
  • Gan et al. [2015] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems, 2015.
  • Gemici et al. [2017] Mevlana Gemici, Chia-Chun Hung, Adam Santoro, Greg Wayne, Shakir Mohamed, Danilo J Rezende, David Amos, and Timothy Lillicrap. Generative temporal models with memory. arXiv preprint arXiv:1702.04649, 2017.
  • Graves [2013] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  • Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2450–2462, 2018.
  • Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019.
  • Hamilton [2020] James Douglas Hamilton. Time series analysis. Princeton university press, 2020.
  • He et al. [2018] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
  • Henter et al. [2019] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. Moglow: Probabilistic and controllable motion synthesis using normalising flows. arXiv preprint arXiv:1905.06598, 2019.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Huang et al. [2017] Chin-Wei Huang, Ahmed Touati, Laurent Dinh, Michal Drozdzal, Mohammad Havaei, Laurent Charlin, and Aaron Courville. Learnable explicit density for continuous latent space and variational inference. arXiv preprint arXiv:1710.02248, 2017.
  • Huang et al. [2018] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, pages 2083–2092, 2018.
  • Hyvärinen and Oja [2000] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • Jaini et al. [2019] Priyank Jaini, Kira A Selby, and Yaoliang Yu. Sum-of-squares polynomial flow. In International Conference on Machine Learning, pages 3009–3018, 2019.
  • Jordan et al. [1998] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. NATO ASI SERIES D BEHAVIOURAL AND SOCIAL SCIENCES, 89:105–162, 1998.
  • Karl et al. [2017] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. In International Conference on Learning Representations, 2017.
  • Kim et al. [2019] Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. Flowavenet: A generative flow for raw audio. In International Conference on Machine Learning, pages 3370–3378, 2019.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder. In Proceedings of the International Conference on Learning Representations, 2014.
  • Kingma and Dhariwal [2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
  • Kingma et al. [2016] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
  • Kumar et al. [2020] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flow-based generative model for video. International Conference on Learning Representations, 2020.
  • Laparra et al. [2011] Valero Laparra, Gustavo Camps-Valls, and Jesús Malo. Iterative gaussianization: from ica to random rotations. IEEE transactions on neural networks, 22(4):537–549, 2011.
  • Li and Mandt [2018] Yingzhen Li and Stephan Mandt. A deep generative model for disentangled representations of sequential data. In International Conference on Machine Learning, 2018.
  • Lombardo et al. [2019] Salvator Lombardo, Jun Han, Christopher Schroers, and Stephan Mandt. Deep generative video compression. In Advances in Neural Information Processing Systems, pages 9283–9294, 2019.
  • Marino et al. [2018] Joseph Marino, Milan Cvitkovic, and Yisong Yue. A general method for amortizing variational filtering. In Advances in Neural Information Processing Systems, pages 7857–7868, 2018.
  • Murphy [2012] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
  • Oliva et al. [2018] Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric Xing, and Jeff Schneider. Transformation autoregressive networks. In International Conference on Machine Learning, pages 3895–3904, 2018.
  • Oliver [1952] BM Oliver. Efficient coding. The Bell System Technical Journal, 31(4):724–750, 1952.
  • Palumbo et al. [2016] Filippo Palumbo, Claudio Gallicchio, Rita Pucci, and Alessio Micheli. Human activity recognition using multisensor data fusion based on reservoir computing. Journal of Ambient Intelligence and Smart Environments, 8(2):87–107, 2016.
  • Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
  • Ping et al. [2019] Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations, 2019.
  • Pourahmadi [2011] Mohsen Pourahmadi. Covariance estimation: The glm and regularization perspectives. Statistical Science, pages 369–387, 2011.
  • Prenger et al. [2019] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
  • Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
  • Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, pages 1278–1286, 2014.
  • Rhinehart et al. [2018] Nicholas Rhinehart, Kris M Kitani, and Paul Vernaza. R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 772–788, 2018.
  • Rhinehart et al. [2019] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. Precog: Prediction conditioned on goals in visual multi-agent settings. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
  • Rippel and Adams [2013] Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125, 2013.
  • Schmidt et al. [2019] Florian Schmidt, Stephan Mandt, and Thomas Hofmann. Autoregressive text generation beyond feedback loops. In Empirical Methods in Natural Language Processing, pages 3391–3397, 2019.
  • Schuldt et al. [2004] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In International Conference on Pattern Recognition, 2004.
  • Srinivasan et al. [1982] Mandyam Veerambudi Srinivasan, Simon Barry Laughlin, and Andreas Dubs. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982.
  • Srivastava et al. [2015a] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on Machine Learning, pages 843–852, 2015a.
  • Srivastava et al. [2015b] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems (NIPS), pages 2377–2385, 2015b.
  • van den Oord et al. [2016a] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
  • van den Oord et al. [2016b] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747–1756, 2016b.
  • van den Oord et al. [2018] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3915–3923, 2018.
  • Vaswani et al. [2018] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018. URL http://arxiv.org/abs/1803.07416.
  • Wiegand et al. [2003] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003.
  • Winkler et al. [2019] Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042, 2019.
  • Xue et al. [2016] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, 2016.
  • Yang et al. [2021] Ruihan Yang, Yibo Yang, Joseph Marino, and Stephan Mandt. Hierarchical autoregressive modeling for neural video compression. In International Conference on Learning Representations, 2021.
  • Ziegler and Rush [2019] Zachary Ziegler and Alexander Rush. Latent normalizing flows for discrete sequences. In International Conference on Machine Learning, pages 7673–7682, 2019.

Appendix A Lower Bound Derivation

Consider the model defined in Section 3.3, with the conditional likelihood parameterized with autoregressive flows. That is, we parameterize

𝐱t=𝝁θ(𝐱<t)+𝝈θ(𝐱<t)𝐲t\mathbf{x}_{t}=\bm{\mu}_{\theta}(\mathbf{x}_{<t})+\bm{\sigma}_{\theta}(\mathbf{x}_{<t})\odot\mathbf{y}_{t} (13)

yielding

pθ(𝐱t|𝐱<t,𝐳t)=pθ(𝐲t|𝐲<t,𝐳t)|det(𝐱t𝐲t)|1.p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})=p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|^{-1}. (14)

The joint distribution over all time steps is then given as

pθ(𝐱1:T,𝐳1:T)\displaystyle p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T}) =t=1Tpθ(𝐱t|𝐱<t,𝐳t)pθ(𝐳t|𝐱<t,𝐳<t)\displaystyle=\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t}) (15)
=t=1Tpθ(𝐲t|𝐲<t,𝐳t)|det(𝐱t𝐲t)|1pθ(𝐳t|𝐱<t,𝐳<t).\displaystyle=\prod_{t=1}^{T}p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|^{-1}p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t}). (16)

To perform variational inference, we consider a filtering approximate posterior of the form

q(𝐳1:T|𝐱1:T)=t=1Tq(𝐳t|𝐱t,𝐳<t).q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})=\prod_{t=1}^{T}q(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t}). (17)

We can then plug these expressions into the evidence lower bound:

\displaystyle\mathcal{L} 𝔼q(𝐳1:T|𝐱1:T)[logpθ(𝐱1:T,𝐳1:T)logq(𝐳1:T|𝐱1:T)]\displaystyle\equiv\mathbb{E}_{q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})}\left[\log p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})-\log q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})\right] (18)
=𝔼q(𝐳1:T|𝐱1:T)[log(t=1Tpθ(𝐲t|𝐲<t,𝐳t)|det(𝐱t𝐲t)|1pθ(𝐳t|𝐱<t,𝐳<t))\displaystyle=\mathbb{E}_{q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})}\Bigg{[}\log\left(\prod_{t=1}^{T}p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|^{-1}p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t})\right)
log(t=1Tq(𝐳t|𝐱t,𝐳<t))]\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\log\left(\prod_{t=1}^{T}q(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t})\right)\Bigg{]} (19)
=𝔼q(𝐳1:T|𝐱1:T)[t=1Tlogpθ(𝐲t|𝐲<t,𝐳t)logq(𝐳t|𝐱t,𝐳<t)pθ(𝐳t|𝐱<t,𝐳<t)log|det(𝐱t𝐲t)|].\displaystyle=\mathbb{E}_{q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})}\Bigg{[}\sum_{t=1}^{T}\log p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})-\log\frac{q(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t})}{p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t})}-\log\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|\Bigg{]}. (20)

Finally, in the filtering setting, we can rewrite the expectation, bringing it inside of the sum (see [25, 49]):

=t=1T𝔼q(𝐳t|𝐱t)[logpθ(𝐲t|𝐲<t,𝐳t)logq(𝐳t|𝐱t,𝐳<t)pθ(𝐳t|𝐱<t,𝐳<t)log|det(𝐱t𝐲t)|].\mathcal{L}=\sum_{t=1}^{T}\mathbb{E}_{q(\mathbf{z}_{\leq t}|\mathbf{x}_{\leq t})}\Bigg{[}\log p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})-\log\frac{q(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t})}{p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t})}-\log\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|\Bigg{]}. (21)

Because there exists a one-to-one mapping between 𝐱1:T\mathbf{x}_{1:T} and 𝐲1:T\mathbf{y}_{1:T}, we can equivalently condition the approximate posterior and the prior on 𝐲\mathbf{y}, i.e.

=t=1T𝔼q(𝐳t|𝐲t)[logpθ(𝐲t|𝐲<t,𝐳t)logq(𝐳t|𝐲t,𝐳<t)pθ(𝐳t|𝐲<t,𝐳<t)log|det(𝐱t𝐲t)|].\mathcal{L}=\sum_{t=1}^{T}\mathbb{E}_{q(\mathbf{z}_{\leq t}|\mathbf{y}_{\leq t})}\Bigg{[}\log p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})-\log\frac{q(\mathbf{z}_{t}|\mathbf{y}_{\leq t},\mathbf{z}_{<t})}{p_{\theta}(\mathbf{z}_{t}|\mathbf{y}_{<t},\mathbf{z}_{<t})}-\log\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|\Bigg{]}. (22)

Appendix B Experiment Details

B.1 Flow Architecture

The affine autoregressive flow architecture is shown in Figure 9. The shift and scale of the affine transform are conditioned on three previous inputs. For each flow, we first apply 44 convolutional layers with kernel size (3,3)(3,3), stride 11, and padding 11 on each conditioned observation, preserving the input shape. The outputs are concatenated along the channel dimension and go through another 44 convolutional layers with kernel size (3,3)(3,3), stride 11, and padding 11. Finally, separate convolutional layers with the same kernel size, stride, and padding are used to output shift and log-scale. We use ReLU non-linearities for all convolutional layers.

B.2 Sequential Latent Variable Model Architecture

For sequential latent variable models, we use a DC-GAN [58] encoder architecture (Figure 9(d)), with 44 convolutional layers of kernel size (4,4)(4,4), stride 22, and padding 11 followed by another convolutional layer of kernel size (4,4)(4,4), stride 11, and no padding. The encoding is sent to one or two LSTM layers [32] followed by separate linear layers to output the mean and log-variance for qϕ(𝐳t|𝐱t,𝐳<t)q_{\phi}(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t}). We note that for SLVM, we input 𝐱t\mathbf{x}_{t} into the encoder, whereas for SLVM + AF, we input 𝐲t\mathbf{y}_{t}. The architecture for the conditional prior, pθ(𝐳t|𝐱<t,𝐳<t)p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t}), shown in Figure 9(e), contains two fully-connected layers, which take the previous latent variable as input, followed by one or two LSTM layers, and separate linear layers to output the mean and log-variance. The decoder architecture, shown in Figure 9(c), mirrors the encoder architecture, using transposed convolutions. In SLVM, we use two LSTM layers for modeling the conditional prior and approximate posterior distributions, while in SLVM + 1-AF, we use a single LSTM layer for each. We use leaky ReLU non-linearities for the encoder and decoder architectures and ReLU non-linearities in the conditional prior architecture.

B.3 VideoFlow Architecture

For VideoFlow experiments, we use the official code provided by [45] in the tensor2tensor repository [72]. Due to memory and computational constraints, we use a smaller version of the model architecture used by [45] for the BAIR Robot Pushing dataset. We change depth from 2424 to 1212 and latent_encoder_width from 256256 to 128128. This reduces the number of parameters from roughly 6767 million to roughly 3232 million. VideoFlow contains a hierarchy of latent variables, with the latent variable at level ll at time tt denoted as 𝐳t(l)\mathbf{z}_{t}^{(l)}. The prior on this latent variable is denoted as pθ(𝐳t(l)|𝐳<t(l),𝐳t(>l))=𝒩(𝐳t(l);𝝁t(l),diag((𝝈t(l))2))p_{\theta}(\mathbf{z}_{t}^{(l)}|\mathbf{z}_{<t}^{(l)},\mathbf{z}_{t}^{(>l)})=\mathcal{N}(\mathbf{z}_{t}^{(l)};\bm{\mu}_{t}^{(l)},\textrm{diag}((\bm{\sigma}_{t}^{(l)})^{2})), where 𝝁t(l)\bm{\mu}_{t}^{(l)} and 𝝈t(l)\bm{\sigma}_{t}^{(l)} are functions of 𝐳<t(l)\mathbf{z}_{<t}^{(l)} and 𝐳t(>l)\mathbf{z}_{t}^{(>l)}. We note that [45] parameterize 𝝁t(l)\bm{\mu}_{t}^{(l)} as 𝝁t(l)=𝐳t1(l)+𝝁~t(l)\bm{\mu}_{t}^{(l)}=\mathbf{z}_{t-1}^{(l)}+\widetilde{\bm{\mu}}_{t}^{(l)}, where 𝝁~t(l)\widetilde{\bm{\mu}}_{t}^{(l)} is the function. [45] refer to this as latent_skip. This is already a special case of an affine autoregressive flow, with a hard-coded shift of 𝐳t1(l)\mathbf{z}_{t-1}^{(l)} and a scale of 𝟏\mathbf{1}. We parameterize an affine autoregressive flow at each latent level, with a shift, 𝜶t(l)\bm{\alpha}_{t}^{(l)}, and scale, 𝜷t(l)\bm{\beta}_{t}^{(l)}, which are function of 𝐳<t(l)\mathbf{z}_{<t}^{(l)}, using the same 5-block ResNet architecture as [45]. In practice, these functions are conditioned on the variables at the past three time steps. The affine autoregressive flow produces a new variable:

𝐮t(l)=𝐳t(l)𝜶t(l)𝜷t(l),\mathbf{u}_{t}^{(l)}=\frac{\mathbf{z}_{t}^{(l)}-\bm{\alpha}_{t}^{(l)}}{\bm{\beta}_{t}^{(l)}},

which we then model using the same prior distribution and architecture as [45]: pθ(𝐮t(l)|𝐳<t(l),𝐳t(>l))=𝒩(𝐮t(l);𝝁t(l),diag((𝝈t(l))2))p_{\theta}(\mathbf{u}_{t}^{(l)}|\mathbf{z}_{<t}^{(l)},\mathbf{z}_{t}^{(>l)})=\mathcal{N}(\mathbf{u}_{t}^{(l)};\bm{\mu}_{t}^{(l)},\textrm{diag}((\bm{\sigma}_{t}^{(l)})^{2})), where 𝝁t(l)\bm{\mu}_{t}^{(l)} and 𝝈t(l)\bm{\sigma}_{t}^{(l)}, again, are functions of 𝐳<t(l)\mathbf{z}_{<t}^{(l)} (or, equivalently 𝐮<t(l)\mathbf{u}_{<t}^{(l)}) and 𝐳t(>l)\mathbf{z}_{t}^{(>l)}.

B.4 Non-Video Sequence Modeling Architecture

We again compare various model classes in terms of log-likelihood estimation. We use fully-connected networks to parameterize all functions within the prior, approximate posterior, and conditional likelihood of each model. All networks are 22 layers of 256256 units with highway connectivity [68]. For autoregressive flows, we use ELU non-linearities [12]. For stability, we found it necessary to use tanh non-linearities in the networks for SLVMs (prior, conditional likelihood, and approximate posterior). In SLVMs, the prior is conditioned on 𝐳t1\mathbf{z}_{t-1}, the approximate posterior is conditioned on 𝐳t1\mathbf{z}_{t-1} and 𝐲t\mathbf{y}_{t}, and the conditional likelihood is conditioned on 𝐳t\mathbf{z}_{t}. We use a latent space dimensionality of 1616 for all SLVMs.

B.5 Training Set-Up

We use the Adam optimizer [41] with a learning rate of 1×1041\times 10^{-4} to train all the models. For Moving MNIST, we use a batch size of 1616 and train for 200,000200,000 iterations for SLVM and 100,000100,000 iterations for 1-AF, 2-AF and SLVM + 1-AF. For BAIR Robot Pushing, we use a batch size of 88 and train for 200,000200,000 iterations for all models. For KTH Actions, we use a batch size of 88 and train for 90,00090,000 iterations for all models. Batch norm [36] is applied to all convolutional layers that do not output distribution or affine transform parameters. We randomly crop sequences of length 1313 from all sequences and evaluate on the last 1010 frames. For AF-2 models, we crop sequences of length 1616 in order to condition both flows on three previous inputs. For VideoFlow experiments, we use the same hyper-parameters as [45] (with the exception of the two architecture changes mentioned above) and train for 100,000100,000 iterations.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 9: SLVM Architecture. Diagrams are shown for the (a) approximate posterior, (b) prior, and (c) conditional likelihood of the sequential latent variable model (SLVM). In (d) and (e) we show the approximate posterior and prior used with SLVM + AF, respectively. The conditional likelihood is the same architecture in both setups. Note: for SLVM + AF, we input 𝐲t\mathbf{y}_{t} into the approximate posterior encoder, rather than 𝐱t\mathbf{x}_{t}. conv denotes a convolutional layer, LSTM denotes a long short-term memory layer, fc denotes a fully-connected layer, and t_conv denotes a transposed convolutional layer. For conv and t_conv layers, the numbers in parentheses respectively denote the number of filters, filter size, stride, and padding of the layer. For fc and LSTM layers, the number in parentheses denotes the number of units. SLVM contains one additional LSTM layer in both the approximate posterior and conditional prior.
Model 1-AF 2-AF SLVM SLVM + 1-AF
Moving Mnist 343343k 686686k 11,30211,302k 10,59210,592k
BAIR Robot Pushing 363363k 726726k 11,32511,325k 10,64310,643k
KTH Action 343343k 686686k 11,30211,302k 10,59210,592k
Table 3: Number of parameters for each model on each dataset. Flow-based models contain relatively few parameters as compared with the SLVM, as our flows consist primarily of 3×33\times 3 convolutions with limited channels. In the SLVM, we use two LSTM layers for modeling the prior and approx. posterior distribution of the latent variable, while in SLVM + 1-AF, we use a single LSTM layer for each.

B.6 Quantifying Decorrelation

To quantify the temporal redundancy reduction resulting from affine autoregressive pre-processing, we evaluate the empirical correlation between successive frames for the data observations and noise variables, averaged over spatial locations and channels. This is an average normalized version of the auto-covariance of each signal with a time delay of 11 time step. Specifically, we estimate the temporal correlation as

corr𝐱1HWCi,j,kH,W,C𝔼xt(i,j,k),xt+1(i,j,k)𝒟[ξt,t+1(i,j,k)],\textrm{corr}_{\mathbf{x}}\equiv\frac{1}{HWC}\cdot\sum_{i,j,k}^{H,W,C}\mathbb{E}_{x^{(i,j,k)}_{t},x^{(i,j,k)}_{t+1}\sim\mathcal{D}}\left[\xi_{t,t+1}(i,j,k)\right], (23)

where the term inside the expectation is

ξt,t+1(i,j,k)(xt(i,j,k)μ(i,j,k))(xt+1(i,j,k)μ(i,j,k))(σ(i,j,k))2.\xi_{t,t+1}(i,j,k)\equiv\frac{(x^{(i,j,k)}_{t}-\mu^{(i,j,k)})(x^{(i,j,k)}_{t+1}-\mu^{(i,j,k)})}{\left(\sigma^{(i,j,k)}\right)^{2}}. (24)

Here, xt(i,j,k)x_{t}^{(i,j,k)} denotes the image at location (i,j)(i,j) and channel kk at time tt, μ(i,j,k)\mu^{(i,j,k)} is the mean of this dimension, and σ(i,j,k)\sigma^{(i,j,k)} is the standard deviation. H,W,H,W, and CC respectively denote the height, width, and number of channels of the observations, and 𝒟\mathcal{D} denotes the dataset. We define an analogous expression for 𝐲\mathbf{y}, denoted corr𝐲\textrm{corr}_{\mathbf{y}}.

Appendix C Illustrative Example

Refer to caption
Figure 10: Motivating Example. Plots are shown for a sample of 𝐱1:T\mathbf{x}_{1:T} (left), 𝐮1:T\mathbf{u}_{1:T} (center), and 𝐰1:T\mathbf{w}_{1:T} (right). Here, 𝐰1:T𝒩(𝐰1:T;𝟎,𝐈)\mathbf{w}_{1:T}\sim\mathcal{N}(\mathbf{w}_{1:T};\mathbf{0},\mathbf{I}), and 𝐮\mathbf{u} and 𝐱\mathbf{x} are initialized at 0. Moving from 𝐱𝐮𝐰\mathbf{x}\rightarrow\mathbf{u}\rightarrow\mathbf{w} via affine transforms results in successively less temporal correlation and therefore simpler dynamics.

To build intuition behind the benefits of temporal pre-processing (e.g., decorrelation) for downstream dynamics modeling, we present the following simple, kinematic example. Consider the discrete dynamical system defined by the following set of equations:

𝐱t\displaystyle\mathbf{x}_{t} =𝐱t1+𝐮t,\displaystyle=\mathbf{x}_{t-1}+\mathbf{u}_{t}, (25)
𝐮t\displaystyle\mathbf{u}_{t} =𝐮t1+𝐰t,\displaystyle=\mathbf{u}_{t-1}+\mathbf{w}_{t}, (26)

where 𝐰t𝒩(𝐰t;𝟎,𝚺)\mathbf{w}_{t}\sim\mathcal{N}(\mathbf{w}_{t};\mathbf{0},\bm{\Sigma}). We can express 𝐱t\mathbf{x}_{t} and 𝐮t\mathbf{u}_{t} in probabilistic terms as

𝐱t\displaystyle\mathbf{x}_{t} 𝒩(𝐱t;𝐱t1+𝐮t1,𝚺),\displaystyle\sim\mathcal{N}(\mathbf{x}_{t};\mathbf{x}_{t-1}+\mathbf{u}_{t-1},\bm{\Sigma}), (27)
𝐮t\displaystyle\mathbf{u}_{t} 𝒩(𝐮t;𝐮t1,𝚺).\displaystyle\sim\mathcal{N}(\mathbf{u}_{t};\mathbf{u}_{t-1},\bm{\Sigma}). (28)

Physically, this describes the noisy dynamics of a particle with momentum and mass 11, subject to Gaussian noise. That is, 𝐱\mathbf{x} represents position, 𝐮\mathbf{u} represents velocity, and 𝐰\mathbf{w} represents stochastic forces. If we consider the dynamics at the level of 𝐱\mathbf{x}, we can use the fact that 𝐮t1=𝐱t1𝐱t2\mathbf{u}_{t-1}=\mathbf{x}_{t-1}-\mathbf{x}_{t-2} to write

p(𝐱t|𝐱t1,𝐱t2)=𝒩(𝐱t;𝐱t1+𝐱t1𝐱t2,𝚺).p(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{t-2})=\mathcal{N}(\mathbf{x}_{t};\mathbf{x}_{t-1}+\mathbf{x}_{t-1}-\mathbf{x}_{t-2},\bm{\Sigma}). (29)

Thus, we see that in the space of 𝐱\mathbf{x}, the dynamics are second-order Markov, requiring knowledge of the past two time steps. However, at the level of 𝐮\mathbf{u} (Eq. 28), the dynamics are first-order Markov, requiring only the previous time step. Yet, note that 𝐮t\mathbf{u}_{t} is, in fact, an affine autoregressive transform of 𝐱t\mathbf{x}_{t} because 𝐮t=𝐱t𝐱t1\mathbf{u}_{t}=\mathbf{x}_{t}-\mathbf{x}_{t-1} is a special case of the general form 𝐱t𝝁θ(𝐱<t)𝝈θ(𝐱<t)\frac{\mathbf{x}_{t}-\bm{\mu}_{\theta}(\mathbf{x}_{<t})}{\bm{\sigma}_{\theta}(\mathbf{x}_{<t})}. In Eq. 25, we see that the Jacobian of this transform is 𝐱t/𝐮t=𝐈\partial\mathbf{x}_{t}/\partial\mathbf{u}_{t}=\mathbf{I}, so, from the change of variables formula, we have p(𝐱t|𝐱t1,𝐱t2)=p(𝐮t|𝐮t1)p(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{t-2})=p(\mathbf{u}_{t}|\mathbf{u}_{t-1}). In other words, an affine autoregressive transform has allowed us to convert a second-order Markov system into a first-order Markov system, thereby simplifying the dynamics. Continuing this process to move to 𝐰t=𝐮t𝐮t1\mathbf{w}_{t}=\mathbf{u}_{t}-\mathbf{u}_{t-1}, we arrive at a representation that is entirely temporally decorrelated, i.e. no dynamics, because p(𝐰t)=𝒩(𝐰t;𝟎,𝚺)p(\mathbf{w}_{t})=\mathcal{N}(\mathbf{w}_{t};\mathbf{0},\bm{\Sigma}). A sample from this system is shown in Figure 10, illustrating this process of temporal decorrelation.

Appendix D Additional Experimental Results

M-MNIST BAIR KTH
1-AF 2.062.06 2.982.98 2.952.95
2-AF 2.042.04 2.762.76 2.952.95
SLVM 1.93\leq 1.93 3.46\leq 3.46 3.05\leq 3.05
SLVM + Δ𝐱\Delta\mathbf{x} 2.47\leq 2.47 3.05\leq 3.05 2.46\leq 2.46
SLVM + 1-AF 1.85\leq\mathbf{1.85} 2.31\leq\mathbf{2.31} 2.21\leq\mathbf{2.21}
VF 1.501.50
VF + AF 1.49\mathbf{1.49}
Table 4: Training Quantitative Comparison. Average training negative log-likelihood in nats per dim. for Moving MNIST, BAIR Robot Pushing, and KTH Actions.

D.1 Additional Qualitative Results

Refer to caption
(a)
Refer to caption
(b)
Figure 11: Autoregressive Flow Visualization on KTH Action. Visualization of the flow component for (a) standalone flow-based models and (b) sequential latent variable models with flow-based conditional likelihoods for KTH Actions. From top to bottom, each figure shows 1) the original frames, 𝐱t\mathbf{x}_{t}, 2) the predicted shift, 𝝁θ(𝐱<t)\bm{\mu}_{\theta}(\mathbf{x}_{<t}), for the frame, 3) the predicted scale, 𝝈θ(𝐱<t)\bm{\sigma}_{\theta}(\mathbf{x}_{<t}), for the frame, and 4) the noise, 𝐲t\mathbf{y}_{t}, obtained from the inverse transform.
Refer to caption
Figure 12: SLVM w/ 2-AF Visualization on Moving MNIST. Visualization of the flow component for sequential latent variable models with 2-layer flow-based conditional likelihoods for Moving MNIST. From top to bottom on the left side, each figure shows 1) the original frames, 𝐱t\mathbf{x}_{t}, 2) the lower-level predicted shift, 𝝁θ1(𝐱<t)\bm{\mu}_{\theta}^{1}(\mathbf{x}_{<t}), for the frame, 3) the predicted scale, 𝝈θ1(𝐱<t)\bm{\sigma}_{\theta}^{1}(\mathbf{x}_{<t}), for the frame. On the right side, from top to bottom, we have 1) the higer-level predicted shift, 𝝁θ2(𝐱<t)\bm{\mu}_{\theta}^{2}(\mathbf{x}_{<t}), for the frame, 3) the predicted scale, 𝝈θ2(𝐱<t)\bm{\sigma}_{\theta}^{2}(\mathbf{x}_{<t}), for the frame and 4) the noise, 𝐲t\mathbf{y}_{t}, obtained from the inverse transform.
Refer to caption
Refer to caption
Figure 13: Generated Moving MNIST Samples. Sample frame sequences generated from a 2-AF model.
Refer to caption
Figure 14: Generated BAIR Robot Pushing Samples. Sample frame sequences generated from SLVM + 1-AF. Sequences remain relatively coherent throughout, but do not display large changes across frames.