\newfloatcommand

capbtabboxtable[][\FBwidth]

Improving Sequential Latent Variable Models
with Autoregressive Flows

Joseph Marino
California Institute of Technology
[email protected] &Lei Chen
Simon Fraser University
[email protected] Jiawei He
Simon Fraser University
[email protected] &Stephan Mandt
University of California Irvine
[email protected]

Abstract

We propose an approach for improving sequence modeling based on autoregressive normalizing flows. Each autoregressive transform, acting across time, serves as a moving frame of reference, removing temporal correlations and simplifying the modeling of higher-level dynamics. This technique provides a simple, general-purpose method for improving sequence modeling, with connections to existing and classical techniques. We demonstrate the proposed approach both with standalone flow-based models and as a component within sequential latent variable models. Results are presented on three benchmark video datasets and three other time series datasets, where autoregressive flow-based dynamics improve log-likelihood performance over baseline models. Finally, we illustrate the decorrelation and improved generalization properties of using flow-based dynamics.

1 Introduction

Data often contain sequential structure, providing a rich signal for learning models of the world. Such models are useful for representing sequences [47, 27] and planning actions [28, 10]. Recent advances in deep learning have facilitated learning sequential probabilistic models directly from high-dimensional data [26], like audio and video. A variety of techniques have emerged for learning deep sequential models, including memory units [32] and stochastic latent variables [11, 6]. These techniques have enabled sequential models to capture increasingly complex dynamics. In this paper, we explore the complementary direction, asking can we simplify the dynamics of the data to meet the capacity of the model? To do so, we aim to learn a frame of reference to assist in modeling the data.

Frames of reference are an important consideration in sequence modeling, as they can simplify dynamics by removing redundancy. For instance, in a physical system, the frame of reference that moves with the system’s center of mass removes the redundancy in displacement. Frames of reference are also more widely applicable to arbitrary sequences. Indeed, video compression schemes use predictions as a frame of reference to remove temporal redundancy [52, 2, 76]. By learning and applying a similar type of temporal normalization for sequence modeling, the model can focus on aspects that are not predicted by the low-level frame of reference, thereby simplifying dynamics modeling.

We formalize this notion of temporal normalization through the framework of autoregressive normalizing flows [44, 54]. In the context of sequences, these flows form predictions across time, attempting to remove temporal dependencies [66]. Thus, autoregressive flows can act as a pre-processing technique to simplify dynamics. We preview this approach in Figure 1, where an autoregressive flow modeling the data (top) creates a transformed space for modeling dynamics (bottom). The transformed space is largely invariant to absolute pixel value, focusing instead on capturing deviations and motion.

We empirically demonstrate this modeling technique, both with standalone autoregressive normalizing flows, as well as within sequential latent variable models. While normalizing flows have been applied in sequential contexts previously, our main contributions are in 1) showing how these models can act as a general pre-processing technique to improve dynamics modeling, 2) empirically demonstrating log-likelihood performance improvements, as well as generalization improvements, on three benchmark video datasets and time series data from the UCI machine learning repository. This technique also connects to previous work in dynamics modeling, probabilistic models, and sequence compression, enabling directions for further investigation.

Refer to caption — Figure 1: Sequence Modeling with Autoregressive Flows. Top: Pixel values (solid) for a particular pixel location in a video sequence. An autoregressive flow models the pixel sequence using an affine shift (dashed) and scale (shaded), acting as a frame of reference. Middle: Frames of the data sequence (top) and the resulting “noise” (bottom) from applying the shift and scale. The redundant, static background has been largely removed. Bottom: The noise values (solid) are modeled using a base distribution (dashed and shaded) provided by a higher-level model. By removing temporal redundancy from the data sequence, the autoregressive flow simplifies dynamics modeling.

2 Background

2.1 Autoregressive Models

Consider modeling discrete sequences of observations, $\mathbf{x}_{1:T}\sim p_{\small\textrm{data}}(\mathbf{x}_{1:T})$ , using a probabilistic model, $p_{\theta}(\mathbf{x}_{1:T})$ , with parameters $\theta$ . Autoregressive models [21, 7] use the chain rule of probability to express the joint distribution over time steps as the product of $T$ conditional distributions. These models are often formulated in forward temporal order:

p_{\theta}(\mathbf{x}_{1:T})=\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t}).

(1)

Each conditional distribution, $p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t})$ , models the dependence between time steps. For continuous variables, it is often assumed that each distribution takes a simple form, such as a diagonal Gaussian: $p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t})=\mathcal{N}(\mathbf{x}_{t};\bm{\mu}_{\theta}(\mathbf{x}_{<t}),\operatorname{diag}(\bm{\sigma}^{2}_{\theta}(\mathbf{x}_{<t}))),$ where $\bm{\mu}_{\theta}(\cdot)$ and $\bm{\sigma}_{\theta}(\cdot)$ are functions denoting the mean and standard deviation. These functions may take past observations as input through a recurrent network or a convolutional window [69]. When applied to spatial data [70], autoregressive models excel at capturing local dependencies. However, due to their restrictive forms, such models often struggle to capture more complex structure.

2.2 Autoregressive (Sequential) Latent Variable Models

Autoregressive models can be improved by incorporating latent variables [50], often represented as a corresponding sequence, $\mathbf{z}_{1:T}$ . The joint distribution, $p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})$ , has the form:

p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})=\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t}).

(2)

Unlike the Gaussian form, evaluating $p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t})$ now requires integrating over the latent variables,

p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t})=\int p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{\leq t}|\mathbf{x}_{<t})d\mathbf{z}_{\leq t},

(3)

yielding a more flexible distribution. However, performing this integration in practice is typically intractable, requiring approximate inference techniques, like variational inference [38], or invertible models [45]. Recent works have parameterized these models with deep neural networks, e.g. [11, 24, 20, 39], using amortized variational inference [42, 60]. Typically, the conditional likelihood, $p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})$ , and the prior, $p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t})$ , are Gaussian densities, with temporal conditioning handled through recurrent networks. Such models have demonstrated success in audio [11, 20] and video modeling [75, 25, 15, 30, 47]. However, as noted by [45], such models can be difficult to train with standard log-likelihood objectives, often struggling to capture dynamics.

2.3 Autoregressive Flows

Our approach is based on affine autoregressive normalizing flows [44, 54]. Here, we continue with the perspective of temporal sequences, however, these flows were initially developed and demonstrated in static settings. [44] noted that sampling from an autoregressive Gaussian model is an invertible transform, resulting in a normalizing flow [63, 16, 17, 59]. Flow-based models transform simple, base probability distributions into more complex ones while maintaining exact likelihood evaluation. To see their connection to autoregressive models, we can express sampling a Gaussian random variable using the reparameterization trick [42, 60]:

\mathbf{x}_{t}=\bm{\mu}_{\theta}(\mathbf{x}_{<t})+\bm{\sigma}_{\theta}(\mathbf{x}_{<t})\odot\mathbf{y}_{t},

(4)

where $\mathbf{y}_{t}\sim\mathcal{N}(\mathbf{y}_{t};\mathbf{0},\mathbf{I})$ is an auxiliary random variable and $\odot$ denotes element-wise multiplication. Thus, $\mathbf{x}_{t}$ is an invertible transform of $\mathbf{y}_{t}$ , with the inverse given as

\mathbf{y}_{t}=\frac{\mathbf{x}_{t}-\bm{\mu}_{\theta}(\mathbf{x}_{<t})}{\bm{\sigma}_{\theta}(\mathbf{x}_{<t})},

(5)

where division is element-wise. The inverse transform in Eq. 5, shown in Figure 2, normalizes (hence, normalizing flow) $\mathbf{x}_{1:T}$ , removing statistical dependencies. Given the functional mapping between $\mathbf{y}_{t}$ and $\mathbf{x}_{t}$ in Eq. 4, the change of variables formula converts between probabilities in each space:

\log p_{\theta}(\mathbf{x}_{1:T})=\log p_{\theta}(\mathbf{y}_{1:T})-\log\left|\det\left(\frac{\partial\mathbf{x}_{1:T}}{\partial\mathbf{y}_{1:T}}\right)\right|.

(6)

By the construction of Eqs. 4 and 5, the Jacobian in Eq. 6 is triangular, enabling efficient evaluation as the product of diagonal terms:

\log\left|\det\left(\frac{\partial\mathbf{x}_{1:T}}{\partial\mathbf{y}_{1:T}}\right)\right|=\sum_{t=1}^{T}\sum_{i}\log\sigma_{\theta,i}(\mathbf{x}_{<t}),

(7)

where $i$ denotes the observation dimension, e.g. pixel. For a Gaussian autoregressive model, the base distribution is $p_{\theta}(\mathbf{y}_{1:T})=\mathcal{N}(\mathbf{y}_{1:T};\mathbf{0},\mathbf{I})$ . We can improve upon this simple set-up by chaining transforms together, i.e. parameterizing $p_{\theta}(\mathbf{y}_{1:T})$ as a flow, resulting in hierarchical models.

2.4 Related Work

Autoregressive flows were initially considered in the contexts of variational inference [44] and generative modeling [54]. These approaches are generalizations of previous approaches with affine transforms [16, 17]. While autoregressive flows are well-suited for sequential data, these approaches, as well as many recent approaches [34, 51, 43], were initially applied to static data, such as images.

Recent works have started applying flow-based models to sequential data. [71] and [55] distill autoregressive speech models into flow-based models. [57] and [40] instead train these models directly. [45] use a flow to model individual video frames, with an autoregressive prior modeling dynamics across time steps. [61, 62] use autoregressive flows for modeling vehicle motion, and [31] use flows for motion synthesis with motion-capture data. [77] model discrete observations (e.g., text) by using flows to model dynamics of continuous latent variables. Like these recent works, we apply flow-based models to sequences. However, we demonstrate that autoregressive flows can serve as a general-purpose technique for improving dynamics models. To the best of our knowledge, our work is the first to use flows to pre-process sequences to improve sequential latent variable models.

We utilize affine flows (Eq. 4), a family that includes methods like NICE [16], RealNVP [17], IAF [44], MAF [54], and Glow [43]. However, there has been recent work in non-affine flows [34, 37, 18], which offer further flexibility. We chose to investigate affine flows because they are commonly employed and relatively simple, however, non-affine flows could result in additional improvements.

Autoregressive dynamics models are also prominent in other related areas. Within the statistics and econometrics literature, autoregressive integrated moving average (ARIMA) is a standard technique [8, 29], calculating differences with an autoregressive prediction to remove non-stationary components of a temporal signal. Such methods simplify downstream modeling, e.g., by removing seasonal effects. Low-level autoregressive models are also found in audio [3] and video compression codecs [73, 2, 76], using predictive coding [52] to remove temporal redundancy, thereby improving downstream compression rates. Intuitively, if sequential inputs are highly predictable, it is far more efficient to compress the prediction error rather than each input (e.g., video frame) separately. Finally, we note that autoregressive models are a generic dynamics modeling approach and can, in principle, be parameterized by other techniques, such as LSTMs [32], or combined with other models, such as hidden Markov models (HMMs) [50].

3 Method

We now describe our approach for improving sequence modeling. First, we motivate using autoregressive flows to reduce temporal dependencies, thereby simplifying dynamics. We then show how this simple technique can be incorporated within sequential latent variable models.

3.1 Motivation: Temporal Redundancy Reduction

Normalizing flows, while often utilized for density estimation, originated from data pre-processing techniques [22, 35, 9], which remove dependencies between dimensions, i.e., redundancy reduction [4]. Removing dependencies simplifies the resulting probability distribution by restricting variation to individual dimensions, generally simplifying downstream tasks [46]. Normalizing flows improve upon these procedures using flexible, non-linear functions [14, 16]. While flows have been used for spatial decorrelation [1, 74] and with other models [33], this capability remains under-explored.

Our main contribution is showing how to utilize autoregressive flows for temporal pre-processing to improve dynamics modeling. Data sequences contain dependencies in time, for example, in the redundancy of video pixels (Figure 1), which are often highly predictable. These dependencies define the dynamics of the data, with the degree of dependence quantified by the multi-information,

\mathcal{I}(\mathbf{x}_{1:T})=\sum_{t}\mathcal{H}(\mathbf{x}_{t})-\mathcal{H}(\mathbf{x}_{1:T}),

(8)

where $\mathcal{H}$ denotes entropy. Normalizing flows are capable of reducing redundancy, arriving at a new sequence, $\mathbf{y}_{1:T}$ , with $\mathcal{I}(\mathbf{y}_{1:T})\leq\mathcal{I}(\mathbf{x}_{1:T})$ , thereby reducing temporal dependencies. Thus, rather than fit the data distribution directly, we can first simplify the dynamics by pre-processing sequences with a normalizing flow, then fitting the resulting sequence. Through training, the flow will attempt to remove redundancies to meet the modeling capacity of the higher-level dynamics model, $p_{\theta}(\mathbf{y}_{1:T})$ .

Example

To visualize this procedure for an affine autoregressive flow, consider a one-dimensional input over two time steps, $x_{1}$ and $x_{2}$ . For each value of $x_{1}$ , there is a conditional density, $p(x_{2}|x_{1})$ . Assume that these densities take one of two forms, which are identical but shifted and scaled, shown in Figure 3. Transforming these densities through their conditional means, $\mu_{2}=\mathbb{E}\left[x_{2}|x_{1}\right]$ , and standard deviations, $\sigma_{2}=\mathbb{E}\left[(x_{2}-\mu_{2})^{2}|x_{1}\right]^{1/2}$ , creates a normalized space, $y_{2}=(x_{2}-\mu_{2})/\sigma_{2}$ , where the conditional densities are identical. In this space, the multi-information is

\mathcal{I}(y_{1};y_{2})=\mathbb{E}_{p(y_{1},y_{2})}\left[\log p(y_{2}|y_{1})-\log p(y_{2})\right]=0,

whereas $\mathcal{I}(x_{1};x_{2})>0.$ Indeed, if $p(x_{t}|x_{<t})$ is linear-Gaussian, inverting an affine autoregressive flow exactly corresponds to Cholesky whitening [56, 44], removing all linear dependencies.

In the example above, $\mu_{2}$ and $\sigma_{2}$ act as a frame of reference for estimating $x_{2}$ . More generally, in the special case where $\bm{\mu}_{\theta}(\mathbf{x}_{<t})=\mathbf{x}_{t-1}$ and $\bm{\sigma}(\mathbf{x}_{<t})=\mathbf{1}$ , we recover $\mathbf{y}_{t}=\mathbf{x}_{t}-\mathbf{x}_{t-1}=\Delta\mathbf{x}_{t}$ . Modeling finite differences (or generalized coordinates [23]) is a well-established technique, (see, e.g. [10, 45]), which is generalized by affine autoregressive flows.

3.2 Modeling Dynamics with Autoregressive Flows

We now discuss utilizing autoregressive flows to improve sequence modeling, highlighting use cases for modeling dynamics in the data and latent spaces.

3.2.1 Data Dynamics

The form of an affine autoregressive flow across sequences is given in Eqs. 4 and 5, again, equivalent to a Gaussian autoregressive model. We can stack hierarchical chains of flows to improve the model capacity. Denoting the shift and scale functions at the $m^{\textrm{th}}$ transform as $\bm{\mu}_{\theta}^{m}(\cdot)$ and $\bm{\sigma}_{\theta}^{m}(\cdot)$ respectively, we then calculate $\mathbf{y}^{m}$ using the inverse transform:

\mathbf{y}^{m}_{t}=\frac{\mathbf{y}^{m-1}_{t}-\bm{\mu}^{m}_{\theta}(\mathbf{y}^{m-1}_{<t})}{\bm{\sigma}^{m}_{\theta}(\mathbf{y}^{m-1}_{<t})}.

(9)

After the final ( $M^{\textrm{th}}$ ) transform, we can choose the form of the base distribution, $p_{\theta}(\mathbf{y}^{M}_{1:T})$ , e.g. Gaussian. While we could attempt to model $\mathbf{x}_{1:T}$ completely using stacked autoregressive flows, these models are limited to affine element-wise transforms that maintain the data dimensionality. Due to this limited capacity, purely flow-based models often require many transforms to be effective [43].

Instead, we can model the base distribution using an expressive sequential latent variable model (SLVM), or, equivalently, we can augment the conditional likelihood of a SLVM using autoregressive flows (Fig. 4(a)). Following the motivation from Section 3.1, the flow can remove temporal dependencies, simplifying the modeling task for the SLVM. With a single flow, the joint probability is

p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})=p_{\theta}(\mathbf{y}_{1:T},\mathbf{z}_{1:T})\left|\det\left(\frac{\partial\mathbf{x}_{1:T}}{\partial\mathbf{y}_{1:T}}\right)\right|^{-1},

(10)

where the SLVM distribution is given by

p_{\theta}(\mathbf{y}_{1:T},\mathbf{z}_{1:T})=\prod_{t=1}^{T}p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{t}|\mathbf{y}_{<t},\mathbf{z}_{<t}).

(11)

If the SLVM is itself a flow-based model, we can use maximum log-likelihood training. If not, we can resort to variational inference [11, 20, 49]. We derive and discuss this procedure in the Appendix.

3.2.2 Latent Dynamics

We can also consider simplifying latent dynamics modeling using autoregressive flows. This is relevant in hierarchical SLVMs, such as VideoFlow [45], where each latent variable is modeled as a function of past and higher-level latent variables. Using $\mathbf{z}_{t}^{(\ell)}$ to denote the latent variable at the $\ell^{\textrm{th}}$ level at time $t$ , we can parameterize the prior as

p_{\theta}(\mathbf{z}^{(\ell)}_{t}|\mathbf{z}^{(\ell)}_{<t},\mathbf{z}^{(>\ell)}_{t})=p_{\theta}(\mathbf{u}^{(\ell)}_{t}|\mathbf{u}^{(\ell)}_{<t},\mathbf{z}^{(>\ell)}_{t})\left|\det\left(\frac{\partial\mathbf{z}^{(\ell)}_{t}}{\partial\mathbf{u}^{(\ell)}_{t}}\right)\right|^{-1},

(12)

converting $\mathbf{z}^{(\ell)}_{t}$ into $\mathbf{u}^{(\ell)}_{t}$ using the inverse transform $\mathbf{u}^{(\ell)}_{t}=(\mathbf{z}^{(\ell)}_{t}-\bm{\alpha}_{\theta}(\mathbf{z}^{(\ell)}_{<t}))/\bm{\beta}_{\theta}(\mathbf{z}^{(\ell)}_{<t})$ . As noted previously, VideoFlow uses a special case of this procedure, setting $\bm{\alpha}_{\theta}(\mathbf{z}^{(\ell)}_{<t})=\mathbf{z}^{(\ell)}_{t-1}$ and $\bm{\beta}_{\theta}(\mathbf{z}^{(\ell)}_{<t})=\mathbf{1}$ . Generalizing this procedure further simplifies dynamics throughout the model.

4 Evaluation

We demonstrate and evaluate the proposed technique on three benchmark video datasets: Moving MNIST [67], KTH Actions [65], and BAIR Robot Pushing [19]. In addition, we also perform experiments on several non-video sequence datasets from the UC Irvine Machine Learning Repository.¹¹1https://archive.ics.uci.edu/ml/index.php Specifically, we look at an activity recognition dataset (activity_rec) [53], an indoor localization dataset (smartphone_sensor) [5], and a facial expression recognition dataset (facial_exp) [13]. Experimental setups are described in Section 4.1, followed by a set of analyses in Section 4.2. Further details and results can be found in the Appendix.

4.1 Experimental Setup

We empirically evaluate the improvements to downstream dynamics modeling from temporal pre-processing via autoregressive flows. For data space modeling, we compare four model classes: 1) standalone affine autoregressive flows with one (1-AF) and 2) two (2-AF) transforms, 3) a sequential latent variable model (SLVM), and 4) SLVM with flow-based pre-processing (SLVM + 1-AF). As we are not proposing a specific architecture, but rather a general modeling technique, the SLVM architecture is representative of recurrent convolutional video models with a single latent level [15, 27, 28]. Flows are implemented with convolutional networks, taking in a fixed window of previous frames (Fig. 4(b)). These models allow us to evaluate the benefits of temporal pre-processing (SLVM vs. SLVM + 1-AF) and the benefits of more expressive higher-level dynamics models (2-AF vs. SLVM + 1-AF).

To evaluate latent dynamics modeling with flows, we use the tensor2tensor library [72] to compare 1) VideoFlow²²2We used a smaller version of the original model architecture, with half of the flow depth, due to GPU memory constraints. and 2) the same model with affine autoregressive flow latent dynamics (VideoFlow + AF). VideoFlow is significantly larger ( $3\times$ more parameters) than the one-level SLVM, allowing us to evaluate whether autoregressive flows are beneficial in this high-capacity regime.

To enable a fairer comparison in our experiments, models with autoregressive flow dynamics have comparable or fewer parameters than baseline counterparts. We note that autoregressive dynamics adds only a constant computational cost per time-step, and this computation can be parallelized for training and evaluation. Full architecture, training, and analysis details can be found in the Appendix. Finally, as noted by [45], many previous works do not train SLVMs with proper log-likelihood objectives. Our SLVM results are consistent with previously reported log-likelihood values [49] for the Stochastic Video Generation model [15] trained with a log-likelihood bound objective.

4.2 Analyses

Visualization

In Figure 1, we visualize the pre-processing procedure for SLVM + 1-AF on BAIR Robot Pushing. The plots show the RGB values for a pixel before (top) and after (bottom) the transform. The noise sequence is nearly zero throughout, despite large changes in the pixel value. We also see that the noise sequence (center, lower) is invariant to the static background, capturing the moving robotic arm. At some time steps (e.g. fourth frame), the autoregressive flow incorrectly predicts the next frame, however, the higher-level SLVM compensates for this prediction error.

We also visualize each component of the flow. Figure 4(b) illustrates this for SLVM + 1-AF on an input from BAIR Robot Pushing. We see that $\bm{\mu}_{\theta}$ captures the static background, while $\bm{\sigma}_{\theta}$ highlights regions of uncertainty. In Figure 6 and the Appendix, we present visualizations on full sequences, where we see that different models remove varying degrees of temporal structure.

Temporal Redundancy Reduction

To quantify temporal redundancy reduction, we evaluate the empirical correlation (linear dependence) between frames, denoted as corr, for the data and noise variables. We evaluate $\textrm{corr}_{\mathbf{x}}$ and $\textrm{corr}_{\mathbf{y}}$ for 1-AF, 2-AF, and SLVM + 1-AF. The results are shown in Figure 5(a). In Figure 5(b), we plot $\textrm{corr}_{\mathbf{y}}$ for SLVM + 1-AF during training on KTH Actions. Flows decrease temporal correlation, with additional transforms yielding further decorrelation. Base distributions without temporal structure (1-AF) yield comparatively more decorrelation. Temporal redundancy is progressively removed throughout training. Note that 2-AF almost completely removes temporal correlations ( $|\textrm{corr}_{\mathbf{y}}|<0.01$ ). However, note that this only quantifies linear dependencies, and more complex non-linear dependencies may require the use of higher-level dynamics models, as shown through quantitative comparisons.

Performance Comparison

Table 1 reports average test negative log-likelihood results on video datasets. Standalone flow-based models perform surprisingly well. Increasing flow depth from 1-AF to 2-AF generally results in improvement. SLVM + 1-AF outperforms the baseline SLVM despite having fewer parameters. As another baseline, we also consider modeling frame differences, $\Delta\mathbf{x}\equiv\mathbf{x}_{t}-\mathbf{x}_{t-1}$ , with SLVM, which can be seen as a special case of 1-AF with $\bm{\mu}_{\theta}=\mathbf{x}_{t-1}$ and $\bm{\sigma}_{\theta}=\mathbf{1}$ . On BAIR and KTH Actions, datasets with significant temporal redundancy (Fig. 5(a)), this technique improves performance over SLVM. However, on Moving MNIST, modeling $\Delta\mathbf{x}$ actually decreases performance, presumably by creating more complex spatial patterns. In all cases, the learned temporal transform, SLVM + 1-AF, outperforms this hard-coded transform, SLVM + $\Delta\mathbf{x}$ . Finally, incorporating autoregressive flows into VideoFlow results in a modest but noticeable improvement, demonstrating that removing spatial dependencies, through VideoFlow, and temporal dependencies, through autoregressive flows, are complementary techniques.

Table 1: Quantitative Comparison. Average test negative log-likelihood (lower is better) in nats per dimension for Moving MNIST, BAIR Robot Pushing, and KTH Actions.

	M-MNIST	BAIR	KTH
1-AF	$2.15$	$3.05$	$3.34$
2-AF	$2.13$	$2.90$	$3.35$
SLVM	$\leq 1.92$	$\leq 3.57$	$\leq 4.63$
SLVM + $\Delta\mathbf{x}$	$\leq 2.45$	$\leq 3.07$	$\leq 2.49$
SLVM + 1-AF	$\leq\mathbf{1.86}$	$\leq\mathbf{2.35}$	$\leq\mathbf{2.39}$
VideoFlow	–	$1.53$	–
VideoFlow + AF	–	$\mathbf{1.50}$	–

Results on Non-Video Sequence Dataset

In Table 2, we report negative log-density results on non-video data in nats per time step. Note that log-densities can be positive or negative. Again, we see that 2-AF consistently outperforms 1-AF, which are typically on-par or better than SLVM. However, SLVM + 1-AF outperforms all other model classes, achieving the lowest (best) log-densities across all datasets. With non-video data, we see that using the special case of modeling temporal differences (SLVM + $\Delta\mathbf{x}$ ), performance is actually slightly worse than that of SLVM on all datasets. This, again, highlights the importance of using a learned pre-processing transform in comparison with hard-coded temporal differences.

Table 2: Non-Video Quantitative Comparison. Average test log-likelihood (lower is better) in nats per time step on various non-video datasets.

	activity_rec	smartphone_sensor	facial_exp
1-AF	$2.71$	$-7.46$	$-241$
2-AF	$2.06$	$-8.53$	$-259$
SLVM	$\leq 2.77$	$\leq-5.21$	$\leq-164$
SLVM + $\Delta\mathbf{x}$	$\leq 5.61$	$\leq-4.02$	$\leq-154$
SLVM + 1-AF	$\mathbf{\leq 1.46}$	$\mathbf{\leq-9.82}$	$\mathbf{\leq-306}$

Improved Samples

The quantitative improvement over VideoFlow is less dramatic, as this is already a high-capacity model. However, qualitatively, we observe that incorporating autoregressive flow dynamics improves sample quality (Figure 7). In these randomly selected samples, the robot arm occasionally becomes blurry for VideoFlow (red boxes) but remains clear for VideoFlow + AF.

Improved Generalization

Our temporal normalization technique also improves generalization to unseen examples, a key benefit of normalization schemes, e.g., batch norm [36]. Intuitively, higher-level dynamics are often preserved, whereas lower-level appearance is not. This is apparent on KTH Actions, which contains a substantial degree of train-test mismatch, due to different identities and activities. NLL histograms on KTH are shown in Figure 8, with greater overlap for SLVM + 1-AF. We also train SLVM and SLVM + 1-AF on subsets of KTH Actions. In Figure 8(c), we see that autoregressive flows enable generalization in the low-data regime, whereas SLVM becomes worse.

5 Conclusion

We have presented a technique for improving sequence modeling using autoregressive flows. Learning a frame of reference, parameterized by autoregressive transforms, reduces temporal redundancy in input sequences, simplifying dynamics. Thus, rather than expanding the model, we can simplify the input to meet the capacity of the model. This approach is distinct from previous works with normalizing flows on sequences, yet contains connections to classical modeling and compression. We hope these connections lead to further insights and applications. Finally, we have analyzed and empirically shown how autoregressive pre-processing in both the data and latent spaces can improve sequence modeling and lead to improved sample quality and generalization.

The underlying assumption behind using autoregressive flows for sequence modeling is that sequences contain smooth or predictable temporal dependencies, with more complex, higher-level dependencies as well. In both video and non-video data, we have seen improvements from combining sequential latent variable models with autoregressive flows, suggesting that such assumptions are generally reasonable. Using affine autoregressive flows restricts our approach to sequences of continuous data, but future work could investigate discrete data, such as natural language. Likewise, we assume regularly sampled sequences (i.e., a constant frequency), however, future work could also investigate irregularly sampled event data.

References

Agrawal and Dukkipati [2016] Siddharth Agrawal and Ambedkar Dukkipati. Deep variational inference without pixel-wise reconstruction. arXiv preprint arXiv:1611.05209, 2016.
Agustsson et al. [2020] Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici. Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8503–8512, 2020.
Atal and Schroeder [1979] B Atal and M Schroeder. Predictive coding of speech signals and subjective error criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(3):247–254, 1979.
Barlow et al. [1961] Horace B Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1:217–234, 1961.
Barsocchi et al. [2016] Paolo Barsocchi, Antonino Crivello, Davide La Rosa, and Filippo Palumbo. A multisource and multivariate dataset for indoor localization methods based on wlan and geo-magnetic field fingerprinting. In 2016 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pages 1–8. IEEE, 2016.
Bayer and Osendorfer [2014] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. In NeurIPS 2014 Workshop on Advances in Variational Inference, 2014.
Bengio and Bengio [2000] Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000.
Box et al. [2015] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
Chen and Gopinath [2001] Scott Saobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in Neural Information Processing Systems, pages 423–429, 2001.
Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
Chung et al. [2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information processing Systems, pages 2980–2988, 2015.
Clevert et al. [2015] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
de Almeida Freitas et al. [2014] Fernando de Almeida Freitas, Sarajane Marques Peres, Clodoaldo Aparecido de Moraes Lima, and Felipe Venancio Barbosa. Grammatical facial expressions recognition with machine learning. In The Twenty-Seventh International Flairs Conference, 2014.
Deco and Brauer [1995] Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. In Advances in Neural Information Processing Systems, pages 247–254, 1995.
Denton and Fergus [2018] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1182–1191, 2018.
Dinh et al. [2015] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In International Conference on Learning Representations, 2015.
Dinh et al. [2017] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017.
Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, 2019.
Ebert et al. [2017] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In Conference on Robot Learning, 2017.
Fraccaro et al. [2016] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pages 2199–2207, 2016.
Frey et al. [1996] Brendan J Frey, Geoffrey E Hinton, and Peter Dayan. Does the wake-sleep algorithm produce good density estimators? In Advances in Neural Information Processing Systems, pages 661–667, 1996.
Friedman [1987] Jerome H Friedman. Exploratory projection pursuit. Journal of the American statistical association, 82(397):249–266, 1987.
Friston [2008] Karl Friston. Hierarchical models in the brain. PLoS computational biology, 4(11):e1000211, 2008.
Gan et al. [2015] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems, 2015.
Gemici et al. [2017] Mevlana Gemici, Chia-Chun Hung, Adam Santoro, Greg Wayne, Shakir Mohamed, Danilo J Rezende, David Amos, and Timothy Lillicrap. Generative temporal models with memory. arXiv preprint arXiv:1702.04649, 2017.
Graves [2013] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2450–2462, 2018.
Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019.
Hamilton [2020] James Douglas Hamilton. Time series analysis. Princeton university press, 2020.
He et al. [2018] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
Henter et al. [2019] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. Moglow: Probabilistic and controllable motion synthesis using normalising flows. arXiv preprint arXiv:1905.06598, 2019.
Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Huang et al. [2017] Chin-Wei Huang, Ahmed Touati, Laurent Dinh, Michal Drozdzal, Mohammad Havaei, Laurent Charlin, and Aaron Courville. Learnable explicit density for continuous latent space and variational inference. arXiv preprint arXiv:1710.02248, 2017.
Huang et al. [2018] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, pages 2083–2092, 2018.
Hyvärinen and Oja [2000] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
Jaini et al. [2019] Priyank Jaini, Kira A Selby, and Yaoliang Yu. Sum-of-squares polynomial flow. In International Conference on Machine Learning, pages 3009–3018, 2019.
Jordan et al. [1998] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. NATO ASI SERIES D BEHAVIOURAL AND SOCIAL SCIENCES, 89:105–162, 1998.
Karl et al. [2017] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. In International Conference on Learning Representations, 2017.
Kim et al. [2019] Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. Flowavenet: A generative flow for raw audio. In International Conference on Machine Learning, pages 3370–3378, 2019.
Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
Kingma and Welling [2014] Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder. In Proceedings of the International Conference on Learning Representations, 2014.
Kingma and Dhariwal [2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
Kingma et al. [2016] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
Kumar et al. [2020] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flow-based generative model for video. International Conference on Learning Representations, 2020.
Laparra et al. [2011] Valero Laparra, Gustavo Camps-Valls, and Jesús Malo. Iterative gaussianization: from ica to random rotations. IEEE transactions on neural networks, 22(4):537–549, 2011.
Li and Mandt [2018] Yingzhen Li and Stephan Mandt. A deep generative model for disentangled representations of sequential data. In International Conference on Machine Learning, 2018.
Lombardo et al. [2019] Salvator Lombardo, Jun Han, Christopher Schroers, and Stephan Mandt. Deep generative video compression. In Advances in Neural Information Processing Systems, pages 9283–9294, 2019.
Marino et al. [2018] Joseph Marino, Milan Cvitkovic, and Yisong Yue. A general method for amortizing variational filtering. In Advances in Neural Information Processing Systems, pages 7857–7868, 2018.
Murphy [2012] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
Oliva et al. [2018] Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric Xing, and Jeff Schneider. Transformation autoregressive networks. In International Conference on Machine Learning, pages 3895–3904, 2018.
Oliver [1952] BM Oliver. Efficient coding. The Bell System Technical Journal, 31(4):724–750, 1952.
Palumbo et al. [2016] Filippo Palumbo, Claudio Gallicchio, Rita Pucci, and Alessio Micheli. Human activity recognition using multisensor data fusion based on reservoir computing. Journal of Ambient Intelligence and Smart Environments, 8(2):87–107, 2016.
Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
Ping et al. [2019] Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In International Conference on Learning Representations, 2019.
Pourahmadi [2011] Mohsen Pourahmadi. Covariance estimation: The glm and regularization perspectives. Statistical Science, pages 369–387, 2011.
Prenger et al. [2019] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, pages 1278–1286, 2014.
Rhinehart et al. [2018] Nicholas Rhinehart, Kris M Kitani, and Paul Vernaza. R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 772–788, 2018.
Rhinehart et al. [2019] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. Precog: Prediction conditioned on goals in visual multi-agent settings. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
Rippel and Adams [2013] Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125, 2013.
Schmidt et al. [2019] Florian Schmidt, Stephan Mandt, and Thomas Hofmann. Autoregressive text generation beyond feedback loops. In Empirical Methods in Natural Language Processing, pages 3391–3397, 2019.
Schuldt et al. [2004] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In International Conference on Pattern Recognition, 2004.
Srinivasan et al. [1982] Mandyam Veerambudi Srinivasan, Simon Barry Laughlin, and Andreas Dubs. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982.
Srivastava et al. [2015a] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on Machine Learning, pages 843–852, 2015a.
Srivastava et al. [2015b] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems (NIPS), pages 2377–2385, 2015b.
van den Oord et al. [2016a] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
van den Oord et al. [2016b] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747–1756, 2016b.
van den Oord et al. [2018] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3915–3923, 2018.
Vaswani et al. [2018] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018. URL http://arxiv.org/abs/1803.07416.
Wiegand et al. [2003] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003.
Winkler et al. [2019] Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042, 2019.
Xue et al. [2016] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, 2016.
Yang et al. [2021] Ruihan Yang, Yibo Yang, Joseph Marino, and Stephan Mandt. Hierarchical autoregressive modeling for neural video compression. In International Conference on Learning Representations, 2021.
Ziegler and Rush [2019] Zachary Ziegler and Alexander Rush. Latent normalizing flows for discrete sequences. In International Conference on Machine Learning, pages 7673–7682, 2019.

Appendix A Lower Bound Derivation

Consider the model defined in Section 3.3, with the conditional likelihood parameterized with autoregressive flows. That is, we parameterize

\mathbf{x}_{t}=\bm{\mu}_{\theta}(\mathbf{x}_{<t})+\bm{\sigma}_{\theta}(\mathbf{x}_{<t})\odot\mathbf{y}_{t}

(13)

yielding

p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{<t},\mathbf{z}_{\leq t})=p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|^{-1}.

(14)

The joint distribution over all time steps is then given as

	$\displaystyle p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})$	$\displaystyle=\prod_{t=1}^{T}p_{\theta}(\mathbf{x}_{t}\|\mathbf{x}_{<t},\mathbf{z}_{\leq t})p_{\theta}(\mathbf{z}_{t}\|\mathbf{x}_{<t},\mathbf{z}_{<t})$		(15)
		$\displaystyle=\prod_{t=1}^{T}p_{\theta}(\mathbf{y}_{t}\|\mathbf{y}_{<t},\mathbf{z}_{\leq t})\left\|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right\|^{-1}p_{\theta}(\mathbf{z}_{t}\|\mathbf{x}_{<t},\mathbf{z}_{<t}).$		(16)

To perform variational inference, we consider a filtering approximate posterior of the form

q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})=\prod_{t=1}^{T}q(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t}).

(17)

We can then plug these expressions into the evidence lower bound:

$\displaystyle\mathcal{L}$	$\displaystyle\equiv\mathbb{E}_{q(\mathbf{z}_{1:T}\|\mathbf{x}_{1:T})}\left[\log p_{\theta}(\mathbf{x}_{1:T},\mathbf{z}_{1:T})-\log q(\mathbf{z}_{1:T}\|\mathbf{x}_{1:T})\right]$	(18)
	$\displaystyle=\mathbb{E}_{q(\mathbf{z}_{1:T}\|\mathbf{x}_{1:T})}\Bigg{[}\log\left(\prod_{t=1}^{T}p_{\theta}(\mathbf{y}_{t}\|\mathbf{y}_{<t},\mathbf{z}_{\leq t})\left\|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right\|^{-1}p_{\theta}(\mathbf{z}_{t}\|\mathbf{x}_{<t},\mathbf{z}_{<t})\right)$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\log\left(\prod_{t=1}^{T}q(\mathbf{z}_{t}\|\mathbf{x}_{\leq t},\mathbf{z}_{<t})\right)\Bigg{]}$	(19)
	$\displaystyle=\mathbb{E}_{q(\mathbf{z}_{1:T}\|\mathbf{x}_{1:T})}\Bigg{[}\sum_{t=1}^{T}\log p_{\theta}(\mathbf{y}_{t}\|\mathbf{y}_{<t},\mathbf{z}_{\leq t})-\log\frac{q(\mathbf{z}_{t}\|\mathbf{x}_{\leq t},\mathbf{z}_{<t})}{p_{\theta}(\mathbf{z}_{t}\|\mathbf{x}_{<t},\mathbf{z}_{<t})}-\log\left\|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right\|\Bigg{]}.$	(20)

Finally, in the filtering setting, we can rewrite the expectation, bringing it inside of the sum (see [25, 49]):

\mathcal{L}=\sum_{t=1}^{T}\mathbb{E}_{q(\mathbf{z}_{\leq t}|\mathbf{x}_{\leq t})}\Bigg{[}\log p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})-\log\frac{q(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t})}{p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t})}-\log\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|\Bigg{]}.

(21)

Because there exists a one-to-one mapping between $\mathbf{x}_{1:T}$ and $\mathbf{y}_{1:T}$ , we can equivalently condition the approximate posterior and the prior on $\mathbf{y}$ , i.e.

\mathcal{L}=\sum_{t=1}^{T}\mathbb{E}_{q(\mathbf{z}_{\leq t}|\mathbf{y}_{\leq t})}\Bigg{[}\log p_{\theta}(\mathbf{y}_{t}|\mathbf{y}_{<t},\mathbf{z}_{\leq t})-\log\frac{q(\mathbf{z}_{t}|\mathbf{y}_{\leq t},\mathbf{z}_{<t})}{p_{\theta}(\mathbf{z}_{t}|\mathbf{y}_{<t},\mathbf{z}_{<t})}-\log\left|\det\left(\frac{\partial\mathbf{x}_{t}}{\partial\mathbf{y}_{t}}\right)\right|\Bigg{]}.

(22)

Appendix B Experiment Details

B.1 Flow Architecture

The affine autoregressive flow architecture is shown in Figure 9. The shift and scale of the affine transform are conditioned on three previous inputs. For each flow, we first apply $4$ convolutional layers with kernel size $(3,3)$ , stride $1$ , and padding $1$ on each conditioned observation, preserving the input shape. The outputs are concatenated along the channel dimension and go through another $4$ convolutional layers with kernel size $(3,3)$ , stride $1$ , and padding $1$ . Finally, separate convolutional layers with the same kernel size, stride, and padding are used to output shift and log-scale. We use ReLU non-linearities for all convolutional layers.

B.2 Sequential Latent Variable Model Architecture

For sequential latent variable models, we use a DC-GAN [58] encoder architecture (Figure 9(d)), with $4$ convolutional layers of kernel size $(4,4)$ , stride $2$ , and padding $1$ followed by another convolutional layer of kernel size $(4,4)$ , stride $1$ , and no padding. The encoding is sent to one or two LSTM layers [32] followed by separate linear layers to output the mean and log-variance for $q_{\phi}(\mathbf{z}_{t}|\mathbf{x}_{\leq t},\mathbf{z}_{<t})$ . We note that for SLVM, we input $\mathbf{x}_{t}$ into the encoder, whereas for SLVM + AF, we input $\mathbf{y}_{t}$ . The architecture for the conditional prior, $p_{\theta}(\mathbf{z}_{t}|\mathbf{x}_{<t},\mathbf{z}_{<t})$ , shown in Figure 9(e), contains two fully-connected layers, which take the previous latent variable as input, followed by one or two LSTM layers, and separate linear layers to output the mean and log-variance. The decoder architecture, shown in Figure 9(c), mirrors the encoder architecture, using transposed convolutions. In SLVM, we use two LSTM layers for modeling the conditional prior and approximate posterior distributions, while in SLVM + 1-AF, we use a single LSTM layer for each. We use leaky ReLU non-linearities for the encoder and decoder architectures and ReLU non-linearities in the conditional prior architecture.

B.3 VideoFlow Architecture

For VideoFlow experiments, we use the official code provided by [45] in the tensor2tensor repository [72]. Due to memory and computational constraints, we use a smaller version of the model architecture used by [45] for the BAIR Robot Pushing dataset. We change depth from $24$ to $12$ and latent_encoder_width from $256$ to $128$ . This reduces the number of parameters from roughly $67$ million to roughly $32$ million. VideoFlow contains a hierarchy of latent variables, with the latent variable at level $l$ at time $t$ denoted as $\mathbf{z}_{t}^{(l)}$ . The prior on this latent variable is denoted as $p_{\theta}(\mathbf{z}_{t}^{(l)}|\mathbf{z}_{<t}^{(l)},\mathbf{z}_{t}^{(>l)})=\mathcal{N}(\mathbf{z}_{t}^{(l)};\bm{\mu}_{t}^{(l)},\textrm{diag}((\bm{\sigma}_{t}^{(l)})^{2}))$ , where $\bm{\mu}_{t}^{(l)}$ and $\bm{\sigma}_{t}^{(l)}$ are functions of $\mathbf{z}_{<t}^{(l)}$ and $\mathbf{z}_{t}^{(>l)}$ . We note that [45] parameterize $\bm{\mu}_{t}^{(l)}$ as $\bm{\mu}_{t}^{(l)}=\mathbf{z}_{t-1}^{(l)}+\widetilde{\bm{\mu}}_{t}^{(l)}$ , where $\widetilde{\bm{\mu}}_{t}^{(l)}$ is the function. [45] refer to this as latent_skip. This is already a special case of an affine autoregressive flow, with a hard-coded shift of $\mathbf{z}_{t-1}^{(l)}$ and a scale of $\mathbf{1}$ . We parameterize an affine autoregressive flow at each latent level, with a shift, $\bm{\alpha}_{t}^{(l)}$ , and scale, $\bm{\beta}_{t}^{(l)}$ , which are function of $\mathbf{z}_{<t}^{(l)}$ , using the same 5-block ResNet architecture as [45]. In practice, these functions are conditioned on the variables at the past three time steps. The affine autoregressive flow produces a new variable:

\mathbf{u}_{t}^{(l)}=\frac{\mathbf{z}_{t}^{(l)}-\bm{\alpha}_{t}^{(l)}}{\bm{\beta}_{t}^{(l)}},

which we then model using the same prior distribution and architecture as [45]: $p_{\theta}(\mathbf{u}_{t}^{(l)}|\mathbf{z}_{<t}^{(l)},\mathbf{z}_{t}^{(>l)})=\mathcal{N}(\mathbf{u}_{t}^{(l)};\bm{\mu}_{t}^{(l)},\textrm{diag}((\bm{\sigma}_{t}^{(l)})^{2}))$ , where $\bm{\mu}_{t}^{(l)}$ and $\bm{\sigma}_{t}^{(l)}$ , again, are functions of $\mathbf{z}_{<t}^{(l)}$ (or, equivalently $\mathbf{u}_{<t}^{(l)}$ ) and $\mathbf{z}_{t}^{(>l)}$ .

B.4 Non-Video Sequence Modeling Architecture

We again compare various model classes in terms of log-likelihood estimation. We use fully-connected networks to parameterize all functions within the prior, approximate posterior, and conditional likelihood of each model. All networks are $2$ layers of $256$ units with highway connectivity [68]. For autoregressive flows, we use ELU non-linearities [12]. For stability, we found it necessary to use tanh non-linearities in the networks for SLVMs (prior, conditional likelihood, and approximate posterior). In SLVMs, the prior is conditioned on $\mathbf{z}_{t-1}$ , the approximate posterior is conditioned on $\mathbf{z}_{t-1}$ and $\mathbf{y}_{t}$ , and the conditional likelihood is conditioned on $\mathbf{z}_{t}$ . We use a latent space dimensionality of $16$ for all SLVMs.

B.5 Training Set-Up

We use the Adam optimizer [41] with a learning rate of $1\times 10^{-4}$ to train all the models. For Moving MNIST, we use a batch size of $16$ and train for $200,000$ iterations for SLVM and $100,000$ iterations for 1-AF, 2-AF and SLVM + 1-AF. For BAIR Robot Pushing, we use a batch size of $8$ and train for $200,000$ iterations for all models. For KTH Actions, we use a batch size of $8$ and train for $90,000$ iterations for all models. Batch norm [36] is applied to all convolutional layers that do not output distribution or affine transform parameters. We randomly crop sequences of length $13$ from all sequences and evaluate on the last $10$ frames. For AF-2 models, we crop sequences of length $16$ in order to condition both flows on three previous inputs. For VideoFlow experiments, we use the same hyper-parameters as [45] (with the exception of the two architecture changes mentioned above) and train for $100,000$ iterations.

Model	1-AF	2-AF	SLVM	SLVM + 1-AF
Moving Mnist	$343$ k	$686$ k	$11,302$ k	$10,592$ k
BAIR Robot Pushing	$363$ k	$726$ k	$11,325$ k	$10,643$ k
KTH Action	$343$ k	$686$ k	$11,302$ k	$10,592$ k

Table 3: Number of parameters for each model on each dataset. Flow-based models contain relatively few parameters as compared with the SLVM, as our flows consist primarily of

3\times 3

convolutions with limited channels. In the SLVM, we use two LSTM layers for modeling the prior and approx. posterior distribution of the latent variable, while in SLVM + 1-AF, we use a single LSTM layer for each.

B.6 Quantifying Decorrelation

To quantify the temporal redundancy reduction resulting from affine autoregressive pre-processing, we evaluate the empirical correlation between successive frames for the data observations and noise variables, averaged over spatial locations and channels. This is an average normalized version of the auto-covariance of each signal with a time delay of $1$ time step. Specifically, we estimate the temporal correlation as

\textrm{corr}_{\mathbf{x}}\equiv\frac{1}{HWC}\cdot\sum_{i,j,k}^{H,W,C}\mathbb{E}_{x^{(i,j,k)}_{t},x^{(i,j,k)}_{t+1}\sim\mathcal{D}}\left[\xi_{t,t+1}(i,j,k)\right],

(23)

where the term inside the expectation is

\xi_{t,t+1}(i,j,k)\equiv\frac{(x^{(i,j,k)}_{t}-\mu^{(i,j,k)})(x^{(i,j,k)}_{t+1}-\mu^{(i,j,k)})}{\left(\sigma^{(i,j,k)}\right)^{2}}.

(24)

Here, $x_{t}^{(i,j,k)}$ denotes the image at location $(i,j)$ and channel $k$ at time $t$ , $\mu^{(i,j,k)}$ is the mean of this dimension, and $\sigma^{(i,j,k)}$ is the standard deviation. $H,W,$ and $C$ respectively denote the height, width, and number of channels of the observations, and $\mathcal{D}$ denotes the dataset. We define an analogous expression for $\mathbf{y}$ , denoted $\textrm{corr}_{\mathbf{y}}$ .

Appendix C Illustrative Example

To build intuition behind the benefits of temporal pre-processing (e.g., decorrelation) for downstream dynamics modeling, we present the following simple, kinematic example. Consider the discrete dynamical system defined by the following set of equations:

	$\displaystyle\mathbf{x}_{t}$	$\displaystyle=\mathbf{x}_{t-1}+\mathbf{u}_{t},$		(25)
	$\displaystyle\mathbf{u}_{t}$	$\displaystyle=\mathbf{u}_{t-1}+\mathbf{w}_{t},$		(26)

where $\mathbf{w}_{t}\sim\mathcal{N}(\mathbf{w}_{t};\mathbf{0},\bm{\Sigma})$ . We can express $\mathbf{x}_{t}$ and $\mathbf{u}_{t}$ in probabilistic terms as

	$\displaystyle\mathbf{x}_{t}$	$\displaystyle\sim\mathcal{N}(\mathbf{x}_{t};\mathbf{x}_{t-1}+\mathbf{u}_{t-1},\bm{\Sigma}),$		(27)
	$\displaystyle\mathbf{u}_{t}$	$\displaystyle\sim\mathcal{N}(\mathbf{u}_{t};\mathbf{u}_{t-1},\bm{\Sigma}).$		(28)

Physically, this describes the noisy dynamics of a particle with momentum and mass $1$ , subject to Gaussian noise. That is, $\mathbf{x}$ represents position, $\mathbf{u}$ represents velocity, and $\mathbf{w}$ represents stochastic forces. If we consider the dynamics at the level of $\mathbf{x}$ , we can use the fact that $\mathbf{u}_{t-1}=\mathbf{x}_{t-1}-\mathbf{x}_{t-2}$ to write

p(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{t-2})=\mathcal{N}(\mathbf{x}_{t};\mathbf{x}_{t-1}+\mathbf{x}_{t-1}-\mathbf{x}_{t-2},\bm{\Sigma}).

(29)

Thus, we see that in the space of $\mathbf{x}$ , the dynamics are second-order Markov, requiring knowledge of the past two time steps. However, at the level of $\mathbf{u}$ (Eq. 28), the dynamics are first-order Markov, requiring only the previous time step. Yet, note that $\mathbf{u}_{t}$ is, in fact, an affine autoregressive transform of $\mathbf{x}_{t}$ because $\mathbf{u}_{t}=\mathbf{x}_{t}-\mathbf{x}_{t-1}$ is a special case of the general form $\frac{\mathbf{x}_{t}-\bm{\mu}_{\theta}(\mathbf{x}_{<t})}{\bm{\sigma}_{\theta}(\mathbf{x}_{<t})}$ . In Eq. 25, we see that the Jacobian of this transform is $\partial\mathbf{x}_{t}/\partial\mathbf{u}_{t}=\mathbf{I}$ , so, from the change of variables formula, we have $p(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{t-2})=p(\mathbf{u}_{t}|\mathbf{u}_{t-1})$ . In other words, an affine autoregressive transform has allowed us to convert a second-order Markov system into a first-order Markov system, thereby simplifying the dynamics. Continuing this process to move to $\mathbf{w}_{t}=\mathbf{u}_{t}-\mathbf{u}_{t-1}$ , we arrive at a representation that is entirely temporally decorrelated, i.e. no dynamics, because $p(\mathbf{w}_{t})=\mathcal{N}(\mathbf{w}_{t};\mathbf{0},\bm{\Sigma})$ . A sample from this system is shown in Figure 10, illustrating this process of temporal decorrelation.

Appendix D Additional Experimental Results

	M-MNIST	BAIR	KTH
1-AF	$2.06$	$2.98$	$2.95$
2-AF	$2.04$	$2.76$	$2.95$
SLVM	$\leq 1.93$	$\leq 3.46$	$\leq 3.05$
SLVM + $\Delta\mathbf{x}$	$\leq 2.47$	$\leq 3.05$	$\leq 2.46$
SLVM + 1-AF	$\leq\mathbf{1.85}$	$\leq\mathbf{2.31}$	$\leq\mathbf{2.21}$
VF	–	$1.50$	–
VF + AF	–	$\mathbf{1.49}$	–

Table 4: Training Quantitative Comparison. Average training negative log-likelihood in nats per dim. for Moving MNIST, BAIR Robot Pushing, and KTH Actions.

Improving Sequential Latent Variable Models with Autoregressive Flows