Low-Rank Constraints for Fast Inference in Structured Models

Justin T. Chiu
Cornell University
[email protected]
&Yuntian Deng¹¹footnotemark: 1
Harvard University
[email protected]
Alexander M. Rush
Cornell University
[email protected]
Equal contribution

Abstract

Structured distributions, i.e. distributions over combinatorial spaces, are commonly used to learn latent probabilistic representations from observed data. However, scaling these models is bottlenecked by the high computational and memory complexity with respect to the size of the latent representations. Common models such as Hidden Markov Models (HMMs) and Probabilistic Context-Free Grammars (PCFGs) require time and space quadratic and cubic in the number of hidden states respectively. This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models. We show that by viewing the central inference step as a matrix-vector product and using a low-rank constraint, we can trade off model expressivity and speed via the rank. Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces while providing practical speedups.

1 Introduction

When modeling complex sequential spaces, such as sentences, musical scores, or video frames, a key choice is the internal structural representations of the model. A common choice in recent years is to use neural representations (Bengio et al., 2003; Mikolov et al., 2011; Brown et al., 2020; Boulanger-Lewandowski et al., 2012; Huang et al., 2018; Weissenborn et al., 2020) to store a deterministic history. These models yield strong predictive accuracy but their deterministic, continuous forms provide little insight into the intermediate decisions of the model.^†^†Code is available here.

Latent structured models provide an alternative approach where complex modeling decisions are broken down into a series of probabilistic steps. Structured models provide a principled framework for reasoning about the probabilistic dependencies between decisions and for computing posterior probabilities. The structure of the decision processes and the ability to answer queries through probabilistic inference afford interpretability and controllability that are lacking in neural models (Koller and Friedman, 2009; Levine, 2018).

Despite the benefits of structured models, the computational complexity of training scales asymptotically much worse than for neural models, as inference, and therefore training, requires marginalizing over all possible latent structures. For standard general-purpose models like Hidden Markov Models (HMM) and Probabilistic Context-Free Grammars (PCFG), the runtime of inference scales quadratically and cubically in the number of states respectively, which limits the ability to reach a massive scale. Promisingly, recent work has shown that in specific situations these models can be scaled, and that the increased scale results in commensurate improvements in accuracy – without sacrificing the ability to perform exact inference (Dedieu et al., 2019; Chiu and Rush, 2020; Yang et al., 2021).

In this work, we propose an approach for improving the runtime of a large class of structured latent models by introducing a low-rank constraint. We target the family of models where inference can be formulated through a labeled directed hypergraph, which describes a broad class of dynamic-programming based inference (Klein and Manning, 2004; Huang and Chiang, 2005; Zhou et al., 2006; Javidian et al., 2020; Chiang and Riley, 2020). We show how under low-rank constraints these models allow for more efficient inference. Imposing a low-rank constraint allows for a key step of inference to be rewritten as a fast matrix-vector product. This approach is also inspired by recent advances in computationally efficient neural attention attention (Katharopoulos et al., 2020; Peng et al., 2021; Choromanski et al., 2020), a significantly different task and formulation, that rewrites matrix-vector products as fast low-rank products using approximate kernel techniques.

We evaluate this approach by learning low-rank structured models for the tasks of language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling. For these tasks we use a variety of models including HMMs, PCFGs, and Hidden Semi-Markov Models (HSMMs). As the application of low-rank constraints is nontrivial in high-dimensional structured models due to reduced expressivity, we demonstrate effective techniques for overcoming several practical challenges of low-rank parameterizations. We find that our approach achieves very similar results to unconstrained models at large state sizes, while the decomposition allows us to greatly increase the speed of inference. Results on HMMs show that we can scale to more than 16,000 states; results on PCFGs achieve a significant perplexity reduction from much larger state spaces compared to past work (Kim et al., 2019); and results on HSMMs show that our formulation enables scaling to much larger state spaces for continuous emissions (Fried et al., 2020).

2 Background: Latent Structure and Hypergraphs

We consider the problem of modeling a sequence of observations $p(x)=p(x_{1},\dots,x_{T})$ . These observations can range in complexity from the words in a sentence to a series of co-occurring musical notes, or to features of video frames, and may be discrete or continuous. We assume these observations are generated by an unobserved (latent) structured representation $z$ , and therefore model the joint $p(x,z)$ . The structure may be sequential or hierarchical, such as latent trees, and the set of structures $\mathcal{Z}$ is combinatorial, i.e. exponential in size with respect to the input sentence. In order to train these models on observations, we must optimize the evidence $p(x)=\sum_{z}p(x,z)$ by marginalizing over $z$ . Scaling this marginalization is the focus of this work.

Hypergraphs are a graphical model formalism for structured distributions that admit tractable inference through dynamic programming (Klein and Manning, 2004; Huang and Chiang, 2005; Zhou et al., 2006; Javidian et al., 2020; Chiang and Riley, 2020).¹¹1While the formalism is similar to undirected factor graphs, it allows us to represent more complex distributions: notably dependency structures with unknown topologies, such as latent trees. A labeled, directed, acyclic hypergraph consists of a set of nodes $\cal V$ , a set of hyperedges $\cal E$ , and a designated root node $S\in{\cal V}$ . Each node $v\in\mathcal{V}$ has a collection of labels ${\cal L}_{v}$ . Each hyperedge $e\in\mathcal{E}$ has a head node $u$ and tuple of tail nodes, $v=(v_{1},\ldots,v_{|e|})$ , where $|e|$ is the number of tail nodes. For simplicity, we will assume at most 2 tail nodes $v_{1},v_{2}$ , and unless noted, a fixed label set $\mathcal{L}$ throughout. Each hyperedge $e$ is associated with a score matrix $\Psi_{e}\in\mathbb{R}^{\mathcal{L}\times\mathcal{L}^{|e|}}$ with a score for all head and tail labels.²²2This formalism can represent inference in both locally and globally normalized models, although we focus on local normalization in this work. We use the notation $[\Psi_{e}]_{z_{u},(z_{1},z_{2})}$ to indicate the score for head label $z_{u}$ and tail labels $z_{1}$ and $z_{2}$ . Finally, we assume we have a topological ordering over the edges.

A hypergraph is used to aggregate scores bottom-up through a dynamic programming (belief propagation) algorithm. Algorithm 1 (left) shows the algorithm. It works by filling in a table vector $\alpha_{v}\in\mathbb{R}^{\mathcal{L}}$ for each node $v$ in order, and is initialized to 1 at the leaf nodes.³³3 The accumulation of scores is denoted by $\stackrel{{\scriptstyle+}}{{\leftarrow}}$ . Multiple hyperedges can have the same head node, whose scores must be added together. It returns the sum over latent structures, $p(x)$ . Counting loops, the worst-case runtime complexity is $O(|{\cal E}|\times L^{|e^{*}|+1})$ where $L=|{\cal L}|$ is the size of the label set and $|e^{*}|$ the max hyperedge tail size. Algorithm 1 (right) shows the same algorithm in matrix form by introducing joined tail vectors $\beta_{v}\in\mathbb{R}^{\mathcal{L}^{|e|}}$ for each group of nodes $v$ . Letting $z_{v}=(z_{1},z_{2})$ , the joined tail vector contains entries $[\beta]_{z_{v}}=[\alpha_{v_{1}}]_{z_{1}}[\alpha_{v_{2}}]_{z_{2}}$ .

[Scalar Form]

for

u\leftarrow v_{1},v_{2}

hyperedge

e

topologically do

for

z_{u}\in{\cal L}_{u}

[\alpha_{u}]_{z_{u}}\stackrel{{\scriptstyle+}}{{\leftarrow}}\sum_{z_{1},z_{2}}[\Psi_{e}]_{z_{u},(z_{1},z_{2})}

\bm{\cdot}\ [\alpha_{v_{1}}]_{z_{1}}\ [\alpha_{v_{2}}]_{z_{2}}

return

\sum_{z}[\alpha_{S}]_{z}

[Matrix Form]

for

u\leftarrow v

hyperedge

e

topologically do

\alpha_{u}\stackrel{{\scriptstyle+}}{{\leftarrow}}\Psi_{e}\beta_{v}

return

\alpha_{S}^{\top}\mathbf{1}

Algorithm 1 Hypergraph marginalization

To make this formalism more concrete, we show how hypergraphs can be used for inference in several structured generative models: hidden Markov models, probabilistic context-free grammars, and hidden semi-Markov models. Inference in these examples are instances of the hypergraph algorithm.

Example: Hidden Markov Models (HMM) HMMs are discrete latent sequence models defined by the following generative process: first, a sequence of discrete latent states $z=(z_{1},\ldots,z_{T})$ with state size $L$ are sampled as a Markov chain. Then each state $z_{t}$ independently emits an observation $x_{t}$ , i.e.

p(x,z)=\prod_{t=1}^{T}p(z_{t}\mid z_{t-1})\ p(x_{t}\mid z_{t}),

(1)

where $p(z_{t}\mid z_{t-1})$ is the transition distribution, $p(x_{t}\mid z_{t})$ the emission distribution, and $p(z_{1}\mid z_{0})$ is the initial distribution with distinguished start symbol $z_{0}$ .

Given a sequence of observations $x=(x_{1},\ldots,x_{n})$ we can compute $p(x)=\sum_{z}p(x,z)$ using a labeled directed hypergraph, with single-tailed edges, nodes corresponding to state positions, labels corresponding to states, and emissions probabilities incorporated into the scoring matrices $\Psi$ . There are $T$ scoring matrices, $\Psi_{t}\in\mathbb{R}^{\mathcal{L}\times\mathcal{L}}$ , with entries $[\Psi_{t}]_{z_{t},z_{t+1}}=p(z_{t+1},x_{t}\mid z_{t})$ corresponding to transitions.⁴⁴4 The left-most scoring matrix for the HMM has entries $[\Psi_{1}]_{z_{1},z_{2}}=p(z_{2},x_{1}\mid z_{1})p(z_{1}\mid z_{0})$ . Algorithm 2 (left) shows the approach. This requires time $O(TL^{2})$ and is identical to the backward algorithm for HMMs.⁵⁵5 In the case of HMMs, the table vectors $\alpha_{t}$ correspond to the backward algorithm’s $\beta$ values.

Algorithm 2 Hypergraph marginalization for HMMs and PCFGs

[HMM - Backward]

for

t\leftarrow(t+1)

in right-to-left order do

for

z_{t+1}\in\mathcal{L}

[\beta_{t+1}]_{z_{t+1}}=[\alpha_{t+1}]_{z_{t+1}}

\alpha_{t}\stackrel{{\scriptstyle+}}{{\leftarrow}}\Psi_{t}\beta_{t+1}

return

\alpha_{0}^{\top}\mathbf{1}

[PCFG - CKY]

for

(i,k)\leftarrow(i,j),(j,k)

in span-size order do

for

z_{1},z_{2}\in\mathcal{L}_{i,j}\times\mathcal{L}_{j,k}

[\beta_{i,j,k}]_{(z_{1},z_{2})}=[\alpha_{i,j}]_{z_{1}}[\alpha_{j,k}]_{z_{2}}

\alpha_{i,k}\stackrel{{\scriptstyle+}}{{\leftarrow}}\Psi\beta_{i,j,k}

return

\alpha_{1,T}^{\top}\mathbf{1}

Example: Context-Free Grammars (CFG) CFGs are a structured model defined by the 5-tuple $\mathcal{G}=(S,\mathcal{N},\mathcal{P},\mathcal{X},\mathcal{R})$ , where $S$ is the distinguished start symbol, $\mathcal{N}$ is a set of nonterminals, $\mathcal{P}$ is a set of preterminals, $\mathcal{X}$ is the token types in the vocabulary, and $\mathcal{R}$ is a set of grammar rules. Production rules for start, nonterminals, and preterminals take the following forms:⁶⁶6 We restrict our attention to grammars in Chomsky normal form.

\displaystyle S

\displaystyle\to A,

\displaystyle A\in\mathcal{N};

\displaystyle A

\displaystyle\to B\ C,

\displaystyle B,C\in\mathcal{N}\cup\mathcal{P};

\displaystyle D

\displaystyle\to x,

\displaystyle D\in\mathcal{P},x\in\mathcal{X}.

(2)

A probabilistic context-free grammar (PCFG) additionally has a probability measure on the set of rules. To compute $p(x_{1},\ldots,x_{T})$ with a hypergraph, we create one node for each contiguous subspan $[i,k)$ in the sentence. Nodes with $i+1<k$ have a nonterminal label set ${\cal L}=\mathcal{N}$ . Nodes with $i+1=k$ have a preterminal label set ${\cal L}_{i,i+1}=\mathcal{P}$ . The main scoring matrix is $\Psi\in\mathbb{R}^{\mathcal{L}\times\mathcal{L}^{2}}$ , with entries $[\Psi]_{z_{u},(z_{1},z_{2})}=p(z_{1},z_{2}\mid z_{u})$ .⁷⁷7We have a separate matrix for terminal production on $x$ which we elide for simplicity. Algorithm 2 (right) shows how for every hyperedge we join the scores from the two tail nodes in $\alpha_{i,j}$ and $\alpha_{j,k}$ into joined tail vector $\beta_{i,j,k}\in\mathbb{R}^{\mathcal{L}^{2}}$ . As there are $O(T^{3})$ hyperedges and the largest ${\cal L}$ is of size $|\mathcal{N}|$ , the runtime of the algorithm is $O(T^{3}{|\mathcal{N}|}^{3})$ . This approach is identical to the CKY algorithm.

Example: Hidden Semi-Markov Models (HSMM) HSMMs are extensions of HMMs that allow for generating a variable-length sequence of observations per state. It defines the following generate process: first, we sample a sequence of discrete latent states $z=(z_{1},\cdots,z_{K})$ with a first-order Markov model. We then use them to generate the length of observations per state. For our experiments we generate independent continuous emissions $x_{t}$ with a Gaussian distribution for $p(x_{i}\mid z_{k})$ . Full details of the inference procedure are given in Appendix E.

3 Rank-Constrained Structured Models

For these structured distributions, hypergraphs provide a general method for inference (and therefore training parameterized versions). However, the underlying algorithms scale poorly with the size of the label sets (quadratic for HMM and HSMM, cubic for CFG). This complexity makes it challenging to scale these models and train versions with very large numbers of states.

In this section, we consider an approach for improving the scalability of these models by reducing the dependence of the computational complexity of inference on the label set size. The main idea is to speed up the matrix-vector product step in inference by using a low-rank decomposition of the scoring matrix $\Psi$ . In the next section we show that this constraint can be easily incorporated into parameterized versions of these models.

3.1 Low-Rank Matrix-Vector Products

The main bottleneck for inference speed is the matrix-vector product $\alpha_{u}\stackrel{{\scriptstyle+}}{{\leftarrow}}\Psi_{e}\beta_{v}$ that must be computed for every edge in the hypergraph. As we saw in Algorithm 1 (left), this step takes time $L^{|e|+1}$ to compute, but it can be sped up by making structural assumptions on $\Psi_{e}$ . In particular, we focus on scoring matrices with low rank.

We note the following elementary property of matrix-vector products. If the scoring matrix can be decomposed as the product of two smaller matrices $\Psi_{e}=U_{e}V_{e}^{\top}$ , where $U_{e}\in\mathbb{R}^{\mathcal{L}\times N}$ and $V_{e}\in\mathbb{R}^{N\times\mathcal{L}^{|e|}}$ , then the matrix-vector products can be computed in time $O(|{\cal E}|\times L^{|e|}\times N)$ as follows:

\Psi_{e}\beta_{v}=\left(U_{e}V_{e}^{\top}\right)\beta_{v}=U_{e}\left(V_{e}^{\top}\beta_{v}\right).

(3)

This reordering of computation exchanges a factor of $L$ for a factor of $N$ . When $N\ll L$ , this method is both faster and more memory-efficient.

We enforce the low-rank constraint by directly parameterizing the factors $U_{e}$ and $V_{e}$ for scoring matrices $\Psi_{e}$ that we would like to constrain. We treat both $U_{e}$ and $V_{e}$ as embedding matrices, where each row corresponds to an embedding of each value of $z_{u}$ and a joint embedding of $(z_{1},z_{2})$ respectively:

[U_{e}]_{z_{u},n}=c_{z_{u}}[\phi(f(z_{u}))]_{n}\qquad[V_{e}]_{(z_{1},z_{2}),n}=c_{z_{1},z_{2}}[\phi(g(z_{1},z_{2}))]_{n},

(4)

where $f$ and $g$ are embedding functions; $c_{z_{u}}$ and $c_{z_{1},z_{2}}$ are constants (used to ensure proper normalization) or clamped potentials (such as conditional probabilities); and $\phi:\mathbb{R}^{D}\to\mathbb{R}^{N}_{+}$ is a function that ensures nonnegativity, necessary for valid probability mass functions. Algorithm 3 shows the role of the low-rank matrix-vector product in marginalization.⁸⁸8 If the normalizing constants are given by $c_{z_{u}}$ , they can be computed from unnormalized $\tilde{U}_{e},\tilde{V}_{e}$ as follows: $c_{z_{u}}=[\tilde{U}_{e}\tilde{V}_{e}^{\top}\mathbf{1}]_{z_{u}}$ in time $O(L^{|e|}N+LN)$ , and similarly for $c_{z_{1},z_{2}}$ .

Algorithm 3 Low-rank marginalization

for

u\leftarrow v_{1},v_{2}

hyperedge

e

topologically do

for

n\in 1,\ldots,N

[\gamma]_{n}=\displaystyle\sum_{z_{v}}c_{v}\ [\phi(g(z_{1},z_{2}))]_{n}\ [\beta_{v}]_{z_{v}}

\vartriangleright

O(L^{|e|})

\alpha_{u}\stackrel{{\scriptstyle+}}{{\leftarrow}}U_{e}\gamma

\vartriangleright

O(LN)

return

\alpha_{S}^{\top}\mathbf{1}

3.2 Application to Structured Models

As enforcing a low-rank factorization of every scoring matrix limits the expressivity of a model, we explicitly target scoring matrices that are involved in computational bottlenecks.⁹⁹9For a discussion of the expressivity of low-rank models compared to models with fewer labels, see Appendix A. For these key scoring matrices, we directly parameterize the scoring matrix with a low-rank factorization, which we call a low-rank parameterization. For other computations, we utilize a standard softmax parameterization and do not factorize the resulting scoring matrix. We refer to this as a mixed parameterization.

Hidden Markov Models Low-rank HMMs (LHMMs) use the following mixed parameterization, which specifically targets the state-state transition bottleneck by using a low-rank parameterization for the transition distribution, but a softmax parameterization for the emission distribution:

\displaystyle p(z_{t}\mid z_{t-1})

\displaystyle\propto\phi(\mathbf{u}_{z_{t-1}})^{\top}\phi(\mathbf{v}_{z_{t}}),

\displaystyle p(x_{t}\mid z_{t})

\displaystyle\propto\exp(\mathbf{u}_{z_{t}}^{\top}\mathbf{v}_{x_{t}}),

(5)

where $\mathbf{u}_{z_{t-1}}=f(z_{t-1})$ and $\mathbf{v}_{z_{t}}=g(z_{t})$ are (possibly neural) embedding functions. The parameterizations of the embedding functions $f,g:\mathcal{L}\to\mathbb{R}^{D}$ , as well as the non-negative mapping $\phi:\mathbb{R}^{D}\to\mathbb{R}^{N}_{+}$ are detailed in Appendix F. When performing inference, we treat the emission probabilities $p(x_{t}\mid z_{t})$ as constants, and absorb them into $c_{u}$ .

This allows inference to be run in time $O(TLN)$ , where $T$ is the length of a sequence, $L$ the size of the label space, and $N$ the feature dimension.

Hidden Semi-Markov Models For low-rank HSMM (LHSMM), we similarly target the transition distribution and keep the standard Gaussian emission distribution:

\displaystyle p(z_{k}\mid z_{k-1})

\displaystyle\propto\mathbf{u}_{z_{k-1}}^{\top}\mathbf{v}_{z_{k}},

\displaystyle p(x_{t}\mid z_{k})

\displaystyle\propto K_{\text{Gauss}}(\mathbf{u}_{z_{k}},\mathbf{x}_{t}),

(6)

where $\mathbf{u}_{z_{k-1}}=\phi(f(z_{k-1}))$ and $\mathbf{v}_{z_{k}}=\phi(g(z_{k}))$ are state embeddings, while $K_{\text{Gauss}}(\cdot,\cdot)$ is the Gaussian kernel used to model continuous $\mathbf{x}_{t}$ . The full parameterization of the embeddings is given in Appendix F. The total inference complexity is $O(TLMN)$ , where $M$ is the maximum length of the observation sequence under any state.

Context-Free Grammars For PCFGs, the inference bottleneck is related to the transition from a nonterminal symbol to two nonterminal symbolss ( $A\to B\ C$ ), and we specifically parameterize it using a low-rank parameterization:

	$\displaystyle p(z_{1,N}\mid S)$	$\displaystyle\propto\exp(\mathbf{u}_{S}^{\top}\mathbf{u}_{z_{1,N}}),$	$\displaystyle p(z_{i,j},z_{j,k}\mid z_{i,k})$	$\displaystyle\propto\begin{cases}\exp(\mathbf{u}_{z_{i,k}}^{\top}\mathbf{v}_{z_{i,j}\ z_{j,k}})&\begin{subarray}{c}i+1=j\lor\\ j+1=k\end{subarray}\\ \phi(\mathbf{u}_{z_{i,k}}^{\prime})^{\top}\phi(\mathbf{v}_{z_{i,j}),z_{j,k}}&\text{o.w.}\\ \end{cases}$		(7)
	$\displaystyle p(x_{i}\mid z_{i})$	$\displaystyle\propto\exp(\mathbf{u}_{z_{i}}^{\top}\mathbf{v}_{x_{i}}),$				(7)

where $\mathbf{u}_{z}$ / $\mathbf{u}_{z}^{\prime}$ is the embedding of $z$ when $z$ is used as head, $\mathbf{v}_{x}$ / $\mathbf{v}_{z_{1},z_{2}}$ is the embedding of $x$ / $(z_{1},z_{2})$ when they are used as tail. See Appendix F for the full parameterization, drawn from Kim et al. (2019). Note that we limit the application of low-rank constraints to nonterminal to nonterminal productions. These productions dominate the runtime as they are applied at $O(T^{3})$ hyperedges. This allows inference to be run in time $O(T^{3}L^{2}N)$ , where $T$ is the length of a sequence, $L$ the size of the label space, and $N$ the feature dimension.

4 Experimental Setup

We evaluate the application of low-rank constraints with four experiments: sequential language modeling with HMMs, polyphonic music modeling with a large observation space, hierarchical language modelings with PCFGs, and video modeling with HSMMs.

Data Our first set of experiments evaluate sequential models on Penn Treebank dataset (Ptb) (Marcus et al., 1993) for the task of word-level language modeling. We use the preprocessing from Mikolov et al. (2011). The second set of experiments is on polyphonic music modeling (Boulanger-Lewandowski et al., 2012). We evaluate on four music datasets: Nottingham (Nott), Piano, MuseData (Muse), and JSB chorales (JSB). Each timestep consists of an 88-dimensional binary vector indicating whether a particular note is played. Since multiple notes may be played at the same time, the effective vocabulary size is extremely large. The third set of experiments use PCFGs for language modeling, we also use Ptb, but with the splits and preprocessing used in unsupervised constituency parsing (Shen et al., 2018, 2019; Kim et al., 2019). The last set of experiments use HSMMs for video modeling, where we use CrossTask (Zhukov et al., 2019) with 10% of the training data for validation. We follow the preprocessing steps in Fried et al. (2020) and apply PCA to project features to vectors of size 200. For the full details on datasets, please see Appendix D.

Models and Hyperparameters For language modeling with HMMs, we experiment with a range of state sizes, $|\mathcal{L}|=L\in\left\{2^{10},2^{11},2^{12},2^{13},2^{14}\right\}$ , and rank $N\in\left\{L/2,L/4,L/8\right\}$ . For polyphonic music modeling with HMMs, we experiment with states sizes $L\in\left\{2^{7},2^{8},2^{9},2^{10},2^{11}\right\}$ . For language modeling with PCFGs, we use a set of nonterminals of size $|\mathcal{N}|\in\left\{30,60,100\right\}$ and preterminals of twice the number of nonterminals $|\mathcal{P}|=2|\mathcal{N}|$ . Our smallest setting ( $|\mathcal{N}|=30$ , $|\mathcal{P}|=60$ ) is the one used in Kim et al. (2019). For video modeling with HSMMs, we use the same model setting as Fried et al. (2020), but we don’t constrain states to the predefined states per task, and we experiment with state sizes $L\in\left\{2^{6},2^{7},2^{8},2^{9},2^{10}\right\}$ and rank $N\in\left\{2^{4},2^{5},2^{6},2^{7}\right\}$ .

We utilize the feature map $\phi(x)=\exp(Wx)$ for the LHMM and LHSMM, and $\phi(x)=\exp(Wx-\|x\|_{2}^{2}/2)$ for the LPCFG. We initialize the parameters of feature maps using orthogonal feature projections (Choromanski et al., 2020), and update it alongside the model parameters. For the full hyperparameter and optimization details, see Appendix G.

Baselines and Evaluation The language modeling experiments are evaluated using perplexity. Baselines are neurally parameterized HMM with a standard softmax transition. We also compare to VL-HMM, which makes a strong structural sparsity assumption on the emission distribution (Chiu and Rush, 2020). We include for reference a state-of-the-art language model, the AWD-LSTM (Merity et al., 2017). For polyphonic music modeling, we compare our LHMM against RNN-NADE (Boulanger-Lewandowski et al., 2012) which models the full joint distribution of notes as well as temporal dependencies; as well as autoregressive neural models such as the R-Transformer (Wang et al., 2019) (as reported by Song et al. (2019)) and an LSTM (as reported by Ziegler and Rush (2019)); models with latent continuous dynamics such as the LV-RNN (Gu et al., 2015) and SRNN (Fraccaro et al., 2016); and finally comparable models with latent discrete dynamics, the TSBN (Gan et al., 2015) and the baseline HMM. We evaluate perplexities of our low-rank PCFG (LPCFG) against a softmax PCFG (PCFG) (Kim et al., 2019). For video modeling, we evaluate the negative log likelihoods on the test set and compare low-rank HSMMs to softmax HSMMs.

5 Results

Hidden Markov Models for Language Modeling

Refer to caption — Figure 1: Validation perplexities on Ptb versus model scale, as well as speed in seconds per batch.

Our main experimental result is that the low-rank models achieve similar accuracy, as measured by perplexity, as our baselines. Fig. 1 shows that perplexity improves as we increase the scale of the HMM, and that the performance of our LHMM also improves at the same rate. At small sizes, the low-rank constraints slightly hinder accuracy; however once the size is large enough, i.e. larger than $2^{12}$ , LHMMs with 8:1 state-to-rank ratios perform comparably. ¹⁰¹⁰10 See Appendix H for an analysis of the ranks of HMMs/LHMMs.

Fig. 1 also contains speed comparisons between HMMs and LHMMs. A state-to-rank ratio of 8:1 matches the accuracy of softmax HMMs at larger state sizes and also gives an empirical speedup of more than 3x at $L=2^{14}$ . As expected, we only see a speedup when the state-to-rank ratio exceeds 2:1, as we replaced the $O(L^{2})$ operation with two $O(LN)$ ones. This implies that the low-rank constraint is most effective with scale, where we observe large computational gains at no cost in accuracy.

Model	Val	Test
AWD-LSTM	60.0	57.3
VL-HMM	128.6	119.5
HMM	144.3	136.8
LHMM	141.4	131.8

Model	$L:N$	Train	Val
HMM	-	95.9	144.3
LHMM	8	97.5	141.4
LHMM+band	8	101.1	143.8
LHMM	16	110.6	146.3
LHMM+band	16	96.9	138.8
LHMM	32	108.4	153.7
LHMM+band	32	110.7	145.0

Table 1: Model perplexities on Ptb. All HMM variants have

L=2^{14}

states. (Left): Validation and test perplexities. The LHMM has a state-to-rank ratio

8:1

. (Right): Further experiments with extending the low-rank structure of LHMMs with a banded transition structure.

HMMs are outperformed by neural models, and also by VL-HMMs (Chiu and Rush, 2020) which offer similar modeling advantages to HMMs, as shown in Tbl. 1 (left). This indicates that some aspects of performance are not strictly tied to scale. We posit this is due to the problem-specific block-sparse emission constraint in VL-HMMs. While very effective for language modeling, the VL-HMM relies on a hard clustering of states for constraining emissions. This is difficult to apply to problems with richer emission models (as in music and video modeling).

Hidden Markov Models for Music Modeling We next apply LHMMs on polyphonic music modeling. This has a max effective vocabulary size of $2^{88}$ , as multiple notes may occur simultaneously. Unlike for language modeling, we use a factored Bernoulli emission model, modeling the presence of each note independently. Fig. 2 (right) shows that HMMs are competitive with many of the models on these datasets, including LSTMs. We find that LHMMs achieve performance slightly worse than but comparable to the unconstrained HMMs overall. Fig. 2 (left) shows that the distinction drops with more states. Both HMMs achieve low negative likelihoods (NLL) on the datasets with shorter sequences, Nottingham and JSB, but relatively poorer NLLs on the datasets with longer sequences (Muse and Piano).

Context-Free Grammars For syntactic language modeling on Ptb, our low-rank PCFG (LPCFG) achieves similar performance to PCFGs, as shown in Table 2 (left), with an improvement in computational complexity. The complexity of inference in PCFGs models is cubic in the number of nonterminals, so even models with $|\mathcal{N}|=30$ nonterminals are relatively costly. Our approach achieves comparable results with $N=8$ features. As we scale up the number of nonterminals to $|\mathcal{N}|=100$ , LPCFG stays competitive with a lower computational complexity (since $N<|\mathcal{N}|$ ). These experiments also demonstrate the importance of scale in syntactic language models with more than 50 point gain in perplexity over a strong starting model.

CFG Speed Once the model is large enough, i.e. $|\mathcal{N}|\geq 60$ nonterminals and $|\mathcal{P}|\geq 120$ preterminals, the LPCFG is faster than PCFG, as shown in Tbl. 2 (left). Note that the LPCFG is faster than the CFG even when the number of features $N>\frac{|\mathcal{N}|}{2}$ , in contrast to the HMM case where a speedup can only be obtained when $N<L/2$ . This is due to the scoring matrix being rectangular: Recall the low-rank matrix product $\Psi\beta=U(V^{\top}\beta)$ , where, when specialized to PCFGs, the left-hand side takes time $O(L^{3})$ and the right-hand side takes $O(L^{2}N+LN)$ . For PCFGs, the term $V^{\top}\beta$ dominates the runtime. This contrasts with HMMs, where both $V^{\top}\beta$ and the subsequent multiplication by $U$ take the same amount of time, $O(LN)$ .

Hidden Semi-Markov Models for Video Modeling Table 2 (right) shows the results of video modeling using HSMMs. In addition to using a different hypergraph for inference, these experiments use a continuous Gaussian emission model. By removing the state constraints from tasks, our HSMM baselines get better video-level NLLs than that from Fried et al. (2020) at the cost of more memory consumption. Due to GPU memory constraints, we can only train HSMMs up to $2^{8}$ states. However, the low-rank parameterization allows models to scale to $2^{10}$ states, yielding an improvement in NLL. Absolute results could likely be improved with more states and by an improved emission parameterization for all models.

Improving Rank Assumptions One potential limitation of all low-rank models is that they cannot learn high-rank structures with low $N$ . We began to see this issue at a ratio of 16:1 states to features for large HMMs. To explore the effects of this limitation, we perform an additional experiment that combines low-rank features with a sparse component. Specifically we add an efficient high-rank sparse banded transition matrix. The full details are in Appendix I. Tbl. 1 (right) shows that combination with the band structure allows for larger ratios than just the low-rank structure alone, while only adding another operation that costs $O(LN)$ .

$\|\mathcal{N}\|$	$\|\mathcal{P}\|$	Model	$N$	PPL	Secs
30	60	PCFG	-	252.60	0.23
		LPCFG	8	247.02	0.27
		LPCFG	16	250.59	0.27
60	120	PCFG	-	234.01	0.33
		LPCFG	16	217.24	0.28
		LPCFG	32	213.81	0.30
100	200	PCFG	-	191.08	1.02
		LPCFG	32	203.47	0.64
		LPCFG	64	194.25	0.81

Model	$L$	$N$	NLL	Secs
HSMM¹¹¹¹11We report the NLL of the baseline model taken from Fried et al. (2020), where every label corresponds to one of 151 possible actions.	151	-	$1.432e5$	-
HSMM	$2^{6}$	-	$1.428e5$	0.78
HSMM	$2^{7}$	-	$1.427e5$	2.22
HSMM	$2^{8}$	-	$1.426e5$	7.69
LHSMM	$2^{7}$	$2^{7}$	$1.427e5$	4.17
LHSMM	$2^{8}$	$2^{6}$	$1.426e5$	5.00
LHSMM	$2^{9}$	$2^{5}$	$1.424e5$	5.56
LHSMM	$2^{10}$	$2^{4}$	$1.423e5$	10.0

Table 2: (Left): Test perplexities and speeds for PCFG models on Ptb. The complexity of PCFG is

O(T^{3}|\mathcal{N}|^{3})

, whereas the complexity of LPCFG is

O(T^{3}|\mathcal{N}|^{2}N)

. Speeds are given in seconds per batch. (Right): Negative log likelihoods (NLL) per video and speeds for HSMM models on CrossTask. We cannot train HSMMs beyond

2^{8}

states due to GPU memory constraints, but we can train LHSMMs with up to

2^{10}

states. Speeds are given in seconds per batch.

6 Related Work

Similar to our work, other approaches target matrix or tensor operations in inference, and impose structural model constraints to improve computational complexity. Many of the works on HMMs in particular take advantage of the transition structure. The Dense-mostly-constant (DMC) HMM assigns a subset of learnable parameters per row of the transition matrix and sets the rest to a constant, leading to a sub-quadratic runtime (Siddiqi and Moore, 2005). Other structures have also been explored, such as aligning the states of an HMM to underlying phenomena that allows inference to be sped up (Felzenszwalb et al., 2004; Roweis, 2000). Additionally, other methods take advantage of emission structure in HMMs in order to scale, such as the Cloned HMM (Dedieu et al., 2019) and VL-HMM (Chiu and Rush, 2020). Compared to these approaches, our method is more flexible and generic, since it can be applied in a non-application-specific manner, and even extended with high-rank components (such as banded structure).

Low-rank structure has been explored in both HMMs (Siddiqi et al., 2009), a generalization of PCFGs called weighted tree automata (Rabusseau et al., 2015), and conditional random fields (Thai et al., 2018). The reduced-rank HMM (Siddiqi et al., 2009) has at most 50 states, and relies on spectral methods for training. The low-rank weighted tree automata (Rabusseau et al., 2015) also trains latent tree models via spectral methods. We extend the low-rank assumption to neural parameterizations, which have been shown to be effective for generalization (Kim et al., 2019; Chiu and Rush, 2020), and directly optimize the evidence via gradient descent. Finally, Thai et al. (2018) do not take advantage of the low-rank parameterization of their CRF potentials for faster inference via low-rank matrix products, a missed opportunity. Instead, the low-rank parameterization is used only as a regularizer, with the full potentials instantiated during inference.

Concurrent work in unsupervised parsing uses a tensor decomposition to scale PCFGs to large state spaces (Yang et al., 2021). Our low-rank decomposition of the flattened head-tail scoring matrix is more general, resulting in worse scaling for the PCFG setting but with wider applicability, as shown by experiments with HMMs and HSMMs.

7 Conclusion

This work improves the scaling of structured models by establishing the effectiveness of low-rank constraints for hypergraph models. We show that viewing a key step of inference in structured models as a matrix-vector product, in combination with a low-rank constraint on relevant parameters, allows for an immediate speedup. Low-rank inference allows us to obtain a reduction in the asymptotic complexity of marginalization at the cost of a constrained model. Our approach applies to a wide class of models, including HMMs, HSMMs, and PCFGs. Through our experiments on language, video, and polyphonic music modeling, we demonstrate an effective approach for overcoming the practical difficulty of applying low-rank constraints in high dimensional, structured spaces by targeting and constraining model components that bottleneck computation. Future work includes exploration of other structural constraints for speeding up matrix-vector products (Dao et al., 2020) performed in inference, as well as application to models where exact inference is intractable.

Acknowledgments and Disclosure of Funding

We thank Nikita Kitaev for the discussion that sparked this project. We thank Sam Wiseman, Jack Morris, and the anonymous reviewers for valuable feedback. Yuntian Deng is sponsored by NSF 1704834, Alexander Rush by NSF CAREER 1845664, and Justin Chiu by an Amazon research award.

References

Bayer and Osendorfer (2015) Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks, 2015.
Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003. ISSN 1532-4435.
Boulanger-Lewandowski et al. (2012) Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 1881–1888, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Chiang and Riley (2020) David Chiang and Darcey Riley. Factor graph grammars. arXiv preprint arXiv:2010.12048, 2020.
Chiu and Rush (2020) Justin Chiu and Alexander Rush. Scaling hidden Markov language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1341–1349, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.103. URL https://www.aclweb.org/anthology/2020.emnlp-main.103.
Choromanski et al. (2020) Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2020.
Dao et al. (2020) Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher Ré. Kaleidoscope: An efficient, learnable representation for all structured linear maps. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BkgrBgSYDS.
Dedieu et al. (2019) Antoine Dedieu, Nishad Gothoskar, Scott Swingle, Wolfgang Lehrach, Miguel Lázaro-Gredilla, and Dileep George. Learning higher-order sequential structure with cloned hmms, 2019.
Felzenszwalb et al. (2004) Pedro Felzenszwalb, Daniel Huttenlocher, and Jon Kleinberg. Fast algorithms for large-state-space hmms with applications to web usage analysis. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems, volume 16. MIT Press, 2004. URL https://proceedings.neurips.cc/paper/2003/file/9407c826d8e3c07ad37cb2d13d1cb641-Paper.pdf.
Fraccaro et al. (2016) Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016.
Fried et al. (2020) Daniel Fried, Jean-Baptiste Alayrac, Phil Blunsom, Chris Dyer, Stephen Clark, and Aida Nematzadeh. Learning to segment actions from observation and narration. arXiv preprint arXiv:2005.03684, 2020.
Gan et al. (2015) Zhe Gan, Chunyuan Li, Ricardo Henao, David Carlson, and Lawrence Carin. Deep temporal sigmoid belief networks for sequence modeling, 2015.
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
Gu et al. (2015) Shixiang Gu, Zoubin Ghahramani, and Richard E. Turner. Neural adaptive sequential monte carlo, 2015.
Huang et al. (2018) Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, and Douglas Eck. An improved relative self-attention mechanism for transformer with application to music generation. CoRR, abs/1809.04281, 2018. URL http://arxiv.org/abs/1809.04281.
Huang and Chiang (2005) Liang Huang and David Chiang. Better k-best parsing. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 53–64, 2005.
Javidian et al. (2020) Mohammad Ali Javidian, Zhiyu Wang, Linyuan Lu, and Marco Valtorta. On a hypergraph probabilistic graphical model. Annals of Mathematics and Artificial Intelligence, 88(9):1003–1033, 2020.
Katharopoulos et al. (2020) A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
Kim et al. (2019) Yoon Kim, Chris Dyer, and Alexander Rush. Compound probabilistic context-free grammars for grammar induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2369–2385, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1228. URL https://www.aclweb.org/anthology/P19-1228.
Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
Klein and Manning (2004) Dan Klein and Christopher D Manning. Parsing and hypergraphs. In New developments in parsing technology, pages 351–372. Springer, 2004.
Koller and Friedman (2009) Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. ISBN 0262013193.
Krishnan et al. (2016) Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models, 2016.
Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.
Liu et al. (2018) Xiaolei Liu, Zhongliu Zhuo, Xiaojiang Du, Xiaosong Zhang, Qingxin Zhu, and Mohsen Guizani. Adversarial attacks against profile hmm website fingerprinting detection model. Cognitive Systems Research, 54, 12 2018. doi: 10.1016/j.cogsys.2018.12.005.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
Marcus et al. (1993) Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330, June 1993. ISSN 0891-2017.
Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182, 2017. URL http://arxiv.org/abs/1708.02182.
Mikolov et al. (2011) Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukás Burget, and Jan Cernocký. Empirical evaluation and combination of advanced language modeling techniques. pages 605–608, 2011. URL http://dblp.uni-trier.de/db/conf/interspeech/interspeech2011.html#MikolovDKBC11.
Peng et al. (2021) Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=QtTKTdVrFBB.
Rabusseau et al. (2015) Guillaume Rabusseau, Borja Balle, and Shay B. Cohen. Weighted tree automata approximation by singular value truncation. CoRR, abs/1511.01442, 2015. URL http://arxiv.org/abs/1511.01442.
Roweis (2000) Sam Roweis. Constrained hidden markov models. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000. URL https://proceedings.neurips.cc/paper/1999/file/84c6494d30851c63a55cdb8cb047fadd-Paper.pdf.
Shen et al. (2018) Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. Neural language modeling by jointly learning syntax and lexicon. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkgOLb-0W.
Shen et al. (2019) Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. Ordered neurons: Integrating tree structures into recurrent neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1l6qiR5F7.
Siddiqi and Moore (2005) Sajid M. Siddiqi and Andrew W. Moore. Fast inference and learning in large-state-space hmms. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 800–807, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805. doi: 10.1145/1102351.1102452. URL https://doi-org.proxy.library.cornell.edu/10.1145/1102351.1102452.
Siddiqi et al. (2009) Sajid M. Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden markov models. CoRR, abs/0910.0902, 2009. URL http://arxiv.org/abs/0910.0902.
Song et al. (2019) Kyungwoo Song, JoonHo Jang, Seungjae Shin, and Il-Chul Moon. Bivariate beta LSTM. CoRR, abs/1905.10521, 2019. URL http://arxiv.org/abs/1905.10521.
Stoller et al. (2019) Daniel Stoller, Mi Tian, Sebastian Ewert, and Simon Dixon. Seq-u-net: A one-dimensional causal u-net for efficient sequence modelling. CoRR, abs/1911.06393, 2019. URL http://arxiv.org/abs/1911.06393.
Thai et al. (2018) Dung Thai, Sree Harsha Ramesh, Shikhar Murty, Luke Vilnis, and Andrew McCallum. Embedded-state latent conditional random fields for sequence labeling. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 1–10, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/K18-1001. URL https://aclanthology.org/K18-1001.
Wang et al. (2019) Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. R-transformer: Recurrent neural network enhanced transformer, 2019.
Weissenborn et al. (2020) Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rJgsskrFwH.
Yang et al. (2021) Songlin Yang, Yanpeng Zhao, and Kewei Tu. Pcfgs can do better: Inducing probabilistic context-free grammars with many symbols, 2021.
Yu (2010) Shun-Zheng Yu. Hidden semi-markov models. Artificial intelligence, 174(2):215–243, 2010.
Zhang et al. (2021) Hongpo Zhang, Ning Cheng, Yang Zhang, and Zhanbo Li. Label flipping attacks against naive bayes on spam filtering systems. Applied Intelligence, pages 1–12, 2021.
Zhang et al. (2018) Xinyang Zhang, Ningfei Wang, Shouling Ji, Hua Shen, and Ting Wang. Interpretable deep learning under fire. CoRR, abs/1812.00891, 2018. URL http://arxiv.org/abs/1812.00891.
Zhou et al. (2006) Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clustering, classification, and embedding. Advances in neural information processing systems, 19:1601–1608, 2006.
Zhukov et al. (2019) Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.
Ziegler and Rush (2019) Zachary M. Ziegler and Alexander M. Rush. Latent normalizing flows for discrete sequences, 2019.

Appendix A Expressivity of Low-Rank Models

We focus on the simplest case of HMMs for an analysis of expressivity. In the case of Gaussian emissions, a model with more states but low rank is more expressive than a model with fewer states because for a single timestep, a larger mixture of Gaussians is more expressive. In the case of discrete emissions, however, the emission distribution for a single timestep (i.e. $\sum_{z}p(x,z)$ ) is not more expressive. Instead, we show that there exists joint marginal distributions of discrete $x$ over multiple timesteps that are captured by large state but low-rank HMMs, but not expressible by models with fewer states.

We construct a counter-example with a sequence of length $T=2$ and emission space of $x_{t}\in\left\{0,1,2\right\}$ . We show that a 3-state HMM with rank 2, HMM-3-2, with manually chosen transitions and emissions, cannot be modeled by any 2-state HMM. The transition probabilities for the HMM-3-2 are given by (rows $z_{t}$ , columns $z_{t+1}$ )

p(z_{t+1}|z_{t})=\begin{bmatrix}\frac{1}{3}&\frac{1}{3}&\frac{1}{3}\\ 0&1&0\\ \frac{1}{2}&0&\frac{1}{2}\end{bmatrix}=\begin{bmatrix}\frac{1}{3}&\frac{2}{3}\\ 1&0\\ 0&1\end{bmatrix}\begin{bmatrix}0&1&0\\ \frac{1}{2}&0&\frac{1}{2}\end{bmatrix}=UV^{T},

emission probabilities by (rows $z_{t}$ , columns $x_{t}$ ):

p(x_{t}|z_{t})=\begin{bmatrix}1&0&0\\ 0&1&0\\ 0&0&1\end{bmatrix},

and starting distribution

P(z_{1}\mid z_{0})=\begin{bmatrix}\frac{1}{3}&\frac{1}{3}&\frac{1}{3}\end{bmatrix}.

This yields the following marginal distribution (row $x_{1}$ , column $x_{2}$ ):

p(x_{1},x_{2})=\begin{bmatrix}\frac{1}{9}&\frac{1}{9}&\frac{1}{9}\\ 0&\frac{1}{3}&0\\ \frac{1}{6}&0&\frac{1}{6}\end{bmatrix}.

Next, we show that there does not exist a 2-state HMM that can have this marginal distribution. Assuming the contrary, that there exists a 2-state HMM that has this marginal distribution, we will first show that there is only one possible emission matrix. We will then use that to further show that the posterior, then transitions also must be sparse, resulting in a marginal emission distribution that contradicts the original assumption.

We start by setting up a system of equations. The marginal distribution over observations is obtained by summing over $z_{1},z_{2}$ :

p(x_{1},x_{2})=\sum_{z_{2}}\left(\sum_{z_{1}}p(x_{1},z_{1},z_{2})\right)p(x_{2}\mid z_{2}).

Let the inner term be $f(x_{1},z_{1})=\sum_{z_{1}}p(x_{1},z_{1},z_{2})=p(x_{1},z_{2})$ . In a small abuse of notation, let $p(x_{2}\mid z_{2}=0)$ be a row vector with entries $[p(x_{2}\mid z_{2}=0)]_{x}=p(x_{2}=x\mid z_{2}=0)$ , and similarly for $p(x_{2}\mid z_{2}=1)$ . We then have, first summing over $z_{1}$ ,

P(x_{1},x_{2})=\begin{bmatrix}\frac{1}{9}&\frac{1}{9}&\frac{1}{9}\\ 0&\frac{1}{3}&0\\ \frac{1}{6}&0&\frac{1}{6}\end{bmatrix}=\begin{bmatrix}f(0,0)&f(0,1)\\ f(1,0)&f(1,1)\\ f(2,0)&f(2,1)\end{bmatrix}\begin{bmatrix}p(x_{2}|z_{2}=0)\\ p(x_{2}|z_{2}=1)\end{bmatrix}.

We can determine the first row of the emission matrix, $p(x_{2}\mid z_{2}=0)$ from the second row of this system of equations, rewritten here:

p(x_{2}\mid x_{1}=1)=f(1,0)p(x_{2}|z_{2}=0)+f(1,1)p(x_{2}|z_{2}=1)=\begin{bmatrix}0&\frac{1}{3}&0\end{bmatrix}.

We can deduce that $f(1,0),f(1,1)>0$ , otherwise $p(x_{1},x_{2}=1)=0\neq\frac{1}{3}$ . Without loss of generality, assume $f(1,0)>0$ , then $p(x_{2}=0|z_{2}=0)=p(x_{2}=2|z_{2}=0)=0$ , since $p(x_{1}=1,x_{2}=0)=p(x_{1}=1,x_{2}=2)=0$ . Therefore,

p(x_{2}|z_{2}=0)=\begin{bmatrix}0&1&0\end{bmatrix}.

We can similarly determine the second row of the emission matrix, $p(x_{2}\mid z_{2}=1)$ , from the last row of the system of equations:

p(x_{2}\mid x_{1}=2)=f(2,0)p(x_{2}|z_{2}=0)+f(2,1)p(x_{2}|z_{2}=1)=\begin{bmatrix}\frac{1}{6}&0&\frac{1}{6}\end{bmatrix}.

As we determined that $p(x_{2}|z_{2}=0)=\begin{bmatrix}0&1&0\end{bmatrix}$ , $f(2,0)$ must be 0, otherwise $p(x_{2}=1|x_{1}=2)>0$ . Therefore $f(2,1)p(x_{2}|z_{2}=1)=\begin{bmatrix}\frac{1}{6}&0&\frac{1}{6}\end{bmatrix}$ , yielding

p(x_{2}|z_{2}=1)=\begin{bmatrix}\frac{1}{2}&0&\frac{1}{2}\end{bmatrix}.

Putting it together, the full emission matrix is given by

p(x_{t}|z_{t})=\begin{bmatrix}0&1&0\\ \frac{1}{2}&0&\frac{1}{2}\end{bmatrix}.

This allows us to find the posterior distribution $p(z_{1}\mid x_{1})$ via Bayes’ rule:

p(z_{1}=1\mid x_{1}=1)=\frac{p(x_{1}=1\mid z_{1}=1)p(z_{1})}{p(x_{1}=1)}=\frac{0\cdot p(z_{1})}{p(x_{1}=1)}=0,

implying $p(z_{1}=0\mid x_{1}=1)=1$ . By similar reasoning, we have $p(z_{1}=1\mid x_{1}=0)=1$ and $p(z_{1}=1\mid x_{1}=2)=1$ .

Given the sparse emissions and posteriors, we will show that the transitions must be similarly sparse, resulting in an overly sparse marginal distribution over emissions (contradiction). We can lower bound

0=p(x_{2}=1\mid x_{1}=2)\geq p(x_{2}=1\mid z_{2}=0)p(z_{2}=0\mid z_{1}=1)p(z_{1}=1\mid x_{1}=2)

by the definition of total probability and nonnegativity of probability. Then, substituting $p(x_{2}=1\mid z_{2}=0)=1$ , we have

0=p(x_{2}=1\mid x_{1}=2)\geq p(z_{2}=0\mid z_{1}=1),

from which we can deduce $p(z_{2}=0\mid z_{1}=1)=0$ .

Now, we will show that $p(x_{2}=1\mid x_{1}=0)=0$ , which contradicts the marginal distribution. We have

	$\displaystyle p(x_{2}=1\mid x_{1}=0)$	$\displaystyle=\sum_{z_{1},z_{2}}p(x_{2}=1\mid z_{2})p(x_{2}\mid z_{1})p(z_{1}\mid x_{1}=0)$
		$\displaystyle=p(x_{2}=1\mid z_{2}=1)p(z_{2}=1\mid z_{1}=1)p(z_{1}=1\mid x_{1}=0),$

where we obtained the second equality because $p(z_{1}=0\mid x_{1}=0)=0$ and $p(z_{2}=0\mid z_{1}=1)$ . As $p(x_{2}=1\mid z_{1}=1)=0$ , we have $p(x_{2}=1\mid z_{1}=1)=0\neq\frac{1}{3}$ . As this is a contradiction, we have shown that there exists a marginal distribution modelable with a 3-state HMM with rank 2, but not a 2-state HMM.

Appendix B Low-Rank Hypergraph Marginalization for HMMs and PCFGs

We provide the low-rank hypergraph marginalization algorithms for HMMs and PCFGs in Alg. 4, with loops over labels $z$ (and products of labels) and feature dimensions $n$ left implicit for brevity. We also assume that the label sets for PCFG are uniform for brevity – in practice, this can easily be relaxed (this was not assumed in Alg. 2). We show how the normalizing constants $c$ are explicitly computed using the unnormalized low-rank factors in each algorithm.

Algorithm 4 Low-rank hypergraph marginalization for HMMs and PCFGs

[HMM - Backward]

[\tilde{V}]_{z,n}=[\phi(g(z))]_{n}

[\tilde{U}]_{z,n}=[\phi(f(z))]_{n}

[c]_{z}=[\tilde{U}\tilde{V}^{\top}\mathbf{1}]_{z}

for

t\leftarrow(t+1)

in right-to-left order do

[\beta_{t+1}]_{z_{t+1}}=[\alpha_{t+1}]_{z_{t+1}}

[V_{t}]_{z_{t+1},n}=[\tilde{V}]_{z_{t+1},n}

[U_{t}]_{z_{t},n}=p(x_{t}\mid z_{t})[c]_{z_{t}}[\tilde{U}]_{z_{t},n}

\alpha_{t}\stackrel{{\scriptstyle+}}{{\leftarrow}}U_{t}(V_{t}^{\top}\beta_{t+1})

return

\alpha_{0}^{\top}\mathbf{1}

[PCFG - CKY]

[\tilde{V}]_{z_{1},z_{2},n}=[\phi(g(z_{1},z_{2}))]_{n}

[\tilde{U}]_{z_{u},n}=[\phi(f(z_{u}))]_{n}

[c]_{z_{u}}=[UV^{\top}\mathbf{1}]_{z_{u}}

for

(i,k)\leftarrow(i,j),(j,k)

in span-size order do

[\beta_{i,j,k}]_{(z_{1},z_{2})}=[\alpha_{i,j}]_{z_{1}}[\alpha_{j,k}]_{z_{2}}

[V_{i,j,k}]_{z_{1},z_{2},n}=[\tilde{V}]_{z_{1},z_{2},n}

[U_{i,j,k}]_{z_{u},n}=[c]_{z_{u}}[\tilde{U}]_{z_{u},n}

\alpha_{i,k}\stackrel{{\scriptstyle+}}{{\leftarrow}}U_{i,j,k}(V_{i,j,k}^{\top}\beta_{i,j,k})

return

\alpha_{1,T}^{\top}\mathbf{1}

Appendix C Extension of the Low-Rank Constraint to Other Semirings

Enforcing low-rank constraints in the scoring matrices $\Psi_{e}$ leads to a speedup for the key step in the hypergraph marginalization algorithm:

\Psi_{e}\beta_{v}=\left(U_{e}V_{e}^{\top}\right)\beta_{v}=U_{e}\left(V_{e}^{\top}\beta_{v}\right),

(8)

where $[\beta_{v}]_{z_{1},z_{2}}=[\alpha_{1}]_{z_{1}}[\alpha_{2}]_{z_{2}}$ . While the low-rank constraint allows for speedups in both the log and probability semirings used for marginal inference, the low-rank constraint does not result in speedups in the tropical semiring, used for MAP inference. To see this, we first review the low-rank speedup in scalar form. The key matrix-vector product step of marginal inference in scalar form is given by

	$\displaystyle\sum_{z_{1},z_{2}}[\Psi_{e}]_{z_{u},(z_{1},z_{2})}[\beta]_{z_{1},z_{2}}$	$\displaystyle=\sum_{z_{1},z_{2}}\sum_{n}[U_{e}]_{z_{u},n}[V_{e}]_{(z_{1},z_{2}),n}[\beta]_{z_{1},z_{2}}$
		$\displaystyle=\sum_{n}\sum_{z_{1},z_{2}}[U_{e}]_{z_{u},n}[V_{e}]_{(z_{1},z_{2}),n}[\beta]_{z_{1},z_{2}}$
		$\displaystyle=\sum_{n}[U_{e}]_{z_{u},n}\sum_{z_{1},z_{2}}[V_{e}]_{(z_{1},z_{2}),n}[\beta]_{z_{1},z_{2}},$

which must be computed for each $z_{u}$ . The first line takes $O(\mathcal{L}^{|e|+1})$ computation, while the last line takes $O(\mathcal{L}^{|e|}N)$ computation. The speedup comes rearranging the sum over $(z_{1},z_{2})$ and $n$ , then pulling out the $U_{e}$ factor, thanks to the distributive propery of multiplication. When performing MAP inference instead of marginal inference, we take a max over $(z_{1},z_{2})$ instead of a sum. Unfortunately, in the case of the max-times semiring used for MAP inference, we cannot rearrange max and sum, preventing low-rank models from obtaining a speedup:

	$\displaystyle\max_{z_{1},z_{2}}[\Psi_{e}]_{z_{u},(z_{1},z_{2})}[\beta]_{z_{1},z_{2}}$	$\displaystyle=\max_{z_{1},z_{2}}\sum_{n}[U_{e}]_{z_{u},n}[V_{e}]_{(z_{1},z_{2}),n}[\beta]_{z_{1},z_{2}}$
		$\displaystyle\neq\sum_{n}\max_{z_{1},z_{2}}[U_{e}]_{z_{u},n}[V_{e}]_{(z_{1},z_{2}),n}[\beta]_{z_{1},z_{2}}.$

Appendix D Data Details

For language modeling on Penn Treebank (Ptb) [Marcus et al., 1993] we use the preprocessing from Mikolov et al. [2011], which lowercases all words and substitutes OOV words with UNKs. The dataset consists of 929k training words, 73k validation words, and 82k test words, with a vocabulary of size 10k. Words outside of the vocabulary are mapped to the UNK token. We insert EOS tokens after each sentence, and model each sentence, including the EOS token, independently.

The four polyphonic music datasets, Nottingham (Nott), Piano, MuseData (Muse), and JSB chorales (JSB), are used with the same splits as Boulanger-Lewandowski et al. [2012]. The data is obtained via the following script. Each timestep consists of an 88-dimensional binary vector indicating whether a particular note is played. Since multiple notes may be played at the same time, the effective vocabulary size is extremely large. The dataset lengths are given in Table 3.

		Total Length
Dataset	Avg Len	Train	Valid	Text
Nott	254.4	176,551	45,513	44,463
Piano	872.5	75,911	8,540	19,036
Muse	467.9	245,202	82,755	64,339
JSB	60.3	64,339	4,602	4,725

Table 3: The lengths for the four polyphonic music datasets. The average length of an example in the training split for each dataset is given.

In experiments with PCFGs for language modeling, we also use Ptb, but with the splits and preprocessing used in unsupervised constituency parsing [Shen et al., 2018, 2019, Kim et al., 2019]. This preprocessing discards punctuation, lowercases all tokens, and uses the 10k most frequent words as the vocabulary. The splits are as follows: sections 2-21 for training, 22 for validation, 23 for test. Performance is evaluated using perplexity.

In experiments with HSMMs for video modeling, we use the primary section of the CrossTask dataset [Zhukov et al., 2019], consisting of about 2.7k instructional videos from 18 different tasks such as “Make Banana Ice Cream” or “Change a Tire”. We use the preprocessing from Fried et al. [2020], where pretrained convolutional neural networks are first applied to extract continuous image and audio features for each frame, followed by PCA to project features to 300 dimensions.¹²¹²12https://github.com/dpfried/action-segmentation We set aside 10% of the training videos for validation.

Appendix E Generative Process of HSMM

We use an HSMM to model the generative process of the sequence of continuous features for each video. The HSMM defines the following generative process: first, we sample a sequence of discrete latent states $z=(z_{1},\cdots,z_{K})$ with a first-order Markov model. Next, we sample the length of observations under each state from a Poisson distribution $l_{k}\sim\text{Poisson}(\lambda_{z_{k}})$ truncated at max length $M$ . The joint distribution is defined as

p(x,z,l)=\prod_{k=1}^{K}p(z_{k}\mid z_{k-1})\ p(l_{k}\mid z_{k})\prod_{i=l_{1}+\cdots+l_{k-1}}^{l_{1}+\cdots+l_{k}}p(x_{i}\mid z_{k}),

(9)

where the sequence length $T$ can be computed as $T=\sum_{k=1}^{K}l_{k}$ . In this work, we only consider modeling continuous $x_{t}$ , so we use a Gaussian distribution for $p(x_{i}\mid z_{k})$ .

To compute $p(x)$ , we can marginalize $l,z$ using dynamic programming similar to HMMs, except that we have an additional factor of $M$ : the overall complexity is $O(T\times M\times L^{2})$ (ignoring the emission part since they are usually not the bottleneck). We refer to Yu [2010] for more details.

Appendix F Full Parameterization of HMMs, PCFGs, and HSMMs

In this section, we present more details on the parameterizations of the HMM, PCFG, and HSMM. The main detail is where and how are neural networks used to parameterize state representations.

For low-rank HMMs (LHMMs) we use the following mixed parameterization that specifically targets the state-state bottleneck:

$\displaystyle p(z_{1}\mid z_{0})$	$\displaystyle\propto\phi(f_{1}(\mathbf{u}_{z_{0}}))^{\top}\phi(\mathbf{v}_{z_{1}})$	(10)
$\displaystyle p(z_{t}\mid z_{t-1})$	$\displaystyle\propto\phi(\mathbf{u}_{z_{t-1}})^{\top}\phi(\mathbf{v}_{z_{t}})$
$\displaystyle p(x_{t}\mid z_{t})$	$\displaystyle\propto\exp(\mathbf{u}_{z_{t}}^{\top}f_{2}(\mathbf{v}_{x_{t}})),$

where $\mathbf{u}_{z}$ is the embedding of $z$ when $z$ is used as head, $\mathbf{v}_{z}$ its embedding when used as tail, $f_{1},f_{2}$ are MLPs with two residual layers, and feature map $\phi(x)=\exp(Wx)$ .

The PCFG uses a similar mixed parameterization. These probabilities correspond to start ( $S\to A$ ), preterminal ( $T\to x$ ), and standard productions ( $A\to B\ C$ ) respectively.

$\displaystyle p(z_{1,N}\mid S)$	$\displaystyle\propto\exp(f_{1}(\mathbf{u}_{S})^{\top}\mathbf{u}_{z_{1,N}})$	(11)
$\displaystyle p(x_{i}\mid z_{i})$	$\displaystyle\propto\exp(\mathbf{u}_{z_{i}}^{\top}f_{2}(\mathbf{v}_{x_{i}}))$
$\displaystyle p(z_{i,j},z_{j,k}\mid z_{i,k})$	$\displaystyle\propto\begin{cases}\exp(\mathbf{u}_{z_{i,k}}^{\top}\mathbf{v}_{z_{i,j},z_{j,k}})&\begin{subarray}{c}i+1=j\lor\\ j+1=k\end{subarray}\\ \phi(\mathbf{u}_{z_{i,k}}^{\prime})^{\top}\phi(\mathbf{v}_{z_{i,j},z_{j,k}})&\text{o.w.}\\ \end{cases}$

where $\mathbf{u}_{z}$ / $\mathbf{u}_{z}^{\prime}$ is the embedding of $z$ when $z$ is used as head, $\mathbf{v}_{x}$ / $\mathbf{v}_{z_{1},z_{2}}$ is the embedding of $x$ / $(z_{1},z_{2})$ when they are used as tail, and $f_{1},f_{2}$ are MLPs with two residual layers as in Kim et al. [2019]. We use the feature map $\phi(x)=\exp(Wx-\|x\|_{2}^{2}/2)$ .

For both HMMs and neural PCFG models, we use the same parameterization of the MLPs $f_{1}$ and $f_{2}$ as Kim et al. [2019]:

	$\displaystyle f_{i}(x)$	$\displaystyle=g_{i,1}(g_{i,2}(W_{i}x)),$		(12)
	$\displaystyle g_{i,j}(y)$	$\displaystyle=\mathrm{ReLU}(U_{i,j}\mathrm{ReLU}(V_{i,j}y))+y,$		(12)

with $i,j\in\left\{1,2\right\}$ , and $W_{i},V_{i,j},U_{i,j}\in\mathbb{R}^{D\times D}$ .

For HSMMs, the baseline (HSMM in Table 2) follows the fully unsupervised setting in Fried et al. [2020] except that we don’t apply any state constraints from the prior knowledge of each task.¹³¹³13We got rid of those constraints to allow for changing the total number of states, since otherwise we can’t make any changes under a predefined state space. The model maintains a log transition probability lookup table for $p(z_{k}\mid z_{k-1})$ , a lookup table for the log of the parameters of the Poisson distribution $\lambda_{z_{k}}$ . We maintain a mean and a diagonal covariance matrix for the Gaussian distribution $p(x_{i}\mid z_{k})$ for each $z_{k}$ . For low-rank HSMMs (LHSMMs), we use the same parameterization for $p(z_{k}\mid z_{k-1})$ as in HMMs:

p(z_{k}\mid z_{k-1})\propto\phi(\mathbf{u}_{z_{t-1}})^{\top}\phi(\mathbf{v}_{z_{t}}),

(13)

where $\mathbf{u}_{z}$ is the embedding of $z$ when $z$ is used as head, $\mathbf{v}_{z}$ its embedding when used as tail, and the feature map $\phi(x)=\exp(Wx)$ . The emission parameterization is the same as in baseline HSMMs, a Gaussian kernel.

Appendix G Initialization and Optimization Hyperparameters

We initialize the parameters $W$ , in $\phi(x)=\exp(Wx)$ and variants, of feature maps using orthogonal feature projections [Choromanski et al., 2020], and update it alongside the model parameters during training.

HMM parameters are initialized with the Xavier initialization [Glorot and Bengio, 2010].¹⁴¹⁴14 For banded experiments, we initialize the band parameters by additionally adding 30 to each element. Without this the band scores were too small compared to the exponentiated scores, and were ignored by the model. We use the AdamW [Loshchilov and Hutter, 2017] optimizer with a learning rate of $0.001$ , $\beta_{1}=0.9,\beta_{2}=0.999$ , weight decay 0.01, and a max grad norm of 5. We use a state dropout rate of 0.1, and additionally have a dropout rate of 0.1 on the feature space of LHMMs. We train for 30 epochs with a max batch size of 256 tokens, and anneal the learning rate by dividing by 4 if the validation perplexity fails to improve after 4 evaluations. Evaluations are performed 4 times per epoch. The sentences, which we model independently from one another, are shuffled after every epoch. Batches of sentences are drawn from buckets containing sentences of similar lengths to minimize padding.

For the polyphonic music datasets, we use the same hyperparameters as the language modeling experiments, except a state dropout rate of 0.5 for JSB and Nottingham, 0.1 for Muse and Piano. We did not use feature space dropout in the LHMMs on the music datasets. For Nottingham and JSB, sentences were batched in length buckets, the same as language modeling. Due to memory constraints, Muse and Piano were processed using BPTT with a batch size of 8 for Muse and 2 for Piano, and a BPTT length of 128. We use $D=256$ for all embeddings and MLPs on all datasets, except Piano, which due to its small size required $D=64$ dimensional embeddings and MLPs.

For PCFGs, parameters are initialized with the Xavier uniform initialization [Glorot and Bengio, 2010]. We follow the experiment setting in Kim et al. [2019] and use the Adam [Kingma and Ba, 2017] optimizer with $\beta_{1}=0.75,\beta_{2}=0.999$ , a max grad norm of 3, and we tune the learning rate from $\{0.001,0.002\}$ using validation perplexity. We train for 15 epochs with a batch size of 4. The learning rate is not annealed over training, but a curriculum learning approach is applied where only sentences of at most length 30 are considered in the first epoch. In each of the following epochs, the longest length of sentences considered is increased by 1.

For HSMMs, we use the same initialization and optimization hyperparameters as Fried et al. [2020]: The Gaussian means and covariances are initialized with empirical means and covariances (the Gaussian parameters for all states are initialized the same way and they only diverge through training). The transition matrix is initialized to be uniform distribution for baseline HSMMs, and the transition embeddings are initialized using the Xavier initialization for LHSMMs. The log of Poisson parameters are initialized to be 0. We train all models for 4 epochs using the Adam optimizer with initial learning rate of 5e-3, and we reduce the learning rate 80% when log likelihood doesn’t improve over the previous epoch. We clamp the learning rate to be at least 1e-4. We use a batch size of 5 following Fried et al. [2020], simulated by accumulating gradients under batch size 1 in order to scale up the number of states as much as we can. Gradient norms are clipped to be at most 10 before updating. Training take 1-2 days depending on the number of states and whether a low-rank constraint is used.

We use the following hardware for our experiments: for HMMs we run experiments on 8 Titan RTX GPUs with 24G of memory on an internal cluster. For PCFGs and HSMMs we run experiments on 1 Nvidia V100 GPU with 32G of memory on an internal cluster.

Appendix H HMM Rank Analysis

Table 4 contains the empirical ranks of trained HMMs and LHMMs, estimated by counting the number of singular values greater than 1e-5. Note that the feature dimension $N$ is the maximum attainable rank for the transition matrix of an LHMM. Although LHMMs often manage to achieve the same validation perplexity as HMMs at relatively small $N$ , the ranks of the transition matrices are much lower than both their HMM counterparts as well as $N$ . At larger state sizes, the ranks of learned matrices are almost half of their max achievable rank. Interestingly, this holds true for HMMs as well, with the empirical rank of the transition matrices significantly smaller than the number of states. Whether this implies that the models can be improved is left to future investigations.

Model	$L$	$N$	rank $(A)$	rank $(O)$	Val PPL
HMM	16384	-	9187	9107	144
LHMM	16384	8192	2572	7487	141
LHMM	16384	4096	2016	7139	144
LHMM	16384	2048	1559	6509	141
LMM	8192	-	5330	5349	152
LHMM	8192	4096	1604	5113	149
LHMM	8192	2048	1020	4980	153
LHMM	8192	1024	791	5033	161
HMM	4096	-	2992	3388	155
LHMM	4096	2048	1171	3300	154
LHMM	4096	1024	790	2940	156
LHMM	4096	512	507	3186	163

Table 4: Ranks and validation perplexities for HMMs and LHMMs. The number of states is given by

L

and the dimensionality of the feature space by

N

. The HMM uses softmax for the emission, and therefore does not have a value for

N

. The transition matrix is denoted by

A

, and the emission matrix by

O

. The rank was estimated by counting the number of singular values greater than 1e-5. Models were trained with 0.1 state and feature dropout.

Appendix I Low-rank and Banded HMM Parameterization

In some scenarios, the low-rank constraint may be too strong. For example, a low-rank model is unable to fit the identity matrix, which would have rank $L$ . In order to overcome this limitation, we extend the low-rank model while preserving the computational complexity of inference. We perform experiments with an additional set of parameters $\theta\in\mathbb{R}^{L\times L}$ which allow the model to learn high-rank structure (the experimental results can be found in Tbl. 1). We constrain $\theta$ to have banded structure, such that $[\theta]_{z_{t-1},z_{t}}=0$ if $|z_{t}-z_{t-1}|>N/2$ . See Fig. 3 for an illustration of banded structure.

Figure 3: An example of a banded matrix with width

N

, which has

N/2

nonzero elements on both sides of the diagonal for each row.

Let band segment $B_{z}=\left\{z^{\prime}:|z-z^{\prime}|\leq N/2\right\}$ . The transition probabilities are then given by

p(z_{t}\mid z_{t-1})=\frac{[\theta]_{z_{t-1},z_{t}}+\phi(\mathbf{u}_{z_{t-1}})^{\top}\phi(\mathbf{v}_{z_{t}})}{Z_{z_{t-1}}},

(14)

with normalizing constants

	$\displaystyle Z_{z_{t-1}}$	$\displaystyle=\sum_{z_{t}}[\theta]_{z_{t-1},z_{t}}+\phi(\mathbf{u}_{z_{t-1}})^{\top}\phi(\mathbf{v}_{z_{t}})$		(15)
		$\displaystyle=\sum_{z_{t}\in B_{z_{t-1}}}[\theta]_{z_{t-1},z_{t}}+\phi(\mathbf{u}_{z_{t-1}})^{\top}\sum_{z_{t}}\phi(\mathbf{v}_{z_{t}}).$		(15)

The normalization constant for each starting state $Z_{z_{t-1}}$ can be computed in time $O(N)$ .

This allows us to perform inference quickly. We can use the above to rewrite the score matrix $\Psi_{t}\propto\theta+UV^{\top}$ , which turns the inner loop of Eqn. 3 (specialized to HMMs) into

\alpha_{t}=\Psi_{t}\beta_{t+1}\propto(\theta+UV^{\top})\beta_{t+1}=\theta\beta_{t+1}+U(V^{\top}\beta_{t+1}),

(16)

omitting constants (i.e. emission probabilities and normalizing constants). Since $\theta$ is banded, the banded matrix-vector product $\theta\beta_{t}$ takes time $O(LN)$ . This update, in combination with the low-rank product, takes $O(LN)$ time total. Each update in the hypergraph marginalization algorithm is now 3 matrix-vector products costing $O(LN)$ each, preserving the runtime of inference.

Appendix J Music Results

The full results on the polyphonic music modeling task can be found in Tbl. 5, with additional models for comparison. Aside from the RNN-NADE [Boulanger-Lewandowski et al., 2012], which models the full joint distribution of notes as well as temporal dependencies; autoregressive neural R-Transformer [Wang et al., 2019] (as reported by Song et al. [2019]) and LSTM (as reported by Ziegler and Rush [2019]); latent continuous LV-RNN [Gu et al., 2015] and SRNN [Fraccaro et al., 2016]; and latent discrete TSBN [Gan et al., 2015] and the baseline HMM; we additionally include the autoregressive Seq-U-Net Stoller et al. [2019], the continuous latent STORN [Bayer and Osendorfer, 2015], DMM [Krishnan et al., 2016] and LNF [Ziegler and Rush, 2019].

Model	Nott	Piano	Muse	JSB
RNN-NADE	2.31	7.05	5.6	5.19
Seq-U-Net	2.97	1.93	6.96	8.173
R-Transformer	2.24	7.44	7.00	8.26
LSTM	3.43	7.77	7.23	8.17
STORN	2.85	7.13	6.16	6.91
LV-RNN	2.72	7.61	6.89	3.99
SRNN	2.94	8.2	6.28	4.74
DMM	2.77	7.83	6.83	6.39
LNF	2.39	8.19	6.92	6.53
TSBN	3.67	7.89	6.81	7.48
HMM	2.43	8.51	7.34	5.74
LHMM	2.60	8.89	7.60	5.80

Table 5: Polyphonic music negative log-likelihood, measured in nats. The HMM models have

\mathcal{L}=2^{11}

states and the LHMM has rank

N=2^{9}

, a 4:1 state:rank ratio.

Appendix K PCFG Analysis

Kernel for $B\ C$				PPL
$\mathcal{N}\times\mathcal{N}$	$\mathcal{N}\times\mathcal{P}$	$\mathcal{P}\times\mathcal{N}$	$\mathcal{P}\times\mathcal{P}$	PPL
SM	SM	SM	SM	243.19
LR	SM	SM	SM	242.72
LR	LR	LR	SM	259.05
LR	LR	LR	LR	278.60

Table 6: Model perplexities evaluated on the validation set of Ptb. Here we use

|\mathcal{N}|=30

|\mathcal{P}|=60

, and

N=16

rank. SM denotes the use of softmax, while LR a low-rank factorization.

Figure 4 shows the entropy distribution of the production rules $H(P(B\ C|A))$ for both using softmax kernel and the approximation. The average entropies of the two distributions are close. Besides, under this setting, $P(B\ C\in\mathcal{N}\times\mathcal{N}|A)$ are close for both kernels as well (softmax 0.20, linear 0.21), eliminating the possibility that the kernel model simply learns to avoid using $B\ C\in\mathcal{N}\times\mathcal{N}$ (such as by using a right-branching tree).

In Table 6, we consider the effects of the mixed parameterization, i.e. of replacing the softmax parameterization with a low-rank parameterization. In particular, we consider different combinations of preterminal / nonterminal tails $B\ C\in\mathcal{N}\times\mathcal{N}$ , $B\ C\in\mathcal{N}\times\mathcal{P}$ , $B\ C\in\mathcal{P}\times\mathcal{N}$ , and $B\ C\in\mathcal{P}\times\mathcal{P}$ (our main model only factorizes nonterminal / nonterminal tails). Table 6 shows that we get the best perplexity when we only use $K$ on $B\ C\in\mathcal{N}\times\mathcal{N}$ , and use softmax kernel $K_{\mathrm{SM}}$ for the rest of the space. This fits with previous observations that when the label space $|\mathcal{L}|$ is large, a model with a very small rank constraint hurts performance.¹⁵¹⁵15In this particular ablation study, the size of $\mathcal{N}\times\mathcal{N}$ is only one-ninth of the total state space size $\{\mathcal{N}\cup\mathcal{P}\}\times\{\mathcal{N}\cup\mathcal{P}\}$ .

Appendix L Speed and Accuracy Frontier Analysis

We provide plots of the speed and accuracy over a range of model sizes for HMMs and PCFGs, in Figure 5 (left and right respectively). Speed is measured in seconds per batch, and accuracy by perplexity. Lower is better for both.

For HMMs, we range over the number of labels $L\in\left\{2^{10},2^{11},2^{12},2^{13},2^{14}\right\}$ . For softmax HMMs, more accurate models are slower, as shown in Figure 5 (left). However, we find that for any given accuracy for a softmax model, there exists a similarly accurate LHMM that outspeeds it. While we saw earlier in Figure 1 that at smaller sizes the low-rank constrain hurt accuracy, a model with a larger state size but lower rank achieves similar accuracy at better speed compared to a small HMM.

For PCFGs, we range over $L\in\left\{90,180,300\right\}$ . We find a similar trend compared to HMMs: accuracy results in slower models, as shown in Figure 5 (right). However, the LPCFG does not dominate the frontier as it did with HMMs. We hypothesize that this is because of the small number of labels in the model. In the case of HMMs, smaller softmax HMMs were more accurate than the faster low-rank versions, but larger LHMMs with low rank were able to achieve similar perplexity at faster speeds. This may be realized by exploring LPCFGs with more state sizes, or simply by scaling further.

Appendix M Potential Negative Impact

While work on interpretable and controllable models is a step towards machine that can more easily be understood by and interact with humans, introducing external-facing components leaves models possibly more vulnerable to adversarial attacks. In particular, the interpretations (in conjunction with the predictions) afforded by interpretable models may be attacked [Zhang et al., 2018]. Additionally, models with simple dependencies may be easier for adversaries to understand and then craft attacks for [Zhang et al., 2021, Liu et al., 2018].