Model Stealing for Any Low-Rank Language Model

Allen Liu Email: [email protected] This work was supported in part by an NSF Graduate Research Fellowship, a Hertz Fellowship, and a Citadel GQS Fellowship. Ankur Moitra Email: [email protected] This work was supported in part by a Microsoft Trustworthy AI Grant, an ONR grant, and a David and Lucile Packard Fellowship.

Abstract

Model stealing, where a learner tries to recover an unknown model via carefully chosen queries, is a critical problem in machine learning, as it threatens the security of proprietary models and the privacy of data they are trained on. In recent years, there has been particular interest in stealing large language models (LLMs). In this paper, we aim to build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. We study model stealing for Hidden Markov Models (HMMs), and more generally low-rank language models.

We assume that the learner works in the conditional query model, introduced by Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. In other words, our algorithm succeeds at stealing any language model whose output distribution is low-rank. This improves upon the result in [KKMZ24] which also requires the unknown distribution to have high “fidelity” – a property that holds only in restricted cases. There are two key insights behind our algorithm: First, we represent the conditional distributions at each timestep by constructing barycentric spanners among a collection of vectors of exponentially large dimension. Second, for sampling from our representation, we iteratively solve a sequence of convex optimization problems that involve projection in relative entropy to prevent compounding of errors over the length of the sequence. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.

1 Introduction

Proprietary machine learning models are often highly confidential. Not only are their weights not publicly released, but even their architecture and hyperparameters used in training are kept a closely guarded secret. And yet these models are often deployed as a service, allowing users to make queries to the model and receive answers. These answers can take the form of labels or completions of prompts, and sometimes a model will even report additional information such as its confidence scores. This raises a natural question:

Question.

Are these black-box models actually secure, or is it possible to reverse engineer their parameters or replicate their functionality just from query access to them?

This task is called model stealing and it threatens the security of proprietary models and the privacy of data they are trained on. Beyond nefarious reasons, it can also be used in model distillation [MCH⁺21], where we have trained a large and highly complex model and we want to transfer its knowledge to a much smaller model. It can also be a useful tool for identifying vulnerabilities, as those are often inherited by stolen models.

In any case, model stealing continues to be a very active area of research. The influential work in [TZJ⁺16] showed that there are simple and efficient attacks on popular models like logistic regression, decision trees and deep neural networks that often work in practice. Since then, many new attacks and defenses have been formulated [HLXS21, HJB⁺21, RST19, WG18, JSMA19, OMR23, OSF19, WXGD20]. There are also approaches based on embedding watermarks [JCCCP21, ZWL23, LZJ⁺22] that make it possible to detect when one model has been stolen from another. In recent years, there has been particular interest in stealing large language models (LLMs). Various works have shown how to steal isolated components of a language model such as the decoding algorithm [NKIH23], prompts used to fine-tune the model [SZ24], and even the entire weight matrix of the last layer (the embedding projection layer) [CPD⁺24].

In this work, our main interest will be in theoretical foundations for stealing language models. As is all too familiar, proving rigorous end-to-end guarantees when working with modern machine learning models with all their bells and whistles seems to be an extremely difficult task. For example, while we can understand the training dynamics on multilayer neural networks in terms of gradient flow in the Wasserstein space of probability distributions [MMM19, NP23], it has turned out to be quite difficult to analyze these dynamics except in simplified settings with a high degree of symmetry. Even worse, there are strong lower bounds for learning deep neural networks [KS09] even with respect to nice input distributions [GGJ⁺20, DKKZ20, GGK20]. The task of reasoning about modern language models seems no easier, as they are built on transformers [VSP⁺17] with many building blocks such as word embeddings, positional encodings, queries, keys, values, attention, masking, feed-forward neural networks and layer normalization.

Nevertheless there are often simplified models that abstract important features of more complex models and give us a sandbox in which to try to find theoretical explanations of empirical phenomena. For example, analyzing the dynamics of gradient descent when training a deep neural network is notoriously difficult. But in an appropriate scaling limit, and when the network is wide enough, it can be approximated through the neural tangent kernel [JGH18]. For recurrent neural networks, a popular approach is to analyze gradient descent on linear dynamical systems instead [HMR18]. Likewise for language models, it is natural to work with Hidden Markov Models (HMMs), which are in some sense the original language model, dating back to the work of Claude Shannon in 1951 [Sha51] and were the basis of other early natural language processing systems including the IBM alignment models. More broadly, we can consider a generalization called low-rank language models (Definition 1.1). This brings us to our main questions:

Question.

Is there an efficient algorithm for stealing HMMs from query access? What about more generally for low-rank language models?

These questions were first introduced and studied in an exciting recent work of Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. However their motivation and framing was somewhat different, as we will explain.

1.1 Main Results

Formally, we view a language model as a distribution $\mathbb{H}$ over $\mathcal{O}^{T}$ for some alphabet $\mathcal{O}$ and sequence length $T$ . For simplicity, we treat the sequence length as fixed. Following Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24], the rank of the distribution generated by a language model is defined as follows:

Definition 1.1.

[Low Rank Distribution] A distribution $\mathbb{H}$ over $\mathcal{O}^{T}$ for alphabet $\mathcal{O}$ of size $O$ and sequence length $T$ is rank $S$ if for all $t<T$ , the $O^{T-t}\times O^{t}$ matrix, $\mathbf{M}^{(t)}$ , with entries equal to $\Pr_{\mathbb{H}}[f|h]$ for $f\in\mathcal{O}^{T-t}$ and $h\in\mathcal{O}^{t}$ has rank at most $S$ .

In other words, a distribution is rank- $S$ if for any $t<T$ , the information in the prefix of length $t$ can be embedded in an $S$ -dimensional space such that the distribution of the future tokens can be represented as a linear function of this embedding. We note that low-rank distributions are expressive and encompass distributions generated by a Hidden Markov Model (HMM) with $S$ hidden states (see Fact 2.2).

Next, we formalize the setup for studying model stealing. We allow the learner to make conditional queries – that is, the learner can specify a history of observations, and then receives a random sample from the conditional distribution on the future observations. Formally, we have the following definition:

Definition 1.2 (Conditional Query).

The learner may make conditional queries to a distribution $\mathbb{H}$ by querying a string $h\in\mathcal{O}^{t}$ where $0\leq t<T$ . Upon making this query, the learner obtains a string $f$ of length $T-t$ drawn from the distribution $\Pr_{\mathbb{H}}[f|h]$ .

In this model, our goal is to design an algorithm that makes a total number of queries that is polynomial in $S,O,T$ and learns an efficiently samplable distribution that is $\epsilon$ -close in total variation distance to $\mathbb{H}$ . This conditional query model was recently introduced by Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. Their motivation was two-fold: First, while learning an HMM from random samples is known to be computationally hard [MR05], in principle one can circumvent these barriers if we are allowed conditional samples. Second, a solution to their problem would generalize Angluin’s classic $L^{*}$ algorithm which learns deterministic finite automata from membership queries [Ang87]. In terms of results, Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24] introduced a notion that they called fidelity and gave a polynomial time algorithm to learn any low-rank distribution (and thus any HMM) which has high fidelity through conditional queries. However this property does not always hold. Thus, their main question still remains: Is it possible to learn arbitrary low-rank distributions through conditional queries? Here we resolve this question in the affirmative. We show:

Theorem 1.3.

Assume we are given conditional query access to an unknown rank $S$ distribution $\mathbb{H}$ over $\mathcal{O}^{T}$ where $|\mathcal{O}|=O$ . Then given a parameter $0<\eta<1$ , there is an algorithm that takes $\mathrm{poly}(S,O,T,1/\eta)$ conditional queries and running time and with probability $1-\eta$ outputs a description of a distribution $\mathbb{H}^{\prime}$ such that $\mathbb{H}^{\prime}$ is $\eta$ -close in TV distance to $\mathbb{H}$ . Moreover there is an algorithm that samples from $\mathbb{H}^{\prime}$ in $\mathrm{poly}(S,O,T,\log(1/\eta))$ time.

Note that crucially, the algorithm only makes conditional queries for learning the representation of $\mathbb{H}^{\prime}$ . Once we have this learned representation, we may draw as many samples as we want without making any more queries to the original distribution $\mathbb{H}$ .

1.2 Discussion

Theorem 1.3 shows that we can efficiently learn any low-rank distribution via conditional queries. Thus, we can view our results as showing that in some sense, the rank of a distribution can be a useful proxy for understanding the complexity of model stealing, similar to how complexity measures, such as Bellman rank [JKA⁺17] and its relatives [FKQR21], are useful for understanding the statistical complexity of learning a near optimal policy in reinforcement learning.

There is a key conceptual insight driving our algorithm. One of the challenges in learning sequential distributions is that the error can grow exponentially in the sequence length $T$ . In particular if we imagine sampling from $\mathbb{H}$ one token at a time, the low-rank structure ensures that we only need to keep track of an $S$ -dimensional hidden state at each step. However, each step involves multiplication by a change-of-basis matrix and these repeated multiplications cause the error to grow exponentially. The key to mitigating this error blowup is that we combine each change-of-basis with a projection step, where we solve a convex optimization problem that performs a projection with respect to KL divergence. Crucially, projection in KL divergence has a contractive property (see Fact 3.9) that does not hold for other natural measures of distance between distributions, such as TV distance. We give a more detailed overview of our algorithm in Section 3. This is an interesting example where allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance. Of course, phrased in general terms, this is a driving philosophy behind OpenAI’s o1 model. But so far we have little theoretical understanding of the provable benefits of allowing more compute at inference time.

1.3 Related Work

There has been a long line of work on learning HMMs from random samples. Mossel and Roch [MR05] gave the first polynomial time algorithms that work when under appropriate full rankness conditions on the transition and observation matrices. Other works gave spectral [HKZ12] and method of moments based approaches [AHK12]. Learning HMMs can also be thought of as a special case of learning phylogenetic trees [CGG01, MR05]. Other approaches assume the output distributions belong to a parametric family [KNW13], or study quasi-HMMs [HGKD15].

The main computational obstruction in this area is that HMMs, without full rankness conditions, can encode noisy parities [MR05], which are believed to be computationally hard to learn. In the overcomplete setting, where the hidden state space is allowed to be larger than the observation space, there are ways around these lower bounds. First, one can aim to predict the sequence of observations, rather than learning the parameters [SKLV18]. Under a natural condition called multi-step observability one can get quasi-polynomial time algorithms [GMR23]. Second, one can make structural assumptions that the transition matrices are sparse, well-conditioned, and have small probability mass on short cycles [SKLV17]. Alternatively there are polynomial time algorithms that work under the assumption that the transition and observation matrices are smoothed [BCPV19].

There are also related models called linear dynamical systems where the hidden states and observations are represented as vectors. In contrast, in HMMs the hidden states and observations take on only a finite set of possible values. There is a long line of work on learning linear dynamical systems too [HMR18, OO19, TP19, SRD19, SBR19, BLMY23a, CP22, BLMY23b]. However, these algorithms require various structural assumptions, and even the weakest ones, namely condition-number bounds on the called observability and controllability matrices, are known to be necessary [BLMY23a].

The conditional query model has also been studied for other statistical problems such as testing discrete distributions and learning juntas [CFGM13, CRS15, BC18, CJLW21]. However, in most of these settings, the goal of studying the conditional query model is to obtain improved statistical rates rather than sidestep computational hardness. We remark that there are also many classic examples of problems in computational learning theory where, when allowed to make queries, there are better algorithms [Jac97] than if we are only given passive samples.

2 Basic Setup

Recall that we have conditional query access to some unknown distribution $\mathbb{H}$ over $\mathcal{O}^{T}$ for some alphabet $\mathcal{O}$ . We also assume that $\mathbb{H}$ has rank $S$ (recall Definition 1.1). Our goal will be to learn a description of $\mathbb{H}$ via conditional queries. Let us first formally define Hidden Markov Models (HMMs):

Definition 2.1 (Hidden Markov Model).

An HMM $\mathbb{H}$ with state space $\mathcal{S}$ , $|\mathcal{S}|=S$ , and observation space $\mathcal{O}$ , $|\mathcal{O}|=O$ , is specified by an initial distribution $\mu$ over the states, a transition matrix $\mathbf{T}\in{\mathbb{R}}^{S\times S}$ , and an emission matrix $\mathbf{O}\in{\mathbb{R}}^{O\times S}$ . For a given sequence length $T$ , the probability of generating a given sequence $x_{1},\dots,x_{T}$ with each $x_{i}\in\mathcal{O}$ is

\Pr[x_{1},\dots,x_{T}]=\sum_{s_{1},\dots,s_{T+1}\in\mathcal{S}^{T+1}}\mu(s_{1})\prod_{t=1}^{T}\mathbf{O}_{x_{t},s_{t}}\mathbf{T}_{s_{t+1},s_{t}}\,.

Next we give a basic observation (see [KKMZ24]) that the class of rank $S$ distributions contains the set of all distributions generated by an HMM with $S$ hidden states.

Fact 2.2.

[KKMZ24] Let $\mathbb{H}$ be a distribution over $\mathcal{O}^{T}$ generated by an HMM with $S$ hidden states. Then $\mathbb{H}$ is a rank at most $S$ distribution.

Notation

Throughout this paper, we will use the same global notation for the unknown distribution $\mathbb{H}$ with rank $S$ , observation space $\mathcal{O}$ of size $O$ and sequence length $T$ . We also write $\Pr_{\mathbb{H}}[f|h]$ for sequences $f$ of length less than $T-t$ . This will be used to denote the probability that conditioned on the prefix $h$ , the observations immediately following $h$ are those in $f$ .

We use the following notation for prefixes and conditional probabilities. For a distribution $\mathbb{H}$ on $\mathcal{O}^{T}$ and $t<T$ , we use $\mathbb{H}[:t]$ to denote the distribution induced by $\mathbb{H}$ on $t$ -character prefixes. For any prefix $h\in\mathcal{O}^{t}$ , we use $\Pr_{\mathbb{H}}[\cdot|h]$ to denote the distribution on futures $f\in\mathcal{O}^{T-t}$ given by $\Pr_{\mathbb{H}}[f|h]$ . Finally, for a string $h\in\mathcal{O}^{t}$ and character $o\in\mathcal{O}$ , we use $h\vee o$ to denote the string of $t+1$ characters obtained by appending $o$ to $h$ .

3 Technical Overview

Our algorithm is composed of two main parts: We work with a certain representation of the distribution and show how to estimate it from conditional samples. Then we give an algorithm that takes this learned representation and can generate samples. In most learning problems, estimating the parameters is the challenging part and sampling is often trivial. But in our case, designing the sampling algorithm, which involves solving a sequence of convex optimization problems, is one of the key ingredients.

3.1 Idealized Representation

In this section, we introduce the representation that will be central to our learning algorithm. First we consider an idealized setting where we ignore issues like sampling noise and the fact that we cannot afford to work with exponentially large vectors directly. We include this subsection for pedagogical reasons as many of the ideas are already in [KKMZ24]. Nevertheless, it will be useful setup for explaining our new contributions in the following subsections.

We slightly abuse notation and for $h\in\mathcal{O}^{t}$ , let $\Pr_{\mathbb{H}}[\cdot|h]$ denote the vector whose entries are $\Pr_{\mathbb{H}}[f|h]$ as $f$ ranges over all elements of $\mathcal{O}^{T-t}$ . Now Definition 1.1 tells us that as $h$ ranges over $\mathcal{O}^{t}$ , all of the vectors $\Pr_{\mathbb{H}}[\cdot|h]$ are contained in an $S$ -dimensional subspace. Thus, we only need to store $S$ of these vectors to “span” the whole space.

Barycentric Spanners

This leads us to the notion of a barycentric spanner, which is the key building block in the representation we will use:

Definition 3.1 (Barycentric Spanner).

For a collection of vectors $z_{1},\dots,z_{n}\in{\mathbb{R}}^{d}$ , we say another set of vectors $v_{1},\dots,v_{s}\in{\mathbb{R}}^{d}$ is a $(C,\gamma)$ -spanner for them if for all $j\in[n]$ , there are coefficients $c_{1},\dots,c_{s}$ with $|c_{i}|\leq C$ such that

\left\lVert z_{j}-(c_{1}v_{1}+\dots+c_{s}v_{s})\right\rVert_{1}\leq\gamma\,.

Crucially, for any set of vectors that are exactly contained in a $S$ -dimensional subspace, there is always a subset of them that form a $(1,0)$ barycentric spanner for the full collection.

Fact 3.2.

Let $v_{1},\dots,v_{n}$ be arbitrary vectors in ${\mathbb{R}}^{d}$ , Then there exists a subset $S\subseteq[n]$ with $|S|\leq d$ such that $\{v_{i}\}_{i\in S}$ form a $(1,0)$ spanner for the collection $(v_{1},\dots,v_{n})$ .

Thus ideally, for each $0<t<T$ , we could store a subset $\mathcal{H}^{(t)}\subseteq\mathcal{O}^{t}$ with $|\mathcal{H}^{(t)}|\leq S$ and such that $\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{H}^{(t)}}$ is a $(1,0)$ -barycentric spanner for $\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ .

Change-of-Basis

In addition to these spanning subsets, we will need to store some additional information about sampling probabilities and “change-of-basis” between these spanning sets. In particular we claim that if, in addition to the barycentric spanners, we also had the following information, then it would be enough to sample:

•

The next character probabilities (under $\mathbb{H}$ ) for all of the elements of $\mathcal{H}^{(t)}$

•

For each $0\leq t<T$ , $o\in\mathcal{O}$ , $h\in\mathcal{H}^{(t)}$ , coefficients $\{\alpha_{h^{\prime}}^{h\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ such that

\Pr_{\mathbb{H}}[\cdot|h\vee o]=\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{h\vee o}\Pr_{\mathbb{H}}[\cdot|h^{\prime}]

(1)

i.e. we know how to write $\Pr_{\mathbb{H}}[\cdot|h\vee o]$ as a linear combination of the vectors $\{\Pr_{\mathbb{H}}[\cdot|h^{\prime}]\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ . We refer to this as a “change-of-basis”.

To see why these suffice, let’s consider sampling a string $x=o_{1}o_{2}\cdots$ one character at a time. Assume we have sampled $x=o_{1}o_{2}\cdots o_{t}$ so far. Throughout the process we will maintain a set of coefficients $\{\alpha_{h}^{x}\}_{h\in\mathcal{H}^{(t)}}$ such that

\Pr_{\mathbb{H}}[\cdot|x]\approx\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\Pr_{\mathbb{H}}[\cdot|h]\,.

(2)

Since we know the next-character distributions for all of $h\in\mathcal{H}^{(t)}$ , the above allows us to exactly compute the next-character distribution for $x$ . After sampling a character, say $o$ from this distribution, we can then re-normalize to obtain

\Pr_{\mathbb{H}}[\cdot|x\vee o]\approx\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\frac{\Pr_{\mathbb{H}}[o|h]}{\Pr_{\mathbb{H}}[o|x]}\Pr_{\mathbb{H}}[\cdot|h\vee o]\,.

(3)

We can then substitute (1) into the above to rewrite the right hand side as a linear combination of the vectors $\{\Pr_{\mathbb{H}}[\cdot|h^{\prime}]\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ . And now we can iterate this process again to sample the next character. The key point is that once we have a barycentric spanner, we do not need to store the coefficients $\alpha_{h}^{x}$ for each history $x$ but rather can track how they evolve.

Challenges

If we want to turn the above representation into a learning algorithm, there are some major obstacles. The vectors $\Pr_{\mathbb{H}}[\cdot|h]$ are exponentially large, and we only have approximate (but not exact) access to them. When our estimates are noisy, the sampling approach above has issues as the error may grow multiplicatively after each iteration.

3.2 Learning the Representation

Simulating pdf access to $\widehat{\mathbb{H}}$ that is close to $\mathbb{H}$

Our first step is to show that, by using conditional queries, we can simulate exact p.d.f. access to a distribution $\widehat{\mathbb{H}}$ that is $\epsilon$ -close to $\mathbb{H}$ in TV distance for any inverse-polynomially small $\epsilon$ .

Lemma 3.3 (Informal, see Lemma 4.6).

There is a distribution $\widehat{\mathbb{H}}$ such that for all $h\in\mathcal{O}^{t}$ , the conditional distributions $\Pr_{\mathbb{H}}[\cdot|h]$ and $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ are $\epsilon$ -close, and we can simulate query access to the exact p.d.f. of all conditional distributions of $\widehat{\mathbb{H}}$ .

Note that crucially this exact p.d.f. access is only for a distribution $\widehat{\mathbb{H}}$ that is $\epsilon$ -close to $\mathbb{H}$ and not for $\mathbb{H}$ itself (as in the latter case [KKMZ24] already gives an algorithm for learning). For most of the algorithm, we will work directly with $\widehat{\mathbb{H}}$ and only use, implicitly in the analysis, that it is close to some low rank distribution $\mathbb{H}$ .

3.2.1 Dimensionality Reduction

Recall that our goal is to find a barycentric spanner for the set of vectors $\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ . These vectors are close to the vectors $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ , so we will try to construct an approximate spanner for this latter collection. The first problem is that these vectors have exponentially many entries. In order to compute with them efficiently, we introduce a type of dimensionality reduction that allows us to subsample a subset of only polynomially many coordinates and only work with the restriction to those coordinates. First, let $\mathcal{D}$ be some distribution over strings in $\mathcal{O}^{T-t}$ . Now we sample polynomially elements $f_{1},\dots,f_{m}$ from $\mathcal{D}$ . And for each $h\in\mathcal{O}^{t}$ , we associate $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ with the reduced vector

v_{h}=\left(\frac{\Pr_{\widehat{\mathbb{H}}}[f_{1}|h]}{m\mathcal{D}[f_{1}]},\dots,\frac{\Pr_{\widehat{\mathbb{H}}}[f_{m}|h]}{m\mathcal{D}[f_{m}]}\right)

(4)

where $\mathcal{D}[f_{i}]$ denotes the density of $\mathcal{D}$ at $f_{i}$ . Recall that by assumption, we have exact access to the densities $\Pr_{\widehat{\mathbb{H}}}[f_{i}|h]$ in the numerator. Thus as long as we also have exact density access to $\mathcal{D}$ , then the above entries can be computed exactly. We first observe that these vectors give unbiased estimates for linear functionals of the original distributions $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ .

Fact 3.4.

For any distribution $\mathcal{D}$ over $\mathcal{O}^{T-t}$ , in expectation over the random draws from $\mathcal{D}$ , for any subset $\mathcal{A}\subseteq\mathcal{O}^{t}$ and real coefficients $\{c_{h}\}_{h\in\mathcal{A}}$ ,

\operatorname{\mathbb{E}}\left[\left\lVert\sum_{h\in\mathcal{A}}c_{h}v_{h}\right\rVert_{1}\right]=\left\lVert\sum_{h\in\mathcal{A}}c_{h}\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\right\rVert_{1}\,.

In other words, at least in expectation, the subsampling of the entries preserves the $L^{1}$ norm of linear combinations of the vectors. Of course, the variance could be prohibitively large depending on the choice of $\mathcal{D}$ . However, with a careful choice of $\mathcal{D}$ , we can obtain concentration for the above quantities as well. The goal of our dimensionality reduction procedure is captured by the following statements:

Definition 3.5.

For a subset $\mathcal{A}\subseteq\mathcal{O}^{t}$ , we say vectors $\{v_{h}\}_{h\in\mathcal{A}}$ are $\gamma$ -representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}}$ if

\left\lvert\left\lVert\sum_{h\in\mathcal{A}}c_{h}v_{h}\right\rVert_{1}-\left\lVert\sum_{h\in\mathcal{A}}c_{h}\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\right\rVert_{1}\right\rvert\leq\gamma

for all sets of coefficients with $|c_{h}|\leq 1$ .

Proposition 3.6.

[Informal, see Lemma 5.10 for more details] For $\mathcal{A}=\{h_{1},\dots,h_{k}\}$ and

\mathcal{D}=\frac{\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{1}]+\dots+\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{k}]}{k}\,.

If we draw samples $f_{1},\dots,f_{m}$ from $\mathcal{D}$ for $m=\mathrm{poly}(k,1/\gamma)$ and set

v_{h}=\left(\frac{\Pr_{\widehat{\mathbb{H}}}[f_{1}|h]}{m\mathcal{D}[f_{1}]},\dots,\frac{\Pr_{\widehat{\mathbb{H}}}[f_{m}|h]}{m\mathcal{D}[f_{m}]}\right)

then with high probability the vectors $\{v_{h}\}_{h\in\mathcal{A}}$ are $\gamma$ -representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}}$ .

To see why the above holds, note that the entries of $v_{h_{1}},\dots,v_{h_{k}}$ are all at most $k$ . Thus for any accuracy parameter $\gamma$ , by taking $m>\mathrm{poly}(k,1/\gamma)$ , we can union bound over a net and get concentration in Fact 3.4. The main point now is that when we want to compute a barycentric spanner we can work with the representative vectors instead.

3.2.2 Computing a Spanner in the Reduced Space

Section 3.2.1 deals with the fact that there are exponentially many possible futures. There are also exponentially many histories, but because they are close to some low-dimensional subspace, we show that sampling a polynomial number of them suffices to obtain a representative subset.

Formally, we sample $k=\mathrm{poly}(S,1/\gamma)$ histories $h_{1},\dots,h_{k}$ from $\widehat{\mathbb{H}}[:t]$ . While we cannot ensure that the collection

\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{1}],\dots,\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{k}]\}

contains a barycentric spanner for all of $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ (because there could be some $h$ that is sampled with very low probability), we can still use Fact 3.2 to argue that this set contains a subset of size $S$ that is a barycentric spanner for most of $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ i.e. with high probability over $h$ drawn from $\widehat{\mathbb{H}}$ , $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ is close to a bounded linear combination of vectors in this subset.

Now to compute the barycentric spanner, by Proposition 3.6, we can construct representative vectors $v_{h_{1}},\dots,v_{h_{k}}$ and it suffices to work with these, which have polynomial size. Then we can run an algorithm for computing approximate barycentric spanners (see Lemma 5.6) on this collection. There are additional technical details due to the error i.e. the vectors are close to, but not exactly contained in an $S$ -dimensional subspace. For the algorithm to work, it is crucial that the error is smaller than the dimensionality of the vectors so we need to ensure that $\epsilon\ll\mathrm{poly}(\gamma)$ (see Section 6.1 for more details). Omitting these technical details, we have now sketched the proof of the following:

Lemma 3.7 (Informal, see Lemma 5.6).

With $\mathrm{poly}(S,O,T,1/\gamma)$ conditional queries and runtime, we can compute sets $\mathcal{H}^{(1)},\dots,\mathcal{H}^{(T-1)}$ such that for each $1\leq t\leq T-1$ , $|\mathcal{H}^{(t)}|\leq S$ and the collection $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{H}^{(t)}}$ is a $(O(1),\gamma)$ -spanner for a $1-\gamma$ -fraction of $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ (where the fraction is measured with respect to $h\sim\widehat{\mathbb{H}}$ ).

For simplicity, in this overview, we even assume that $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{H}^{(t)}}$ is actually an approximate spanner for all of $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ — in the full proof, roughly the “exceptional” possibilities for $h$ can be absorbed into the failure probability.

3.2.3 A Reduced Representation

Let $\mathcal{B}^{(t)}\subseteq\mathcal{O}^{t}$ consist of all strings in $\mathcal{H}^{(t)}$ and all strings obtained by taking a string in $\mathcal{H}^{(t-1)}$ and appending a character in $\mathcal{O}$ . In other words $\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}$ . Now the set of information that we store is as follows:

Definition 3.8 (Learned representation (informal)).

Our learned representation consists of the following information:

•

Sets $\mathcal{H}^{(t)}$ for $t=1,2,\dots,T-1$
•

The next character probabilities (under $\widehat{\mathbb{H}}$ ) for all of the elements of $\mathcal{H}^{(t)}$
•

Vectors $\{v_{h}\}_{h\in\mathcal{B}^{(t)}}$ that are representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}$ (can be constructed using Proposition 3.6).

This is our first main departure from the idealized representation: rather than storing explicit linear combinations as in the idealized representation (see (1)), we simply store the representative vectors and defer the computation of the “change-of-basis” to the sampling algorithm. For technical reasons, the full algorithm requires storing some additional information but we will omit these in this overview. See Section 6.2 for details.

3.3 Sampling Step

As in the idealized sampling algorithm in Section 3.1, we sample a string $x$ one character at a time. At each step $t<T$ , we maintain a linear combination $\{\alpha_{h}^{x}\}_{h\in\mathcal{H}^{(t)}}$ of the strings $h\in\mathcal{H}^{(t)}$ as in (2), except with conditional distributions with respect to $\widehat{\mathbb{H}}$ . We then sample a character and re-normalize as in (3). The main difficulty now is the change-of-basis step.

Naive Attempt: Linear Change-of-Basis

Naively, for each $h\in\mathcal{H}^{(t)}$ and $o\in\mathcal{O}$ , we could simply pre-compute coefficients $\{\beta_{h^{\prime}}^{h\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ such that

v_{h\vee o}\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\beta_{h^{\prime}}^{h\vee o}v_{h^{\prime}}

and thus (since the vectors $v_{h}$ are representative for the distribution $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ ),

\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\beta_{h^{\prime}}^{h\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\,.

We can then substitute this into the re-normalized relation (see (3)) and get

\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\theta_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\mbox{, where }\theta_{h^{\prime}}^{x\vee o}=\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\beta_{h^{\prime}}^{h\vee o}\cdot\frac{\Pr_{\widehat{\mathbb{H}}}[o|h]}{\Pr_{\widehat{\mathbb{H}}}[o|x]}

(5)

This gives us an expression for $\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]$ as a linear combination of the vectors $\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]$ for $h^{\prime}\in\mathcal{H}^{(t+1)}$ . However, the issue with this approach is that the coefficients and approximation error may grow exponentially with $t$ .

Need for Projection

The fact that the coefficients may grow motivates the need for a “projection” step, where we reduce the magnitudes of the coefficients. One natural attempt is to take the coefficients $\{\theta_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ in (5) and “project” them by finding an equivalent set of coefficients of bounded magnitude, say $\{\alpha_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ , such that

\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\theta_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\,.

(6)

In fact, there exists a set of coefficients of bounded magnitude such that

\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]

(7)

with high probability over $x$ , by Lemma 3.7. However, there is a subtle issue with this approach. We only have access to the vectors $\{v_{h}\}_{h\in\mathcal{B}^{(t+1)}}$ so we can only compute (succinct representations) of the expressions in (6) but not of $\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]$ .

The issue is captured by the following abstraction. Let $w$ be the vector that is equal to the left hand side of (6). Let $\mathcal{K}$ be the convex set consisting of all possible vectors obtainable on the right hand side of (6) for coefficients $\{\alpha_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ bounded by some constant. Let $z=\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]$ . We know $z\in\mathcal{K}$ (up to some small error), but we do not know $z$ . Now we want to map $w$ to a point in $z^{\prime}\in\mathcal{K}$ . Ideally, we want to guarantee $\left\lVert z^{\prime}-z\right\rVert_{1}\leq\left\lVert w-z\right\rVert_{1}$ , up to some small additive error. The issue is that we cannot guarantee better than

\left\lVert z^{\prime}-z\right\rVert_{1}\leq 2\left\lVert w-z\right\rVert_{1}

due to the fact that $z$ is unknown and having to use triangle inequality. In other words, the problem is that when projecting onto a convex set in $L^{1}$ norm, we may still double the $L^{1}$ distance to some of the elements of the convex set, causing the error to grow multiplicatively in each iteration.

Projection in KL

The above issue motivates projection in a different distance measure between distributions. Specifically, KL divergence has the following appealing property:

Fact 3.9.

Let $\mathcal{T}\subset{\mathbb{R}}^{N}$ be the $N$ -dimensional simplex i.e. the convex hull of the standard basis vectors. We view points in $\mathcal{T}$ as distributions over $N$ elements. Let $\mathcal{K}\subseteq\mathcal{T}$ be a convex set and let $w\in\mathcal{T}$ be a point. Let $z^{*}=\operatorname{argmin}_{z\in\mathcal{K}}\textsf{KL}(z\|v)$ . Then for all $z\in\mathcal{K}$ ,

\textsf{KL}(z\|z^{*})\leq\textsf{KL}(z\|w)\,.

This implies that “projecting” onto a convex set actually decreases the KL-divergence from all elements within the convex set. This is the key to overcoming the error doubling issue above. In particular, in the same abstraction as above, we would just set $z^{\prime}=z^{*}$ .

More concretely, since the vectors $\{v_{h}\}_{h\in\mathcal{B}^{(t+1)}}$ are succinct representations of $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ , we can solve the projection in KL, which is a convex optimization problem, over these representative vectors. While Proposition 3.6 is stated for TV distance, we can extend it to also preserve the KL divergence between linear combinations of these distributions as well. Technically, this requires a “truncated” notion of KL to deal with possibly negative entries but we will gloss over this for now (see Definition 4.8 and Lemma 5.10 for more details).

To describe our actual algorithm, once we have the coefficients as in (3), we solve the following convex optimization problem. We solve for coefficients $\{\alpha_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}$ that are bounded in magnitude and minimize

\textsf{KL}\left(\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}v_{h^{\prime}}\Bigg{\|}\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\frac{\Pr_{\widehat{\mathbb{H}}}[o|h]}{\Pr_{\widehat{\mathbb{H}}}[o|x]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o]\right)\,.

(8)

While we omit the details in this overview (the full analysis is in Section 7), the main point of this projection in KL step is that we can appeal to Fact 3.9 with $z=\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]$ and $z^{*}=\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}v_{h^{\prime}}$ for the optima obtained in (8) to get

\textsf{KL}\left(\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\Bigg{\|}\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}v_{h^{\prime}}\right)\leq\textsf{KL}\left(\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\Bigg{\|}\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\frac{\Pr_{\widehat{\mathbb{H}}}[o|h]}{\Pr_{\widehat{\mathbb{H}}}[o|x]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o]\right)+\gamma

where the additive error $\gamma$ comes only from how well KL-divergences are preserved in the succinct representation $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\rightarrow v_{h}$ . In particular, this completes the “change-of-basis” from $\mathcal{H}^{(t)}$ to $\mathcal{H}^{(t+1)}$ and the error only grows additively. Now to sample the entire string, we simply sample one character at a time and iterate the above.

3.4 Organization

In Section 4, we present a few basic observations for simulating density access to a distribution $\widehat{\mathbb{H}}$ that is close to $\mathbb{H}$ and working with (truncated) KL divergence. In Section 5, we introduce our machinery for working with barycentric spanners, including the dimensionality reduction procedure. In Section 6, we present our full learning algorithm. In Section 7, we present our sampling algorithm for sampling from our learned representation.

4 Basic Results

In this section, we introduce notation and collect several basic observations that will be used throughout the later sections.

4.1 Algorithmic Primitives

We start by developing a few basic primitives that conditional queries allow us to implement. First, we have a subroutine for computing the probabilities of the next character given a specified history $h$ .

Claim 4.1 (Next Character Probabilities).

For $t\leq T$ and a string $h\in\mathcal{O}^{t}$ and parameters $\epsilon,\delta>0$ , we can make $\mathrm{poly}(1/\epsilon,O,\log(1/\delta))$ conditional queries, and with $1-\delta$ probability, we output a distribution $\mathcal{P}_{h}$ over $\mathcal{O}$ that is within TV distance $\epsilon$ of the true distribution of the next character $\{\Pr_{\mathbb{H}}[o|h]\}_{o\in\mathcal{O}}$ .

Proof.

We can repeatedly make conditional queries with input $h$ and look at the next character. We make $\mathrm{poly}(1/\epsilon,O,\log(1/\delta))$ such queries and output the empirical distribution of the next character. By a Chernoff bound, with $1-\delta$ probability, the resulting distribution is within TV distance $\epsilon$ of the true distribution $\{\Pr_{\mathbb{H}}[o|h]\}_{o\in\mathcal{O}}$ . ∎

In light of the above, we make the following definition.

Definition 4.2 (Conditional Closeness).

For two distributions $\mathbb{H},\widehat{\mathbb{H}}$ on $\mathcal{O}^{T}$ , we say they are $\epsilon$ -conditionally close if for any string $h\in\mathcal{O}^{t}$ for $t<T$ , the conditional distributions of the next character $\{\Pr_{\mathbb{H}}[o|h]\}_{o\in\mathcal{O}}$ and $\{\Pr_{\widehat{\mathbb{H}}}[o|h]\}_{o\in\mathcal{O}}$ are $\epsilon$ -close in TV distance.

The following statement shows that conditional closeness implies closeness in TV distance for the entire distribution (and further on any conditional distribution) up to a factor of $T$ .

Claim 4.3.

Let $\mathbb{H},\widehat{\mathbb{H}}$ be distributions on $\mathcal{O}^{T}$ that are $\epsilon/T$ conditionally close. Then for any history $h\in\mathcal{O}^{t}$ for $t\leq T$ , we must have

d_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h])\leq\epsilon

where $\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ represent the full distribution over the future $f\in\mathcal{O}^{T-t}$ .

Proof.

We prove by reverse induction on $t$ that

d_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h])\leq\frac{\epsilon(T-t)}{T}

whenever $h$ has length $t$ . When $t=T-1$ , the above is immediate from the assumption that $\mathbb{H},\widehat{\mathbb{H}}$ are $\epsilon/T$ conditionally close. Now assume we have proven the claim for $t$ . Consider $h$ that has length $t-1$ .

d_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h])\leq\frac{\epsilon}{T}+\sum_{o\in\mathcal{O}}\Pr_{\mathbb{H}}[o|h]d_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h\vee o],\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o])\leq\frac{\epsilon(T-t+1)}{T}

where in the above, we first used conditional closeness and then the inductive hypothesis. This completes the proof. ∎

For technical reasons later on, it will be convenient to have the following definition which necessitates that, regardless of the prefix, all possibilities for the next character occur with some non-negligible probability.

Definition 4.4 (Positivity).

Let $0<\eta<1$ be some parameter. We say a distribution $\mathbb{H}$ on $\mathcal{O}^{T}$ is $\eta$ -positive if for any string $h\in\mathcal{O}^{t}$ for $t<T$ , the probabilities $\Pr_{\mathbb{H}}[o|h]$ are at least $\eta$ for any choice of $o\in\mathcal{O}$ .

Now we show how to implement sample and exact pdf access to a distribution $\widehat{\mathbb{H}}$ that is conditionally close to the distribution of $\mathbb{H}$ . Note that this is slightly stronger than just implementing approximate pdf access to $\mathbb{H}$ since we need some “consistency” between the responses to pdf queries.

First, we can approximate the pdf of $\mathbb{H}$ at a given string $h\in\mathcal{O}^{t}$ by iteratively computing the next character probabilities using Claim 4.1 and then multiplying them. The key to ensure consistency is to only estimate the “next character distribution” once for each possible prefix $h$ . Whenever this prefix shows up again in a different computation, we use the same “next character distribution” that has already been computed.

Definition 4.5 (Sample and PDF Access).

For a distribution $\widehat{\mathbb{H}}$ over $\mathcal{O}^{T}$ , we say that we have sample and pdf access to $\widehat{\mathbb{H}}$ if we can perform the following operations:

•

Given a string $h\in\mathcal{O}^{t}$ for $t<T$ , draw a sample $f\in\mathcal{O}^{T-t}$ from $\Pr_{\widehat{\mathbb{H}}}[f|h]$
•

Given a string $h\in\mathcal{O}^{t}$ for $t\leq T$ , output $\Pr_{\widehat{\mathbb{H}}}[h]$ (where this is the probability that the first $t$ characters of the string match $h$ )

Lemma 4.6 (PDF Estimation).

Let $N\in\mathbb{N}$ and $\epsilon,\delta>0$ be parameters. Assume we are given conditional query access to a distribution $\mathbb{H}$ . With probability at least $1-\delta$ , we can respond to $N$ sample and pdf queries for some distribution $\widehat{\mathbb{H}}$ that is $\epsilon/(10O)^{2}$ -positive and $\epsilon$ conditionally close to $\mathbb{H}$ . This uses a total of $\mathrm{poly}(1/\epsilon,T,O,N,\log(1/\delta))$ conditional queries to $\mathbb{H}$ .

Remark 4.7.

While we don’t specify the whole distribution $\widehat{\mathbb{H}}$ , the point is that the responses made by the algorithm to the sample and pdf queries are consistent with some distribution $\widehat{\mathbb{H}}$ that is close to $\mathbb{H}$ .

Proof.

We define a rooted tree with $T$ layers where each intermediate node has exactly $O$ children. Now the nodes of the tree are labeled with strings in $\mathcal{O}^{t}$ for $t=0,1,\dots,T$ . The root is labeled with the empty string and the labels of the children of each node are obtained by appending each of the $O$ possible characters – thus the labels of the nodes at level $t$ are exactly the strings in $\mathcal{O}^{t}$ .

We first describe an algorithm for answering sample and pdf queries that requires exponential time but then we will give a polynomial time algorithm for answering polynomially many queries that is statistically equivalent to it.

Naive Algorithm:

for every non-leaf node in the tree, say with label $h$ , apply Claim 4.1 (using new, independent samples) to compute the next-character probabilities $p_{h}[o]$ for $o\in\mathcal{O}$ . We can ensure that with probability $1-\delta/O^{T}$ , $|\Pr_{\mathbb{H}}[o|h]-p_{h}[o]|\leq\epsilon/(10O)$ for all $o\in\mathcal{O}$ . By perturbing the $p_{h}[o]$ if necessary, we can also ensure $p_{h}[o]\geq\epsilon/(10O)^{2}$ and $\sum_{o\in\mathcal{O}}p_{h}[o]=1$ for all $o\in\mathcal{O}$ . Now assign the edge from $h$ to $h\vee o$ a weight of $p_{h}[o]$ . Let $\widehat{\mathbb{H}}$ be the distribution induced by this weighted tree where the probability of sampling a string $s\in\mathcal{O}^{T}$ is equal to the product of the weights along the root-to-leaf path to $s$ . Now answer all sample and pdf queries with respect to $\widehat{\mathbb{H}}$ . It is clear that $\widehat{\mathbb{H}}$ is a valid distribution, and union bounding over all nodes, with $1-\delta$ probability, $\widehat{\mathbb{H}}$ is $\epsilon/(10O)^{2}$ -positive and $\epsilon$ conditionally close to $\mathbb{H}$ .

Efficient Implementation:

We say we visit a node labeled $h$ to mean that we apply Claim 4.1 to compute the next-character probabilities and assign edge weights as described above. We will lazily maintain the tree instead, only visiting nodes as necessary. After we have visited a node, we never revisit it, so its edge weights are fixed once it has been visited once. Initially, we have only visited the root. Each time we receive a pdf query for a string $h$ , we follow the root-to-leaf path to $h$ and visit all nodes along this path. Now the pdf we output is obtained by simply multiplying the edge weights along the path from the root to $h$ .

Next, for a conditional sample query at $h$ , we begin from the node labeled $h$ . If we have visited it already, then we sample one of its children with probabilities proportional to the edge weights. We then move to this child and repeat. If we have not visited a node yet, we first visit it and then sample a child and repeat until we reach a leaf. We output this leaf as the conditional sample.

In this implementation, answering each query only requires $\mathrm{poly}(1/\epsilon,T,O,\log(1/\delta))$ time and conditional queries to $\mathbb{H}$ . Furthermore, this algorithm is statistically equivalent to the naive algorithm described earlier so we are done.

∎

4.2 Truncated KL Divergence

Later on in the analysis, we will need a modified version of KL Divergence that is truncated so that it is defined everywhere and doesn’t blow up at $0$ .

Definition 4.8.

For a parameter $c>0$ and real valued inputs $x,y$ , we define

\ell_{\geq c}(x,y)=x\log\max(x,c)-x\log\max(y,c)\,.

Definition 4.9.

For vectors $u,v,w\in{\mathbb{R}}^{N}$ where $w$ has positive entries, we define

\textsf{KL}_{\geq w}(u\|v)=\sum_{i=1}^{N}\ell_{\geq w[i]}(u[i],v[i])\,.

When $w$ has all entries equal to $c$ , we will slightly simplify notation and write

\textsf{KL}_{\geq c}(u\|v)=\sum_{i=1}^{N}\ell_{\geq c}(u[i],v[i])\,.

Note that the above is well-defined since we are only evaluating the logarithm when both $u[i]$ and $v[i]$ are positive. When $u,v$ are distributions and $c=0$ , the above corresponds to the usual definition of KL-divergence. When $c>0$ , this corresponds to a soft truncation of the distributions to elements with nontrivial mass.

One of the key properties of KL divergence is that it allows for projection in the following sense: when projecting a point onto a convex set, the projection is closer to all points in the set.

Fact 4.10.

Let $\mathcal{T}\subset{\mathbb{R}}^{N}$ be the $N$ -dimensional simplex i.e. the convex hull of the standard basis vectors. Let $\mathcal{K}\subseteq\mathcal{T}$ be a convex set and let $v\in\mathcal{T}$ be a point. Let $x^{*}=\operatorname{argmin}_{x\in\mathcal{K}}\textsf{KL}(x\|v)$ . Then for all $x\in\mathcal{K}$ ,

\textsf{KL}(x\|x^{*})\leq\textsf{KL}(x\|v)\,.

Proof.

Let $x^{*}=(x_{1}^{*},\dots,x_{N}^{*})$ and $v=(v_{1},\dots,v_{N})$ . Then the optimality of $x^{*}$ implies that for any $x=(x_{1},\dots,x_{N})\in\mathcal{K}$ ,

\sum_{j=1}^{N}(x_{j}-x_{j}^{*})(\log x_{j}^{*}-\log v_{j}+1)\geq 0\,.

Since $x,x^{*}$ are on the simplex and $\sum_{j=1}^{N}x_{j}^{*}(\log x_{j}^{*}-\log v_{j})\geq 0$ by non-negativity of KL divergence, we conclude

\sum_{j=1}^{N}x_{j}\log v_{j}\leq\sum_{j=1}^{N}x_{j}\log x_{j}^{*}

and this immediately gives the desired inequality. ∎

In our analysis later on, we will need a slightly more general version of Fact 4.10, stated below, that deals with truncated KL divergences and points that are not exactly on the simplex.

Corollary 4.11.

Let $w,v\in{\mathbb{R}}^{N}$ be vectors. Assume $w$ has positive coordinates and assume $\mathcal{K}\subset{\mathbb{R}}^{N}$ is a convex set such that all elements $x\in\mathcal{K}$ have sum of coordinates equal to $1$ and $x\geq w$ entrywise. Let $v\in{\mathbb{R}}^{N}$ be a point. Let $x^{*}=\operatorname{argmin}_{x\in\mathcal{K}}\textsf{KL}_{\geq w}(x\|v)$ . Then for any $x\in\mathcal{K}$ ,

\textsf{KL}_{\geq w}(x\|x^{*})\leq\textsf{KL}_{\geq w}(x\|v)+\log(\left\lVert v\right\rVert_{1}+\left\lVert w\right\rVert_{1})\,.

Proof.

Note that by the assumptions in the statement, for any $x\in\mathcal{K}$ ,

\textsf{KL}_{\geq w}(x\|v)=\textsf{KL}(x\|\max(w,v))=\textsf{KL}\left(x\|\max(w,v)/C\right)-\log C

where the max on the RHS is taken entrywise and $C$ is any positive constant. Now letting $C$ be equal to the sum of the coordinates of $\max(w,v)$ , note that $C\leq\left\lVert v\right\rVert_{1}+\left\lVert w\right\rVert_{1}$ . Then we can simply apply Fact 4.10 to the vector $\max(w,v)/C$ to get the desired inequality. ∎

We also have the following observation that will allow us to relate truncated KL divergence to TV distance.

Claim 4.12.

For a vector $w\in{\mathbb{R}}^{N}$ with positive entries and minimum entry $w_{\min}$ and maximum entry $w_{\max}$ , the divergence $\textsf{KL}_{\geq w}(u\|v)$ is Lipchitz with respect to $u$ with Lipchitz constant

1+\log(1+1/w_{\min})+\log(1+w_{\max})+\log(1+\left\lVert u\right\rVert_{\infty})+\log(1+\left\lVert v\right\rVert_{\infty})\,.

Proof.

This follows immediately from the definition. ∎

5 Geometric Results

Our learning algorithm will also rely on a few geometric constructions which we introduce below.

5.1 Spanners

First, we recall the standard notion of a barycentric spanner.

Definition 5.1 (Barycentric Spanner).

\left\lVert z_{j}-(c_{1}v_{1}+\dots+c_{s}v_{s})\right\rVert_{1}\leq\gamma\,.

Note that the error is defined in $L^{1}$ because later on, we will have $v_{1},\dots,v_{s}$ representing the density functions of distributions and the error will correspond to TV-closeness.

Next we introduce a distributional notion of a spanner, where we only require that the spanner covers most of the mass of some underlying distribution.

Definition 5.2 (Distribution Spanner).

Let $\mathcal{D}$ be a distribution over ${\mathbb{R}}^{d}$ . Let $C>0$ , $0\leq\epsilon,\gamma\leq 1$ be parameters. We say a set of points $\{v_{1},\dots,v_{s}\}\in{\mathbb{R}}^{d}$ is a $(1-\epsilon,C,\gamma)$ -spanner for $\mathcal{D}$ if for a random point $z$ drawn from $\mathcal{D}$ , with $1-\epsilon$ probability, there are coefficients $|c_{1}|,\dots,|c_{s}|\leq C$ with

\left\lVert z-(c_{1}v_{1}+\dots+c_{s}v_{s})\right\rVert_{1}\leq\gamma\,.

It is folkore that any subset of points has an exact barycentric spanner. However, in our setting, it will be important to compute a spanner for a high-dimensional distribution given only polynomially many samples.

Fact 5.3.

[Folklore] Let $v_{1},\dots,v_{n}$ be arbitrary vectors in ${\mathbb{R}}^{d}$ , Then there exists a subset $S\subseteq[n]$ with $|S|\leq d$ such that $\{v_{i}\}_{i\in S}$ form a $(1,0)$ spanner for the collection $(v_{1},\dots,v_{n})$ .

Lemma 5.4.

Let $\mathcal{D}$ be a distribution over ${\mathbb{R}}^{d}$ . Let $0<\epsilon,\delta<1$ be parameters. Given a set $\mathcal{A}$ of $(d/\epsilon)^{10}\cdot\log(1/\delta)$ independent samples from $\mathcal{D}$ , with probability $1-\delta$ , there exists a subset of $d$ elements of $\mathcal{A}$ , $\{v_{1},\dots,v_{d}\}$ , which form a $(1-\epsilon,1,0)$ -spanner for $\mathcal{D}$ .

Proof.

For any $d$ vectors in ${\mathbb{R}}^{d}$ , say $v_{1},\dots,v_{d}$ , let $\mathcal{P}_{v_{1},\dots,v_{d}}$ be the convex body whose vertices are $\pm v_{1}\pm\dots\pm v_{d}$ . Let $\mu(\mathcal{P}_{v_{1},\dots,v_{d}})$ be the pdf of $\mathcal{D}$ inside this convex body and let $r(\mathcal{P}_{v_{1},\dots,v_{d}})$ be the fraction of points of $\mathcal{A}$ that are within this body. For fixed, $v_{1},\dots,v_{d}$ , we have

|\mu(\mathcal{P}_{v_{1},\dots,v_{d}})-r(\mathcal{P}_{v_{1},\dots,v_{d}})|\leq 0.5\epsilon

with $1-\delta^{(d/\epsilon)^{2}}$ probability. Now we can union bound over all subsets $\{v_{1},\dots,v_{d}\}\subseteq\mathcal{A}$ and deduce that with probability $1-\delta$ , for all such subsets,

|\mu(\mathcal{P}_{v_{1},\dots,v_{d}})-r(\mathcal{P}_{v_{1},\dots,v_{d}})|\leq\epsilon\,.

However, by Fact 5.3, there is a subset of $d$ elements of $\mathcal{A}$ that is a $(1,0)$ -spanner for $\mathcal{A}$ and thus the above implies that taking $\{v_{1},\dots,v_{d}\}$ to be this subset gives us a $(1-\epsilon,1,0)$ -spanner for $\mathcal{D}$ . ∎

Now we recall the following standard algorithm for computing a spanner (see e.g. [AK08]). We include a proof for completeness.

Claim 5.5.

Let $v_{1},\dots,v_{n}\in{\mathbb{R}}^{d}$ be vectors and let $\mathbf{M}\in{\mathbb{R}}^{d\times n}$ be the matrix with columns given by $v_{1},\dots v_{n}$ and let its singular values be $\sigma_{1}\geq\dots\geq\sigma_{d}$ . Then there is an algorithm that runs in time $\mathrm{poly}(n,d)\cdot\log(\sigma_{1}/\sigma_{d})$ that computes a subset $\mathcal{B}\subseteq[n]$ with $|\mathcal{B}|\leq d$ such that $\{v_{i}\}_{i\in\mathcal{B}}$ is a $(2,0)$ spanner for the collection $\{v_{1},\dots,v_{n}\}$

Proof.

First, we iteratively construct an initial subset $\mathcal{A}\subseteq[n]$ as follows. Initially start with $\mathcal{A}_{0}=\emptyset$ and in each step $j=1,2,\dots,d$ , find a vector $v_{a_{j}}$ such that

\left\lVert\Pi_{\textsf{span}\left(\{v_{i}\}_{i\in\mathcal{A}_{j-1}})\right)^{\perp}}(v_{a_{j}})\right\rVert\geq\frac{\sigma_{d}}{\sqrt{n}}\,.

First, such a vector always exists for $j=1,2,\dots,d$ because we know that

\sum_{a\in[n]}\left\lVert\Pi_{\textsf{span}\left(\{v_{i}\}_{i\in\mathcal{A}_{j-1}})\right)^{\perp}}(v_{a})\right\rVert^{2}\geq\min_{\mathbf{U}\in{\mathbb{R}}^{d\times(j-1)},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}}\left\lVert\mathbf{M}-\mathbf{U}\mathbf{U}^{\top}\mathbf{M}\right\rVert_{F}^{2}\geq\sigma_{j}^{2}\geq\sigma_{d}^{2}

where we used the extremal characterization of the $j$ th singular value of $\mathbf{M}$ in terms of the best rank- $j$ approximation to $\mathbf{M}$ .

After this procedure, we have a subset $\mathcal{A}=\{v_{a_{1}},\dots,v_{a_{d}}\}$ . We will use $\textsf{Vol}(\mathcal{A})$ to denote the volume of the simplex formed by the vectors in $\mathcal{A}$ . By construction, we have

\textsf{Vol}(\mathcal{A})\geq\frac{\sigma_{d}^{d}}{(dn)^{d}}\,.

Next, assume for the sake of contradiction that $\mathcal{A}$ is not a $(2,0)$ -spanner. Then there is some vector $v_{i}$ that cannot be written as a linear combination of $v_{a_{1}},\dots,v_{a_{d}}$ with coefficients of magnitude at most $2$ . There is a unique way to write

v_{i}=c_{1}v_{a_{1}}+\dots+c_{d}v_{a_{d}}

and there must be some coefficient say $c_{j}$ with $|c_{j}|\geq 2$ . Then we can replace $v_{a_{j}}$ with $v_{i}$ in $\mathcal{A}$ and this at least doubles $\textsf{Vol}(\mathcal{A})$ . The maximal possible value of $\textsf{Vol}(\mathcal{A})$ for any subset $\mathcal{A}$ with $|\mathcal{A}|=d$ is at most $\sigma_{1}^{d}$ and thus we need to iterate this process at most $\mathrm{poly}(n,d)\cdot\log(\sigma_{1}/\sigma_{d})$ times before we get a set $\mathcal{A}$ that is a $(2,0)$ -spanner. This completes the proof. ∎

Claim 5.5 has the restriction that $\sigma_{d}$ is nontrivial. We will now give a robust version of Claim 5.5 for computing an approximate spanner even when the vectors may be arbitrarily close to degenerate.

Algorithm 1 Robust Spanner

1:Input: Vectors

v_{1},\dots,v_{n}\in{\mathbb{R}}^{d}

2:Input: Parameters

s,\gamma

3:Form matrix

\mathbf{M}\in{\mathbb{R}}^{d\times n}

with columns

v_{1},\dots,v_{n}

4:Compute SVD of

\mathbf{M}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top}

5:Sort the singular values

\sigma_{1}\geq\sigma_{2}\geq\dots\geq\sigma_{d}

6:Let

\sigma_{t}

be the largest index with

\sigma_{t}>\gamma\sqrt{n}

7:Let

\mathbf{Y}

be the matrix formed by the columns of

\mathbf{U}

corresponding to the top-

t

singular values

8:Define

u_{i}=\mathbf{Y}^{\top}v_{i}

for all

i\in[n]

9:Apply Claim 5.5 to compute

(2,0)

-spanner

\{u_{a_{1}},\dots,u_{a_{t}}\}

for

\{u_{1},\dots,u_{n}\}

10:Output:

\{a_{1},\dots,a_{t}\}

Lemma 5.6.

Let $v_{1},\dots,v_{n}\in{\mathbb{R}}^{d}$ be vectors with $\left\lVert v_{i}\right\rVert\leq\mathrm{poly}(n,d)$ for all $i$ . Let $s,\gamma$ be parameters we are given. Assume that there exists a subspace $\mathbf{V}$ of dimension $s$ such that $\left\lVert\Pi_{\mathbf{V}^{\perp}}v_{i}\right\rVert\leq\gamma$ for all $i$ . Then if we run Algorithm 1 on $v_{1},\dots,v_{n}$ , it runs in time $\mathrm{poly}(n,d)\cdot\log(1/\gamma)$ and its output satisfies

•

$t\leq s$
•

$\{v_{a_{1}},\dots,v_{a_{t}}\}$ forms a $(2,3\gamma s\sqrt{nd})$ spanner for the collection $\{v_{1},\dots,v_{n}\}$

Proof.

Recall that $t$ is the largest index such that $\sigma_{t}>\gamma\sqrt{n}$ . Note that $t\leq s$ since by assumption, there is rank- $s$ subspace say $\mathbf{W}\in{\mathbb{R}}^{d\times s}$ such that

\left\lVert\mathbf{M}-\mathbf{W}\mathbf{W}^{\top}\mathbf{M}\right\rVert_{F}\leq\sqrt{n}\gamma

and thus $\sigma_{s+1}\leq\sqrt{n}\gamma$ .

Next, note that $\mathbf{Y}^{\top}\mathbf{M}$ is a $t\times n$ matrix with all $t$ singular values being at least $\gamma\sqrt{n}$ (and the largest singular value is at most $\mathrm{poly}(n,d)$ ). Thus, when we apply Claim 5.5 in Line 9, it runs in $\mathrm{poly}(n,d)\cdot\log(1/\gamma)$ time and the subset $\{u_{a_{1}},\dots,u_{a_{t}}\}$ is indeed a $(2,0)$ -spanner for the $u_{i}$ . Finally, we show that $\{v_{a_{1}},\dots,v_{a_{t}}\}$ is a good spanner for the collection of $v_{i}$ . For any $v_{i}$ , we can first write

\mathbf{Y}^{\top}v_{i}=u_{i}=c_{1}u_{a_{1}}+\dots+c_{t}u_{a_{t}}

where $|c_{1}|,\dots,|c_{t}|\leq 2$ . Now

\begin{split}\left\lVert v_{i}-(c_{1}v_{a_{1}}+\dots+c_{t}v_{a_{t}})\right\rVert_{1}&\leq\sqrt{d}\left\lVert v_{i}-(c_{1}v_{a_{1}}+\dots+c_{t}v_{a_{t}})\right\rVert_{2}\\ &=\sqrt{d}\left\lVert(\mathbf{I}-\mathbf{Y}\mathbf{Y}^{\top})(v_{i}-(c_{1}v_{a_{1}}+\dots+c_{t}v_{a_{t}}))\right\rVert_{2}\\ &\leq(2t+1)\sqrt{d}\left\lVert(\mathbf{I}-\mathbf{Y}\mathbf{Y}^{\top})\mathbf{M}\right\rVert_{\textsf{op}}\\ &\leq 3t\sqrt{d}\sigma_{t+1}\\ &\leq 3\gamma s\sqrt{nd}\end{split}

ad thus the set $\{v_{a_{1}},\dots,v_{a_{t}}\}$ is a $(2,3\gamma s\sqrt{nd})$ spanner. It is clear that the overall runtime is $\mathrm{poly}(n,d)\cdot\log(1/\gamma)$ . ∎

5.2 Dimensionality Reduction

We will also need a type of dimensionality for distributions over exponentially large domains. For distributions $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ over a large domain say $[N]$ , we can represent them as $N$ -dimensional vectors. However, we need a more succinct representation. Below we show how with only polynomially many samples and pdf access to the distributions, we can construct a succinct representation that approximately preserves important properties of the original distributions. We use the following notation for translating between distributions and vectors of their densities.

Definition 5.7.

Let $\mathcal{D}$ be a distribution on a finite set of elements $[N]$ . For $a\in[N]$ , we write $\mathcal{D}[a]$ for the density function of $\mathcal{D}$ at $a$ . We define the vector $\textsf{vec}_{\mathcal{D}}\in{\mathbb{R}}^{N}$ to be $\textsf{vec}_{\mathcal{D}}=(\mathcal{D}[1],\dots,\mathcal{D}[N])$ . For a multiset $\mathcal{X}$ of elements in $[N]$ , we define $\textsf{vec}_{\mathcal{D}}[\mathcal{X}]\in{\mathbb{R}}^{|\mathcal{X}|}$ to be a vector whose entries are indexed by elements $a\in\mathcal{X}$ and the values are equal to $\mathcal{D}[a]$ .

Definition 5.8.

Given distributions $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ on a set of elements, say $[N]$ , we say vectors $u_{1},\dots,u_{m}\in{\mathbb{R}}^{d}$ for some dimension $d$ are $(r,\gamma)$ -representative for $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ if for all coefficients $c_{1},\dots,c_{m}\in{\mathbb{R}}$ with $|c_{i}|\leq r$ such that at most $r$ of them are nonzero,

\left\lvert\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}u_{1}+\dots+c_{m}u_{m}\right\rVert_{1}\right\rvert\leq\gamma\,.

Definition 5.9.

Given distributions $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ on a set of elements, say $[N]$ , we say vectors $w,u_{1},\dots,u_{m}\in{\mathbb{R}}^{d}$ for some dimension $d$ are $(r,\gamma,\tau)$ KL-preserving for $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ if for all coefficients $c_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime}\in{\mathbb{R}}$ with $|c_{i}|\leq r,|c_{i}^{\prime}|\leq r/\tau$ such that at most $r$ of the $c_{i}$ and $r$ of the $c_{i}^{\prime}$ are nonzero and for any constant $\tau^{*}$ with $\tau\leq\tau^{*}\leq 1$ ,

\left\lvert\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})-\textsf{KL}_{\geq\tau^{*}w}(c_{1}u_{1}+\dots+c_{m}u_{m}\|c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m})\right\rvert\leq\gamma\,.

The main result of this section is Lemma 5.10, where we show that for any distributions $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ , after sampling and appropriately re-normalizing, we can construct succinct representations that approximately preserve the TV distance and KL divergence between linear combinations of these distributions.

Lemma 5.10.

Let $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ be distributions on a set of elements, say $[N]$ . Let $k\in\mathbb{N}$ and let $\mathcal{X}_{1},\dots,\mathcal{X}_{m}$ be multisets where $\mathcal{X}_{i}$ is obtained by drawing $k$ elements independently from $\mathcal{D}_{i}$ . Let $\mathcal{X}=\mathcal{X}_{1}\cup\dots\cup\mathcal{X}_{m}$ . Define the distribution $\widehat{\mathcal{D}}=\frac{\mathcal{D}_{1}+\dots+\mathcal{D}_{m}}{m}$ . Define the vectors $u_{i}=\frac{\textsf{vec}_{\mathcal{D}_{i}}[\mathcal{X}]}{mk\textsf{vec}_{\widehat{\mathcal{D}}}[\mathcal{X}]}$ for $i\in[m]$ where the division is done entrywise and define $w=1/(mk\textsf{vec}_{\widehat{\mathcal{D}}}[\mathcal{X}])$ (reciprocated entrywise).

Let $0<\delta,\gamma,\tau<1$ and $r\in\mathbb{N}$ be be some parameters. If $k\geq 100mr^{4}\log^{4}(m/(\tau\delta\gamma))/\gamma^{2}$ then with $1-\delta$ probability

•

The vectors $u_{1},\dots,u_{m}$ are $(r,\gamma)$ -representative for $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$
•

The vectors $w,u_{1},\dots,u_{m}$ are $(r,\gamma,\tau)$ KL-preserving for $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$

Proof.

We begin by proving the first statement. Consider a fixed set of coefficients $c_{1},\dots,c_{m}$ . We will first imagine that the sets $\mathcal{X}_{1},\dots,\mathcal{X}_{m}$ are sampled after fixing the coefficients and then we will union bound over a net over all possible choices for $c_{1},\dots,c_{m}$ . We have

\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}=\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{|c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]|}{\widehat{\mathcal{D}}[a]}\right]\,.

Also the quantity inside the expectation on the RHS always has magnitude at most $rm$ . Now we can draw samples from $\widehat{\mathcal{D}}$ to approximate the RHS. By a Chernoff bound, if we draw at least $100m^{2}r^{4}/\gamma^{2}\cdot\log(m/(\delta\gamma))$ samples from the distribution $\widehat{\cal{D}}$ – this corresponds to taking $k\geq 100mr^{4}/\gamma^{2}\cdot\log(m/(\delta\gamma))$ , then with probability at least $1-(\delta\gamma/(10rm))^{10r}$ , the empirical estimate of the RHS is within $\gamma/2$ of the true value i.e.

\left\lvert\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}u_{1}+\dots+c_{m}u_{m}\right\rVert_{1}\right\rvert\leq\frac{\gamma}{2}

which is exactly the inequality that we want to show.

Since the failure probability of the above is $(\delta\gamma/(10rm))^{10r}$ , we can union bound over a $\gamma/(10rm)^{3}$ -net of all of the possible choices of $c_{1},\dots,c_{m}$ . Call the net $\mathcal{N}$ . Then for any possible $c_{1},\dots,c_{m}$ such that $|c_{i}|\leq r$ and at most $r$ of the $c_{i}$ are nonzero, there is some element of the net, say $(c_{1}^{\prime},\dots,c_{m}^{\prime})\in\mathcal{N}$ , such that $\max_{i\in[m]}|c_{i}-c_{i}^{\prime}|\leq\gamma/(10rm)^{3}$ . From the union bound over the net, we know that

\left\lvert\left\lVert c_{1}^{\prime}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}^{\prime}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m}\right\rVert_{1}\right\rvert\leq\frac{\gamma}{2}

and thus

\left\lvert\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}u_{1}+\dots+c_{m}u_{m}\right\rVert_{1}\right\rvert\leq\gamma

as desired.

We prove the second statement in a similar way. First consider a fixed choice of $\tau^{*}$ . Consider fixed coefficients $c_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime}$ . Then

\begin{split}&\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})\\ &=\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{\widehat{\mathcal{D}}[a]}\cdot\log\max((c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a],\tau^{*})\right]\\ &-\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{\widehat{\mathcal{D}}[a]}\cdot\log\max((c_{1}^{\prime}\mathcal{D}_{1}[a]+\dots+c_{m}^{\prime}\mathcal{D}_{m}[a]),\tau^{*})\right]\\ &=mk\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]}\cdot\log\max\left(\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]},\frac{\tau^{*}}{mk\widehat{\mathcal{D}}[a]}\right)\right]\\ &-mk\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]}\cdot\log\max\left(\frac{c_{1}^{\prime}\mathcal{D}_{1}[a]+\dots+c_{m}^{\prime}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]},\frac{\tau^{*}}{mk\widehat{\mathcal{D}}[a]}\right)\right]\,.\end{split}

Also for all possible choices of $a\sim\widehat{\mathcal{D}}$ , the RHS always has magnitude at most $10rm\log(m/\tau)$ . Now, as before in the proof of the first statement, by a Chernoff bound, this implies that for fixed $c_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime}$ and also $\tau^{*}$ , with probability $1-(\delta\gamma\tau/(10rm))^{10r}$ , we have

\left\lvert\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})-\textsf{KL}_{\geq\tau^{*}w}(c_{1}u_{1}+\dots+c_{m}u_{m}\|c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m})\right\rvert\leq\frac{\gamma}{2}\,.

Again, as before, we next union bound over a $\gamma\tau/(10rm)^{3}$ -net over all possible choices of $c_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime}$ and also $\tau^{*}$ and we deduce that

\left\lvert\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})-\textsf{KL}_{\geq\tau^{*}w}(c_{1}u_{1}+\dots+c_{m}u_{m}\|c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m})\right\rvert\leq\gamma

for all choices of $c_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime},\tau^{*}$ , as desired. ∎

The following fact will also be useful as we won’t have pdf access to the exact distribution that we get samples from but instead to a distribution that is close in TV distance.

Lemma 5.11.

Let $\mathcal{D}_{1},\dots,\mathcal{D}_{m}$ be distributions on a set of elements, say $[N]$ . Let $\mathcal{D}_{1}^{\prime},\dots,\mathcal{D}_{m}^{\prime}$ be distributions such that $d_{\textsf{TV}}(\mathcal{D}_{i},\mathcal{D}_{i}^{\prime})\leq\epsilon$ for all $i\in[m]$ . Let $k\in\mathbb{N}$ and let $\mathcal{X}_{1},\dots,\mathcal{X}_{m}$ be multisets where $\mathcal{X}_{i}$ is obtained by drawing $k$ elements independently from $\mathcal{D}_{i}^{\prime}$ . With probability $1-2km^{2}\sqrt{\epsilon}$ , we have for all $i\in[m]$ and $a\in\mathcal{X}_{i}$

|\mathcal{D}_{j}[a]-\mathcal{D}_{j}^{\prime}[a]|\leq\sqrt{\epsilon}\mathcal{D}_{i}[a]

for all $j\in[m]$ .

Proof.

Fix an $i\in[m]$ . For $j\in[m]$ , let $\mathcal{Z}_{j,i}\subset[N]$ be the set of elements $a\in[N]$ such that

|\mathcal{D}_{j}[a]-\mathcal{D}_{j}^{\prime}[a]|\geq\sqrt{\epsilon}\mathcal{D}_{i}[a]\,.

Note that by assumption,

\sum_{a\in\mathcal{Z}_{j,i}}|\mathcal{D}_{j}[a]-\mathcal{D}_{j}^{\prime}[a]|\leq\epsilon

and thus we must have

\Pr_{a\sim\mathcal{D}_{i}}[a\in\mathcal{Z}_{j,i}]=\sum_{a\in\mathcal{Z}_{j,i}}\mathcal{D}_{i}[a]\leq\sqrt{\epsilon}

which also implies $\Pr_{a\sim\mathcal{D}_{i}^{\prime}}[a\in\mathcal{Z}_{j,i}]\leq 2\sqrt{\epsilon}$ . Now we can union bound this over all choices of $i,j$ and all of the samples drawn to get that the overall failure probability is at most $2km^{2}\sqrt{\epsilon}$ as desired. ∎

6 Learning Algorithm

In this section, we present our learning algorithm. In light of Lemma 4.6, throughout our learning algorithm, we will assume that we have sample/conditional query and pdf access to a distribution $\widehat{\mathbb{H}}$ that is conditionally close to the unknown distribution $\mathbb{H}$ . We will only use conditional queries to $\mathbb{H}$ to simulate access to $\widehat{\mathbb{H}}$ and almost all of our reasoning will be with respect to the distribution $\widehat{\mathbb{H}}$ .

6.1 Finding a Spanner for the State Space

Recall that by Fact 2.2, for a fixed $t$ with $t<T$ , the (vectorized) distributions $\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ all lie in some $S$ -dimensional subspace. Thus, by Fact 5.3, there exists a barycentric spanner consisting of $S$ elements corresponding to some $S$ histories. Since $\widehat{\mathbb{H}}$ is conditionally close to $\mathbb{H}$ , the (vectorized) distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ are all close to some $S$ -dimensional subspace. The first step in our algorithm will be to, for each $t$ , compute a set $\mathcal{B}\subseteq\mathcal{O}^{t}$ of $S$ histories such that $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}}$ are an approximate spanner for $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}$ . The main algorithm in this section is Algorithm 3 and we analyze it in Lemma 6.3.

First, we will need a few preliminaries. It will be convenient to introduce the following notation for arranging possible histories and futures into a matrix.

Definition 6.1.

For a subset $\mathcal{H}\subseteq\mathcal{O}^{t}$ for some $t$ and subset $\mathcal{X}\subseteq\mathcal{O}^{T-t}$ , let $\mathbf{M}^{(t)}[\mathcal{H},\mathcal{X}]$ be the matrix with rows indexed by elements $h\in\mathcal{H}$ and columns indexed by elements $x\in\mathcal{X}$ and entries equal to $\Pr_{\widehat{\mathbb{H}}}[x|h]$ . When either $\mathcal{X}$ or $\mathcal{H}$ is the full set, we may write $\mathbf{M}^{(t)}[\mathcal{H},:]$ or $\mathbf{M}^{(t)}[:,\mathcal{X}]$

We will repeatedly make use of the primitive in Algorithm 2 for taking histories $h_{1},\dots,h_{s}\in\mathcal{O}^{t}$ and outputting vectors $u_{1},\dots,u_{s}$ that are succinct representations (in the sense of Lemma 5.10) of the distributions $\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{1}],\dots,\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{s}]$ .

Algorithm 2 Building Vectors

1:Input: Sample, conditional query, and exact pdf access to a distribution

\widehat{\mathbb{H}}

over

\mathcal{O}^{T}

2:Input: Parameter

t<T

, strings

h_{1},\dots,h_{s}\in\mathcal{O}^{t}

3:Input: Parameter

k\in\mathbb{N}

4:for

i\in[s]

5: Sample

k

futures

f_{1},\dots,f_{k}\in\mathcal{O}^{T-t}

from

\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{i}]

6: Set

\mathcal{X}_{i}=\{f_{1},\dots,f_{k}\}

7:end for

8:Set

\mathcal{X}=\mathcal{X}_{1}\cup\dots\cup\mathcal{X}_{s}

(as a multiset)

9:for

i\in[s]

10: Compute

\Pr_{\widehat{\mathbb{H}}}[f|h_{i}]

for all

f\in\mathcal{X}

11: Set

v_{i}\in{\mathbb{R}}^{|\mathcal{X}|}

v_{i}=\mathbf{M}^{(t)}[h_{i},\mathcal{X}]

12:end for

13:Compute

w=1/(k(v_{1}+\dots+v_{s}))

(reciporicated entrywise)

14:Compute

u_{i}=v_{i}\odot w

(multiplication entrywise)

15:Output:

\mathcal{X},\{u_{1},\dots,u_{s}\},w

Fact 6.2.

Whenever we run Algorithm 2, the output vectors $u_{1},\dots,u_{s}$ have nonnegative entries and satisfy $\left\lVert u_{1}\right\rVert_{\infty},\dots,\left\lVert u_{s}\right\rVert_{\infty}\leq 1$ ,

Proof.

This follows immediately from the construction in Algorithm 2. ∎

Now we are ready to present the algorithm for building the spanners and its analysis. For technical reasons that will be important later, when computing a spanner for histories of length $t+1$ , the algorithm takes as input some strings $h_{1}^{(t)},\dots,h_{s}^{(t)}$ of length $t$ . The goal will be to output a list of strings of length $t+1$ , say $\mathcal{B}\subseteq\mathcal{O}^{t+1}$ , such that the corresponding vectors $\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{B}}$ are an approximate spanner for the distribution of vectors $\{\mathbf{M}^{(t+1)}[h,:]\}_{h\sim\widehat{\mathbb{H}}[:t+1]}$ and they also span $\{\mathbf{M}^{(t+1)}[h_{i}^{(t)}\vee o,:]\}_{h\in\mathcal{B}}$ for all $i\in[s],o\in\mathcal{O}$ . Note that the first condition is the natural requirement for a spanner and the second will be important later when we compute “transitions” between the length- $t$ histories and length- $t+1$ histories.

Algorithm 3 Building Spanners

1:Input: Sample, conditional query, and exact pdf access to a distribution

\widehat{\mathbb{H}}

over

\mathcal{O}^{T}

2:Input: Parameter

t<T

, strings

h_{1}^{(t)},\dots,h_{s}^{(t)}\in\mathcal{O}^{t}

3:Input: Parameters

S\in\mathbb{N},S\geq s

\gamma,\delta>0

4:Define

\mathcal{A}\subseteq\mathcal{O}^{t+1}

\mathcal{A}=\{h_{i}^{(t)}\vee o\}_{i\in[s],o\in\mathcal{O}}

5:Draw

N=\mathrm{poly}(S,O,T,1/\gamma,\log(1/\delta))

histories of length

t+1

from

\widehat{\mathbb{H}}[:t+1]

, say

h^{\prime}_{1},\dots,h^{\prime}_{N}

6:Set

\mathcal{A}^{\prime}\leftarrow\mathcal{A}\cup\{h^{\prime}_{1},\dots,h^{\prime}_{N}\}

7:Run Algorithm 2 on set

\mathcal{A}^{\prime}

with parameters

t+1,k=N\cdot\mathrm{poly}(S,O,T,1/\gamma,\log(1/\delta))

8:Let the result be

\mathcal{X},\{u_{h}\}_{h\in\mathcal{A}^{\prime}},w

where

\mathcal{X}\subseteq\mathcal{O}^{T-t-1},u_{h}\in{\mathbb{R}}^{|\mathcal{X}|}

9:Run Algorithm 1 on

\{u_{h}\}_{h\in\mathcal{A}^{\prime}}

with parameters

S,\gamma/(100Sk^{2})

and let the output be

\mathcal{B}\subseteq\mathcal{A}^{\prime}

10:Output:

\mathcal{B}

We now prove guarantees on the spanning set $\mathcal{B}$ found by Algorithm 3.

Lemma 6.3.

Assume that $\widehat{\mathbb{H}}$ is $\epsilon$ conditionally close to a rank $S$ distribution $\mathbb{H}$ where

\epsilon\leq\frac{1}{\mathrm{poly}(S,O,T,1/\gamma,\log(1/\delta))}

for some sufficiently large polynomial. With probability at least $1-\delta-\epsilon^{0.4}$ , if we run Algorithm 3 for arbitrary input strings $h_{1}^{(t)},\dots,h_{s}^{(t)}$ , the output satisfies the following conditions

•

$|\mathcal{B}|\leq S$ and the vectors $\{u_{h}\}_{h\in\mathcal{B}}$ form a $(2,\gamma)$ spanner for $\{u_{h}\}_{h\in\mathcal{A}^{\prime}}$
•

The rows of $\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{B}}$ form a $(2,S\gamma)$ -spanner for the rows of $\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{A}^{\prime}}$
•

The rows of $\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{B}}$ form a $(1-\gamma,2S,2S\gamma)$ -spanner for the distribution of vectors $\{\mathbf{M}^{(t+1)}[h,:]\}_{h\sim\widehat{\mathbb{H}}[:t+1]}$

Proof.

For $h\in\mathcal{O}^{t+1}$ , let $\nu_{h},\nu^{\prime}_{h}$ be vectors with entries indexed by $x\in\mathcal{O}^{T-t-1}$ and entries given by $\Pr_{\mathbb{H}}[x|h]$ and $\Pr_{\widehat{\mathbb{H}}}[x|h]$ respectively. First, note that the vectors $\{\nu_{h}\}$ are contained in an $S$ -dimensional space by Fact 2.2. Thus, by Lemma 5.11, and the setting of $\epsilon$ sufficiently small, with probability at least $1-\epsilon^{0.4}$ , there is an $S$ -dimensional subspace, say $\mathbf{V}$ , such that all of $\{u_{h}\}_{h\in\mathcal{A}^{\prime}}$ have projection at most $\epsilon^{0.1}$ onto $\mathbf{V}^{\perp}$ . Assuming that this holds, all of the hypotheses of Lemma 5.6 are satisfied (note the condition about the norms of the input vectors is satisfied by the construction in Algorithm 2). We then get that the execution of Algorithm 3 successfully finds a set $\{u_{h}\}_{h\in\mathcal{B}}$ with $|\mathcal{B}|\leq S$ that is a $(2,\gamma)$ spanner for $\{u_{h}\}_{h\in\mathcal{A}^{\prime}}$ .

For each $h^{\prime}\in\mathcal{A}^{\prime}$ , we can use the fact that $\{u_{h}\}_{h\in\mathcal{B}}$ is a $(2,\gamma)$ spanner for $\{u_{h}\}_{h\in\mathcal{A}^{\prime}}$ to find a linear combination $\{c_{h^{\prime},h}\}_{h\in\mathcal{B}}$ with $|c_{h^{\prime},h}|\leq 2$ for all $h\in\mathcal{B}$ such that

\left\lVert u_{h^{\prime}}-\sum_{h\in\mathcal{B}}c_{h^{\prime},h}u_{h}\right\rVert_{1}\leq\gamma\,.

(9)

Next, by Lemma 5.10, with probability at least $1-0.1\delta$ , the vectors $\{u_{h}\}_{h\in\mathcal{A}^{\prime}}$ are $(10S,0.1\gamma)$ -representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}^{\prime}}$ . This then implies

\left\lVert\mathbf{M}^{(t+1)}[h^{\prime},:]-\sum_{h\in\mathcal{B}}c_{h^{\prime},h}\mathbf{M}^{(t+1)}[h,:]\right\rVert_{1}\leq S\gamma

which gives the second condition.

Now we prove the last condition. Recall that the vectors $\{\nu_{h}\}$ are vectors in ${\mathbb{R}}^{O^{T-t-1}}$ that are contained in some $S$ -dimensional subspace. Let $\mathcal{P}$ be the distribution of these vectors for $h\sim\widehat{\mathbb{H}}[:t+1]$ . By Lemma 5.4 (since these vectors live in an $S$ -dimensional subspace), with probability at least $1-0.1\delta$ , there is a subset $\widehat{\mathcal{B}}\subseteq\mathcal{A}^{\prime}$ with $|\widehat{\mathcal{B}}|\leq S$ , such that the vectors $\{\nu_{h}\}_{h\in\widehat{\mathcal{B}}}$ are a $(1-\gamma,1,0)$ -spanner for this distribution $\mathcal{P}$ . For each $h^{\prime}\in\widehat{\mathcal{B}}$ , we can again use (9) (since $\widehat{\mathcal{B}}\subseteq\mathcal{A}^{\prime}$ ). We can also again use Lemma 5.10 to get that the vectors $\{u_{h}\}_{h\in\mathcal{A}^{\prime}}$ are $(10S,0.1\gamma)$ -representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}^{\prime}}$ and thus by the conditional closeness of $\widehat{\mathbb{H}}$ and $\mathbb{H}$ and Claim 4.3, they are $(10S,0.2\gamma)$ -representative for $\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{A}^{\prime}}$ . Thus, (9) implies that for all $h^{\prime}\in\widehat{\mathcal{B}}$ , there exist coefficients $\{c_{h^{\prime},h}\}_{h\in\mathcal{B}}$ with $|c_{h^{\prime},h}|\leq 2$ for all $h\in\mathcal{B}$ such that

\left\lVert\nu_{h^{\prime}}-\sum_{h\in\mathcal{B}}c_{h^{\prime},h}\nu_{h}\right\rVert_{1}\leq 2\gamma

(10)

since $|\mathcal{B}|\leq S$ . Next, recall that $\{\nu_{h}\}_{h\in\widehat{\mathcal{B}}}$ are a $(1-\gamma,1,0)$ -spanner for the distribution $\mathcal{P}$ of $\{\nu_{h}\}$ for $h\sim\widehat{\mathbb{H}}[:t+1]$ . Thus, for $h^{\prime\prime}\sim\widehat{\mathbb{H}}[:t+1]$ , with $1-\gamma$ probability, there are coefficients $\{c^{\prime}_{h^{\prime}}\}_{h^{\prime}\in\widehat{\mathcal{B}}}$ with $|c^{\prime}_{h^{\prime}}|\leq 1$ such that

\nu_{h^{\prime\prime}}=\sum_{h^{\prime}\in\widehat{\mathcal{B}}}c^{\prime}_{h^{\prime}}\nu_{h^{\prime}}\,.

Finally, we can combine this with (10) (which we can apply for all $h^{\prime}\in\widehat{\mathcal{B}}$ ) and the closeness between $\widehat{\mathbb{H}}$ and $\mathbb{H}$ to get that there are coefficients $\{c_{h}\}_{h\in\mathcal{B}}$ with $|c_{h}|\leq 2S$ such that

\left\lVert\nu^{\prime}_{h^{\prime\prime}}-\sum_{h\in\mathcal{B}}c_{h}\nu_{h}^{\prime}\right\rVert_{1}\leq 2S\gamma

which gives the second statement. Note that the overall failure probability for all of the statements is at most $\delta+\epsilon^{0.4}$ . ∎

6.2 Full Learning Algorithm

For our full learning algorithm, we will apply Algorithm 3 to build a sequence of spanning sets $\mathcal{H}^{(1)},\dots,\mathcal{H}^{(T-1)}$ for each history-length. We then need to learn the “transitions” between these spanning sets. Rather than explicitly learning these transitions, we instead learn a representation of the distribution of the form described below where the transitions are implicit – the transitions are only computed in Section 7 when we actually want to sample from the learned distribution. The representation of the distribution consists of:

•

Subsets $\mathcal{H}^{(t)}\subseteq\mathcal{O}^{t}$ for all $0\leq t\leq T-1$
•

Matrices $\mathbf{P}^{(t)}\in{\mathbb{R}}^{|\mathcal{H}^{(t)}|\times O}$ for all $0\leq t\leq T-1$ containing the next character probabilities for each of the strings in $\mathcal{H}^{(t)}$
•

For each $0\leq t\leq T$ , a collection of vectors $u_{h^{(t)}}$ for all $h^{(t)}\in\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}$ that are succinct representations of the distributions $\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{(t)}]$ obtained from Algorithm 2
•

For each $0\leq t\leq T$ , we also maintain an index set $\mathcal{X}^{(t)}$ and vector $w^{(t)}$ (these are also obtained from Algorithm 2)

Algorithm 5 gives a full description of how we compute this representation.

Algorithm 4 Next Character Probabilities

1:Input: Sample, conditional query, and exact pdf access to a distribution

\widehat{\mathbb{H}}

over

\mathcal{O}^{T}

2:Input: Parameter

t<T

, strings

h_{1}^{(t)},\dots,h_{s}^{(t)}\in\mathcal{O}^{t}

3:for

o\in\mathcal{O}

4: Compute probabilities

p_{1,o}=\Pr_{\widehat{\mathbb{H}}}[o|h_{1}^{(t)}],\dots,p_{s,o}=\Pr_{\widehat{\mathbb{H}}}[o|h_{s}^{(t)}]

5:end for

6:Let

\mathbf{P}^{(t)}\in{\mathbb{R}}^{s\times O}

be the matrix whose entries are given by

\{p_{i,o}\}_{i\in[s],o\in\mathcal{O}}

7:Output:

\mathbf{P}^{(t)}

Algorithm 5 Full Learning Algorithm

1:Input: Sample, conditional query, and exact pdf access to a distribution

\widehat{\mathbb{H}}

over

\mathcal{O}^{T}

2:Input: Parameters

T,S\in\mathbb{N}

\eta,\delta>0

3:Initialize

\mathcal{H}^{(0)}

to be a set consisting of only the empty string

4:Set

k=\mathrm{poly}(S,O,T,1/\eta,\log(1/\delta))

5:Set

\gamma=1/\mathrm{poly}(k)

6:for

t=0,1,\dots,T-1

7: Run Algorithm 4 on

\mathcal{H}^{(t)}

and let the output be

\mathbf{P}^{(t+1)}

8: Run Algorithm 3 with input

\mathcal{H}^{(t)}

and parameters

t,S,\gamma,\delta

9: Let the output be

\mathcal{H}^{(t+1)}

10: Set

\mathcal{B}^{(t+1)}=\mathcal{H}^{(t+1)}\cup\{\mathcal{H}^{(t)}\vee o\}_{o\in\mathcal{O}}

11: Run Algorithm 2 on set

\mathcal{B}^{(t+1)}

with parameters

t+1,k

12: Let the result be

\mathcal{X}^{(t+1)},\{u_{h}\}_{h\in\mathcal{B}^{(t+1)}},w^{(t+1)}

13:end for

14:Output:

\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h^{(t)}}\}_{h^{(t)}\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

After running Algorithm 5, we obtain a description of the distribution in terms of

\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}\,.

Note that for all $t$ , the set $\mathcal{B}^{(t)}$ is just defined by $\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}$ . In the remainder of this section, we prove that the parameters learned above satisfy certain properties (see Lemma 6.6). These properties will then be used in Section 7 to argue that we can sample from the learned description to get a distribution that is close to the original distribution $\widehat{\mathbb{H}}$ . We begin with a few definitions.

Definition 6.4.

We say a string $h\in\mathcal{O}^{t}$ for $t<T$ is $\gamma$ -representable by $\mathcal{H}\subseteq\mathcal{O}^{t}$ if there exists some vector $y\in{\mathbb{R}}^{|\mathcal{H}|}$ with $\left\lVert y\right\rVert_{\infty}\leq 2S$ such that

\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H},:]\right\rVert_{1}\leq 2S\gamma\,.

Definition 6.5.

We say a string $h\in\mathcal{O}^{t}$ for $t<T$ is $(c,\gamma)$ -positively representable by $\mathcal{H}\subseteq\mathcal{O}^{t}$ and $\mathcal{X}\subseteq O^{T-t}$ if there exists some vector $y\in{\mathbb{R}}^{|\mathcal{H}|}$ with $\left\lVert y\right\rVert_{\infty}\leq 2S$ such that

•

$\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H},:]\right\rVert_{1}\leq 2S\gamma$
•

$y^{\top}\mathbf{M}^{(t)}[\mathcal{H},\mathcal{X}]$ has all entries at least $c$

The main lemma of this section is stated below.

Lemma 6.6.

Assume that $\widehat{\mathbb{H}}$ is $\epsilon$ conditionally close to a rank $S$ distribution $\mathbb{H}$ where

\epsilon\leq\frac{1}{\mathrm{poly}(S,O,T,1/\eta,\log(1/\delta))}

for some sufficiently large polynomial. For any parameter $c\geq e^{-10(OTS)^{2}/\eta}$ , in the execution of Algorithm 5, with probability $1-\delta-\gamma^{0.1}$ , we have the following properties:

•

For any $t\leq T$ , $|\mathcal{H}^{(t)}|\leq S$ .
•

For any $t\leq T$ and $h\sim\widehat{\mathbb{H}}[:t]$ , the string $h$ is $(c,4S\sqrt{\gamma})$ -positively representable by $\mathcal{H}^{(t)},\mathcal{X}^{(t)}$ with at least $1-\gamma^{0.1}-\frac{c\cdot O^{T}}{\gamma}$ probability
•

For all $t\in[T]$ , the vectors $\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}$ are $(10S,\eta)$ -representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}$
•

For all $t\in[T]$ , the vectors $w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}$ are $(10S,\eta,c)$ KL-preserving for the distributions
$\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}$

Remark 6.7.

We think of the parameters as being set such that $\epsilon\ll\mathrm{poly}(1/\gamma)^{-1},\gamma\ll\mathrm{poly}(1/\eta)^{-1}$ .

Proof.

We apply Lemma 6.3 – with probability $1-\delta/T-\epsilon^{0.4}$ , the guarantees of the lemma hold each time we execute Algorithm 3. This immediately implies the first statement that we want to show. For the second statement, Lemma 6.3 also gives us that over the execution of the algorithm, for all $t<T$ , the the vectors $\{\mathbf{M}^{(t)}[h,:]\}_{h\in\mathcal{H}^{(t)}}$ form a $(1-\gamma,2S,2S\gamma)$ -spanner for the distribution of vectors $\{\mathbf{M}^{(t)}[h,:]\}_{h\sim\widehat{\mathbb{H}}[:t]}$ . Thus, for $h\sim\widehat{\mathbb{H}}[:t]$ , with $1-\gamma$ probability, there exists some vector $y\in{\mathbb{R}}^{|\mathcal{H}|}$ with $\left\lVert y\right\rVert_{\infty}\leq 2S$ such that

\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}\leq 2S\gamma\,.

(11)

Let $y^{\prime}$ be obtained by increasing all entries of $y$ by $\sqrt{\gamma}$ . Since $|\mathcal{H}^{(t)}|\leq S$ , we have

\left\lVert\mathbf{M}^{(t)}[h,:]-{y^{\prime}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}\leq 4S\sqrt{\gamma}\,.

Now first consider sampling $\mathcal{X}^{(t)}$ after $\mathcal{H}^{(t)}$ is fixed. Now let $\mathcal{Z}\subseteq\mathcal{O}^{T-t}$ be the set of all strings $x$ such that

y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},x]\leq c-\sqrt{\gamma}\sum_{h\in\mathcal{H}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[x|h]\,.

(12)

As long as $\mathcal{X}^{(t)}$ doesn’t contain any of these elements, then $y^{\prime}$ would give us a $(c,4S\sqrt{\gamma})$ -positive representation of $h$ by $\mathcal{H}^{(t)},\mathcal{X}^{(t)}$ . Note that

-\left(\sum_{x\in\mathcal{Z}}y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},x]\right)\leq\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}

and thus we must have

\sum_{x\in\mathcal{Z}}y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},x]\geq-2S\gamma

and using (12), this rearranges into

\sum_{x\in\mathcal{Z}}\sum_{h\in\mathcal{H}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[x|h]\leq\frac{c\cdot|\mathcal{Z}|+2S\gamma}{\sqrt{\gamma}}\leq\frac{c\cdot O^{T}+2S\gamma}{\sqrt{\gamma}}\,.

Also recall Lemma 6.3 implies that for any $h\in\mathcal{B}^{(t)}$ ,

\sum_{x\in\mathcal{Z}}\Pr_{\widehat{\mathbb{H}}}[x|h]\leq S\gamma+2\sum_{x\in\mathcal{Z}}\sum_{h\in\mathcal{H}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[x|h]\leq\frac{3(c\cdot O^{T}+2S\gamma)}{\sqrt{\gamma}}\,.

Now we have

\Pr[\mathcal{X}^{(t)}\cap\mathcal{Z}=\emptyset]\geq 1-|\mathcal{B}^{(t)}|k\cdot\frac{3(c\cdot O^{T}+2S\gamma)}{\sqrt{\gamma}}\geq 1-\gamma^{0.3}-\frac{c\cdot O^{T}}{\gamma^{0.6}}

and if this happens, then $h$ is $(c,4S\sqrt{\gamma})$ -positively representable by $\mathcal{H}^{(t)},\mathcal{X}^{(t)}$ . Thus, so far we have shown that over the randomness of $h\sim\widehat{\mathbb{H}}[:t]$ and $\mathcal{X}^{(t)}$ ,

\Pr\left[h\text{ is }(c,4S\sqrt{\gamma})\text{-positively representable by }\mathcal{H}^{(t)},\mathcal{X}^{(t)}\right]\geq 1-\gamma^{0.3}-\frac{c\cdot O^{T}}{\gamma^{0.6}}\,.

Now by Markov’s inequality, with probability at least $1-\gamma^{0.1}$ over the choice of $\mathcal{X}^{(t)}$ ,

\Pr_{h\sim\widehat{\mathbb{H}}[:t]}\left[h\text{ is }(c,4S\sqrt{\gamma})\text{-positively representable by }\mathcal{H}^{(t)},\mathcal{X}^{(t)}|\mathcal{X}^{(t)}\right]\geq 1-\gamma^{0.1}-\frac{c\cdot O^{T}}{\gamma}

and this proves the second of the desired statements.

Finally, the last two statements follow from Lemma 5.10. Combining the failure probabilities for all of the parts, the total failure probability is at most $\delta+\gamma^{0.1}$ , and this completes the proof. ∎

Later on, we will also need the following simple observation.

Claim 6.8.

In the execution of Algorithm 5, assume that $|\mathcal{H}^{(t)}|\leq S$ for all $t\leq T$ . Then for any $c>0$ , with probability at least $1-(OS)^{2T}\sqrt{c}$ , we have $\left\lVert w^{(t)}\right\rVert_{\infty}\leq 1/\sqrt{c}$ for all $t\in[T]$ .

Proof.

Consider a fixed $t\in[T]$ . The desired statement is failed only when in line 11 of Algorithm 5, when we execute the subroutine Algorithm 2, we sample some $f\in\mathcal{O}^{T-t}$ such that

\sum_{h^{(t)}\in\mathcal{B}^{(t)}}\Pr[f|h^{(t)}]\leq\frac{\sqrt{c}}{k}\,.

However, the probability of sampling such an element is at most

\frac{\sqrt{c}}{k}O^{T}k|\mathcal{B}^{(t)}|\leq O^{T}(O+1)S\sqrt{c}\,.

Now union bounding over all $t\in[T]$ immediately gives the desired conclusion. ∎

7 Sampling Procedure

After running Algorithm 5, we have learned a set of parameters

\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

where recall $\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}$ . However, it is not yet clear how these parameters determine a distribution. In this section, we show how these parameters determine a distribution that we can efficiently sample from and argue that if the learned parameters satisfy the properties in Lemma 6.6, then this distribution is close to $\widehat{\mathbb{H}}$ .

Throughout this section, we assume that we are given the global parameters $O,T,S$ and a target accuracy parameter $\eta$ . We make the following assumption on the accuracy of the learned parameters. In light of Lemma 6.6 and Claim 6.8, we can show that this assumption holds with high probability as long as we ran Algorithm 5 on a distribution $\widehat{\mathbb{H}}$ that is $\epsilon$ -conditionally close to a rank $S$ distribution $\mathbb{H}$ for sufficiently small $\epsilon$ .

Assumption 7.1.

We have $\eta>0$ satisfying $\eta<1/\mathrm{poly}(O,T,S)$ for some sufficiently large polynomial. Let $c=\eta^{10OTS}$ . The parameters

\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

satisfy the following properties:

•

For any $t\leq T$ , $|\mathcal{H}^{(t)}|\leq S$ .
•

For any $t\leq T$ and $h\sim\widehat{\mathbb{H}}[:t]$ , the string $h$ is $(2\sqrt{c},\eta)$ -positively representable by $\mathcal{H}^{(t)},\mathcal{X}^{(t)}$ with at least $1-\eta$ probability
•

For all $t\in[T]$ , the vectors $w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}$ have nonnegative entries and have the same dimensionality $d=\mathrm{poly}(S,O,T,1/\eta)$ and satisfy $u_{h}^{(t)}=w^{(t)}\odot\mathbf{M}^{(t)}[h,\mathcal{X}^{(t)}]$ (where $\odot$ denotes entrywise product)
•

For all $t\in[T]$ , the vectors $\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}$ are $(10S,\eta)$ -representative for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}$
•

For all $t\in[T]$ , the vectors $w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}$ are $(10S,\eta,c^{T})$ KL-preserving for the distributions $\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}$
•

$\left\lVert w^{(t)}\right\rVert_{\infty}\leq 1/\sqrt{c}$ for all $t\in[T]$

We will also make the following assumption on the underlying distribution $\widehat{\mathbb{H}}$ .

Assumption 7.2.

The distribution $\widehat{\mathbb{H}}$ is $c$ -positive (recall Definition 4.4).

Throughout the rest of this section, we will treat the parameters $\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}$ as fixed. We will describe the sampling procedure and then analyze it, proving that it samples from a distribution close to $\widehat{\mathbb{H}}$ as long as the assumptions above hold.

The main sampling algorithm is Algorithm 6. First, we define the following operation which rounds a vector to a point on the simplex corresponding to a probability distribution.

Definition 7.3.

For a vector $v\in{\mathbb{R}}^{n}$ and parameter $\tau>0$ , we define $\textsf{round}_{\tau}(v)$ to be

\left(\frac{\max(v[1],\tau)}{C},\dots,\frac{\max(v[n],\tau)}{C}\right)

where $C=\sum_{i=1}^{n}\max(v[i],\tau)$ . Note that the output of $\textsf{round}()$ is always a valid probability distribution.

At a high-level, Algorithm 6 works as follows. We sample a string $x$ one character at a time. At each step $t<T$ , we maintain a linear combination of the strings $h\in\mathcal{H}^{(t)}$ that is supposed to approximate the string $x[:t]=o_{1}o_{2}\dots o_{t}$ given by the first $t$ characters of $x$ in the following sense: we want

\Pr_{\widehat{\mathbb{H}}}[\cdot|x[:t]]\approx\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x[:t]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h]

as distributions, for some coefficients $\{\alpha_{h}^{x[:t]}\}_{h\in\mathcal{H}^{(t)}}$ . Given this linear combination, and since $\mathbf{P}^{(t)}$ gives us the next character probabilities for all of the strings in $\mathcal{H}^{(t)}$ , we know what the distribution for the next character should be given the first $t$ characters in $x[:t]$ so we simply sample the next character from this distribution. If the next character is $o_{t+1}$ , so $x[:t+1]=x[:t]\vee o_{t+1}$ , then we can re-normalize the coefficients $\alpha_{h}^{x[:t]}$ to get coefficients $\widehat{\alpha}_{h}^{x[:t]}$ such that

\Pr_{\widehat{\mathbb{H}}}[\cdot|x[:t+1]]\approx\sum_{h\in\mathcal{H}^{(t)}}\widehat{\alpha}_{h}^{x[:t]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o_{t+1}]\,.

(13)

The main difficulty is that we now want to rewrite the RHS as a linear combination of the distributions
$\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{H}^{(t+1)}}$ .

Recalling the discussion in Section 3.3, we will solve a convex optimization problem, that roughly corresponds to projection in KL, for each “change-of-basis” operation. In particular, we will use the fact that the vectors $\{u_{h}\}_{h\in\mathcal{B}^{(t+1)}}$ (where recall $\mathcal{B}^{(t+1)}=\mathcal{H}^{(t+1)}\cup\{\mathcal{H}^{(t)}\vee o\}_{o\in\mathcal{O}}$ ) are succinct representations of $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ . Then once we have the coefficients $\widehat{\alpha}_{h}^{x[:t]}$ in (13), to “change the basis”, we solve a convex optimization problem for coefficients $\alpha_{h}^{x[:t+1]}$ such that $|\alpha_{h}^{x[:t+1]}|$ are bounded and

\textsf{KL}\left(\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}^{x[:t+1]}u_{h}\Bigg{\|}\sum_{h\in\mathcal{H}^{(t)}}\widehat{\alpha}_{h}^{x[:t]}u_{h\vee o_{t+1}}\right)

is minimized. Recall the crucial property of KL divergence that ensures the error only grows additively but not multiplicatively in each step is that projection in KL onto a convex set decreases the distance (in KL) to all other points in the set (see Fact 4.10).

For describing our algorithm formally, it will be convenient to define the following notation for arranging the rows $\{u_{h}\}_{h\in\mathcal{B}^{(t)}}$ as the rows of a matrix.

Definition 7.4.

Let $\mathbf{R}^{(t)}$ be the matrix whose rows are given by $\{u_{h}\}_{h\in\mathcal{H}^{(t)}}$ . Let for $o\in\mathcal{O}$ , let $\mathbf{T}^{(t)}_{o}$ be the matrix whose rows are given by $\{u_{h}\}_{h\in\mathcal{H}^{(t)}\vee o}$ .

Algorithm 6 Sampling Procedure

1:Input:

S,O,T\in\mathbb{N},\eta>0

2:Input: Learned representation

\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

3:for

t\in[T]

o\in\mathcal{O}

4: Let

\mathbf{R}^{(t)}

be the matrix whose rows are given by

\{u_{h}\}_{h\in\mathcal{H}^{(t)}}

5: Let

\mathbf{T}^{(t)}_{o}

be the matrix whose rows are given by

\{u_{h}\}_{h\in\mathcal{H}^{(t)}\vee o}

6:end for

7:Set

c=\eta^{-10OTS}

8:Set

x[:0]

to be the empty string

9:Let

\boldsymbol{\alpha}^{x[:0]}=1

10:for

t=0,1,\dots,T-1

11: Compute probabilities

\{p_{o}\}_{o\in\mathcal{O}}=\textsf{round}_{2c^{0.1}}({\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{P}^{(t)})

12: Sample next character

o_{t+1}\in\mathcal{O}

with probabilities

\{p_{o}\}_{o\in\mathcal{O}}

13: Set

x[:t+1]=x[:t]\vee o_{t+1}

14: Let the entries in the column of

\mathbf{P}^{(t)}

indexed by

o_{t+1}

\{q^{o_{t+1}}_{h}\}_{h\in\mathcal{H}^{(t)}}

15: Define the vector

\boldsymbol{\nu}^{x[:t+1]}=\frac{1}{p_{o_{t+1}}}{\boldsymbol{\alpha}^{x[:t]}}^{\top}\textsf{diag}\left(\{q^{o_{t+1}}_{h}\}_{h\in\mathcal{H}^{(t)}}\right)

16: Let

\mathbf{z}^{x[:t+1]}={\boldsymbol{\nu}^{x[:t+1]}}^{\top}\mathbf{T}_{o}^{(t)}

17: Compute vector

\boldsymbol{\alpha}=\{\alpha_{h}\}_{h\in\mathcal{H}^{(t+1)}}

such that

\begin{split}&|\alpha_{h}|\leq 3S\quad\quad\forall h\in\mathcal{H}^{(t+1)}\\ &{\boldsymbol{\alpha}}^{\top}\mathbf{R}^{(t+1)}\geq c^{T-t}w^{(t+1)}\text{ entrywise}\\ &\langle{\boldsymbol{\alpha}}^{\top}\mathbf{R}^{(t+1)},\mathbf{1}\rangle=1\text{ where }\mathbf{1}\text{ is the all ones vector}\end{split}

and which minimizes

\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left({\boldsymbol{\alpha}}^{\top}\mathbf{R}^{(t+1)}\|\mathbf{z}^{x[:t+1]}\right)

18: Set

\boldsymbol{\alpha}^{x[:t+1]}=\{\alpha^{x[:t+1]}_{h}\}_{h\in\mathcal{H}^{(t+1)}}

to be the minimizer in the above optimization problem

19:end for

20:Let

x

be the string

o_{1}o_{2}\cdots o_{T}

21:Output:

x

Remark 7.5.

Note that the constraints in the optimization problem in Line 17 of Algorithm 6 ensure that the KL divergence is a convex function of $\boldsymbol{\alpha}^{x[:t+1]}$ over the entire feasible domain and thus can be optimized efficiently.

The analysis will rely on bounding the (truncated) KL divergence between the true distribution and the distribution that our algorithm samples from. We will do this inductively in $t$ via a hybridization argument. However, there are certain “bad” histories that we will need to truncate out and these are precisely those that are not positively representable in Lemma 6.6.

Definition 7.6.

For $t\leq T$ , we let $\mathcal{G}^{(t)}\subseteq\mathcal{O}^{t}$ be the subset of $h\in\mathcal{O}^{t}$ such that for all $t^{\prime}\leq t$ , the prefix of $h$ of length $t^{\prime}$ is $(2\sqrt{c},\eta)$ -positively representable by $\mathcal{H}^{(t^{\prime})},\mathcal{X}^{(t^{\prime})}$ .

The key lemma of the analysis is Lemma 7.12 but first we will need to prove a sequence of preliminary claims. We apply Corollary 4.11 to analyze the convex optimization step in Algorithm 6. We get that, up to some small additive error, replacing $\mathbf{z}^{x[:t+1]}$ with the solution ${\boldsymbol{\alpha}^{x[:t+1]}}^{\top}\mathbf{R}^{(t+1)}$ that we compute reduces the KL divergence to all vectors in the feasible set of the convex program. We can then use Assumption 7.1 to lift this argument to the original distributions and argue that our change-of-basis step only incurs some small additive error overall.

Claim 7.7.

Let $t<T$ . Consider the execution of Algorithm 6 and assume that so far we have sampled $t+1$ characters i.e. $x[:t+1]=o_{1}\cdots o_{t+1}$ . Let $\boldsymbol{\alpha}=\{{\alpha_{h}}\}_{h\in\mathcal{H}^{(t+1)}}$ be any feasible set of coefficients in the convex program in Line 17 of Algorithm 6. Then we have

\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|{\boldsymbol{\alpha}^{x[:t+1]}}^{\top}\mathbf{R}^{(t+1)}\right)\leq\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|\mathbf{z}^{x[:t+1]}\right)+\log\left(\left\lVert\mathbf{z}^{x[:t+1]}\right\rVert_{1}+\eta\right)\,.

Proof.

We apply Corollary 4.11 where $\mathcal{K}$ is the set of all vectors $\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}$ where $\boldsymbol{\alpha}$ is feasible in the convex program. It is clear that this set is convex. Also, if we set $w\leftarrow c^{T-t}w^{(t+1)}$ in Corollary 4.11 then all of the hypotheses are satisfied. By Assumption 7.1, $\left\lVert w\right\rVert_{1}\leq d\sqrt{c}\leq\eta$ . Also by definition, all elements of $\mathcal{K}$ have sum of entries equal to $1$ . Thus, we get

\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|{\boldsymbol{\alpha}^{x[:t+1]}}^{\top}\mathbf{R}^{(t+1)}\right)\leq\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|\mathbf{z}^{x[:t+1]}\right)+\log\left(\left\lVert\mathbf{z}^{x[:t+1]}\right\rVert_{1}+\eta\right)

as desired. ∎

We will also need the following observation about Algorithm 6 that the vector

{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]

is always close to being a valid distribution.

Claim 7.8.

Throughout the execution of Algorithm 6, for any $t<T$ , any feasible solution $\boldsymbol{\alpha}=\{\alpha_{h}\}_{h\in\mathcal{H}^{(t+1)}}$ to the convex program in Line 17 must satisfy

•

$1-3S^{2}\eta\leq\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}\leq 1+3S^{2}\eta$
•

$1-\eta\leq\left\lVert\boldsymbol{\alpha}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right\rVert_{1}\leq 1+\eta$

Proof.

The third constraint of the convex program implies that

1=\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}\langle u_{h},\mathbf{1}\rangle=\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}+\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}(\langle u_{h},\mathbf{1}\rangle-1)\,.

However Assumption 7.1 implies that $|\langle u_{h},\mathbf{1}\rangle-1|\leq\eta$ for all $h$ and thus we conclude

1-3S^{2}\eta\leq\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}\leq 1+3S^{2}\eta\,.

Now for the second statement, note that the second and third constraints of the convex program together enforce that $\left\lVert\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}u_{h}\right\rVert_{1}=1$ and then Assumption 7.1 implies that

1-\eta\leq\left\lVert\boldsymbol{\alpha}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right\rVert_{1}\leq 1+\eta

as desired. ∎

Corollary 7.9.

Throughout the execution of Algorithm 6, the probabilities $p_{o}$ computed in line 11 are always at least $c^{0.1}$ .

Proof.

Let the entries of ${\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{P}^{(t)}$ be $\{r_{o}\}_{o\in\mathcal{O}}$ . By Claim 7.8,

\sum_{o\in\mathcal{O}}|r_{o}|\leq 1+\eta\,.

Thus,

\sum_{o\in\mathcal{O}}\max(2c^{0.1},r_{o})\leq 1+\eta+2c^{0.1}O\leq 2

and this immediately implies the desired statement. ∎

Now we can take Claim 7.7 and then use Assumption 7.1 to replace the matrix $\mathbf{R}^{(t+1)}$ with the matrix of true probabilities $\mathbf{M}^{(t+1)}$ as follows.

Claim 7.10.

Fix an $x[:t]\in\mathcal{G}^{(t)}$ (recall Definition 7.6). Let $o\in\mathcal{O}$ be such that $x[:t]\vee o\in\mathcal{G}^{(t+1)}$ . Assume that in the execution of Algorithm 6, the first $t$ characters we have sampled are exactly $x[:t]$ and the $t+1$ st character is sampled to be $o_{t+1}=o$ , so we just set $x[:t+1]=x[:t]\vee o$ in line 13. Then after solving the optimization problem in line 17 to compute the next set of coefficients $\boldsymbol{\alpha}^{x[:t]\vee o}$ , we have

\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\eta^{0.5}\,.\end{split}

Proof.

By assumption that $x[:t]\vee o\in\mathcal{G}^{(t+1)}$ , there exists coefficients $\boldsymbol{\beta}^{x[:t]\vee o}=\{\beta_{h}^{x[:t]\vee o}\}_{h\in\mathcal{H}^{(t+1)}}$ such that $\left\lVert\boldsymbol{\beta}^{x[:t]\vee o}\right\rVert_{\infty}\leq 2S$ and

•

\left\lVert\mathbf{M}^{(t+1)}[x[:t]\vee o,:]-{\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right\rVert_{1}\leq 2S\eta

(14)

•

${\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},\mathcal{X}^{(t+1)}]$ has all entries at least $2\sqrt{c}$

Note that the first condition implies that

1-2S\eta\leq\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}\leq 1+2S\eta

because all rows of the matrix $\mathbf{M}^{(t+1)}$ have sum equal to $1$ . Let $\mathbf{1}$ be the all ones vector of dimension $d$ (recall Assumption 7.1). Then define

Q\mathrel{\mathop{:}}=\langle{\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)},\mathbf{1}\rangle=\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}\langle u_{h},\mathbf{1}\rangle=\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}+\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}\left(\langle u_{h},\mathbf{1}\rangle-1\right)\,.

By Assumption 7.1 (specifically that the vectors $u_{h}$ are representative for the distributions $\Pr_{\widehat{\mathbb{H}}}[\cdot|h]$ ), this implies that

1-5S^{2}\eta\leq Q\leq 1+5S^{2}\eta\,.

(15)

Thus, using Assumption 7.1, the vector $\boldsymbol{\beta}^{x[:t]\vee o}/Q=\{\beta_{h}^{x[:t]\vee o}/Q\}_{h\in\mathcal{H}^{(t+1)}}$ must be a feasible solution to the convex program in Line 17 of Algorithm 6. By Claim 7.7,

\begin{split}&\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)}/Q\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)}\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)}/Q\|\mathbf{z}^{x[:t]\vee o}\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)\end{split}

Next, recall the vector $\boldsymbol{\nu}^{x[:t]\vee o}\mathrel{\mathop{:}}=\{q_{h}^{o}\alpha_{h}^{x[:t]}/p_{o}\}_{h\in\mathcal{H}^{(t)}}$ constructed in Algorithm 6 and recall that

\mathbf{z}^{x[:t]\vee o}={\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{T}_{o}^{(t)}\,.

Then by the KL-preserving property in Assumption 7.1 (and Corollary 7.9 which bounds the entries of $\boldsymbol{\nu}^{x[:t]\vee o}$ )

\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]/Q\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]/Q\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+2\eta\,.\end{split}\,.

Finally, by Claim 4.12 and (14), (15), the above implies

\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\eta^{0.5}\end{split}

(16)

where in the last step we used the assumption that $\eta$ is sufficiently small. This completes the proof. ∎

Now we sum Claim 7.10 over all possibilities for $o\in\mathcal{O}$ to get the following bound.

Claim 7.11.

Fix an $x[:t]\in\mathcal{G}^{(t)}$ . Assume that in the execution of Algorithm 6, the first $t$ characters we have sampled are exactly $x[:t]$ . Then we have

\begin{split}&\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+2\eta^{0.5}+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\,.\end{split}

Proof.

Let $\mathcal{F}_{o}^{T-t}\subset\mathcal{O}^{T-t}$ be the subset of all strings whose first character is $o$ . Then we have

{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]=\frac{1}{p_{o}}\sum_{h\in\mathcal{H}^{(t)}}q_{h}^{o}\alpha_{h}^{x[:t]}\mathbf{M}^{(t+1)}[h\vee o,:]=\frac{1}{p_{o}}{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},\mathcal{F}_{o}^{T-t}]\,.

Also note that

\mathbf{M}^{(t+1)}[x[:t]\vee o,:]=\frac{\mathbf{M}^{(t)}[x[:t],\mathcal{F}_{o}^{T-t}]}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\,.

Note that all entries of the matrix $\mathbf{M}^{(t)}$ are at least $c^{T-t}$ by Assumption 7.2 and $p_{o}\geq c^{0.1}$ by Corollary 7.9 so using the definition of truncated KL,

\begin{split}&\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],\mathcal{F}_{o}^{T-t}]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},\mathcal{F}_{o}^{T-t}]\right)+\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\log\frac{p_{o}}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\,.\end{split}

(17)

Next, we will apply Claim 7.10 and (17) to bound the sum

\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\,.

However, we can only apply Claim 7.10 when $o$ is such that $x[:t]\vee o\in\mathcal{G}^{(t+1)}$ . For other choices of $o$ , we can nevertheless use the following trivial version of Claim 7.10

\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\eta^{0.5}+5T\log(1/c)\end{split}

where the above holds simply because both of the KL divergences are bounded in magnitude by $2T\log(1/c)$ . Thus, we have the bound

\begin{split}&\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\eta^{0.5}+\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\log\frac{p_{o}}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\\ &\quad+\sum_{o\in\mathcal{O}}\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],\mathcal{F}_{o}^{T-t}]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},\mathcal{F}_{o}^{T-t}]\right)+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\\ &=\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)\\ &\quad+\eta^{0.5}+\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\log\left(\frac{\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)p_{o}}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\right)+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\,.\end{split}

(18)

Next, by definition and Assumption 7.1,

p_{o}\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}=\left\lVert p_{o}{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{T}_{o}^{(t)}\right\rVert_{1}\leq\left\lVert p_{o}{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right\rVert_{1}+\eta\,.

Observe that the vector ${\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]$ is the same as concatenating $p_{o}{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]$ over all choices of $o$ . Thus by Claim 7.8

\sum_{o\in\mathcal{O}}p_{o}\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)\leq(O+1)\eta+\left\lVert{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}\leq 1+(O+2)\eta\,.

Combining this with (18) and using nonnegativity of KL gives

\begin{split}&\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+\eta^{0.5}+\log\left(\sum_{o\in\mathcal{O}}p_{o}\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)\right)\\ &\quad+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+2\eta^{0.5}+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\,.\end{split}

∎

Next, we sum Claim 7.11 over all possibilities for $x[:t]\in\mathcal{G}^{(t)}$ . Note that in the execution of Algorithm 6, we can associate a unique vector of coefficients $\boldsymbol{\alpha}^{h}$ to every possible history $h$ to be the state of the vector $\boldsymbol{\alpha}^{x[:t]}$ conditioned on the prefix $x[:t]$ being equal to $h$ .

Lemma 7.12.

For all $0\leq t\leq T-1$ , we have

\begin{split}&\sum_{h\in\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+3\eta^{0.5}\,.\end{split}

Proof.

Summing Claim 7.11 as $x[:t]$ ranges over all $h\in\mathcal{G}^{(t)}$ (with weights $\Pr_{\widehat{\mathbb{H}}}[h]$ ) gives

\begin{split}&\sum_{h\in\mathcal{G}^{(t)},o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq 5T\log(1/c)\cdot\sum_{h\vee o\notin\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]+2\eta^{0.5}+\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)\,.\end{split}

(19)

Finally, note that

\left\lvert\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\right\rvert\leq 2T\log(1/c)

because $\mathbf{M}^{(t+1)}[h\vee o,:]$ has nonnegative entries summing to $1$ and Claim 7.8 bounds the entries of the second part. Thus by Assumption 7.1 (specifically the condition about $h\sim\widehat{\mathbb{H}}[:t]$ being positively representable), we conclude

\sum_{h\vee o\notin\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\leq T\eta\,.

Thus from (19) and using our assumptions on $\eta$ , we get

\begin{split}&\sum_{h\in\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+3\eta^{0.5}\end{split}

as desired. ∎

Now we can prove the main lemma of this section, where we show that the distribution that our algorithm samples from is close to the desired distribution $\widehat{\mathbb{H}}$ .

Lemma 7.13.

Under Assumption 7.1 and Assumption 7.2, the distribution that Algorithm 6 samples from, say $\mathbb{H}^{\prime}$ , satisfies

d_{\textsf{TV}}(\widehat{\mathbb{H}},\mathbb{H}^{\prime})\leq 10T\eta^{0.1}\,.

Proof.

First, for a fixed history $h\in\mathcal{O}^{(t)}$ , let $\boldsymbol{\beta}^{h}$ be the vector

\max({\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:],c^{T-t+1})

where the maximum is taken entrywise. Clearly $\boldsymbol{\beta}^{h}$ has all positive entries and by Claim 7.8 (and the definition of $c$ ), the sum of the entries is at most $1+2\eta$ . Now let

\boldsymbol{\theta}^{h}=\frac{\boldsymbol{\beta}^{h}}{\left\lVert\boldsymbol{\beta}^{h}\right\rVert_{1}}

i.e. normalizing so that the sum of the entries is $1$ . Note that all of the entries of the matrix $\mathbf{M}^{(t)}$ are at least $c^{T-t}$ so

\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)=\textsf{KL}\left(\mathbf{M}^{(t)}[h,:]\|\boldsymbol{\beta}^{h}\right)\geq\textsf{KL}\left(\mathbf{M}^{(t)}[h,:]\|\boldsymbol{\theta}^{h}\right)-2\eta\,.

Now combining the above with Lemma 7.12 implies

\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}\left(\mathbf{M}^{(t)}[h,:]\|\boldsymbol{\theta}^{h}\right)\leq 4t\eta^{0.5}\leq\eta^{0.4}\,.

Recall by Assumption 7.1 that

\Pr_{h\sim\widehat{\mathbb{H}}[:t]}[h\in\mathcal{G}^{(t)}]\geq 1-\eta

and thus by Pinsker’s inequality, we deduce

\Pr_{h\sim\widehat{\mathbb{H}}[:t]}\left[\left\lVert\boldsymbol{\theta}^{h}-\mathbf{M}^{(t)}[h,:]\right\rVert_{1}\geq\eta^{0.1}\right]\leq 2\eta^{0.2}

(20)

and when this happens, since

\left\lVert{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]-\boldsymbol{\theta}^{h}\right\rVert\leq 4\eta

we get that the distributions $\{\Pr_{\widehat{\mathbb{H}}}[o|h]\}_{o\in\mathcal{O}},\{\Pr_{\mathbb{H}^{\prime}}[o|h]\}_{o\in\mathcal{O}}$ are $2\eta^{0.1}$ -close in TV distance. Now we construct a hybrid distribution $\mathbb{H}^{\prime\prime}$ that is equal to $\mathbb{H}^{\prime}$ except starting from every prefix $h$ where the condition in (20) fails, all future characters are sampled according to $\widehat{\mathbb{H}}$ . By (20), $d_{\textsf{TV}}(\mathbb{H}^{\prime\prime},\mathbb{H}^{\prime})\leq 2T\eta^{0.2}$ and by Claim 4.3, $d_{\textsf{TV}}(\mathbb{H}^{\prime\prime},\widehat{\mathbb{H}})\leq 2T\eta^{0.1}$ . Putting these together yields the desired statement.

∎

Now we can put everything together and complete the proof of Theorem 1.3.

Proof of Theorem 1.3.

Let $\epsilon=1/\mathrm{poly}(S,O,T,1/\eta)$ for some sufficiently large polynomial. Now we will combine the following ingredients

•

Lemma 4.6 to get exact sample and pdf access to a distribution $\widehat{\mathbb{H}}$ close to $\mathbb{H}$
•

Algorithm 5 to learn a description of a distribution
•

Algorithm 6 to draw samples from the distribution parameterized by this learned description

For all of these algorithms, we will set the failure probability parameter $\delta\leftarrow 0.1\eta$ . First, we apply Lemma 4.6 to get exact sample and pdf access to a distribution $\widehat{\mathbb{H}}$ that is $\epsilon$ conditionally close to $\mathbb{H}$ and $\epsilon/(10O)^{2}$ -positive. Next, we apply Algorithm 5 to learn a set of parameters

\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

where $\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}$ . We apply Lemma 6.6 and Claim 6.8 on these learned parameters. They, combined with the definition of the vectors $w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}$ , imply that with probability $1-\eta$ , the conditions of Assumption 7.1 are satisfied. Also, Assumption 7.2 is satisfied by construction. Finally Lemma 7.13 implies that when these conditions are satisfied, the distribution that Algorithm 6 samples from is $10T\eta^{0.1}$ -close to $\widehat{\mathbb{H}}$ in TV distance. Note that the sampling algorithm runs in time $\mathrm{poly}(S,O,T\log(1/\eta))$ . Redefining $\eta\leftarrow(0.1\eta/T)^{10}$ completes the proof. ∎

References

[AHK12] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on learning theory, pages 33–1. JMLR Workshop and Conference Proceedings, 2012.
[AK08] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
[Ang87] Dana Angluin. Learning regular sets from queries and counterexamples. Information and computation, 75(2):87–106, 1987.
[BC18] Rishiraj Bhattacharyya and Sourav Chakraborty. Property testing of joint distributions using conditional samples. ACM Transactions on Computation Theory (TOCT), 10(4):1–20, 2018.
[BCPV19] Aditya Bhaskara, Aidao Chen, Aidan Perreault, and Aravindan Vijayaraghavan. Smoothed analysis in unsupervised learning via decoupling. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 582–610. IEEE, 2019.
[BLMY23a] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. A new approach to learning linear dynamical systems. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 335–348, 2023.
[BLMY23b] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. Tensor decompositions meet control theory: learning general mixtures of linear dynamical systems. In International Conference on Machine Learning, pages 1549–1563. PMLR, 2023.
[CFGM13] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 561–580, 2013.
[CGG01] Mary Cryan, Leslie Ann Goldberg, and Paul W Goldberg. Evolutionary trees can be learned in polynomial time in the two-state general markov model. SIAM Journal on Computing, 31(2):375–397, 2001.
[CJLW21] Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Learning and testing junta distributions with sub cube conditioning. In Conference on Learning Theory, pages 1060–1113. PMLR, 2021.
[CP22] Yanxi Chen and H Vincent Poor. Learning mixtures of linear dynamical systems. In International conference on machine learning, pages 3507–3557. PMLR, 2022.
[CPD⁺24] Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model. arXiv preprint arXiv:2403.06634, 2024.
[CRS15] Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using conditional samples. SIAM Journal on Computing, 44(3):540–616, 2015.
[DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539. PMLR, 2020.
[FKQR21] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
[GGJ⁺20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587–3596. PMLR, 2020.
[GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147–2158, 2020.
[GMR23] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning and learning in partially observable systems via filter stability. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 349–362, 2023.
[HGKD15] Qingqing Huang, Rong Ge, Sham Kakade, and Munther Dahleh. Minimal realization problems for hidden markov models. IEEE Transactions on Signal Processing, 64(7):1896–1904, 2015.
[HJB⁺21] Xinlei He, Jinyuan Jia, Michael Backes, Neil Zhenqiang Gong, and Yang Zhang. Stealing links from graph neural networks. In 30th USENIX security symposium (USENIX security 21), pages 2669–2686, 2021.
[HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
[HLXS21] Xuanli He, Lingjuan Lyu, Qiongkai Xu, and Lichao Sun. Model extraction and adversarial transferability, your bert is vulnerable! arXiv preprint arXiv:2103.10013, 2021.
[HMR18] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1–44, 2018.
[Jac97] Jeffrey C Jackson. An efficient membership-query algorithm for learning dnf with respect to the uniform distribution. Journal of Computer and System Sciences, 55(3):414–440, 1997.
[JCCCP21] Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Papernot. Entangled watermarks as a defense against model extraction. In 30th USENIX security symposium (USENIX Security 21), pages 1937–1954, 2021.
[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
[JKA⁺17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
[JSMA19] Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. Prada: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pages 512–527. IEEE, 2019.
[KKMZ24] Sham M. Kakade, Akshay Krishnamurthy, Gaurav Mahajan, and Cyril Zhang. Learning hidden markov models using conditional samples, 2024.
[KNW13] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In International Conference on Machine Learning, pages 702–710. PMLR, 2013.
[KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
[LZJ⁺22] Yiming Li, Linghui Zhu, Xiaojun Jia, Yong Jiang, Shu-Tao Xia, and Xiaochun Cao. Defending against model stealing via verifying embedded external features. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 1464–1472, 2022.
[MCH⁺21] Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, and Zhangyang Wang. Undistillable: Making a nasty teacher that cannot teach students. arXiv preprint arXiv:2105.07381, 2021.
[MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on learning theory, pages 2388–2464. PMLR, 2019.
[MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
[NKIH23] Ali Naseh, Kalpesh Krishna, Mohit Iyyer, and Amir Houmansadr. Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1835–1849, 2023.
[NP23] Phan-Minh Nguyen and Huy Tuan Pham. A rigorous framework for the mean field limit of multilayer neural networks. Mathematical Statistics and Learning, 6(3):201–357, 2023.
[OMR23] Daryna Oliynyk, Rudolf Mayer, and Andreas Rauber. I know what you trained last summer: A survey on stealing machine learning models and defences. ACM Computing Surveys, 55(14s):1–41, 2023.
[OO19] Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference (ACC), pages 5655–5661. IEEE, 2019.
[OSF19] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Prediction poisoning: Towards defenses against dnn model stealing attacks. arXiv preprint arXiv:1906.10908, 2019.
[RST19] Robert Nikolai Reith, Thomas Schneider, and Oleksandr Tkachenko. Efficiently stealing your machine learning models. In Proceedings of the 18th ACM Workshop on Privacy in the Electronic Society, pages 198–210, 2019.
[SBR19] Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714–2802. PMLR, 2019.
[Sha51] Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
[SKLV17] Vatsal Sharan, Sham M Kakade, Percy S Liang, and Gregory Valiant. Learning overcomplete hmms. Advances in Neural Information Processing Systems, 30, 2017.
[SKLV18] Vatsal Sharan, Sham Kakade, Percy Liang, and Gregory Valiant. Prediction with a short memory. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1074–1087, 2018.
[SRD19] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Nonparametric finite time lti system identification. arXiv preprint arXiv:1902.01848, 2019.
[SZ24] Zeyang Sha and Yang Zhang. Prompt stealing attacks against large language models. arXiv preprint arXiv:2402.12959, 2024.
[TP19] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019.
[TZJ⁺16] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction $\{$ APIs $\}$ . In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
[VSP⁺17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[WG18] Binghui Wang and Neil Zhenqiang Gong. Stealing hyperparameters in machine learning. In 2018 IEEE symposium on security and privacy (SP), pages 36–52. IEEE, 2018.
[WXGD20] Xinran Wang, Yu Xiang, Jun Gao, and Jie Ding. Information laundering for model privacy. arXiv preprint arXiv:2009.06112, 2020.
[ZWL23] Xuandong Zhao, Yu-Xiang Wang, and Lei Li. Protecting language generation models via invisible watermarking. In International Conference on Machine Learning, pages 42187–42199. PMLR, 2023.

Model Stealing for Any Low-Rank Language Model

Abstract

1 Introduction

Question.

Question.

1.1 Main Results

Definition 1.1.

Definition 1.2 (Conditional Query).

Theorem 1.3.

1.2 Discussion

1.3 Related Work

2 Basic Setup

Definition 2.1 (Hidden Markov Model).

Fact 2.2.

Notation

3 Technical Overview

3.1 Idealized Representation

Barycentric Spanners

Definition 3.1 (Barycentric Spanner).

Fact 3.2.

Change-of-Basis

Challenges

3.2 Learning the Representation

Simulating pdf access to ℍ^\widehat{\mathbb{H}} that is close to ℍ\mathbb{H}

Lemma 3.3 (Informal, see Lemma 4.6).

3.2.1 Dimensionality Reduction

Fact 3.4.

Definition 3.5.

Proposition 3.6.

3.2.2 Computing a Spanner in the Reduced Space

Lemma 3.7 (Informal, see Lemma 5.6).

3.2.3 A Reduced Representation

Definition 3.8 (Learned representation (informal)).

3.3 Sampling Step

Naive Attempt: Linear Change-of-Basis

Need for Projection

Projection in KL

Fact 3.9.

3.4 Organization

4 Basic Results

4.1 Algorithmic Primitives

Claim 4.1 (Next Character Probabilities).

Proof.

Definition 4.2 (Conditional Closeness).

Claim 4.3.

Proof.

Definition 4.4 (Positivity).

Definition 4.5 (Sample and PDF Access).

Lemma 4.6 (PDF Estimation).

Remark 4.7.

Proof.

Naive Algorithm:

Efficient Implementation:

4.2 Truncated KL Divergence

Definition 4.8.

Definition 4.9.

Fact 4.10.

Proof.

Corollary 4.11.

Proof.

Claim 4.12.

Proof.

5 Geometric Results

5.1 Spanners

Definition 5.1 (Barycentric Spanner).

Definition 5.2 (Distribution Spanner).

Fact 5.3.

Lemma 5.4.

Proof.

Claim 5.5.

Proof.

Lemma 5.6.

Proof.

5.2 Dimensionality Reduction

Definition 5.7.

Definition 5.8.

Definition 5.9.

Lemma 5.10.

Proof.

Lemma 5.11.

Simulating pdf access to $\widehat{\mathbb{H}}$ that is close to $\mathbb{H}$