This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model Stealing for Any Low-Rank Language Model

Allen Liu Email: [email protected] This work was supported in part by an NSF Graduate Research Fellowship, a Hertz Fellowship, and a Citadel GQS Fellowship.    Ankur Moitra Email: [email protected] This work was supported in part by a Microsoft Trustworthy AI Grant, an ONR grant, and a David and Lucile Packard Fellowship.
Abstract

Model stealing, where a learner tries to recover an unknown model via carefully chosen queries, is a critical problem in machine learning, as it threatens the security of proprietary models and the privacy of data they are trained on. In recent years, there has been particular interest in stealing large language models (LLMs). In this paper, we aim to build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. We study model stealing for Hidden Markov Models (HMMs), and more generally low-rank language models.

We assume that the learner works in the conditional query model, introduced by Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. In other words, our algorithm succeeds at stealing any language model whose output distribution is low-rank. This improves upon the result in [KKMZ24] which also requires the unknown distribution to have high “fidelity” – a property that holds only in restricted cases. There are two key insights behind our algorithm: First, we represent the conditional distributions at each timestep by constructing barycentric spanners among a collection of vectors of exponentially large dimension. Second, for sampling from our representation, we iteratively solve a sequence of convex optimization problems that involve projection in relative entropy to prevent compounding of errors over the length of the sequence. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.

1 Introduction

Proprietary machine learning models are often highly confidential. Not only are their weights not publicly released, but even their architecture and hyperparameters used in training are kept a closely guarded secret. And yet these models are often deployed as a service, allowing users to make queries to the model and receive answers. These answers can take the form of labels or completions of prompts, and sometimes a model will even report additional information such as its confidence scores. This raises a natural question:

Question.

Are these black-box models actually secure, or is it possible to reverse engineer their parameters or replicate their functionality just from query access to them?

This task is called model stealing and it threatens the security of proprietary models and the privacy of data they are trained on. Beyond nefarious reasons, it can also be used in model distillation [MCH+21], where we have trained a large and highly complex model and we want to transfer its knowledge to a much smaller model. It can also be a useful tool for identifying vulnerabilities, as those are often inherited by stolen models.

In any case, model stealing continues to be a very active area of research. The influential work in [TZJ+16] showed that there are simple and efficient attacks on popular models like logistic regression, decision trees and deep neural networks that often work in practice. Since then, many new attacks and defenses have been formulated [HLXS21, HJB+21, RST19, WG18, JSMA19, OMR23, OSF19, WXGD20]. There are also approaches based on embedding watermarks [JCCCP21, ZWL23, LZJ+22] that make it possible to detect when one model has been stolen from another. In recent years, there has been particular interest in stealing large language models (LLMs). Various works have shown how to steal isolated components of a language model such as the decoding algorithm [NKIH23], prompts used to fine-tune the model [SZ24], and even the entire weight matrix of the last layer (the embedding projection layer) [CPD+24].

In this work, our main interest will be in theoretical foundations for stealing language models. As is all too familiar, proving rigorous end-to-end guarantees when working with modern machine learning models with all their bells and whistles seems to be an extremely difficult task. For example, while we can understand the training dynamics on multilayer neural networks in terms of gradient flow in the Wasserstein space of probability distributions [MMM19, NP23], it has turned out to be quite difficult to analyze these dynamics except in simplified settings with a high degree of symmetry. Even worse, there are strong lower bounds for learning deep neural networks [KS09] even with respect to nice input distributions [GGJ+20, DKKZ20, GGK20]. The task of reasoning about modern language models seems no easier, as they are built on transformers [VSP+17] with many building blocks such as word embeddings, positional encodings, queries, keys, values, attention, masking, feed-forward neural networks and layer normalization.

Nevertheless there are often simplified models that abstract important features of more complex models and give us a sandbox in which to try to find theoretical explanations of empirical phenomena. For example, analyzing the dynamics of gradient descent when training a deep neural network is notoriously difficult. But in an appropriate scaling limit, and when the network is wide enough, it can be approximated through the neural tangent kernel [JGH18]. For recurrent neural networks, a popular approach is to analyze gradient descent on linear dynamical systems instead [HMR18]. Likewise for language models, it is natural to work with Hidden Markov Models (HMMs), which are in some sense the original language model, dating back to the work of Claude Shannon in 1951 [Sha51] and were the basis of other early natural language processing systems including the IBM alignment models. More broadly, we can consider a generalization called low-rank language models (Definition 1.1). This brings us to our main questions:

Question.

Is there an efficient algorithm for stealing HMMs from query access? What about more generally for low-rank language models?

These questions were first introduced and studied in an exciting recent work of Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. However their motivation and framing was somewhat different, as we will explain.

1.1 Main Results

Formally, we view a language model as a distribution \mathbb{H} over 𝒪T\mathcal{O}^{T} for some alphabet 𝒪\mathcal{O} and sequence length TT. For simplicity, we treat the sequence length as fixed. Following Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24], the rank of the distribution generated by a language model is defined as follows:

Definition 1.1.

[Low Rank Distribution] A distribution \mathbb{H} over 𝒪T\mathcal{O}^{T} for alphabet 𝒪\mathcal{O} of size OO and sequence length TT is rank SS if for all t<Tt<T, the OTt×OtO^{T-t}\times O^{t} matrix, 𝐌(t)\mathbf{M}^{(t)}, with entries equal to Pr[f|h]\Pr_{\mathbb{H}}[f|h] for f𝒪Ttf\in\mathcal{O}^{T-t} and h𝒪th\in\mathcal{O}^{t} has rank at most SS.

In other words, a distribution is rank-SS if for any t<Tt<T, the information in the prefix of length tt can be embedded in an SS-dimensional space such that the distribution of the future tokens can be represented as a linear function of this embedding. We note that low-rank distributions are expressive and encompass distributions generated by a Hidden Markov Model (HMM) with SS hidden states (see Fact 2.2).

Next, we formalize the setup for studying model stealing. We allow the learner to make conditional queries – that is, the learner can specify a history of observations, and then receives a random sample from the conditional distribution on the future observations. Formally, we have the following definition:

Definition 1.2 (Conditional Query).

The learner may make conditional queries to a distribution \mathbb{H} by querying a string h𝒪th\in\mathcal{O}^{t} where 0t<T0\leq t<T. Upon making this query, the learner obtains a string ff of length TtT-t drawn from the distribution Pr[f|h]\Pr_{\mathbb{H}}[f|h].

In this model, our goal is to design an algorithm that makes a total number of queries that is polynomial in S,O,TS,O,T and learns an efficiently samplable distribution that is ϵ\epsilon-close in total variation distance to \mathbb{H}. This conditional query model was recently introduced by Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. Their motivation was two-fold: First, while learning an HMM from random samples is known to be computationally hard [MR05], in principle one can circumvent these barriers if we are allowed conditional samples. Second, a solution to their problem would generalize Angluin’s classic LL^{*} algorithm which learns deterministic finite automata from membership queries [Ang87]. In terms of results, Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24] introduced a notion that they called fidelity and gave a polynomial time algorithm to learn any low-rank distribution (and thus any HMM) which has high fidelity through conditional queries. However this property does not always hold. Thus, their main question still remains: Is it possible to learn arbitrary low-rank distributions through conditional queries? Here we resolve this question in the affirmative. We show:

Theorem 1.3.

Assume we are given conditional query access to an unknown rank SS distribution \mathbb{H} over 𝒪T\mathcal{O}^{T} where |𝒪|=O|\mathcal{O}|=O. Then given a parameter 0<η<10<\eta<1, there is an algorithm that takes poly(S,O,T,1/η)\mathrm{poly}(S,O,T,1/\eta) conditional queries and running time and with probability 1η1-\eta outputs a description of a distribution \mathbb{H}^{\prime} such that \mathbb{H}^{\prime} is η\eta-close in TV distance to \mathbb{H}. Moreover there is an algorithm that samples from \mathbb{H}^{\prime} in poly(S,O,T,log(1/η))\mathrm{poly}(S,O,T,\log(1/\eta)) time.

Note that crucially, the algorithm only makes conditional queries for learning the representation of \mathbb{H}^{\prime}. Once we have this learned representation, we may draw as many samples as we want without making any more queries to the original distribution \mathbb{H}.

1.2 Discussion

Theorem 1.3 shows that we can efficiently learn any low-rank distribution via conditional queries. Thus, we can view our results as showing that in some sense, the rank of a distribution can be a useful proxy for understanding the complexity of model stealing, similar to how complexity measures, such as Bellman rank [JKA+17] and its relatives [FKQR21], are useful for understanding the statistical complexity of learning a near optimal policy in reinforcement learning.

There is a key conceptual insight driving our algorithm. One of the challenges in learning sequential distributions is that the error can grow exponentially in the sequence length TT. In particular if we imagine sampling from \mathbb{H} one token at a time, the low-rank structure ensures that we only need to keep track of an SS-dimensional hidden state at each step. However, each step involves multiplication by a change-of-basis matrix and these repeated multiplications cause the error to grow exponentially. The key to mitigating this error blowup is that we combine each change-of-basis with a projection step, where we solve a convex optimization problem that performs a projection with respect to KL divergence. Crucially, projection in KL divergence has a contractive property (see Fact 3.9) that does not hold for other natural measures of distance between distributions, such as TV distance. We give a more detailed overview of our algorithm in Section 3. This is an interesting example where allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance. Of course, phrased in general terms, this is a driving philosophy behind OpenAI’s o1 model. But so far we have little theoretical understanding of the provable benefits of allowing more compute at inference time.

1.3 Related Work

There has been a long line of work on learning HMMs from random samples. Mossel and Roch [MR05] gave the first polynomial time algorithms that work when under appropriate full rankness conditions on the transition and observation matrices. Other works gave spectral [HKZ12] and method of moments based approaches [AHK12]. Learning HMMs can also be thought of as a special case of learning phylogenetic trees [CGG01, MR05]. Other approaches assume the output distributions belong to a parametric family [KNW13], or study quasi-HMMs [HGKD15].

The main computational obstruction in this area is that HMMs, without full rankness conditions, can encode noisy parities [MR05], which are believed to be computationally hard to learn. In the overcomplete setting, where the hidden state space is allowed to be larger than the observation space, there are ways around these lower bounds. First, one can aim to predict the sequence of observations, rather than learning the parameters [SKLV18]. Under a natural condition called multi-step observability one can get quasi-polynomial time algorithms [GMR23]. Second, one can make structural assumptions that the transition matrices are sparse, well-conditioned, and have small probability mass on short cycles [SKLV17]. Alternatively there are polynomial time algorithms that work under the assumption that the transition and observation matrices are smoothed [BCPV19].

There are also related models called linear dynamical systems where the hidden states and observations are represented as vectors. In contrast, in HMMs the hidden states and observations take on only a finite set of possible values. There is a long line of work on learning linear dynamical systems too [HMR18, OO19, TP19, SRD19, SBR19, BLMY23a, CP22, BLMY23b]. However, these algorithms require various structural assumptions, and even the weakest ones, namely condition-number bounds on the called observability and controllability matrices, are known to be necessary [BLMY23a].

The conditional query model has also been studied for other statistical problems such as testing discrete distributions and learning juntas [CFGM13, CRS15, BC18, CJLW21]. However, in most of these settings, the goal of studying the conditional query model is to obtain improved statistical rates rather than sidestep computational hardness. We remark that there are also many classic examples of problems in computational learning theory where, when allowed to make queries, there are better algorithms [Jac97] than if we are only given passive samples.

2 Basic Setup

Recall that we have conditional query access to some unknown distribution \mathbb{H} over 𝒪T\mathcal{O}^{T} for some alphabet 𝒪\mathcal{O}. We also assume that \mathbb{H} has rank SS (recall Definition 1.1). Our goal will be to learn a description of \mathbb{H} via conditional queries. Let us first formally define Hidden Markov Models (HMMs):

Definition 2.1 (Hidden Markov Model).

An HMM \mathbb{H} with state space 𝒮\mathcal{S}, |𝒮|=S|\mathcal{S}|=S, and observation space 𝒪\mathcal{O}, |𝒪|=O|\mathcal{O}|=O, is specified by an initial distribution μ\mu over the states, a transition matrix 𝐓S×S\mathbf{T}\in{\mathbb{R}}^{S\times S}, and an emission matrix 𝐎O×S\mathbf{O}\in{\mathbb{R}}^{O\times S}. For a given sequence length TT, the probability of generating a given sequence x1,,xTx_{1},\dots,x_{T} with each xi𝒪x_{i}\in\mathcal{O} is

Pr[x1,,xT]=s1,,sT+1𝒮T+1μ(s1)t=1T𝐎xt,st𝐓st+1,st.\Pr[x_{1},\dots,x_{T}]=\sum_{s_{1},\dots,s_{T+1}\in\mathcal{S}^{T+1}}\mu(s_{1})\prod_{t=1}^{T}\mathbf{O}_{x_{t},s_{t}}\mathbf{T}_{s_{t+1},s_{t}}\,.

Next we give a basic observation (see [KKMZ24]) that the class of rank SS distributions contains the set of all distributions generated by an HMM with SS hidden states.

Fact 2.2.

[KKMZ24] Let \mathbb{H} be a distribution over 𝒪T\mathcal{O}^{T} generated by an HMM with SS hidden states. Then \mathbb{H} is a rank at most SS distribution.

Notation

Throughout this paper, we will use the same global notation for the unknown distribution \mathbb{H} with rank SS, observation space 𝒪\mathcal{O} of size OO and sequence length TT. We also write Pr[f|h]\Pr_{\mathbb{H}}[f|h] for sequences ff of length less than TtT-t. This will be used to denote the probability that conditioned on the prefix hh, the observations immediately following hh are those in ff.

We use the following notation for prefixes and conditional probabilities. For a distribution \mathbb{H} on 𝒪T\mathcal{O}^{T} and t<Tt<T, we use [:t]\mathbb{H}[:t] to denote the distribution induced by \mathbb{H} on tt-character prefixes. For any prefix h𝒪th\in\mathcal{O}^{t}, we use Pr[|h]\Pr_{\mathbb{H}}[\cdot|h] to denote the distribution on futures f𝒪Ttf\in\mathcal{O}^{T-t} given by Pr[f|h]\Pr_{\mathbb{H}}[f|h]. Finally, for a string h𝒪th\in\mathcal{O}^{t} and character o𝒪o\in\mathcal{O}, we use hoh\vee o to denote the string of t+1t+1 characters obtained by appending oo to hh.

3 Technical Overview

Our algorithm is composed of two main parts: We work with a certain representation of the distribution and show how to estimate it from conditional samples. Then we give an algorithm that takes this learned representation and can generate samples. In most learning problems, estimating the parameters is the challenging part and sampling is often trivial. But in our case, designing the sampling algorithm, which involves solving a sequence of convex optimization problems, is one of the key ingredients.

3.1 Idealized Representation

In this section, we introduce the representation that will be central to our learning algorithm. First we consider an idealized setting where we ignore issues like sampling noise and the fact that we cannot afford to work with exponentially large vectors directly. We include this subsection for pedagogical reasons as many of the ideas are already in [KKMZ24]. Nevertheless, it will be useful setup for explaining our new contributions in the following subsections.

We slightly abuse notation and for h𝒪th\in\mathcal{O}^{t}, let Pr[|h]\Pr_{\mathbb{H}}[\cdot|h] denote the vector whose entries are Pr[f|h]\Pr_{\mathbb{H}}[f|h] as ff ranges over all elements of 𝒪Tt\mathcal{O}^{T-t}. Now Definition 1.1 tells us that as hh ranges over 𝒪t\mathcal{O}^{t}, all of the vectors Pr[|h]\Pr_{\mathbb{H}}[\cdot|h] are contained in an SS-dimensional subspace. Thus, we only need to store SS of these vectors to “span” the whole space.

Barycentric Spanners

This leads us to the notion of a barycentric spanner, which is the key building block in the representation we will use:

Definition 3.1 (Barycentric Spanner).

For a collection of vectors z1,,zndz_{1},\dots,z_{n}\in{\mathbb{R}}^{d}, we say another set of vectors v1,,vsdv_{1},\dots,v_{s}\in{\mathbb{R}}^{d} is a (C,γ)(C,\gamma)-spanner for them if for all j[n]j\in[n], there are coefficients c1,,csc_{1},\dots,c_{s} with |ci|C|c_{i}|\leq C such that

zj(c1v1++csvs)1γ.\left\lVert z_{j}-(c_{1}v_{1}+\dots+c_{s}v_{s})\right\rVert_{1}\leq\gamma\,.

Crucially, for any set of vectors that are exactly contained in a SS-dimensional subspace, there is always a subset of them that form a (1,0)(1,0) barycentric spanner for the full collection.

Fact 3.2.

Let v1,,vnv_{1},\dots,v_{n} be arbitrary vectors in d{\mathbb{R}}^{d}, Then there exists a subset S[n]S\subseteq[n] with |S|d|S|\leq d such that {vi}iS\{v_{i}\}_{i\in S} form a (1,0)(1,0) spanner for the collection (v1,,vn)(v_{1},\dots,v_{n}).

Thus ideally, for each 0<t<T0<t<T, we could store a subset (t)𝒪t\mathcal{H}^{(t)}\subseteq\mathcal{O}^{t} with |(t)|S|\mathcal{H}^{(t)}|\leq S and such that {Pr[|h]}h(t)\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{H}^{(t)}} is a (1,0)(1,0)-barycentric spanner for {Pr[|h]}h𝒪t\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}.

Change-of-Basis

In addition to these spanning subsets, we will need to store some additional information about sampling probabilities and “change-of-basis” between these spanning sets. In particular we claim that if, in addition to the barycentric spanners, we also had the following information, then it would be enough to sample:

  • The next character probabilities (under \mathbb{H}) for all of the elements of (t)\mathcal{H}^{(t)}

  • For each 0t<T0\leq t<T, o𝒪o\in\mathcal{O}, h(t)h\in\mathcal{H}^{(t)}, coefficients {αhho}h(t+1)\{\alpha_{h^{\prime}}^{h\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}} such that

    Pr[|ho]=h(t+1)αhhoPr[|h]\Pr_{\mathbb{H}}[\cdot|h\vee o]=\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{h\vee o}\Pr_{\mathbb{H}}[\cdot|h^{\prime}] (1)

    i.e. we know how to write Pr[|ho]\Pr_{\mathbb{H}}[\cdot|h\vee o] as a linear combination of the vectors {Pr[|h]}h(t+1)\{\Pr_{\mathbb{H}}[\cdot|h^{\prime}]\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}. We refer to this as a “change-of-basis”.

To see why these suffice, let’s consider sampling a string x=o1o2x=o_{1}o_{2}\cdots one character at a time. Assume we have sampled x=o1o2otx=o_{1}o_{2}\cdots o_{t} so far. Throughout the process we will maintain a set of coefficients {αhx}h(t)\{\alpha_{h}^{x}\}_{h\in\mathcal{H}^{(t)}} such that

Pr[|x]h(t)αhxPr[|h].\Pr_{\mathbb{H}}[\cdot|x]\approx\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\Pr_{\mathbb{H}}[\cdot|h]\,. (2)

Since we know the next-character distributions for all of h(t)h\in\mathcal{H}^{(t)}, the above allows us to exactly compute the next-character distribution for xx. After sampling a character, say oo from this distribution, we can then re-normalize to obtain

Pr[|xo]h(t)αhxPr[o|h]Pr[o|x]Pr[|ho].\Pr_{\mathbb{H}}[\cdot|x\vee o]\approx\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\frac{\Pr_{\mathbb{H}}[o|h]}{\Pr_{\mathbb{H}}[o|x]}\Pr_{\mathbb{H}}[\cdot|h\vee o]\,. (3)

We can then substitute (1) into the above to rewrite the right hand side as a linear combination of the vectors {Pr[|h]}h(t+1)\{\Pr_{\mathbb{H}}[\cdot|h^{\prime}]\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}. And now we can iterate this process again to sample the next character. The key point is that once we have a barycentric spanner, we do not need to store the coefficients αhx\alpha_{h}^{x} for each history xx but rather can track how they evolve.

Challenges

If we want to turn the above representation into a learning algorithm, there are some major obstacles. The vectors Pr[|h]\Pr_{\mathbb{H}}[\cdot|h] are exponentially large, and we only have approximate (but not exact) access to them. When our estimates are noisy, the sampling approach above has issues as the error may grow multiplicatively after each iteration.

3.2 Learning the Representation

Simulating pdf access to ^\widehat{\mathbb{H}} that is close to \mathbb{H}

Our first step is to show that, by using conditional queries, we can simulate exact p.d.f. access to a distribution ^\widehat{\mathbb{H}} that is ϵ\epsilon-close to \mathbb{H} in TV distance for any inverse-polynomially small ϵ\epsilon .

Lemma 3.3 (Informal, see Lemma 4.6).

There is a distribution ^\widehat{\mathbb{H}} such that for all h𝒪th\in\mathcal{O}^{t}, the conditional distributions Pr[|h]\Pr_{\mathbb{H}}[\cdot|h] and Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h] are ϵ\epsilon-close, and we can simulate query access to the exact p.d.f. of all conditional distributions of ^\widehat{\mathbb{H}}.

Note that crucially this exact p.d.f. access is only for a distribution ^\widehat{\mathbb{H}} that is ϵ\epsilon-close to \mathbb{H} and not for \mathbb{H} itself (as in the latter case [KKMZ24] already gives an algorithm for learning). For most of the algorithm, we will work directly with ^\widehat{\mathbb{H}} and only use, implicitly in the analysis, that it is close to some low rank distribution \mathbb{H}.

3.2.1 Dimensionality Reduction

Recall that our goal is to find a barycentric spanner for the set of vectors {Pr[|h]}h𝒪t\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}. These vectors are close to the vectors {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}, so we will try to construct an approximate spanner for this latter collection. The first problem is that these vectors have exponentially many entries. In order to compute with them efficiently, we introduce a type of dimensionality reduction that allows us to subsample a subset of only polynomially many coordinates and only work with the restriction to those coordinates. First, let 𝒟\mathcal{D} be some distribution over strings in 𝒪Tt\mathcal{O}^{T-t}. Now we sample polynomially elements f1,,fmf_{1},\dots,f_{m} from 𝒟\mathcal{D}. And for each h𝒪th\in\mathcal{O}^{t}, we associate Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h] with the reduced vector

vh=(Pr^[f1|h]m𝒟[f1],,Pr^[fm|h]m𝒟[fm])v_{h}=\left(\frac{\Pr_{\widehat{\mathbb{H}}}[f_{1}|h]}{m\mathcal{D}[f_{1}]},\dots,\frac{\Pr_{\widehat{\mathbb{H}}}[f_{m}|h]}{m\mathcal{D}[f_{m}]}\right) (4)

where 𝒟[fi]\mathcal{D}[f_{i}] denotes the density of 𝒟\mathcal{D} at fif_{i}. Recall that by assumption, we have exact access to the densities Pr^[fi|h]\Pr_{\widehat{\mathbb{H}}}[f_{i}|h] in the numerator. Thus as long as we also have exact density access to 𝒟\mathcal{D}, then the above entries can be computed exactly. We first observe that these vectors give unbiased estimates for linear functionals of the original distributions Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h].

Fact 3.4.

For any distribution 𝒟\mathcal{D} over 𝒪Tt\mathcal{O}^{T-t}, in expectation over the random draws from 𝒟\mathcal{D}, for any subset 𝒜𝒪t\mathcal{A}\subseteq\mathcal{O}^{t} and real coefficients {ch}h𝒜\{c_{h}\}_{h\in\mathcal{A}},

𝔼[h𝒜chvh1]=h𝒜chPr^[|h]1.\operatorname{\mathbb{E}}\left[\left\lVert\sum_{h\in\mathcal{A}}c_{h}v_{h}\right\rVert_{1}\right]=\left\lVert\sum_{h\in\mathcal{A}}c_{h}\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\right\rVert_{1}\,.

In other words, at least in expectation, the subsampling of the entries preserves the L1L^{1} norm of linear combinations of the vectors. Of course, the variance could be prohibitively large depending on the choice of 𝒟\mathcal{D}. However, with a careful choice of 𝒟\mathcal{D}, we can obtain concentration for the above quantities as well. The goal of our dimensionality reduction procedure is captured by the following statements:

Definition 3.5.

For a subset 𝒜𝒪t\mathcal{A}\subseteq\mathcal{O}^{t}, we say vectors {vh}h𝒜\{v_{h}\}_{h\in\mathcal{A}} are γ\gamma-representative for the distributions {Pr^[|h]}h𝒜\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}} if

|h𝒜chvh1h𝒜chPr^[|h]1|γ\left\lvert\left\lVert\sum_{h\in\mathcal{A}}c_{h}v_{h}\right\rVert_{1}-\left\lVert\sum_{h\in\mathcal{A}}c_{h}\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\right\rVert_{1}\right\rvert\leq\gamma

for all sets of coefficients with |ch|1|c_{h}|\leq 1.

Proposition 3.6.

[Informal, see Lemma 5.10 for more details] For 𝒜={h1,,hk}\mathcal{A}=\{h_{1},\dots,h_{k}\} and

𝒟=Pr^[|h1]++Pr^[|hk]k.\mathcal{D}=\frac{\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{1}]+\dots+\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{k}]}{k}\,.

If we draw samples f1,,fmf_{1},\dots,f_{m} from 𝒟\mathcal{D} for m=poly(k,1/γ)m=\mathrm{poly}(k,1/\gamma) and set

vh=(Pr^[f1|h]m𝒟[f1],,Pr^[fm|h]m𝒟[fm])v_{h}=\left(\frac{\Pr_{\widehat{\mathbb{H}}}[f_{1}|h]}{m\mathcal{D}[f_{1}]},\dots,\frac{\Pr_{\widehat{\mathbb{H}}}[f_{m}|h]}{m\mathcal{D}[f_{m}]}\right)

then with high probability the vectors {vh}h𝒜\{v_{h}\}_{h\in\mathcal{A}} are γ\gamma-representative for the distributions {Pr^[|h]}h𝒜\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}}.

To see why the above holds, note that the entries of vh1,,vhkv_{h_{1}},\dots,v_{h_{k}} are all at most kk. Thus for any accuracy parameter γ\gamma, by taking m>poly(k,1/γ)m>\mathrm{poly}(k,1/\gamma), we can union bound over a net and get concentration in Fact 3.4. The main point now is that when we want to compute a barycentric spanner we can work with the representative vectors instead.

3.2.2 Computing a Spanner in the Reduced Space

Section 3.2.1 deals with the fact that there are exponentially many possible futures. There are also exponentially many histories, but because they are close to some low-dimensional subspace, we show that sampling a polynomial number of them suffices to obtain a representative subset.

Formally, we sample k=poly(S,1/γ)k=\mathrm{poly}(S,1/\gamma) histories h1,,hkh_{1},\dots,h_{k} from ^[:t]\widehat{\mathbb{H}}[:t]. While we cannot ensure that the collection

{Pr^[|h1],,Pr^[|hk]}\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{1}],\dots,\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{k}]\}

contains a barycentric spanner for all of {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}} (because there could be some hh that is sampled with very low probability), we can still use Fact 3.2 to argue that this set contains a subset of size SS that is a barycentric spanner for most of {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}} i.e. with high probability over hh drawn from ^\widehat{\mathbb{H}}, Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h] is close to a bounded linear combination of vectors in this subset.

Now to compute the barycentric spanner, by Proposition 3.6, we can construct representative vectors vh1,,vhkv_{h_{1}},\dots,v_{h_{k}} and it suffices to work with these, which have polynomial size. Then we can run an algorithm for computing approximate barycentric spanners (see Lemma 5.6) on this collection. There are additional technical details due to the error i.e. the vectors are close to, but not exactly contained in an SS-dimensional subspace. For the algorithm to work, it is crucial that the error is smaller than the dimensionality of the vectors so we need to ensure that ϵpoly(γ)\epsilon\ll\mathrm{poly}(\gamma) (see Section 6.1 for more details). Omitting these technical details, we have now sketched the proof of the following:

Lemma 3.7 (Informal, see Lemma 5.6).

With poly(S,O,T,1/γ)\mathrm{poly}(S,O,T,1/\gamma) conditional queries and runtime, we can compute sets (1),,(T1)\mathcal{H}^{(1)},\dots,\mathcal{H}^{(T-1)} such that for each 1tT11\leq t\leq T-1, |(t)|S|\mathcal{H}^{(t)}|\leq S and the collection {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{H}^{(t)}} is a (O(1),γ)(O(1),\gamma)-spanner for a 1γ1-\gamma-fraction of {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}} (where the fraction is measured with respect to h^h\sim\widehat{\mathbb{H}}).

For simplicity, in this overview, we even assume that {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{H}^{(t)}} is actually an approximate spanner for all of {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}} — in the full proof, roughly the “exceptional” possibilities for hh can be absorbed into the failure probability.

3.2.3 A Reduced Representation

Let (t)𝒪t\mathcal{B}^{(t)}\subseteq\mathcal{O}^{t} consist of all strings in (t)\mathcal{H}^{(t)} and all strings obtained by taking a string in (t1)\mathcal{H}^{(t-1)} and appending a character in 𝒪\mathcal{O}. In other words (t)=(t){(t1)o}o𝒪\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}. Now the set of information that we store is as follows:

Definition 3.8 (Learned representation (informal)).

Our learned representation consists of the following information:

  • Sets (t)\mathcal{H}^{(t)} for t=1,2,,T1t=1,2,\dots,T-1

  • The next character probabilities (under ^\widehat{\mathbb{H}}) for all of the elements of (t)\mathcal{H}^{(t)}

  • Vectors {vh}h(t)\{v_{h}\}_{h\in\mathcal{B}^{(t)}} that are representative for the distributions {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}} (can be constructed using Proposition 3.6).

This is our first main departure from the idealized representation: rather than storing explicit linear combinations as in the idealized representation (see (1)), we simply store the representative vectors and defer the computation of the “change-of-basis” to the sampling algorithm. For technical reasons, the full algorithm requires storing some additional information but we will omit these in this overview. See Section 6.2 for details.

3.3 Sampling Step

As in the idealized sampling algorithm in Section 3.1, we sample a string xx one character at a time. At each step t<Tt<T, we maintain a linear combination {αhx}h(t)\{\alpha_{h}^{x}\}_{h\in\mathcal{H}^{(t)}} of the strings h(t)h\in\mathcal{H}^{(t)} as in (2), except with conditional distributions with respect to ^\widehat{\mathbb{H}}. We then sample a character and re-normalize as in (3). The main difficulty now is the change-of-basis step.

Naive Attempt: Linear Change-of-Basis

Naively, for each h(t)h\in\mathcal{H}^{(t)} and o𝒪o\in\mathcal{O}, we could simply pre-compute coefficients {βhho}h(t+1)\{\beta_{h^{\prime}}^{h\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}} such that

vhoh(t+1)βhhovhv_{h\vee o}\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\beta_{h^{\prime}}^{h\vee o}v_{h^{\prime}}

and thus (since the vectors vhv_{h} are representative for the distribution Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h]),

Pr^[|ho]h(t+1)βhhoPr^[|h].\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\beta_{h^{\prime}}^{h\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\,.

We can then substitute this into the re-normalized relation (see (3)) and get

Pr^[|xo]h(t+1)θhxoPr^[|h], where θhxo=h(t)αhxβhhoPr^[o|h]Pr^[o|x]\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\theta_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\mbox{, where }\theta_{h^{\prime}}^{x\vee o}=\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\beta_{h^{\prime}}^{h\vee o}\cdot\frac{\Pr_{\widehat{\mathbb{H}}}[o|h]}{\Pr_{\widehat{\mathbb{H}}}[o|x]} (5)

This gives us an expression for Pr^[|xo]\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o] as a linear combination of the vectors Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}] for h(t+1)h^{\prime}\in\mathcal{H}^{(t+1)}. However, the issue with this approach is that the coefficients and approximation error may grow exponentially with tt.

Need for Projection

The fact that the coefficients may grow motivates the need for a “projection” step, where we reduce the magnitudes of the coefficients. One natural attempt is to take the coefficients {θhxo}h(t+1)\{\theta_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}} in (5) and “project” them by finding an equivalent set of coefficients of bounded magnitude, say {αhxo}h(t+1)\{\alpha_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}}, such that

h(t+1)θhxoPr^[|h]h(t+1)αhxoPr^[|h].\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\theta_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}]\,. (6)

In fact, there exists a set of coefficients of bounded magnitude such that

Pr^[|xo]h(t+1)αhxoPr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\approx\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{\prime}] (7)

with high probability over xx, by Lemma 3.7. However, there is a subtle issue with this approach. We only have access to the vectors {vh}h(t+1)\{v_{h}\}_{h\in\mathcal{B}^{(t+1)}} so we can only compute (succinct representations) of the expressions in (6) but not of Pr^[|xo]\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o].

The issue is captured by the following abstraction. Let ww be the vector that is equal to the left hand side of (6). Let 𝒦\mathcal{K} be the convex set consisting of all possible vectors obtainable on the right hand side of (6) for coefficients {αhxo}h(t+1)\{\alpha_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}} bounded by some constant. Let z=Pr^[|xo]z=\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]. We know z𝒦z\in\mathcal{K} (up to some small error), but we do not know zz. Now we want to map ww to a point in z𝒦z^{\prime}\in\mathcal{K}. Ideally, we want to guarantee zz1wz1\left\lVert z^{\prime}-z\right\rVert_{1}\leq\left\lVert w-z\right\rVert_{1}, up to some small additive error. The issue is that we cannot guarantee better than

zz12wz1\left\lVert z^{\prime}-z\right\rVert_{1}\leq 2\left\lVert w-z\right\rVert_{1}

due to the fact that zz is unknown and having to use triangle inequality. In other words, the problem is that when projecting onto a convex set in L1L^{1} norm, we may still double the L1L^{1} distance to some of the elements of the convex set, causing the error to grow multiplicatively in each iteration.

Projection in KL

The above issue motivates projection in a different distance measure between distributions. Specifically, KL divergence has the following appealing property:

Fact 3.9.

Let 𝒯N\mathcal{T}\subset{\mathbb{R}}^{N} be the NN-dimensional simplex i.e. the convex hull of the standard basis vectors. We view points in 𝒯\mathcal{T} as distributions over NN elements. Let 𝒦𝒯\mathcal{K}\subseteq\mathcal{T} be a convex set and let w𝒯w\in\mathcal{T} be a point. Let z=argminz𝒦KL(zv)z^{*}=\operatorname{argmin}_{z\in\mathcal{K}}\textsf{KL}(z\|v). Then for all z𝒦z\in\mathcal{K},

KL(zz)KL(zw).\textsf{KL}(z\|z^{*})\leq\textsf{KL}(z\|w)\,.

This implies that “projecting” onto a convex set actually decreases the KL-divergence from all elements within the convex set. This is the key to overcoming the error doubling issue above. In particular, in the same abstraction as above, we would just set z=zz^{\prime}=z^{*}.

More concretely, since the vectors {vh}h(t+1)\{v_{h}\}_{h\in\mathcal{B}^{(t+1)}} are succinct representations of Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h], we can solve the projection in KL, which is a convex optimization problem, over these representative vectors. While Proposition 3.6 is stated for TV distance, we can extend it to also preserve the KL divergence between linear combinations of these distributions as well. Technically, this requires a “truncated” notion of KL to deal with possibly negative entries but we will gloss over this for now (see Definition 4.8 and Lemma 5.10 for more details).

To describe our actual algorithm, once we have the coefficients as in (3), we solve the following convex optimization problem. We solve for coefficients {αhxo}h(t+1)\{\alpha_{h^{\prime}}^{x\vee o}\}_{h^{\prime}\in\mathcal{H}^{(t+1)}} that are bounded in magnitude and minimize

KL(h(t+1)αhxovhh(t)αhxPr^[o|h]Pr^[o|x]Pr^[|ho]).\textsf{KL}\left(\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}v_{h^{\prime}}\Bigg{\|}\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\frac{\Pr_{\widehat{\mathbb{H}}}[o|h]}{\Pr_{\widehat{\mathbb{H}}}[o|x]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o]\right)\,. (8)

While we omit the details in this overview (the full analysis is in Section 7), the main point of this projection in KL step is that we can appeal to Fact 3.9 with z=Pr^[|xo]z=\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o] and z=h(t+1)αhxovhz^{*}=\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}v_{h^{\prime}} for the optima obtained in (8) to get

KL(Pr^[|xo]h(t+1)αhxovh)KL(Pr^[|xo]h(t)αhxPr^[o|h]Pr^[o|x]Pr^[|ho])+γ\textsf{KL}\left(\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\Bigg{\|}\sum_{h^{\prime}\in\mathcal{H}^{(t+1)}}\alpha_{h^{\prime}}^{x\vee o}v_{h^{\prime}}\right)\leq\textsf{KL}\left(\Pr_{\widehat{\mathbb{H}}}[\cdot|x\vee o]\Bigg{\|}\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x}\frac{\Pr_{\widehat{\mathbb{H}}}[o|h]}{\Pr_{\widehat{\mathbb{H}}}[o|x]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o]\right)+\gamma

where the additive error γ\gamma comes only from how well KL-divergences are preserved in the succinct representation Pr^[|h]vh\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\rightarrow v_{h}. In particular, this completes the “change-of-basis” from (t)\mathcal{H}^{(t)} to (t+1)\mathcal{H}^{(t+1)} and the error only grows additively. Now to sample the entire string, we simply sample one character at a time and iterate the above.

3.4 Organization

In Section 4, we present a few basic observations for simulating density access to a distribution ^\widehat{\mathbb{H}} that is close to \mathbb{H} and working with (truncated) KL divergence. In Section 5, we introduce our machinery for working with barycentric spanners, including the dimensionality reduction procedure. In Section 6, we present our full learning algorithm. In Section 7, we present our sampling algorithm for sampling from our learned representation.

4 Basic Results

In this section, we introduce notation and collect several basic observations that will be used throughout the later sections.

4.1 Algorithmic Primitives

We start by developing a few basic primitives that conditional queries allow us to implement. First, we have a subroutine for computing the probabilities of the next character given a specified history hh.

Claim 4.1 (Next Character Probabilities).

For tTt\leq T and a string h𝒪th\in\mathcal{O}^{t} and parameters ϵ,δ>0\epsilon,\delta>0, we can make poly(1/ϵ,O,log(1/δ))\mathrm{poly}(1/\epsilon,O,\log(1/\delta)) conditional queries, and with 1δ1-\delta probability, we output a distribution 𝒫h\mathcal{P}_{h} over 𝒪\mathcal{O} that is within TV distance ϵ\epsilon of the true distribution of the next character {Pr[o|h]}o𝒪\{\Pr_{\mathbb{H}}[o|h]\}_{o\in\mathcal{O}}.

Proof.

We can repeatedly make conditional queries with input hh and look at the next character. We make poly(1/ϵ,O,log(1/δ))\mathrm{poly}(1/\epsilon,O,\log(1/\delta)) such queries and output the empirical distribution of the next character. By a Chernoff bound, with 1δ1-\delta probability, the resulting distribution is within TV distance ϵ\epsilon of the true distribution {Pr[o|h]}o𝒪\{\Pr_{\mathbb{H}}[o|h]\}_{o\in\mathcal{O}}. ∎

In light of the above, we make the following definition.

Definition 4.2 (Conditional Closeness).

For two distributions ,^\mathbb{H},\widehat{\mathbb{H}} on 𝒪T\mathcal{O}^{T}, we say they are ϵ\epsilon-conditionally close if for any string h𝒪th\in\mathcal{O}^{t} for t<Tt<T, the conditional distributions of the next character {Pr[o|h]}o𝒪\{\Pr_{\mathbb{H}}[o|h]\}_{o\in\mathcal{O}} and {Pr^[o|h]}o𝒪\{\Pr_{\widehat{\mathbb{H}}}[o|h]\}_{o\in\mathcal{O}} are ϵ\epsilon-close in TV distance.

The following statement shows that conditional closeness implies closeness in TV distance for the entire distribution (and further on any conditional distribution) up to a factor of TT.

Claim 4.3.

Let ,^\mathbb{H},\widehat{\mathbb{H}} be distributions on 𝒪T\mathcal{O}^{T} that are ϵ/T\epsilon/T conditionally close. Then for any history h𝒪th\in\mathcal{O}^{t} for tTt\leq T, we must have

dTV(Pr[|h],Pr^[|h])ϵd_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h])\leq\epsilon

where Pr[|h],Pr^[|h]\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h] represent the full distribution over the future f𝒪Ttf\in\mathcal{O}^{T-t}.

Proof.

We prove by reverse induction on tt that

dTV(Pr[|h],Pr^[|h])ϵ(Tt)Td_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h])\leq\frac{\epsilon(T-t)}{T}

whenever hh has length tt. When t=T1t=T-1, the above is immediate from the assumption that ,^\mathbb{H},\widehat{\mathbb{H}} are ϵ/T\epsilon/T conditionally close. Now assume we have proven the claim for tt. Consider hh that has length t1t-1.

dTV(Pr[|h],Pr^[|h])ϵT+o𝒪Pr[o|h]dTV(Pr[|ho],Pr^[|ho])ϵ(Tt+1)Td_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h],\Pr_{\widehat{\mathbb{H}}}[\cdot|h])\leq\frac{\epsilon}{T}+\sum_{o\in\mathcal{O}}\Pr_{\mathbb{H}}[o|h]d_{\textsf{TV}}(\Pr_{\mathbb{H}}[\cdot|h\vee o],\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o])\leq\frac{\epsilon(T-t+1)}{T}

where in the above, we first used conditional closeness and then the inductive hypothesis. This completes the proof. ∎

For technical reasons later on, it will be convenient to have the following definition which necessitates that, regardless of the prefix, all possibilities for the next character occur with some non-negligible probability.

Definition 4.4 (Positivity).

Let 0<η<10<\eta<1 be some parameter. We say a distribution \mathbb{H} on 𝒪T\mathcal{O}^{T} is η\eta-positive if for any string h𝒪th\in\mathcal{O}^{t} for t<Tt<T, the probabilities Pr[o|h]\Pr_{\mathbb{H}}[o|h] are at least η\eta for any choice of o𝒪o\in\mathcal{O}.

Now we show how to implement sample and exact pdf access to a distribution ^\widehat{\mathbb{H}} that is conditionally close to the distribution of \mathbb{H}. Note that this is slightly stronger than just implementing approximate pdf access to \mathbb{H} since we need some “consistency” between the responses to pdf queries.

First, we can approximate the pdf of \mathbb{H} at a given string h𝒪th\in\mathcal{O}^{t} by iteratively computing the next character probabilities using Claim 4.1 and then multiplying them. The key to ensure consistency is to only estimate the “next character distribution” once for each possible prefix hh. Whenever this prefix shows up again in a different computation, we use the same “next character distribution” that has already been computed.

Definition 4.5 (Sample and PDF Access).

For a distribution ^\widehat{\mathbb{H}} over 𝒪T\mathcal{O}^{T}, we say that we have sample and pdf access to ^\widehat{\mathbb{H}} if we can perform the following operations:

  • Given a string h𝒪th\in\mathcal{O}^{t} for t<Tt<T, draw a sample f𝒪Ttf\in\mathcal{O}^{T-t} from Pr^[f|h]\Pr_{\widehat{\mathbb{H}}}[f|h]

  • Given a string h𝒪th\in\mathcal{O}^{t} for tTt\leq T, output Pr^[h]\Pr_{\widehat{\mathbb{H}}}[h] (where this is the probability that the first tt characters of the string match hh)

Lemma 4.6 (PDF Estimation).

Let NN\in\mathbb{N} and ϵ,δ>0\epsilon,\delta>0 be parameters. Assume we are given conditional query access to a distribution \mathbb{H}. With probability at least 1δ1-\delta, we can respond to NN sample and pdf queries for some distribution ^\widehat{\mathbb{H}} that is ϵ/(10O)2\epsilon/(10O)^{2}-positive and ϵ\epsilon conditionally close to \mathbb{H}. This uses a total of poly(1/ϵ,T,O,N,log(1/δ))\mathrm{poly}(1/\epsilon,T,O,N,\log(1/\delta)) conditional queries to \mathbb{H}.

Remark 4.7.

While we don’t specify the whole distribution ^\widehat{\mathbb{H}}, the point is that the responses made by the algorithm to the sample and pdf queries are consistent with some distribution ^\widehat{\mathbb{H}} that is close to \mathbb{H}.

Proof.

We define a rooted tree with TT layers where each intermediate node has exactly OO children. Now the nodes of the tree are labeled with strings in 𝒪t\mathcal{O}^{t} for t=0,1,,Tt=0,1,\dots,T. The root is labeled with the empty string and the labels of the children of each node are obtained by appending each of the OO possible characters – thus the labels of the nodes at level tt are exactly the strings in 𝒪t\mathcal{O}^{t}.

We first describe an algorithm for answering sample and pdf queries that requires exponential time but then we will give a polynomial time algorithm for answering polynomially many queries that is statistically equivalent to it.

Naive Algorithm:

for every non-leaf node in the tree, say with label hh, apply Claim 4.1 (using new, independent samples) to compute the next-character probabilities ph[o]p_{h}[o] for o𝒪o\in\mathcal{O}. We can ensure that with probability 1δ/OT1-\delta/O^{T}, |Pr[oh]ph[o]|ϵ/(10O)|\Pr_{\mathbb{H}}[o|h]-p_{h}[o]|\leq\epsilon/(10O) for all o𝒪o\in\mathcal{O}. By perturbing the ph[o]p_{h}[o] if necessary, we can also ensure ph[o]ϵ/(10O)2p_{h}[o]\geq\epsilon/(10O)^{2} and o𝒪ph[o]=1\sum_{o\in\mathcal{O}}p_{h}[o]=1 for all o𝒪o\in\mathcal{O}. Now assign the edge from hh to hoh\vee o a weight of ph[o]p_{h}[o]. Let ^\widehat{\mathbb{H}} be the distribution induced by this weighted tree where the probability of sampling a string s𝒪Ts\in\mathcal{O}^{T} is equal to the product of the weights along the root-to-leaf path to ss. Now answer all sample and pdf queries with respect to ^\widehat{\mathbb{H}}. It is clear that ^\widehat{\mathbb{H}} is a valid distribution, and union bounding over all nodes, with 1δ1-\delta probability, ^\widehat{\mathbb{H}} is ϵ/(10O)2\epsilon/(10O)^{2}-positive and ϵ\epsilon conditionally close to \mathbb{H}.

Efficient Implementation:

We say we visit a node labeled hh to mean that we apply Claim 4.1 to compute the next-character probabilities and assign edge weights as described above. We will lazily maintain the tree instead, only visiting nodes as necessary. After we have visited a node, we never revisit it, so its edge weights are fixed once it has been visited once. Initially, we have only visited the root. Each time we receive a pdf query for a string hh, we follow the root-to-leaf path to hh and visit all nodes along this path. Now the pdf we output is obtained by simply multiplying the edge weights along the path from the root to hh.

Next, for a conditional sample query at hh, we begin from the node labeled hh. If we have visited it already, then we sample one of its children with probabilities proportional to the edge weights. We then move to this child and repeat. If we have not visited a node yet, we first visit it and then sample a child and repeat until we reach a leaf. We output this leaf as the conditional sample.

In this implementation, answering each query only requires poly(1/ϵ,T,O,log(1/δ))\mathrm{poly}(1/\epsilon,T,O,\log(1/\delta)) time and conditional queries to \mathbb{H}. Furthermore, this algorithm is statistically equivalent to the naive algorithm described earlier so we are done.

4.2 Truncated KL Divergence

Later on in the analysis, we will need a modified version of KL Divergence that is truncated so that it is defined everywhere and doesn’t blow up at 0.

Definition 4.8.

For a parameter c>0c>0 and real valued inputs x,yx,y, we define

c(x,y)=xlogmax(x,c)xlogmax(y,c).\ell_{\geq c}(x,y)=x\log\max(x,c)-x\log\max(y,c)\,.
Definition 4.9.

For vectors u,v,wNu,v,w\in{\mathbb{R}}^{N} where ww has positive entries, we define

KLw(uv)=i=1Nw[i](u[i],v[i]).\textsf{KL}_{\geq w}(u\|v)=\sum_{i=1}^{N}\ell_{\geq w[i]}(u[i],v[i])\,.

When ww has all entries equal to cc, we will slightly simplify notation and write

KLc(uv)=i=1Nc(u[i],v[i]).\textsf{KL}_{\geq c}(u\|v)=\sum_{i=1}^{N}\ell_{\geq c}(u[i],v[i])\,.

Note that the above is well-defined since we are only evaluating the logarithm when both u[i]u[i] and v[i]v[i] are positive. When u,vu,v are distributions and c=0c=0, the above corresponds to the usual definition of KL-divergence. When c>0c>0, this corresponds to a soft truncation of the distributions to elements with nontrivial mass.

One of the key properties of KL divergence is that it allows for projection in the following sense: when projecting a point onto a convex set, the projection is closer to all points in the set.

Fact 4.10.

Let 𝒯N\mathcal{T}\subset{\mathbb{R}}^{N} be the NN-dimensional simplex i.e. the convex hull of the standard basis vectors. Let 𝒦𝒯\mathcal{K}\subseteq\mathcal{T} be a convex set and let v𝒯v\in\mathcal{T} be a point. Let x=argminx𝒦KL(xv)x^{*}=\operatorname{argmin}_{x\in\mathcal{K}}\textsf{KL}(x\|v). Then for all x𝒦x\in\mathcal{K},

KL(xx)KL(xv).\textsf{KL}(x\|x^{*})\leq\textsf{KL}(x\|v)\,.
Proof.

Let x=(x1,,xN)x^{*}=(x_{1}^{*},\dots,x_{N}^{*}) and v=(v1,,vN)v=(v_{1},\dots,v_{N}). Then the optimality of xx^{*} implies that for any x=(x1,,xN)𝒦x=(x_{1},\dots,x_{N})\in\mathcal{K},

j=1N(xjxj)(logxjlogvj+1)0.\sum_{j=1}^{N}(x_{j}-x_{j}^{*})(\log x_{j}^{*}-\log v_{j}+1)\geq 0\,.

Since x,xx,x^{*} are on the simplex and j=1Nxj(logxjlogvj)0\sum_{j=1}^{N}x_{j}^{*}(\log x_{j}^{*}-\log v_{j})\geq 0 by non-negativity of KL divergence, we conclude

j=1Nxjlogvjj=1Nxjlogxj\sum_{j=1}^{N}x_{j}\log v_{j}\leq\sum_{j=1}^{N}x_{j}\log x_{j}^{*}

and this immediately gives the desired inequality. ∎

In our analysis later on, we will need a slightly more general version of Fact 4.10, stated below, that deals with truncated KL divergences and points that are not exactly on the simplex.

Corollary 4.11.

Let w,vNw,v\in{\mathbb{R}}^{N} be vectors. Assume ww has positive coordinates and assume 𝒦N\mathcal{K}\subset{\mathbb{R}}^{N} is a convex set such that all elements x𝒦x\in\mathcal{K} have sum of coordinates equal to 11 and xwx\geq w entrywise. Let vNv\in{\mathbb{R}}^{N} be a point. Let x=argminx𝒦KLw(xv)x^{*}=\operatorname{argmin}_{x\in\mathcal{K}}\textsf{KL}_{\geq w}(x\|v). Then for any x𝒦x\in\mathcal{K},

KLw(xx)KLw(xv)+log(v1+w1).\textsf{KL}_{\geq w}(x\|x^{*})\leq\textsf{KL}_{\geq w}(x\|v)+\log(\left\lVert v\right\rVert_{1}+\left\lVert w\right\rVert_{1})\,.
Proof.

Note that by the assumptions in the statement, for any x𝒦x\in\mathcal{K},

KLw(xv)=KL(xmax(w,v))=KL(xmax(w,v)/C)logC\textsf{KL}_{\geq w}(x\|v)=\textsf{KL}(x\|\max(w,v))=\textsf{KL}\left(x\|\max(w,v)/C\right)-\log C

where the max on the RHS is taken entrywise and CC is any positive constant. Now letting CC be equal to the sum of the coordinates of max(w,v)\max(w,v), note that Cv1+w1C\leq\left\lVert v\right\rVert_{1}+\left\lVert w\right\rVert_{1}. Then we can simply apply Fact 4.10 to the vector max(w,v)/C\max(w,v)/C to get the desired inequality. ∎

We also have the following observation that will allow us to relate truncated KL divergence to TV distance.

Claim 4.12.

For a vector wNw\in{\mathbb{R}}^{N} with positive entries and minimum entry wminw_{\min} and maximum entry wmaxw_{\max}, the divergence KLw(uv)\textsf{KL}_{\geq w}(u\|v) is Lipchitz with respect to uu with Lipchitz constant

1+log(1+1/wmin)+log(1+wmax)+log(1+u)+log(1+v).1+\log(1+1/w_{\min})+\log(1+w_{\max})+\log(1+\left\lVert u\right\rVert_{\infty})+\log(1+\left\lVert v\right\rVert_{\infty})\,.
Proof.

This follows immediately from the definition. ∎

5 Geometric Results

Our learning algorithm will also rely on a few geometric constructions which we introduce below.

5.1 Spanners

First, we recall the standard notion of a barycentric spanner.

Definition 5.1 (Barycentric Spanner).

For a collection of vectors z1,,zndz_{1},\dots,z_{n}\in{\mathbb{R}}^{d}, we say another set of vectors v1,,vsdv_{1},\dots,v_{s}\in{\mathbb{R}}^{d} is a (C,γ)(C,\gamma)-spanner for them if for all j[n]j\in[n], there are coefficients c1,,csc_{1},\dots,c_{s} with |ci|C|c_{i}|\leq C such that

zj(c1v1++csvs)1γ.\left\lVert z_{j}-(c_{1}v_{1}+\dots+c_{s}v_{s})\right\rVert_{1}\leq\gamma\,.

Note that the error is defined in L1L^{1} because later on, we will have v1,,vsv_{1},\dots,v_{s} representing the density functions of distributions and the error will correspond to TV-closeness.

Next we introduce a distributional notion of a spanner, where we only require that the spanner covers most of the mass of some underlying distribution.

Definition 5.2 (Distribution Spanner).

Let 𝒟\mathcal{D} be a distribution over d{\mathbb{R}}^{d}. Let C>0C>0, 0ϵ,γ10\leq\epsilon,\gamma\leq 1 be parameters. We say a set of points {v1,,vs}d\{v_{1},\dots,v_{s}\}\in{\mathbb{R}}^{d} is a (1ϵ,C,γ)(1-\epsilon,C,\gamma)-spanner for 𝒟\mathcal{D} if for a random point zz drawn from 𝒟\mathcal{D}, with 1ϵ1-\epsilon probability, there are coefficients |c1|,,|cs|C|c_{1}|,\dots,|c_{s}|\leq C with

z(c1v1++csvs)1γ.\left\lVert z-(c_{1}v_{1}+\dots+c_{s}v_{s})\right\rVert_{1}\leq\gamma\,.

It is folkore that any subset of points has an exact barycentric spanner. However, in our setting, it will be important to compute a spanner for a high-dimensional distribution given only polynomially many samples.

Fact 5.3.

[Folklore] Let v1,,vnv_{1},\dots,v_{n} be arbitrary vectors in d{\mathbb{R}}^{d}, Then there exists a subset S[n]S\subseteq[n] with |S|d|S|\leq d such that {vi}iS\{v_{i}\}_{i\in S} form a (1,0)(1,0) spanner for the collection (v1,,vn)(v_{1},\dots,v_{n}).

Lemma 5.4.

Let 𝒟\mathcal{D} be a distribution over d{\mathbb{R}}^{d}. Let 0<ϵ,δ<10<\epsilon,\delta<1 be parameters. Given a set 𝒜\mathcal{A} of (d/ϵ)10log(1/δ)(d/\epsilon)^{10}\cdot\log(1/\delta) independent samples from 𝒟\mathcal{D}, with probability 1δ1-\delta, there exists a subset of dd elements of 𝒜\mathcal{A}, {v1,,vd}\{v_{1},\dots,v_{d}\}, which form a (1ϵ,1,0)(1-\epsilon,1,0)-spanner for 𝒟\mathcal{D}.

Proof.

For any dd vectors in d{\mathbb{R}}^{d}, say v1,,vdv_{1},\dots,v_{d}, let 𝒫v1,,vd\mathcal{P}_{v_{1},\dots,v_{d}} be the convex body whose vertices are ±v1±±vd\pm v_{1}\pm\dots\pm v_{d}. Let μ(𝒫v1,,vd)\mu(\mathcal{P}_{v_{1},\dots,v_{d}}) be the pdf of 𝒟\mathcal{D} inside this convex body and let r(𝒫v1,,vd)r(\mathcal{P}_{v_{1},\dots,v_{d}}) be the fraction of points of 𝒜\mathcal{A} that are within this body. For fixed, v1,,vdv_{1},\dots,v_{d}, we have

|μ(𝒫v1,,vd)r(𝒫v1,,vd)|0.5ϵ|\mu(\mathcal{P}_{v_{1},\dots,v_{d}})-r(\mathcal{P}_{v_{1},\dots,v_{d}})|\leq 0.5\epsilon

with 1δ(d/ϵ)21-\delta^{(d/\epsilon)^{2}} probability. Now we can union bound over all subsets {v1,,vd}𝒜\{v_{1},\dots,v_{d}\}\subseteq\mathcal{A} and deduce that with probability 1δ1-\delta, for all such subsets,

|μ(𝒫v1,,vd)r(𝒫v1,,vd)|ϵ.|\mu(\mathcal{P}_{v_{1},\dots,v_{d}})-r(\mathcal{P}_{v_{1},\dots,v_{d}})|\leq\epsilon\,.

However, by Fact 5.3, there is a subset of dd elements of 𝒜\mathcal{A} that is a (1,0)(1,0)-spanner for 𝒜\mathcal{A} and thus the above implies that taking {v1,,vd}\{v_{1},\dots,v_{d}\} to be this subset gives us a (1ϵ,1,0)(1-\epsilon,1,0)-spanner for 𝒟\mathcal{D}. ∎

Now we recall the following standard algorithm for computing a spanner (see e.g. [AK08]). We include a proof for completeness.

Claim 5.5.

Let v1,,vndv_{1},\dots,v_{n}\in{\mathbb{R}}^{d} be vectors and let 𝐌d×n\mathbf{M}\in{\mathbb{R}}^{d\times n} be the matrix with columns given by v1,vnv_{1},\dots v_{n} and let its singular values be σ1σd\sigma_{1}\geq\dots\geq\sigma_{d}. Then there is an algorithm that runs in time poly(n,d)log(σ1/σd)\mathrm{poly}(n,d)\cdot\log(\sigma_{1}/\sigma_{d}) that computes a subset [n]\mathcal{B}\subseteq[n] with ||d|\mathcal{B}|\leq d such that {vi}i\{v_{i}\}_{i\in\mathcal{B}} is a (2,0)(2,0) spanner for the collection {v1,,vn}\{v_{1},\dots,v_{n}\}

Proof.

First, we iteratively construct an initial subset 𝒜[n]\mathcal{A}\subseteq[n] as follows. Initially start with 𝒜0=\mathcal{A}_{0}=\emptyset and in each step j=1,2,,dj=1,2,\dots,d, find a vector vajv_{a_{j}} such that

Πspan({vi}i𝒜j1))(vaj)σdn.\left\lVert\Pi_{\textsf{span}\left(\{v_{i}\}_{i\in\mathcal{A}_{j-1}})\right)^{\perp}}(v_{a_{j}})\right\rVert\geq\frac{\sigma_{d}}{\sqrt{n}}\,.

First, such a vector always exists for j=1,2,,dj=1,2,\dots,d because we know that

a[n]Πspan({vi}i𝒜j1))(va)2min𝐔d×(j1),𝐔𝐔=𝐈𝐌𝐔𝐔𝐌F2σj2σd2\sum_{a\in[n]}\left\lVert\Pi_{\textsf{span}\left(\{v_{i}\}_{i\in\mathcal{A}_{j-1}})\right)^{\perp}}(v_{a})\right\rVert^{2}\geq\min_{\mathbf{U}\in{\mathbb{R}}^{d\times(j-1)},\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}}\left\lVert\mathbf{M}-\mathbf{U}\mathbf{U}^{\top}\mathbf{M}\right\rVert_{F}^{2}\geq\sigma_{j}^{2}\geq\sigma_{d}^{2}

where we used the extremal characterization of the jjth singular value of 𝐌\mathbf{M} in terms of the best rank-jj approximation to 𝐌\mathbf{M}.

After this procedure, we have a subset 𝒜={va1,,vad}\mathcal{A}=\{v_{a_{1}},\dots,v_{a_{d}}\}. We will use Vol(𝒜)\textsf{Vol}(\mathcal{A}) to denote the volume of the simplex formed by the vectors in 𝒜\mathcal{A}. By construction, we have

Vol(𝒜)σdd(dn)d.\textsf{Vol}(\mathcal{A})\geq\frac{\sigma_{d}^{d}}{(dn)^{d}}\,.

Next, assume for the sake of contradiction that 𝒜\mathcal{A} is not a (2,0)(2,0)-spanner. Then there is some vector viv_{i} that cannot be written as a linear combination of va1,,vadv_{a_{1}},\dots,v_{a_{d}} with coefficients of magnitude at most 22. There is a unique way to write

vi=c1va1++cdvadv_{i}=c_{1}v_{a_{1}}+\dots+c_{d}v_{a_{d}}

and there must be some coefficient say cjc_{j} with |cj|2|c_{j}|\geq 2. Then we can replace vajv_{a_{j}} with viv_{i} in 𝒜\mathcal{A} and this at least doubles Vol(𝒜)\textsf{Vol}(\mathcal{A}). The maximal possible value of Vol(𝒜)\textsf{Vol}(\mathcal{A}) for any subset 𝒜\mathcal{A} with |𝒜|=d|\mathcal{A}|=d is at most σ1d\sigma_{1}^{d} and thus we need to iterate this process at most poly(n,d)log(σ1/σd)\mathrm{poly}(n,d)\cdot\log(\sigma_{1}/\sigma_{d}) times before we get a set 𝒜\mathcal{A} that is a (2,0)(2,0)-spanner. This completes the proof. ∎

Claim 5.5 has the restriction that σd\sigma_{d} is nontrivial. We will now give a robust version of Claim 5.5 for computing an approximate spanner even when the vectors may be arbitrarily close to degenerate.

Algorithm 1 Robust Spanner
1:Input: Vectors v1,,vndv_{1},\dots,v_{n}\in{\mathbb{R}}^{d}
2:Input: Parameters s,γs,\gamma
3:Form matrix 𝐌d×n\mathbf{M}\in{\mathbb{R}}^{d\times n} with columns v1,,vnv_{1},\dots,v_{n}
4:Compute SVD of 𝐌=𝐔𝚺𝐕\mathbf{M}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top}
5:Sort the singular values σ1σ2σd\sigma_{1}\geq\sigma_{2}\geq\dots\geq\sigma_{d}
6:Let σt\sigma_{t} be the largest index with σt>γn\sigma_{t}>\gamma\sqrt{n}
7:Let 𝐘\mathbf{Y} be the matrix formed by the columns of 𝐔\mathbf{U} corresponding to the top-tt singular values
8:Define ui=𝐘viu_{i}=\mathbf{Y}^{\top}v_{i} for all i[n]i\in[n]
9:Apply Claim 5.5 to compute (2,0)(2,0)-spanner {ua1,,uat}\{u_{a_{1}},\dots,u_{a_{t}}\} for {u1,,un}\{u_{1},\dots,u_{n}\}
10:Output: {a1,,at}\{a_{1},\dots,a_{t}\}
Lemma 5.6.

Let v1,,vndv_{1},\dots,v_{n}\in{\mathbb{R}}^{d} be vectors with vipoly(n,d)\left\lVert v_{i}\right\rVert\leq\mathrm{poly}(n,d) for all ii. Let s,γs,\gamma be parameters we are given. Assume that there exists a subspace 𝐕\mathbf{V} of dimension ss such that Π𝐕viγ\left\lVert\Pi_{\mathbf{V}^{\perp}}v_{i}\right\rVert\leq\gamma for all ii. Then if we run Algorithm 1 on v1,,vnv_{1},\dots,v_{n}, it runs in time poly(n,d)log(1/γ)\mathrm{poly}(n,d)\cdot\log(1/\gamma) and its output satisfies

  • tst\leq s

  • {va1,,vat}\{v_{a_{1}},\dots,v_{a_{t}}\} forms a (2,3γsnd)(2,3\gamma s\sqrt{nd}) spanner for the collection {v1,,vn}\{v_{1},\dots,v_{n}\}

Proof.

Recall that tt is the largest index such that σt>γn\sigma_{t}>\gamma\sqrt{n}. Note that tst\leq s since by assumption, there is rank-ss subspace say 𝐖d×s\mathbf{W}\in{\mathbb{R}}^{d\times s} such that

𝐌𝐖𝐖𝐌Fnγ\left\lVert\mathbf{M}-\mathbf{W}\mathbf{W}^{\top}\mathbf{M}\right\rVert_{F}\leq\sqrt{n}\gamma

and thus σs+1nγ\sigma_{s+1}\leq\sqrt{n}\gamma.

Next, note that 𝐘𝐌\mathbf{Y}^{\top}\mathbf{M} is a t×nt\times n matrix with all tt singular values being at least γn\gamma\sqrt{n} (and the largest singular value is at most poly(n,d)\mathrm{poly}(n,d)). Thus, when we apply Claim 5.5 in Line 9, it runs in poly(n,d)log(1/γ)\mathrm{poly}(n,d)\cdot\log(1/\gamma) time and the subset {ua1,,uat}\{u_{a_{1}},\dots,u_{a_{t}}\} is indeed a (2,0)(2,0)-spanner for the uiu_{i}. Finally, we show that {va1,,vat}\{v_{a_{1}},\dots,v_{a_{t}}\} is a good spanner for the collection of viv_{i}. For any viv_{i}, we can first write

𝐘vi=ui=c1ua1++ctuat\mathbf{Y}^{\top}v_{i}=u_{i}=c_{1}u_{a_{1}}+\dots+c_{t}u_{a_{t}}

where |c1|,,|ct|2|c_{1}|,\dots,|c_{t}|\leq 2. Now

vi(c1va1++ctvat)1dvi(c1va1++ctvat)2=d(𝐈𝐘𝐘)(vi(c1va1++ctvat))2(2t+1)d(𝐈𝐘𝐘)𝐌op3tdσt+13γsnd\begin{split}\left\lVert v_{i}-(c_{1}v_{a_{1}}+\dots+c_{t}v_{a_{t}})\right\rVert_{1}&\leq\sqrt{d}\left\lVert v_{i}-(c_{1}v_{a_{1}}+\dots+c_{t}v_{a_{t}})\right\rVert_{2}\\ &=\sqrt{d}\left\lVert(\mathbf{I}-\mathbf{Y}\mathbf{Y}^{\top})(v_{i}-(c_{1}v_{a_{1}}+\dots+c_{t}v_{a_{t}}))\right\rVert_{2}\\ &\leq(2t+1)\sqrt{d}\left\lVert(\mathbf{I}-\mathbf{Y}\mathbf{Y}^{\top})\mathbf{M}\right\rVert_{\textsf{op}}\\ &\leq 3t\sqrt{d}\sigma_{t+1}\\ &\leq 3\gamma s\sqrt{nd}\end{split}

ad thus the set {va1,,vat}\{v_{a_{1}},\dots,v_{a_{t}}\} is a (2,3γsnd)(2,3\gamma s\sqrt{nd}) spanner. It is clear that the overall runtime is poly(n,d)log(1/γ)\mathrm{poly}(n,d)\cdot\log(1/\gamma). ∎

5.2 Dimensionality Reduction

We will also need a type of dimensionality for distributions over exponentially large domains. For distributions 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} over a large domain say [N][N], we can represent them as NN-dimensional vectors. However, we need a more succinct representation. Below we show how with only polynomially many samples and pdf access to the distributions, we can construct a succinct representation that approximately preserves important properties of the original distributions. We use the following notation for translating between distributions and vectors of their densities.

Definition 5.7.

Let 𝒟\mathcal{D} be a distribution on a finite set of elements [N][N]. For a[N]a\in[N], we write 𝒟[a]\mathcal{D}[a] for the density function of 𝒟\mathcal{D} at aa. We define the vector vec𝒟N\textsf{vec}_{\mathcal{D}}\in{\mathbb{R}}^{N} to be vec𝒟=(𝒟[1],,𝒟[N])\textsf{vec}_{\mathcal{D}}=(\mathcal{D}[1],\dots,\mathcal{D}[N]). For a multiset 𝒳\mathcal{X} of elements in [N][N], we define vec𝒟[𝒳]|𝒳|\textsf{vec}_{\mathcal{D}}[\mathcal{X}]\in{\mathbb{R}}^{|\mathcal{X}|} to be a vector whose entries are indexed by elements a𝒳a\in\mathcal{X} and the values are equal to 𝒟[a]\mathcal{D}[a].

Definition 5.8.

Given distributions 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} on a set of elements, say [N][N], we say vectors u1,,umdu_{1},\dots,u_{m}\in{\mathbb{R}}^{d} for some dimension dd are (r,γ)(r,\gamma)-representative for 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} if for all coefficients c1,,cmc_{1},\dots,c_{m}\in{\mathbb{R}} with |ci|r|c_{i}|\leq r such that at most rr of them are nonzero,

|c1vec𝒟1++cmvec𝒟m1c1u1++cmum1|γ.\left\lvert\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}u_{1}+\dots+c_{m}u_{m}\right\rVert_{1}\right\rvert\leq\gamma\,.
Definition 5.9.

Given distributions 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} on a set of elements, say [N][N], we say vectors w,u1,,umdw,u_{1},\dots,u_{m}\in{\mathbb{R}}^{d} for some dimension dd are (r,γ,τ)(r,\gamma,\tau) KL-preserving for 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} if for all coefficients c1,,cm,c1,,cmc_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime}\in{\mathbb{R}} with |ci|r,|ci|r/τ|c_{i}|\leq r,|c_{i}^{\prime}|\leq r/\tau such that at most rr of the cic_{i} and rr of the cic_{i}^{\prime} are nonzero and for any constant τ\tau^{*} with ττ1\tau\leq\tau^{*}\leq 1,

|KLτ(c1𝒟1++cm𝒟mc1𝒟1++cm𝒟m)KLτw(c1u1++cmumc1u1++cmum)|γ.\left\lvert\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})-\textsf{KL}_{\geq\tau^{*}w}(c_{1}u_{1}+\dots+c_{m}u_{m}\|c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m})\right\rvert\leq\gamma\,.

The main result of this section is Lemma 5.10, where we show that for any distributions 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m}, after sampling and appropriately re-normalizing, we can construct succinct representations that approximately preserve the TV distance and KL divergence between linear combinations of these distributions.

Lemma 5.10.

Let 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} be distributions on a set of elements, say [N][N]. Let kk\in\mathbb{N} and let 𝒳1,,𝒳m\mathcal{X}_{1},\dots,\mathcal{X}_{m} be multisets where 𝒳i\mathcal{X}_{i} is obtained by drawing kk elements independently from 𝒟i\mathcal{D}_{i}. Let 𝒳=𝒳1𝒳m\mathcal{X}=\mathcal{X}_{1}\cup\dots\cup\mathcal{X}_{m}. Define the distribution 𝒟^=𝒟1++𝒟mm\widehat{\mathcal{D}}=\frac{\mathcal{D}_{1}+\dots+\mathcal{D}_{m}}{m}. Define the vectors ui=vec𝒟i[𝒳]mkvec𝒟^[𝒳]u_{i}=\frac{\textsf{vec}_{\mathcal{D}_{i}}[\mathcal{X}]}{mk\textsf{vec}_{\widehat{\mathcal{D}}}[\mathcal{X}]} for i[m]i\in[m] where the division is done entrywise and define w=1/(mkvec𝒟^[𝒳])w=1/(mk\textsf{vec}_{\widehat{\mathcal{D}}}[\mathcal{X}]) (reciprocated entrywise).

Let 0<δ,γ,τ<10<\delta,\gamma,\tau<1 and rr\in\mathbb{N} be be some parameters. If k100mr4log4(m/(τδγ))/γ2k\geq 100mr^{4}\log^{4}(m/(\tau\delta\gamma))/\gamma^{2} then with 1δ1-\delta probability

  • The vectors u1,,umu_{1},\dots,u_{m} are (r,γ)(r,\gamma)-representative for 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m}

  • The vectors w,u1,,umw,u_{1},\dots,u_{m} are (r,γ,τ)(r,\gamma,\tau) KL-preserving for 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m}

Proof.

We begin by proving the first statement. Consider a fixed set of coefficients c1,,cmc_{1},\dots,c_{m}. We will first imagine that the sets 𝒳1,,𝒳m\mathcal{X}_{1},\dots,\mathcal{X}_{m} are sampled after fixing the coefficients and then we will union bound over a net over all possible choices for c1,,cmc_{1},\dots,c_{m}. We have

c1vec𝒟1++cmvec𝒟m1=𝔼a𝒟^[|c1𝒟1[a]++cm𝒟m[a]|𝒟^[a]].\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}=\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{|c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]|}{\widehat{\mathcal{D}}[a]}\right]\,.

Also the quantity inside the expectation on the RHS always has magnitude at most rmrm. Now we can draw samples from 𝒟^\widehat{\mathcal{D}} to approximate the RHS. By a Chernoff bound, if we draw at least 100m2r4/γ2log(m/(δγ))100m^{2}r^{4}/\gamma^{2}\cdot\log(m/(\delta\gamma)) samples from the distribution 𝒟^\widehat{\cal{D}} – this corresponds to taking k100mr4/γ2log(m/(δγ))k\geq 100mr^{4}/\gamma^{2}\cdot\log(m/(\delta\gamma)), then with probability at least 1(δγ/(10rm))10r1-(\delta\gamma/(10rm))^{10r}, the empirical estimate of the RHS is within γ/2\gamma/2 of the true value i.e.

|c1vec𝒟1++cmvec𝒟m1c1u1++cmum1|γ2\left\lvert\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}u_{1}+\dots+c_{m}u_{m}\right\rVert_{1}\right\rvert\leq\frac{\gamma}{2}

which is exactly the inequality that we want to show.

Since the failure probability of the above is (δγ/(10rm))10r(\delta\gamma/(10rm))^{10r}, we can union bound over a γ/(10rm)3\gamma/(10rm)^{3}-net of all of the possible choices of c1,,cmc_{1},\dots,c_{m}. Call the net 𝒩\mathcal{N}. Then for any possible c1,,cmc_{1},\dots,c_{m} such that |ci|r|c_{i}|\leq r and at most rr of the cic_{i} are nonzero, there is some element of the net, say (c1,,cm)𝒩(c_{1}^{\prime},\dots,c_{m}^{\prime})\in\mathcal{N}, such that maxi[m]|cici|γ/(10rm)3\max_{i\in[m]}|c_{i}-c_{i}^{\prime}|\leq\gamma/(10rm)^{3}. From the union bound over the net, we know that

|c1vec𝒟1++cmvec𝒟m1c1u1++cmum1|γ2\left\lvert\left\lVert c_{1}^{\prime}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}^{\prime}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m}\right\rVert_{1}\right\rvert\leq\frac{\gamma}{2}

and thus

|c1vec𝒟1++cmvec𝒟m1c1u1++cmum1|γ\left\lvert\left\lVert c_{1}\textsf{vec}_{\mathcal{D}_{1}}+\dots+c_{m}\textsf{vec}_{\mathcal{D}_{m}}\right\rVert_{1}-\left\lVert c_{1}u_{1}+\dots+c_{m}u_{m}\right\rVert_{1}\right\rvert\leq\gamma

as desired.

We prove the second statement in a similar way. First consider a fixed choice of τ\tau^{*}. Consider fixed coefficients c1,,cm,c1,,cmc_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime}. Then

KLτ(c1𝒟1++cm𝒟mc1𝒟1++cm𝒟m)=𝔼a𝒟^[c1𝒟1[a]++cm𝒟m[a]𝒟^[a]logmax((c1𝒟1[a]++cm𝒟m[a],τ)]𝔼a𝒟^[c1𝒟1[a]++cm𝒟m[a]𝒟^[a]logmax((c1𝒟1[a]++cm𝒟m[a]),τ)]=mk𝔼a𝒟^[c1𝒟1[a]++cm𝒟m[a]mk𝒟^[a]logmax(c1𝒟1[a]++cm𝒟m[a]mk𝒟^[a],τmk𝒟^[a])]mk𝔼a𝒟^[c1𝒟1[a]++cm𝒟m[a]mk𝒟^[a]logmax(c1𝒟1[a]++cm𝒟m[a]mk𝒟^[a],τmk𝒟^[a])].\begin{split}&\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})\\ &=\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{\widehat{\mathcal{D}}[a]}\cdot\log\max((c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a],\tau^{*})\right]\\ &-\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{\widehat{\mathcal{D}}[a]}\cdot\log\max((c_{1}^{\prime}\mathcal{D}_{1}[a]+\dots+c_{m}^{\prime}\mathcal{D}_{m}[a]),\tau^{*})\right]\\ &=mk\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]}\cdot\log\max\left(\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]},\frac{\tau^{*}}{mk\widehat{\mathcal{D}}[a]}\right)\right]\\ &-mk\operatorname{\mathbb{E}}_{a\sim\widehat{\mathcal{D}}}\left[\frac{c_{1}\mathcal{D}_{1}[a]+\dots+c_{m}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]}\cdot\log\max\left(\frac{c_{1}^{\prime}\mathcal{D}_{1}[a]+\dots+c_{m}^{\prime}\mathcal{D}_{m}[a]}{mk\widehat{\mathcal{D}}[a]},\frac{\tau^{*}}{mk\widehat{\mathcal{D}}[a]}\right)\right]\,.\end{split}

Also for all possible choices of a𝒟^a\sim\widehat{\mathcal{D}}, the RHS always has magnitude at most 10rmlog(m/τ)10rm\log(m/\tau). Now, as before in the proof of the first statement, by a Chernoff bound, this implies that for fixed c1,,cm,c1,,cmc_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime} and also τ\tau^{*}, with probability 1(δγτ/(10rm))10r1-(\delta\gamma\tau/(10rm))^{10r}, we have

|KLτ(c1𝒟1++cm𝒟mc1𝒟1++cm𝒟m)KLτw(c1u1++cmumc1u1++cmum)|γ2.\left\lvert\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})-\textsf{KL}_{\geq\tau^{*}w}(c_{1}u_{1}+\dots+c_{m}u_{m}\|c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m})\right\rvert\leq\frac{\gamma}{2}\,.

Again, as before, we next union bound over a γτ/(10rm)3\gamma\tau/(10rm)^{3}-net over all possible choices of c1,,cm,c1,,cmc_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime} and also τ\tau^{*} and we deduce that

|KLτ(c1𝒟1++cm𝒟mc1𝒟1++cm𝒟m)KLτw(c1u1++cmumc1u1++cmum)|γ\left\lvert\textsf{KL}_{\geq\tau^{*}}(c_{1}\mathcal{D}_{1}+\dots+c_{m}\mathcal{D}_{m}\|c_{1}^{\prime}\mathcal{D}_{1}+\dots+c_{m}^{\prime}\mathcal{D}_{m})-\textsf{KL}_{\geq\tau^{*}w}(c_{1}u_{1}+\dots+c_{m}u_{m}\|c_{1}^{\prime}u_{1}+\dots+c_{m}^{\prime}u_{m})\right\rvert\leq\gamma

for all choices of c1,,cm,c1,,cm,τc_{1},\dots,c_{m},c_{1}^{\prime},\dots,c_{m}^{\prime},\tau^{*}, as desired. ∎

The following fact will also be useful as we won’t have pdf access to the exact distribution that we get samples from but instead to a distribution that is close in TV distance.

Lemma 5.11.

Let 𝒟1,,𝒟m\mathcal{D}_{1},\dots,\mathcal{D}_{m} be distributions on a set of elements, say [N][N]. Let 𝒟1,,𝒟m\mathcal{D}_{1}^{\prime},\dots,\mathcal{D}_{m}^{\prime} be distributions such that dTV(𝒟i,𝒟i)ϵd_{\textsf{TV}}(\mathcal{D}_{i},\mathcal{D}_{i}^{\prime})\leq\epsilon for all i[m]i\in[m]. Let kk\in\mathbb{N} and let 𝒳1,,𝒳m\mathcal{X}_{1},\dots,\mathcal{X}_{m} be multisets where 𝒳i\mathcal{X}_{i} is obtained by drawing kk elements independently from 𝒟i\mathcal{D}_{i}^{\prime}. With probability 12km2ϵ1-2km^{2}\sqrt{\epsilon}, we have for all i[m]i\in[m] and a𝒳ia\in\mathcal{X}_{i}

|𝒟j[a]𝒟j[a]|ϵ𝒟i[a]|\mathcal{D}_{j}[a]-\mathcal{D}_{j}^{\prime}[a]|\leq\sqrt{\epsilon}\mathcal{D}_{i}[a]

for all j[m]j\in[m].

Proof.

Fix an i[m]i\in[m]. For j[m]j\in[m], let 𝒵j,i[N]\mathcal{Z}_{j,i}\subset[N] be the set of elements a[N]a\in[N] such that

|𝒟j[a]𝒟j[a]|ϵ𝒟i[a].|\mathcal{D}_{j}[a]-\mathcal{D}_{j}^{\prime}[a]|\geq\sqrt{\epsilon}\mathcal{D}_{i}[a]\,.

Note that by assumption,

a𝒵j,i|𝒟j[a]𝒟j[a]|ϵ\sum_{a\in\mathcal{Z}_{j,i}}|\mathcal{D}_{j}[a]-\mathcal{D}_{j}^{\prime}[a]|\leq\epsilon

and thus we must have

Pra𝒟i[a𝒵j,i]=a𝒵j,i𝒟i[a]ϵ\Pr_{a\sim\mathcal{D}_{i}}[a\in\mathcal{Z}_{j,i}]=\sum_{a\in\mathcal{Z}_{j,i}}\mathcal{D}_{i}[a]\leq\sqrt{\epsilon}

which also implies Pra𝒟i[a𝒵j,i]2ϵ\Pr_{a\sim\mathcal{D}_{i}^{\prime}}[a\in\mathcal{Z}_{j,i}]\leq 2\sqrt{\epsilon}. Now we can union bound this over all choices of i,ji,j and all of the samples drawn to get that the overall failure probability is at most 2km2ϵ2km^{2}\sqrt{\epsilon} as desired. ∎

6 Learning Algorithm

In this section, we present our learning algorithm. In light of Lemma 4.6, throughout our learning algorithm, we will assume that we have sample/conditional query and pdf access to a distribution ^\widehat{\mathbb{H}} that is conditionally close to the unknown distribution \mathbb{H}. We will only use conditional queries to \mathbb{H} to simulate access to ^\widehat{\mathbb{H}} and almost all of our reasoning will be with respect to the distribution ^\widehat{\mathbb{H}}.

6.1 Finding a Spanner for the State Space

Recall that by Fact 2.2, for a fixed tt with t<Tt<T, the (vectorized) distributions {Pr[|h]}h𝒪t\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{O}^{t}} all lie in some SS-dimensional subspace. Thus, by Fact 5.3, there exists a barycentric spanner consisting of SS elements corresponding to some SS histories. Since ^\widehat{\mathbb{H}} is conditionally close to \mathbb{H}, the (vectorized) distributions {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}} are all close to some SS-dimensional subspace. The first step in our algorithm will be to, for each tt, compute a set 𝒪t\mathcal{B}\subseteq\mathcal{O}^{t} of SS histories such that {Pr^[|h]}h\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}} are an approximate spanner for {Pr^[|h]}h𝒪t\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{O}^{t}}. The main algorithm in this section is Algorithm 3 and we analyze it in Lemma 6.3.

First, we will need a few preliminaries. It will be convenient to introduce the following notation for arranging possible histories and futures into a matrix.

Definition 6.1.

For a subset 𝒪t\mathcal{H}\subseteq\mathcal{O}^{t} for some tt and subset 𝒳𝒪Tt\mathcal{X}\subseteq\mathcal{O}^{T-t}, let 𝐌(t)[,𝒳]\mathbf{M}^{(t)}[\mathcal{H},\mathcal{X}] be the matrix with rows indexed by elements hh\in\mathcal{H} and columns indexed by elements x𝒳x\in\mathcal{X} and entries equal to Pr^[x|h]\Pr_{\widehat{\mathbb{H}}}[x|h]. When either 𝒳\mathcal{X} or \mathcal{H} is the full set, we may write 𝐌(t)[,:]\mathbf{M}^{(t)}[\mathcal{H},:] or 𝐌(t)[:,𝒳]\mathbf{M}^{(t)}[:,\mathcal{X}]

We will repeatedly make use of the primitive in Algorithm 2 for taking histories h1,,hs𝒪th_{1},\dots,h_{s}\in\mathcal{O}^{t} and outputting vectors u1,,usu_{1},\dots,u_{s} that are succinct representations (in the sense of Lemma 5.10) of the distributions Pr^[|h1],,Pr^[|hs]\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{1}],\dots,\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{s}].

Algorithm 2 Building Vectors
1:Input: Sample, conditional query, and exact pdf access to a distribution ^\widehat{\mathbb{H}} over 𝒪T\mathcal{O}^{T}
2:Input: Parameter t<Tt<T, strings h1,,hs𝒪th_{1},\dots,h_{s}\in\mathcal{O}^{t}
3:Input: Parameter kk\in\mathbb{N}
4:for i[s]i\in[s] do
5:     Sample kk futures f1,,fk𝒪Ttf_{1},\dots,f_{k}\in\mathcal{O}^{T-t} from Pr^[|hi]\Pr_{\widehat{\mathbb{H}}}[\cdot|h_{i}]
6:     Set 𝒳i={f1,,fk}\mathcal{X}_{i}=\{f_{1},\dots,f_{k}\}
7:end for
8:Set 𝒳=𝒳1𝒳s\mathcal{X}=\mathcal{X}_{1}\cup\dots\cup\mathcal{X}_{s} (as a multiset)
9:for i[s]i\in[s] do
10:     Compute Pr^[f|hi]\Pr_{\widehat{\mathbb{H}}}[f|h_{i}] for all f𝒳f\in\mathcal{X}
11:     Set vi|𝒳|v_{i}\in{\mathbb{R}}^{|\mathcal{X}|} as vi=𝐌(t)[hi,𝒳]v_{i}=\mathbf{M}^{(t)}[h_{i},\mathcal{X}]
12:end for
13:Compute w=1/(k(v1++vs))w=1/(k(v_{1}+\dots+v_{s})) (reciporicated entrywise)
14:Compute ui=viwu_{i}=v_{i}\odot w (multiplication entrywise)
15:Output: 𝒳,{u1,,us},w\mathcal{X},\{u_{1},\dots,u_{s}\},w
Fact 6.2.

Whenever we run Algorithm 2, the output vectors u1,,usu_{1},\dots,u_{s} have nonnegative entries and satisfy u1,,us1\left\lVert u_{1}\right\rVert_{\infty},\dots,\left\lVert u_{s}\right\rVert_{\infty}\leq 1,

Proof.

This follows immediately from the construction in Algorithm 2. ∎

Now we are ready to present the algorithm for building the spanners and its analysis. For technical reasons that will be important later, when computing a spanner for histories of length t+1t+1, the algorithm takes as input some strings h1(t),,hs(t)h_{1}^{(t)},\dots,h_{s}^{(t)} of length tt. The goal will be to output a list of strings of length t+1t+1, say 𝒪t+1\mathcal{B}\subseteq\mathcal{O}^{t+1}, such that the corresponding vectors {𝐌(t+1)[h,:]}h\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{B}} are an approximate spanner for the distribution of vectors {𝐌(t+1)[h,:]}h^[:t+1]\{\mathbf{M}^{(t+1)}[h,:]\}_{h\sim\widehat{\mathbb{H}}[:t+1]} and they also span {𝐌(t+1)[hi(t)o,:]}h\{\mathbf{M}^{(t+1)}[h_{i}^{(t)}\vee o,:]\}_{h\in\mathcal{B}} for all i[s],o𝒪i\in[s],o\in\mathcal{O}. Note that the first condition is the natural requirement for a spanner and the second will be important later when we compute “transitions” between the length-tt histories and length-t+1t+1 histories.

Algorithm 3 Building Spanners
1:Input: Sample, conditional query, and exact pdf access to a distribution ^\widehat{\mathbb{H}} over 𝒪T\mathcal{O}^{T}
2:Input: Parameter t<Tt<T, strings h1(t),,hs(t)𝒪th_{1}^{(t)},\dots,h_{s}^{(t)}\in\mathcal{O}^{t}
3:Input: Parameters S,SsS\in\mathbb{N},S\geq s, γ,δ>0\gamma,\delta>0
4:Define 𝒜𝒪t+1\mathcal{A}\subseteq\mathcal{O}^{t+1} as 𝒜={hi(t)o}i[s],o𝒪\mathcal{A}=\{h_{i}^{(t)}\vee o\}_{i\in[s],o\in\mathcal{O}}
5:Draw N=poly(S,O,T,1/γ,log(1/δ))N=\mathrm{poly}(S,O,T,1/\gamma,\log(1/\delta)) histories of length t+1t+1 from ^[:t+1]\widehat{\mathbb{H}}[:t+1], say h1,,hNh^{\prime}_{1},\dots,h^{\prime}_{N}
6:Set 𝒜𝒜{h1,,hN}\mathcal{A}^{\prime}\leftarrow\mathcal{A}\cup\{h^{\prime}_{1},\dots,h^{\prime}_{N}\}
7:Run Algorithm 2 on set 𝒜\mathcal{A}^{\prime} with parameters t+1,k=Npoly(S,O,T,1/γ,log(1/δ))t+1,k=N\cdot\mathrm{poly}(S,O,T,1/\gamma,\log(1/\delta))
8:Let the result be 𝒳,{uh}h𝒜,w\mathcal{X},\{u_{h}\}_{h\in\mathcal{A}^{\prime}},w where 𝒳𝒪Tt1,uh|𝒳|\mathcal{X}\subseteq\mathcal{O}^{T-t-1},u_{h}\in{\mathbb{R}}^{|\mathcal{X}|}
9:Run Algorithm 1 on {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}} with parameters S,γ/(100Sk2)S,\gamma/(100Sk^{2}) and let the output be 𝒜\mathcal{B}\subseteq\mathcal{A}^{\prime}
10:Output: \mathcal{B}

We now prove guarantees on the spanning set \mathcal{B} found by Algorithm 3.

Lemma 6.3.

Assume that ^\widehat{\mathbb{H}} is ϵ\epsilon conditionally close to a rank SS distribution \mathbb{H} where

ϵ1poly(S,O,T,1/γ,log(1/δ))\epsilon\leq\frac{1}{\mathrm{poly}(S,O,T,1/\gamma,\log(1/\delta))}

for some sufficiently large polynomial. With probability at least 1δϵ0.41-\delta-\epsilon^{0.4}, if we run Algorithm 3 for arbitrary input strings h1(t),,hs(t)h_{1}^{(t)},\dots,h_{s}^{(t)}, the output satisfies the following conditions

  • ||S|\mathcal{B}|\leq S and the vectors {uh}h\{u_{h}\}_{h\in\mathcal{B}} form a (2,γ)(2,\gamma) spanner for {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}}

  • The rows of {𝐌(t+1)[h,:]}h\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{B}} form a (2,Sγ)(2,S\gamma)-spanner for the rows of {𝐌(t+1)[h,:]}h𝒜\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{A}^{\prime}}

  • The rows of {𝐌(t+1)[h,:]}h\{\mathbf{M}^{(t+1)}[h,:]\}_{h\in\mathcal{B}} form a (1γ,2S,2Sγ)(1-\gamma,2S,2S\gamma)-spanner for the distribution of vectors {𝐌(t+1)[h,:]}h^[:t+1]\{\mathbf{M}^{(t+1)}[h,:]\}_{h\sim\widehat{\mathbb{H}}[:t+1]}

Proof.

For h𝒪t+1h\in\mathcal{O}^{t+1}, let νh,νh\nu_{h},\nu^{\prime}_{h} be vectors with entries indexed by x𝒪Tt1x\in\mathcal{O}^{T-t-1} and entries given by Pr[x|h]\Pr_{\mathbb{H}}[x|h] and Pr^[x|h]\Pr_{\widehat{\mathbb{H}}}[x|h] respectively. First, note that the vectors {νh}\{\nu_{h}\} are contained in an SS-dimensional space by Fact 2.2. Thus, by Lemma 5.11, and the setting of ϵ\epsilon sufficiently small, with probability at least 1ϵ0.41-\epsilon^{0.4}, there is an SS-dimensional subspace, say 𝐕\mathbf{V}, such that all of {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}} have projection at most ϵ0.1\epsilon^{0.1} onto 𝐕\mathbf{V}^{\perp}. Assuming that this holds, all of the hypotheses of Lemma 5.6 are satisfied (note the condition about the norms of the input vectors is satisfied by the construction in Algorithm 2). We then get that the execution of Algorithm 3 successfully finds a set {uh}h\{u_{h}\}_{h\in\mathcal{B}} with ||S|\mathcal{B}|\leq S that is a (2,γ)(2,\gamma) spanner for {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}}.

For each h𝒜h^{\prime}\in\mathcal{A}^{\prime}, we can use the fact that {uh}h\{u_{h}\}_{h\in\mathcal{B}} is a (2,γ)(2,\gamma) spanner for {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}} to find a linear combination {ch,h}h\{c_{h^{\prime},h}\}_{h\in\mathcal{B}} with |ch,h|2|c_{h^{\prime},h}|\leq 2 for all hh\in\mathcal{B} such that

uhhch,huh1γ.\left\lVert u_{h^{\prime}}-\sum_{h\in\mathcal{B}}c_{h^{\prime},h}u_{h}\right\rVert_{1}\leq\gamma\,. (9)

Next, by Lemma 5.10, with probability at least 10.1δ1-0.1\delta, the vectors {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}} are (10S,0.1γ)(10S,0.1\gamma)-representative for the distributions {Pr^[|h]}h𝒜\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}^{\prime}}. This then implies

𝐌(t+1)[h,:]hch,h𝐌(t+1)[h,:]1Sγ\left\lVert\mathbf{M}^{(t+1)}[h^{\prime},:]-\sum_{h\in\mathcal{B}}c_{h^{\prime},h}\mathbf{M}^{(t+1)}[h,:]\right\rVert_{1}\leq S\gamma

which gives the second condition.

Now we prove the last condition. Recall that the vectors {νh}\{\nu_{h}\} are vectors in OTt1{\mathbb{R}}^{O^{T-t-1}} that are contained in some SS-dimensional subspace. Let 𝒫\mathcal{P} be the distribution of these vectors for h^[:t+1]h\sim\widehat{\mathbb{H}}[:t+1]. By Lemma 5.4 (since these vectors live in an SS-dimensional subspace), with probability at least 10.1δ1-0.1\delta, there is a subset ^𝒜\widehat{\mathcal{B}}\subseteq\mathcal{A}^{\prime} with |^|S|\widehat{\mathcal{B}}|\leq S, such that the vectors {νh}h^\{\nu_{h}\}_{h\in\widehat{\mathcal{B}}} are a (1γ,1,0)(1-\gamma,1,0)-spanner for this distribution 𝒫\mathcal{P}. For each h^h^{\prime}\in\widehat{\mathcal{B}}, we can again use (9) (since ^𝒜\widehat{\mathcal{B}}\subseteq\mathcal{A}^{\prime}). We can also again use Lemma 5.10 to get that the vectors {uh}h𝒜\{u_{h}\}_{h\in\mathcal{A}^{\prime}} are (10S,0.1γ)(10S,0.1\gamma)-representative for the distributions {Pr^[|h]}h𝒜\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{A}^{\prime}} and thus by the conditional closeness of ^\widehat{\mathbb{H}} and \mathbb{H} and Claim 4.3, they are (10S,0.2γ)(10S,0.2\gamma)-representative for {Pr[|h]}h𝒜\{\Pr_{\mathbb{H}}[\cdot|h]\}_{h\in\mathcal{A}^{\prime}}. Thus, (9) implies that for all h^h^{\prime}\in\widehat{\mathcal{B}}, there exist coefficients {ch,h}h\{c_{h^{\prime},h}\}_{h\in\mathcal{B}} with |ch,h|2|c_{h^{\prime},h}|\leq 2 for all hh\in\mathcal{B} such that

νhhch,hνh12γ\left\lVert\nu_{h^{\prime}}-\sum_{h\in\mathcal{B}}c_{h^{\prime},h}\nu_{h}\right\rVert_{1}\leq 2\gamma (10)

since ||S|\mathcal{B}|\leq S. Next, recall that {νh}h^\{\nu_{h}\}_{h\in\widehat{\mathcal{B}}} are a (1γ,1,0)(1-\gamma,1,0)-spanner for the distribution 𝒫\mathcal{P} of {νh}\{\nu_{h}\} for h^[:t+1]h\sim\widehat{\mathbb{H}}[:t+1]. Thus, for h′′^[:t+1]h^{\prime\prime}\sim\widehat{\mathbb{H}}[:t+1], with 1γ1-\gamma probability, there are coefficients {ch}h^\{c^{\prime}_{h^{\prime}}\}_{h^{\prime}\in\widehat{\mathcal{B}}} with |ch|1|c^{\prime}_{h^{\prime}}|\leq 1 such that

νh′′=h^chνh.\nu_{h^{\prime\prime}}=\sum_{h^{\prime}\in\widehat{\mathcal{B}}}c^{\prime}_{h^{\prime}}\nu_{h^{\prime}}\,.

Finally, we can combine this with (10) (which we can apply for all h^h^{\prime}\in\widehat{\mathcal{B}}) and the closeness between ^\widehat{\mathbb{H}} and \mathbb{H} to get that there are coefficients {ch}h\{c_{h}\}_{h\in\mathcal{B}} with |ch|2S|c_{h}|\leq 2S such that

νh′′hchνh12Sγ\left\lVert\nu^{\prime}_{h^{\prime\prime}}-\sum_{h\in\mathcal{B}}c_{h}\nu_{h}^{\prime}\right\rVert_{1}\leq 2S\gamma

which gives the second statement. Note that the overall failure probability for all of the statements is at most δ+ϵ0.4\delta+\epsilon^{0.4}. ∎

6.2 Full Learning Algorithm

For our full learning algorithm, we will apply Algorithm 3 to build a sequence of spanning sets (1),,(T1)\mathcal{H}^{(1)},\dots,\mathcal{H}^{(T-1)} for each history-length. We then need to learn the “transitions” between these spanning sets. Rather than explicitly learning these transitions, we instead learn a representation of the distribution of the form described below where the transitions are implicit – the transitions are only computed in Section 7 when we actually want to sample from the learned distribution. The representation of the distribution consists of:

  • Subsets (t)𝒪t\mathcal{H}^{(t)}\subseteq\mathcal{O}^{t} for all 0tT10\leq t\leq T-1

  • Matrices 𝐏(t)|(t)|×O\mathbf{P}^{(t)}\in{\mathbb{R}}^{|\mathcal{H}^{(t)}|\times O} for all 0tT10\leq t\leq T-1 containing the next character probabilities for each of the strings in (t)\mathcal{H}^{(t)}

  • For each 0tT0\leq t\leq T, a collection of vectors uh(t)u_{h^{(t)}} for all h(t)(t){(t1)o}o𝒪h^{(t)}\in\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}} that are succinct representations of the distributions Pr^[|h(t)]\Pr_{\widehat{\mathbb{H}}}[\cdot|h^{(t)}] obtained from Algorithm 2

  • For each 0tT0\leq t\leq T, we also maintain an index set 𝒳(t)\mathcal{X}^{(t)} and vector w(t)w^{(t)} (these are also obtained from Algorithm 2)

Algorithm 5 gives a full description of how we compute this representation.

Algorithm 4 Next Character Probabilities
1:Input: Sample, conditional query, and exact pdf access to a distribution ^\widehat{\mathbb{H}} over 𝒪T\mathcal{O}^{T}
2:Input: Parameter t<Tt<T, strings h1(t),,hs(t)𝒪th_{1}^{(t)},\dots,h_{s}^{(t)}\in\mathcal{O}^{t}
3:for o𝒪o\in\mathcal{O} do
4:     Compute probabilities p1,o=Pr^[o|h1(t)],,ps,o=Pr^[o|hs(t)]p_{1,o}=\Pr_{\widehat{\mathbb{H}}}[o|h_{1}^{(t)}],\dots,p_{s,o}=\Pr_{\widehat{\mathbb{H}}}[o|h_{s}^{(t)}]
5:end for
6:Let 𝐏(t)s×O\mathbf{P}^{(t)}\in{\mathbb{R}}^{s\times O} be the matrix whose entries are given by {pi,o}i[s],o𝒪\{p_{i,o}\}_{i\in[s],o\in\mathcal{O}}
7:Output: 𝐏(t)\mathbf{P}^{(t)}
Algorithm 5 Full Learning Algorithm
1:Input: Sample, conditional query, and exact pdf access to a distribution ^\widehat{\mathbb{H}} over 𝒪T\mathcal{O}^{T}
2:Input: Parameters T,ST,S\in\mathbb{N}, η,δ>0\eta,\delta>0
3:Initialize (0)\mathcal{H}^{(0)} to be a set consisting of only the empty string
4:Set k=poly(S,O,T,1/η,log(1/δ))k=\mathrm{poly}(S,O,T,1/\eta,\log(1/\delta))
5:Set γ=1/poly(k)\gamma=1/\mathrm{poly}(k)
6:for t=0,1,,T1t=0,1,\dots,T-1 do
7:     Run Algorithm 4 on (t)\mathcal{H}^{(t)} and let the output be 𝐏(t+1)\mathbf{P}^{(t+1)}
8:     Run Algorithm 3 with input (t)\mathcal{H}^{(t)} and parameters t,S,γ,δt,S,\gamma,\delta
9:     Let the output be (t+1)\mathcal{H}^{(t+1)}
10:     Set (t+1)=(t+1){(t)o}o𝒪\mathcal{B}^{(t+1)}=\mathcal{H}^{(t+1)}\cup\{\mathcal{H}^{(t)}\vee o\}_{o\in\mathcal{O}}
11:     Run Algorithm 2 on set (t+1)\mathcal{B}^{(t+1)} with parameters t+1,kt+1,k
12:     Let the result be 𝒳(t+1),{uh}h(t+1),w(t+1)\mathcal{X}^{(t+1)},\{u_{h}\}_{h\in\mathcal{B}^{(t+1)}},w^{(t+1)}
13:end for
14:Output: {𝐏(t),(t),𝒳(t),{uh(t)}h(t)(t),w(t)}t[T]\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h^{(t)}}\}_{h^{(t)}\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

After running Algorithm 5, we obtain a description of the distribution in terms of

{𝐏(t),(t),𝒳(t),{uh}h(t),w(t)}t[T].\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}\,.

Note that for all tt, the set (t)\mathcal{B}^{(t)} is just defined by (t)=(t){(t1)o}o𝒪\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}. In the remainder of this section, we prove that the parameters learned above satisfy certain properties (see Lemma 6.6). These properties will then be used in Section 7 to argue that we can sample from the learned description to get a distribution that is close to the original distribution ^\widehat{\mathbb{H}}. We begin with a few definitions.

Definition 6.4.

We say a string h𝒪th\in\mathcal{O}^{t} for t<Tt<T is γ\gamma-representable by 𝒪t\mathcal{H}\subseteq\mathcal{O}^{t} if there exists some vector y||y\in{\mathbb{R}}^{|\mathcal{H}|} with y2S\left\lVert y\right\rVert_{\infty}\leq 2S such that

𝐌(t)[h,:]y𝐌(t)[,:]12Sγ.\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H},:]\right\rVert_{1}\leq 2S\gamma\,.
Definition 6.5.

We say a string h𝒪th\in\mathcal{O}^{t} for t<Tt<T is (c,γ)(c,\gamma)-positively representable by 𝒪t\mathcal{H}\subseteq\mathcal{O}^{t} and 𝒳OTt\mathcal{X}\subseteq O^{T-t} if there exists some vector y||y\in{\mathbb{R}}^{|\mathcal{H}|} with y2S\left\lVert y\right\rVert_{\infty}\leq 2S such that

  • 𝐌(t)[h,:]y𝐌(t)[,:]12Sγ\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H},:]\right\rVert_{1}\leq 2S\gamma

  • y𝐌(t)[,𝒳]y^{\top}\mathbf{M}^{(t)}[\mathcal{H},\mathcal{X}] has all entries at least cc

The main lemma of this section is stated below.

Lemma 6.6.

Assume that ^\widehat{\mathbb{H}} is ϵ\epsilon conditionally close to a rank SS distribution \mathbb{H} where

ϵ1poly(S,O,T,1/η,log(1/δ))\epsilon\leq\frac{1}{\mathrm{poly}(S,O,T,1/\eta,\log(1/\delta))}

for some sufficiently large polynomial. For any parameter ce10(OTS)2/ηc\geq e^{-10(OTS)^{2}/\eta}, in the execution of Algorithm 5, with probability 1δγ0.11-\delta-\gamma^{0.1}, we have the following properties:

  • For any tTt\leq T, |(t)|S|\mathcal{H}^{(t)}|\leq S.

  • For any tTt\leq T and h^[:t]h\sim\widehat{\mathbb{H}}[:t], the string hh is (c,4Sγ)(c,4S\sqrt{\gamma})-positively representable by (t),𝒳(t)\mathcal{H}^{(t)},\mathcal{X}^{(t)} with at least 1γ0.1cOTγ1-\gamma^{0.1}-\frac{c\cdot O^{T}}{\gamma} probability

  • For all t[T]t\in[T], the vectors {uh(t)}h(t)\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}} are (10S,η)(10S,\eta)-representative for the distributions {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}

  • For all t[T]t\in[T], the vectors w(t),{uh(t)}h(t)w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}} are (10S,η,c)(10S,\eta,c) KL-preserving for the distributions
    {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}

Remark 6.7.

We think of the parameters as being set such that ϵpoly(1/γ)1,γpoly(1/η)1\epsilon\ll\mathrm{poly}(1/\gamma)^{-1},\gamma\ll\mathrm{poly}(1/\eta)^{-1}.

Proof.

We apply Lemma 6.3 – with probability 1δ/Tϵ0.41-\delta/T-\epsilon^{0.4}, the guarantees of the lemma hold each time we execute Algorithm 3. This immediately implies the first statement that we want to show. For the second statement, Lemma 6.3 also gives us that over the execution of the algorithm, for all t<Tt<T, the the vectors {𝐌(t)[h,:]}h(t)\{\mathbf{M}^{(t)}[h,:]\}_{h\in\mathcal{H}^{(t)}} form a (1γ,2S,2Sγ)(1-\gamma,2S,2S\gamma)-spanner for the distribution of vectors {𝐌(t)[h,:]}h^[:t]\{\mathbf{M}^{(t)}[h,:]\}_{h\sim\widehat{\mathbb{H}}[:t]}. Thus, for h^[:t]h\sim\widehat{\mathbb{H}}[:t], with 1γ1-\gamma probability, there exists some vector y||y\in{\mathbb{R}}^{|\mathcal{H}|} with y2S\left\lVert y\right\rVert_{\infty}\leq 2S such that

𝐌(t)[h,:]y𝐌(t)[(t),:]12Sγ.\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}\leq 2S\gamma\,. (11)

Let yy^{\prime} be obtained by increasing all entries of yy by γ\sqrt{\gamma}. Since |(t)|S|\mathcal{H}^{(t)}|\leq S, we have

𝐌(t)[h,:]y𝐌(t)[(t),:]14Sγ.\left\lVert\mathbf{M}^{(t)}[h,:]-{y^{\prime}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}\leq 4S\sqrt{\gamma}\,.

Now first consider sampling 𝒳(t)\mathcal{X}^{(t)} after (t)\mathcal{H}^{(t)} is fixed. Now let 𝒵𝒪Tt\mathcal{Z}\subseteq\mathcal{O}^{T-t} be the set of all strings xx such that

y𝐌(t)[(t),x]cγh(t)Pr^[x|h].y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},x]\leq c-\sqrt{\gamma}\sum_{h\in\mathcal{H}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[x|h]\,. (12)

As long as 𝒳(t)\mathcal{X}^{(t)} doesn’t contain any of these elements, then yy^{\prime} would give us a (c,4Sγ)(c,4S\sqrt{\gamma})-positive representation of hh by (t),𝒳(t)\mathcal{H}^{(t)},\mathcal{X}^{(t)}. Note that

(x𝒵y𝐌(t)[(t),x])𝐌(t)[h,:]y𝐌(t)[(t),:]1-\left(\sum_{x\in\mathcal{Z}}y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},x]\right)\leq\left\lVert\mathbf{M}^{(t)}[h,:]-y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}

and thus we must have

x𝒵y𝐌(t)[(t),x]2Sγ\sum_{x\in\mathcal{Z}}y^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},x]\geq-2S\gamma

and using (12), this rearranges into

x𝒵h(t)Pr^[x|h]c|𝒵|+2SγγcOT+2Sγγ.\sum_{x\in\mathcal{Z}}\sum_{h\in\mathcal{H}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[x|h]\leq\frac{c\cdot|\mathcal{Z}|+2S\gamma}{\sqrt{\gamma}}\leq\frac{c\cdot O^{T}+2S\gamma}{\sqrt{\gamma}}\,.

Also recall Lemma 6.3 implies that for any h(t)h\in\mathcal{B}^{(t)},

x𝒵Pr^[x|h]Sγ+2x𝒵h(t)Pr^[x|h]3(cOT+2Sγ)γ.\sum_{x\in\mathcal{Z}}\Pr_{\widehat{\mathbb{H}}}[x|h]\leq S\gamma+2\sum_{x\in\mathcal{Z}}\sum_{h\in\mathcal{H}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[x|h]\leq\frac{3(c\cdot O^{T}+2S\gamma)}{\sqrt{\gamma}}\,.

Now we have

Pr[𝒳(t)𝒵=]1|(t)|k3(cOT+2Sγ)γ1γ0.3cOTγ0.6\Pr[\mathcal{X}^{(t)}\cap\mathcal{Z}=\emptyset]\geq 1-|\mathcal{B}^{(t)}|k\cdot\frac{3(c\cdot O^{T}+2S\gamma)}{\sqrt{\gamma}}\geq 1-\gamma^{0.3}-\frac{c\cdot O^{T}}{\gamma^{0.6}}

and if this happens, then hh is (c,4Sγ)(c,4S\sqrt{\gamma})-positively representable by (t),𝒳(t)\mathcal{H}^{(t)},\mathcal{X}^{(t)}. Thus, so far we have shown that over the randomness of h^[:t]h\sim\widehat{\mathbb{H}}[:t] and 𝒳(t)\mathcal{X}^{(t)},

Pr[h is (c,4Sγ)-positively representable by (t),𝒳(t)]1γ0.3cOTγ0.6.\Pr\left[h\text{ is }(c,4S\sqrt{\gamma})\text{-positively representable by }\mathcal{H}^{(t)},\mathcal{X}^{(t)}\right]\geq 1-\gamma^{0.3}-\frac{c\cdot O^{T}}{\gamma^{0.6}}\,.

Now by Markov’s inequality, with probability at least 1γ0.11-\gamma^{0.1} over the choice of 𝒳(t)\mathcal{X}^{(t)},

Prh^[:t][h is (c,4Sγ)-positively representable by (t),𝒳(t)|𝒳(t)]1γ0.1cOTγ\Pr_{h\sim\widehat{\mathbb{H}}[:t]}\left[h\text{ is }(c,4S\sqrt{\gamma})\text{-positively representable by }\mathcal{H}^{(t)},\mathcal{X}^{(t)}|\mathcal{X}^{(t)}\right]\geq 1-\gamma^{0.1}-\frac{c\cdot O^{T}}{\gamma}

and this proves the second of the desired statements.

Finally, the last two statements follow from Lemma 5.10. Combining the failure probabilities for all of the parts, the total failure probability is at most δ+γ0.1\delta+\gamma^{0.1}, and this completes the proof. ∎

Later on, we will also need the following simple observation.

Claim 6.8.

In the execution of Algorithm 5, assume that |(t)|S|\mathcal{H}^{(t)}|\leq S for all tTt\leq T. Then for any c>0c>0, with probability at least 1(OS)2Tc1-(OS)^{2T}\sqrt{c}, we have w(t)1/c\left\lVert w^{(t)}\right\rVert_{\infty}\leq 1/\sqrt{c} for all t[T]t\in[T].

Proof.

Consider a fixed t[T]t\in[T]. The desired statement is failed only when in line 11 of Algorithm 5, when we execute the subroutine Algorithm 2, we sample some f𝒪Ttf\in\mathcal{O}^{T-t} such that

h(t)(t)Pr[f|h(t)]ck.\sum_{h^{(t)}\in\mathcal{B}^{(t)}}\Pr[f|h^{(t)}]\leq\frac{\sqrt{c}}{k}\,.

However, the probability of sampling such an element is at most

ckOTk|(t)|OT(O+1)Sc.\frac{\sqrt{c}}{k}O^{T}k|\mathcal{B}^{(t)}|\leq O^{T}(O+1)S\sqrt{c}\,.

Now union bounding over all t[T]t\in[T] immediately gives the desired conclusion. ∎

7 Sampling Procedure

After running Algorithm 5, we have learned a set of parameters

{𝐏(t),(t),𝒳(t),{uh}h(t),w(t)}t[T]\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

where recall (t)=(t){(t1)o}o𝒪\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}. However, it is not yet clear how these parameters determine a distribution. In this section, we show how these parameters determine a distribution that we can efficiently sample from and argue that if the learned parameters satisfy the properties in Lemma 6.6, then this distribution is close to ^\widehat{\mathbb{H}}.

Throughout this section, we assume that we are given the global parameters O,T,SO,T,S and a target accuracy parameter η\eta. We make the following assumption on the accuracy of the learned parameters. In light of Lemma 6.6 and Claim 6.8, we can show that this assumption holds with high probability as long as we ran Algorithm 5 on a distribution ^\widehat{\mathbb{H}} that is ϵ\epsilon-conditionally close to a rank SS distribution \mathbb{H} for sufficiently small ϵ\epsilon.

Assumption 7.1.

We have η>0\eta>0 satisfying η<1/poly(O,T,S)\eta<1/\mathrm{poly}(O,T,S) for some sufficiently large polynomial. Let c=η10OTSc=\eta^{10OTS}. The parameters

{𝐏(t),(t),𝒳(t),{uh}h(t),w(t)}t[T]\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

satisfy the following properties:

  • For any tTt\leq T, |(t)|S|\mathcal{H}^{(t)}|\leq S.

  • For any tTt\leq T and h^[:t]h\sim\widehat{\mathbb{H}}[:t], the string hh is (2c,η)(2\sqrt{c},\eta)-positively representable by (t),𝒳(t)\mathcal{H}^{(t)},\mathcal{X}^{(t)} with at least 1η1-\eta probability

  • For all t[T]t\in[T], the vectors w(t),{uh(t)}h(t)w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}} have nonnegative entries and have the same dimensionality d=poly(S,O,T,1/η)d=\mathrm{poly}(S,O,T,1/\eta) and satisfy uh(t)=w(t)𝐌(t)[h,𝒳(t)]u_{h}^{(t)}=w^{(t)}\odot\mathbf{M}^{(t)}[h,\mathcal{X}^{(t)}] (where \odot denotes entrywise product)

  • For all t[T]t\in[T], the vectors {uh(t)}h(t)\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}} are (10S,η)(10S,\eta)-representative for the distributions {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}

  • For all t[T]t\in[T], the vectors w(t),{uh(t)}h(t)w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}} are (10S,η,cT)(10S,\eta,c^{T}) KL-preserving for the distributions {Pr^[|h]}h(t)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{B}^{(t)}}

  • w(t)1/c\left\lVert w^{(t)}\right\rVert_{\infty}\leq 1/\sqrt{c} for all t[T]t\in[T]

We will also make the following assumption on the underlying distribution ^\widehat{\mathbb{H}}.

Assumption 7.2.

The distribution ^\widehat{\mathbb{H}} is cc-positive (recall Definition 4.4).

Throughout the rest of this section, we will treat the parameters {𝐏(t),(t),𝒳(t),{uh}h(t),w(t)}t[T]\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]} as fixed. We will describe the sampling procedure and then analyze it, proving that it samples from a distribution close to ^\widehat{\mathbb{H}} as long as the assumptions above hold.

The main sampling algorithm is Algorithm 6. First, we define the following operation which rounds a vector to a point on the simplex corresponding to a probability distribution.

Definition 7.3.

For a vector vnv\in{\mathbb{R}}^{n} and parameter τ>0\tau>0, we define roundτ(v)\textsf{round}_{\tau}(v) to be

(max(v[1],τ)C,,max(v[n],τ)C)\left(\frac{\max(v[1],\tau)}{C},\dots,\frac{\max(v[n],\tau)}{C}\right)

where C=i=1nmax(v[i],τ)C=\sum_{i=1}^{n}\max(v[i],\tau). Note that the output of round()\textsf{round}() is always a valid probability distribution.

At a high-level, Algorithm 6 works as follows. We sample a string xx one character at a time. At each step t<Tt<T, we maintain a linear combination of the strings h(t)h\in\mathcal{H}^{(t)} that is supposed to approximate the string x[:t]=o1o2otx[:t]=o_{1}o_{2}\dots o_{t} given by the first tt characters of xx in the following sense: we want

Pr^[|x[:t]]h(t)αhx[:t]Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|x[:t]]\approx\sum_{h\in\mathcal{H}^{(t)}}\alpha_{h}^{x[:t]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h]

as distributions, for some coefficients {αhx[:t]}h(t)\{\alpha_{h}^{x[:t]}\}_{h\in\mathcal{H}^{(t)}}. Given this linear combination, and since 𝐏(t)\mathbf{P}^{(t)} gives us the next character probabilities for all of the strings in (t)\mathcal{H}^{(t)}, we know what the distribution for the next character should be given the first tt characters in x[:t]x[:t] so we simply sample the next character from this distribution. If the next character is ot+1o_{t+1}, so x[:t+1]=x[:t]ot+1x[:t+1]=x[:t]\vee o_{t+1}, then we can re-normalize the coefficients αhx[:t]\alpha_{h}^{x[:t]} to get coefficients α^hx[:t]\widehat{\alpha}_{h}^{x[:t]} such that

Pr^[|x[:t+1]]h(t)α^hx[:t]Pr^[|hot+1].\Pr_{\widehat{\mathbb{H}}}[\cdot|x[:t+1]]\approx\sum_{h\in\mathcal{H}^{(t)}}\widehat{\alpha}_{h}^{x[:t]}\Pr_{\widehat{\mathbb{H}}}[\cdot|h\vee o_{t+1}]\,. (13)

The main difficulty is that we now want to rewrite the RHS as a linear combination of the distributions
{Pr^[|h]}h(t+1)\{\Pr_{\widehat{\mathbb{H}}}[\cdot|h]\}_{h\in\mathcal{H}^{(t+1)}}.

Recalling the discussion in Section 3.3, we will solve a convex optimization problem, that roughly corresponds to projection in KL, for each “change-of-basis” operation. In particular, we will use the fact that the vectors {uh}h(t+1)\{u_{h}\}_{h\in\mathcal{B}^{(t+1)}} (where recall (t+1)=(t+1){(t)o}o𝒪\mathcal{B}^{(t+1)}=\mathcal{H}^{(t+1)}\cup\{\mathcal{H}^{(t)}\vee o\}_{o\in\mathcal{O}}) are succinct representations of Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h]. Then once we have the coefficients α^hx[:t]\widehat{\alpha}_{h}^{x[:t]} in (13), to “change the basis”, we solve a convex optimization problem for coefficients αhx[:t+1]\alpha_{h}^{x[:t+1]} such that |αhx[:t+1]||\alpha_{h}^{x[:t+1]}| are bounded and

KL(h(t+1)αhx[:t+1]uhh(t)α^hx[:t]uhot+1)\textsf{KL}\left(\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}^{x[:t+1]}u_{h}\Bigg{\|}\sum_{h\in\mathcal{H}^{(t)}}\widehat{\alpha}_{h}^{x[:t]}u_{h\vee o_{t+1}}\right)

is minimized. Recall the crucial property of KL divergence that ensures the error only grows additively but not multiplicatively in each step is that projection in KL onto a convex set decreases the distance (in KL) to all other points in the set (see Fact 4.10).

For describing our algorithm formally, it will be convenient to define the following notation for arranging the rows {uh}h(t)\{u_{h}\}_{h\in\mathcal{B}^{(t)}} as the rows of a matrix.

Definition 7.4.

Let 𝐑(t)\mathbf{R}^{(t)} be the matrix whose rows are given by {uh}h(t)\{u_{h}\}_{h\in\mathcal{H}^{(t)}}. Let for o𝒪o\in\mathcal{O}, let 𝐓o(t)\mathbf{T}^{(t)}_{o} be the matrix whose rows are given by {uh}h(t)o\{u_{h}\}_{h\in\mathcal{H}^{(t)}\vee o}.

Algorithm 6 Sampling Procedure
1:Input: S,O,T,η>0S,O,T\in\mathbb{N},\eta>0
2:Input: Learned representation {𝐏(t),(t),𝒳(t),{uh}h(t),w(t)}t[T]\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}
3:for t[T]t\in[T], o𝒪o\in\mathcal{O} do
4:     Let 𝐑(t)\mathbf{R}^{(t)} be the matrix whose rows are given by {uh}h(t)\{u_{h}\}_{h\in\mathcal{H}^{(t)}}
5:     Let 𝐓o(t)\mathbf{T}^{(t)}_{o} be the matrix whose rows are given by {uh}h(t)o\{u_{h}\}_{h\in\mathcal{H}^{(t)}\vee o}.
6:end for
7:Set c=η10OTSc=\eta^{-10OTS}
8:Set x[:0]x[:0] to be the empty string
9:Let 𝜶x[:0]=1\boldsymbol{\alpha}^{x[:0]}=1
10:for t=0,1,,T1t=0,1,\dots,T-1 do
11:     Compute probabilities {po}o𝒪=round2c0.1(𝜶x[:t]𝐏(t))\{p_{o}\}_{o\in\mathcal{O}}=\textsf{round}_{2c^{0.1}}({\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{P}^{(t)})
12:     Sample next character ot+1𝒪o_{t+1}\in\mathcal{O} with probabilities {po}o𝒪\{p_{o}\}_{o\in\mathcal{O}}
13:     Set x[:t+1]=x[:t]ot+1x[:t+1]=x[:t]\vee o_{t+1}
14:     Let the entries in the column of 𝐏(t)\mathbf{P}^{(t)} indexed by ot+1o_{t+1} be {qhot+1}h(t)\{q^{o_{t+1}}_{h}\}_{h\in\mathcal{H}^{(t)}}
15:     Define the vector 𝝂x[:t+1]=1pot+1𝜶x[:t]diag({qhot+1}h(t))\boldsymbol{\nu}^{x[:t+1]}=\frac{1}{p_{o_{t+1}}}{\boldsymbol{\alpha}^{x[:t]}}^{\top}\textsf{diag}\left(\{q^{o_{t+1}}_{h}\}_{h\in\mathcal{H}^{(t)}}\right)
16:     Let
𝐳x[:t+1]=𝝂x[:t+1]𝐓o(t)\mathbf{z}^{x[:t+1]}={\boldsymbol{\nu}^{x[:t+1]}}^{\top}\mathbf{T}_{o}^{(t)}
17:      Compute vector 𝜶={αh}h(t+1)\boldsymbol{\alpha}=\{\alpha_{h}\}_{h\in\mathcal{H}^{(t+1)}} such that
|αh|3Sh(t+1)𝜶𝐑(t+1)cTtw(t+1) entrywise𝜶𝐑(t+1),𝟏=1 where 𝟏 is the all ones vector\begin{split}&|\alpha_{h}|\leq 3S\quad\quad\forall h\in\mathcal{H}^{(t+1)}\\ &{\boldsymbol{\alpha}}^{\top}\mathbf{R}^{(t+1)}\geq c^{T-t}w^{(t+1)}\text{ entrywise}\\ &\langle{\boldsymbol{\alpha}}^{\top}\mathbf{R}^{(t+1)},\mathbf{1}\rangle=1\text{ where }\mathbf{1}\text{ is the all ones vector}\end{split}
and which minimizes
KLcTtw(t+1)(𝜶𝐑(t+1)𝐳x[:t+1])\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left({\boldsymbol{\alpha}}^{\top}\mathbf{R}^{(t+1)}\|\mathbf{z}^{x[:t+1]}\right)
18:     Set 𝜶x[:t+1]={αhx[:t+1]}h(t+1)\boldsymbol{\alpha}^{x[:t+1]}=\{\alpha^{x[:t+1]}_{h}\}_{h\in\mathcal{H}^{(t+1)}} to be the minimizer in the above optimization problem
19:end for
20:Let xx be the string o1o2oTo_{1}o_{2}\cdots o_{T}
21:Output: xx
Remark 7.5.

Note that the constraints in the optimization problem in Line 17 of Algorithm 6 ensure that the KL divergence is a convex function of 𝛂x[:t+1]\boldsymbol{\alpha}^{x[:t+1]} over the entire feasible domain and thus can be optimized efficiently.

The analysis will rely on bounding the (truncated) KL divergence between the true distribution and the distribution that our algorithm samples from. We will do this inductively in tt via a hybridization argument. However, there are certain “bad” histories that we will need to truncate out and these are precisely those that are not positively representable in Lemma 6.6.

Definition 7.6.

For tTt\leq T, we let 𝒢(t)𝒪t\mathcal{G}^{(t)}\subseteq\mathcal{O}^{t} be the subset of h𝒪th\in\mathcal{O}^{t} such that for all ttt^{\prime}\leq t, the prefix of hh of length tt^{\prime} is (2c,η)(2\sqrt{c},\eta)-positively representable by (t),𝒳(t)\mathcal{H}^{(t^{\prime})},\mathcal{X}^{(t^{\prime})}.

The key lemma of the analysis is Lemma 7.12 but first we will need to prove a sequence of preliminary claims. We apply Corollary 4.11 to analyze the convex optimization step in Algorithm 6. We get that, up to some small additive error, replacing 𝐳x[:t+1]\mathbf{z}^{x[:t+1]} with the solution 𝜶x[:t+1]𝐑(t+1){\boldsymbol{\alpha}^{x[:t+1]}}^{\top}\mathbf{R}^{(t+1)} that we compute reduces the KL divergence to all vectors in the feasible set of the convex program. We can then use Assumption 7.1 to lift this argument to the original distributions and argue that our change-of-basis step only incurs some small additive error overall.

Claim 7.7.

Let t<Tt<T. Consider the execution of Algorithm 6 and assume that so far we have sampled t+1t+1 characters i.e. x[:t+1]=o1ot+1x[:t+1]=o_{1}\cdots o_{t+1}. Let 𝛂={αh}h(t+1)\boldsymbol{\alpha}=\{{\alpha_{h}}\}_{h\in\mathcal{H}^{(t+1)}} be any feasible set of coefficients in the convex program in Line 17 of Algorithm 6. Then we have

KLcTtw(t+1)(𝜶𝐑(t+1)𝜶x[:t+1]𝐑(t+1))KLcTtw(t+1)(𝜶𝐑(t+1)𝐳x[:t+1])+log(𝐳x[:t+1]1+η).\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|{\boldsymbol{\alpha}^{x[:t+1]}}^{\top}\mathbf{R}^{(t+1)}\right)\leq\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|\mathbf{z}^{x[:t+1]}\right)+\log\left(\left\lVert\mathbf{z}^{x[:t+1]}\right\rVert_{1}+\eta\right)\,.
Proof.

We apply Corollary 4.11 where 𝒦\mathcal{K} is the set of all vectors 𝜶𝐑(t+1)\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)} where 𝜶\boldsymbol{\alpha} is feasible in the convex program. It is clear that this set is convex. Also, if we set wcTtw(t+1)w\leftarrow c^{T-t}w^{(t+1)} in Corollary 4.11 then all of the hypotheses are satisfied. By Assumption 7.1, w1dcη\left\lVert w\right\rVert_{1}\leq d\sqrt{c}\leq\eta. Also by definition, all elements of 𝒦\mathcal{K} have sum of entries equal to 11. Thus, we get

KLcTtw(t+1)(𝜶𝐑(t+1)𝜶x[:t+1]𝐑(t+1))KLcTtw(t+1)(𝜶𝐑(t+1)𝐳x[:t+1])+log(𝐳x[:t+1]1+η)\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|{\boldsymbol{\alpha}^{x[:t+1]}}^{\top}\mathbf{R}^{(t+1)}\right)\leq\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left(\boldsymbol{\alpha}^{\top}\mathbf{R}^{(t+1)}\|\mathbf{z}^{x[:t+1]}\right)+\log\left(\left\lVert\mathbf{z}^{x[:t+1]}\right\rVert_{1}+\eta\right)

as desired. ∎

We will also need the following observation about Algorithm 6 that the vector

𝜶x[:t]𝐌(t)[(t),:]{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]

is always close to being a valid distribution.

Claim 7.8.

Throughout the execution of Algorithm 6, for any t<Tt<T, any feasible solution 𝛂={αh}h(t+1)\boldsymbol{\alpha}=\{\alpha_{h}\}_{h\in\mathcal{H}^{(t+1)}} to the convex program in Line 17 must satisfy

  • 13S2ηh(t+1)αh1+3S2η1-3S^{2}\eta\leq\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}\leq 1+3S^{2}\eta

  • 1η𝜶𝐌(t+1)[(t+1),:]11+η1-\eta\leq\left\lVert\boldsymbol{\alpha}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right\rVert_{1}\leq 1+\eta

Proof.

The third constraint of the convex program implies that

1=h(t+1)αhuh,𝟏=h(t+1)αh+h(t+1)αh(uh,𝟏1).1=\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}\langle u_{h},\mathbf{1}\rangle=\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}+\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}(\langle u_{h},\mathbf{1}\rangle-1)\,.

However Assumption 7.1 implies that |uh,𝟏1|η|\langle u_{h},\mathbf{1}\rangle-1|\leq\eta for all hh and thus we conclude

13S2ηh(t+1)αh1+3S2η.1-3S^{2}\eta\leq\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}\leq 1+3S^{2}\eta\,.

Now for the second statement, note that the second and third constraints of the convex program together enforce that h(t+1)αhuh1=1\left\lVert\sum_{h\in\mathcal{H}^{(t+1)}}\alpha_{h}u_{h}\right\rVert_{1}=1 and then Assumption 7.1 implies that

1η𝜶𝐌(t+1)[(t+1),:]11+η1-\eta\leq\left\lVert\boldsymbol{\alpha}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right\rVert_{1}\leq 1+\eta

as desired. ∎

Corollary 7.9.

Throughout the execution of Algorithm 6, the probabilities pop_{o} computed in line 11 are always at least c0.1c^{0.1}.

Proof.

Let the entries of 𝜶x[:t]𝐏(t){\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{P}^{(t)} be {ro}o𝒪\{r_{o}\}_{o\in\mathcal{O}}. By Claim 7.8,

o𝒪|ro|1+η.\sum_{o\in\mathcal{O}}|r_{o}|\leq 1+\eta\,.

Thus,

o𝒪max(2c0.1,ro)1+η+2c0.1O2\sum_{o\in\mathcal{O}}\max(2c^{0.1},r_{o})\leq 1+\eta+2c^{0.1}O\leq 2

and this immediately implies the desired statement. ∎

Now we can take Claim 7.7 and then use Assumption 7.1 to replace the matrix 𝐑(t+1)\mathbf{R}^{(t+1)} with the matrix of true probabilities 𝐌(t+1)\mathbf{M}^{(t+1)} as follows.

Claim 7.10.

Fix an x[:t]𝒢(t)x[:t]\in\mathcal{G}^{(t)} (recall Definition 7.6). Let o𝒪o\in\mathcal{O} be such that x[:t]o𝒢(t+1)x[:t]\vee o\in\mathcal{G}^{(t+1)}. Assume that in the execution of Algorithm 6, the first tt characters we have sampled are exactly x[:t]x[:t] and the t+1t+1st character is sampled to be ot+1=oo_{t+1}=o, so we just set x[:t+1]=x[:t]ox[:t+1]=x[:t]\vee o in line 13. Then after solving the optimization problem in line 17 to compute the next set of coefficients 𝛂x[:t]o\boldsymbol{\alpha}^{x[:t]\vee o}, we have

KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:])KLcTt(𝐌(t+1)[x[:t]o,:]𝝂x[:t]o𝐌(t+1)[(t)o,:])+log(𝐳x[:t]o1+η)+η0.5.\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\eta^{0.5}\,.\end{split}
Proof.

By assumption that x[:t]o𝒢(t+1)x[:t]\vee o\in\mathcal{G}^{(t+1)}, there exists coefficients 𝜷x[:t]o={βhx[:t]o}h(t+1)\boldsymbol{\beta}^{x[:t]\vee o}=\{\beta_{h}^{x[:t]\vee o}\}_{h\in\mathcal{H}^{(t+1)}} such that 𝜷x[:t]o2S\left\lVert\boldsymbol{\beta}^{x[:t]\vee o}\right\rVert_{\infty}\leq 2S and

  • 𝐌(t+1)[x[:t]o,:]𝜷x[:t]o𝐌(t+1)[(t+1),:]12Sη\left\lVert\mathbf{M}^{(t+1)}[x[:t]\vee o,:]-{\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right\rVert_{1}\leq 2S\eta (14)
  • 𝜷x[:t]o𝐌(t+1)[(t+1),𝒳(t+1)]{\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},\mathcal{X}^{(t+1)}] has all entries at least 2c2\sqrt{c}

Note that the first condition implies that

12Sηh(t+1)βhx[:t]o1+2Sη1-2S\eta\leq\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}\leq 1+2S\eta

because all rows of the matrix 𝐌(t+1)\mathbf{M}^{(t+1)} have sum equal to 11. Let 𝟏\mathbf{1} be the all ones vector of dimension dd (recall Assumption 7.1). Then define

Q:=𝜷x[:t]o𝐑(t+1),𝟏=h(t+1)βhx[:t]ouh,𝟏=h(t+1)βhx[:t]o+h(t+1)βhx[:t]o(uh,𝟏1).Q\mathrel{\mathop{:}}=\langle{\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)},\mathbf{1}\rangle=\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}\langle u_{h},\mathbf{1}\rangle=\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}+\sum_{h\in\mathcal{H}^{(t+1)}}\beta_{h}^{x[:t]\vee o}\left(\langle u_{h},\mathbf{1}\rangle-1\right)\,.

By Assumption 7.1 (specifically that the vectors uhu_{h} are representative for the distributions Pr^[|h]\Pr_{\widehat{\mathbb{H}}}[\cdot|h]), this implies that

15S2ηQ1+5S2η.1-5S^{2}\eta\leq Q\leq 1+5S^{2}\eta\,. (15)

Thus, using Assumption 7.1, the vector 𝜷x[:t]o/Q={βhx[:t]o/Q}h(t+1)\boldsymbol{\beta}^{x[:t]\vee o}/Q=\{\beta_{h}^{x[:t]\vee o}/Q\}_{h\in\mathcal{H}^{(t+1)}} must be a feasible solution to the convex program in Line 17 of Algorithm 6. By Claim 7.7,

KLcTtw(t+1)(𝜷x[:t]o𝐑(t+1)/Q𝜶x[:t]o𝐑(t+1))KLcTtw(t+1)(𝜷x[:t]o𝐑(t+1)/Q𝐳x[:t]o)+log(𝐳x[:t]o1+η)\begin{split}&\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)}/Q\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)}\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}w^{(t+1)}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{R}^{(t+1)}/Q\|\mathbf{z}^{x[:t]\vee o}\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)\end{split}

Next, recall the vector 𝝂x[:t]o:={qhoαhx[:t]/po}h(t)\boldsymbol{\nu}^{x[:t]\vee o}\mathrel{\mathop{:}}=\{q_{h}^{o}\alpha_{h}^{x[:t]}/p_{o}\}_{h\in\mathcal{H}^{(t)}} constructed in Algorithm 6 and recall that

𝐳x[:t]o=𝝂x[:t]o𝐓o(t).\mathbf{z}^{x[:t]\vee o}={\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{T}_{o}^{(t)}\,.

Then by the KL-preserving property in Assumption 7.1 (and Corollary 7.9 which bounds the entries of 𝝂x[:t]o\boldsymbol{\nu}^{x[:t]\vee o})

KLcTt(𝜷x[:t]o𝐌(t+1)[(t+1),:]/Q𝜶x[:t]o𝐌(t+1)[(t+1),:])KLcTt(𝜷x[:t]o𝐌(t+1)[(t+1),:]/Q𝝂x[:t]o𝐌(t+1)[(t)o,:])+log(𝐳x[:t]o1+η)+2η..\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]/Q\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left({\boldsymbol{\beta}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]/Q\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+2\eta\,.\end{split}\,.

Finally, by Claim 4.12 and (14), (15), the above implies

KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:])KLcTt(𝐌(t+1)[x[:t]o,:]𝝂x[:t]o𝐌(t+1)[(t)o,:])+log(𝐳x[:t]o1+η)+η0.5\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\eta^{0.5}\end{split} (16)

where in the last step we used the assumption that η\eta is sufficiently small. This completes the proof. ∎

Now we sum Claim 7.10 over all possibilities for o𝒪o\in\mathcal{O} to get the following bound.

Claim 7.11.

Fix an x[:t]𝒢(t)x[:t]\in\mathcal{G}^{(t)}. Assume that in the execution of Algorithm 6, the first tt characters we have sampled are exactly x[:t]x[:t]. Then we have

o𝒪Pr^[o|x[:t]]KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:])KLcTt+1(𝐌(t)[x[:t],:]𝜶x[:t]𝐌(t)[(t),:])+2η0.5+5Tlog(1/c)o𝒪x[:t]o𝒢(t+1)Pr^[o|x[:t]].\begin{split}&\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+2\eta^{0.5}+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\,.\end{split}
Proof.

Let oTt𝒪Tt\mathcal{F}_{o}^{T-t}\subset\mathcal{O}^{T-t} be the subset of all strings whose first character is oo. Then we have

𝝂x[:t]o𝐌(t+1)[(t)o,:]=1poh(t)qhoαhx[:t]𝐌(t+1)[ho,:]=1po𝜶x[:t]𝐌(t)[(t),oTt].{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]=\frac{1}{p_{o}}\sum_{h\in\mathcal{H}^{(t)}}q_{h}^{o}\alpha_{h}^{x[:t]}\mathbf{M}^{(t+1)}[h\vee o,:]=\frac{1}{p_{o}}{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},\mathcal{F}_{o}^{T-t}]\,.

Also note that

𝐌(t+1)[x[:t]o,:]=𝐌(t)[x[:t],oTt]Pr^[o|x[:t]].\mathbf{M}^{(t+1)}[x[:t]\vee o,:]=\frac{\mathbf{M}^{(t)}[x[:t],\mathcal{F}_{o}^{T-t}]}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\,.

Note that all entries of the matrix 𝐌(t)\mathbf{M}^{(t)} are at least cTtc^{T-t} by Assumption 7.2 and poc0.1p_{o}\geq c^{0.1} by Corollary 7.9 so using the definition of truncated KL,

Pr^[o|x[:t]]KLcTt(𝐌(t+1)[x[:t]o,:]𝝂x[:t]o𝐌(t+1)[(t)o,:])KLcTt+1(𝐌(t)[x[:t],oTt]𝜶x[:t]𝐌(t)[(t),oTt])+Pr^[o|x[:t]]logpoPr^[o|x[:t]].\begin{split}&\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],\mathcal{F}_{o}^{T-t}]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},\mathcal{F}_{o}^{T-t}]\right)+\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\log\frac{p_{o}}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\,.\end{split} (17)

Next, we will apply Claim 7.10 and (17) to bound the sum

o𝒪Pr^[o|x[:t]]KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:]).\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\,.

However, we can only apply Claim 7.10 when oo is such that x[:t]o𝒢(t+1)x[:t]\vee o\in\mathcal{G}^{(t+1)}. For other choices of oo, we can nevertheless use the following trivial version of Claim 7.10

KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:])KLcTt(𝐌(t+1)[x[:t]o,:]𝝂x[:t]o𝐌(t+1)[(t)o,:])+log(𝐳x[:t]o1+η)+η0.5+5Tlog(1/c)\begin{split}&\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right)+\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\eta^{0.5}+5T\log(1/c)\end{split}

where the above holds simply because both of the KL divergences are bounded in magnitude by 2Tlog(1/c)2T\log(1/c). Thus, we have the bound

o𝒪Pr^[o|x[:t]]KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:])η0.5+o𝒪Pr^[o|x[:t]]log(𝐳x[:t]o1+η)+o𝒪Pr^[o|x[:t]]logpoPr^[o|x[:t]]+o𝒪KLcTt+1(𝐌(t)[x[:t],oTt]𝜶x[:t]𝐌(t)[(t),oTt])+5Tlog(1/c)o𝒪x[:t]o𝒢(t+1)Pr^[o|x[:t]]=KLcTt+1(𝐌(t)[x[:t],:]𝜶x[:t]𝐌(t)[(t),:])+η0.5+o𝒪Pr^[o|x[:t]]log((𝐳x[:t]o1+η)poPr^[o|x[:t]])+5Tlog(1/c)o𝒪x[:t]o𝒢(t+1)Pr^[o|x[:t]].\begin{split}&\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\eta^{0.5}+\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\log\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)+\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\log\frac{p_{o}}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\\ &\quad+\sum_{o\in\mathcal{O}}\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],\mathcal{F}_{o}^{T-t}]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},\mathcal{F}_{o}^{T-t}]\right)+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\\ &=\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)\\ &\quad+\eta^{0.5}+\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\log\left(\frac{\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)p_{o}}{\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]}\right)+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\,.\end{split} (18)

Next, by definition and Assumption 7.1,

po𝐳x[:t]o1=po𝝂x[:t]o𝐓o(t)1po𝝂x[:t]o𝐌(t+1)[(t)o,:]1+η.p_{o}\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}=\left\lVert p_{o}{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{T}_{o}^{(t)}\right\rVert_{1}\leq\left\lVert p_{o}{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:]\right\rVert_{1}+\eta\,.

Observe that the vector 𝜶x[:t]𝐌(t)[(t),:]{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:] is the same as concatenating po𝝂x[:t]o𝐌(t+1)[(t)o,:]p_{o}{\boldsymbol{\nu}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t)}\vee o,:] over all choices of oo. Thus by Claim 7.8

o𝒪po(𝐳x[:t]o1+η)(O+1)η+𝜶x[:t]𝐌(t)[(t),:]11+(O+2)η.\sum_{o\in\mathcal{O}}p_{o}\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)\leq(O+1)\eta+\left\lVert{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right\rVert_{1}\leq 1+(O+2)\eta\,.

Combining this with (18) and using nonnegativity of KL gives

o𝒪Pr^[o|x[:t]]KLcTt(𝐌(t+1)[x[:t]o,:]𝜶x[:t]o𝐌(t+1)[(t+1),:])KLcTt+1(𝐌(t)[x[:t],:]𝜶x[:t]𝐌(t)[(t),:])+η0.5+log(o𝒪po(𝐳x[:t]o1+η))+5Tlog(1/c)o𝒪x[:t]o𝒢(t+1)Pr^[o|x[:t]]KLcTt+1(𝐌(t)[x[:t],:]𝜶x[:t]𝐌(t)[(t),:])+2η0.5+5Tlog(1/c)o𝒪x[:t]o𝒢(t+1)Pr^[o|x[:t]].\begin{split}&\sum_{o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[x[:t]\vee o,:]\|{\boldsymbol{\alpha}^{x[:t]\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+\eta^{0.5}+\log\left(\sum_{o\in\mathcal{O}}p_{o}\left(\left\lVert\mathbf{z}^{x[:t]\vee o}\right\rVert_{1}+\eta\right)\right)\\ &\quad+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\\ &\leq\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[x[:t],:]\|{\boldsymbol{\alpha}^{x[:t]}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+2\eta^{0.5}+5T\log(1/c)\cdot\sum_{\begin{subarray}{c}o\in\mathcal{O}\\ x[:t]\vee o\notin\mathcal{G}^{(t+1)}\end{subarray}}\Pr_{\widehat{\mathbb{H}}}[o|x[:t]]\,.\end{split}

Next, we sum Claim 7.11 over all possibilities for x[:t]𝒢(t)x[:t]\in\mathcal{G}^{(t)}. Note that in the execution of Algorithm 6, we can associate a unique vector of coefficients 𝜶h\boldsymbol{\alpha}^{h} to every possible history hh to be the state of the vector 𝜶x[:t]\boldsymbol{\alpha}^{x[:t]} conditioned on the prefix x[:t]x[:t] being equal to hh.

Lemma 7.12.

For all 0tT10\leq t\leq T-1, we have

h𝒢(t+1)Pr^[ho]KLcTt(𝐌(t+1)[ho,:]𝜶ho𝐌(t+1)[(t+1),:])h𝒢(t)Pr^[h]KLcTt+1(𝐌(t)[h,:]𝜶h𝐌(t)[(t),:])+3η0.5.\begin{split}&\sum_{h\in\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+3\eta^{0.5}\,.\end{split}
Proof.

Summing Claim 7.11 as x[:t]x[:t] ranges over all h𝒢(t)h\in\mathcal{G}^{(t)} (with weights Pr^[h]\Pr_{\widehat{\mathbb{H}}}[h]) gives

h𝒢(t),o𝒪Pr^[ho]KLcTt(𝐌(t+1)[ho,:]𝜶ho𝐌(t+1)[(t+1),:])5Tlog(1/c)ho𝒢(t+1)Pr^[ho]+2η0.5+h𝒢(t)Pr^[h]KLcTt+1(𝐌(t)[h,:]𝜶h𝐌(t)[(t),:]).\begin{split}&\sum_{h\in\mathcal{G}^{(t)},o\in\mathcal{O}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq 5T\log(1/c)\cdot\sum_{h\vee o\notin\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]+2\eta^{0.5}+\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)\,.\end{split} (19)

Finally, note that

|KLcTt(𝐌(t+1)[ho,:]𝜶ho𝐌(t+1)[(t+1),:])|2Tlog(1/c)\left\lvert\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\right\rvert\leq 2T\log(1/c)

because 𝐌(t+1)[ho,:]\mathbf{M}^{(t+1)}[h\vee o,:] has nonnegative entries summing to 11 and Claim 7.8 bounds the entries of the second part. Thus by Assumption 7.1 (specifically the condition about h^[:t]h\sim\widehat{\mathbb{H}}[:t] being positively representable), we conclude

ho𝒢(t+1)Pr^[ho]Tη.\sum_{h\vee o\notin\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\leq T\eta\,.

Thus from (19) and using our assumptions on η\eta, we get

h𝒢(t+1)Pr^[ho]KLcTt(𝐌(t+1)[ho,:]𝜶ho𝐌(t+1)[(t+1),:])h𝒢(t)Pr^[h]KLcTt+1(𝐌(t)[h,:]𝜶h𝐌(t)[(t),:])+3η0.5\begin{split}&\sum_{h\in\mathcal{G}^{(t+1)}}\Pr_{\widehat{\mathbb{H}}}[h\vee o]\cdot\textsf{KL}_{\geq c^{T-t}}\left(\mathbf{M}^{(t+1)}[h\vee o,:]\|{\boldsymbol{\alpha}^{h\vee o}}^{\top}\mathbf{M}^{(t+1)}[\mathcal{H}^{(t+1)},:]\right)\\ &\leq\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)+3\eta^{0.5}\end{split}

as desired. ∎

Now we can prove the main lemma of this section, where we show that the distribution that our algorithm samples from is close to the desired distribution ^\widehat{\mathbb{H}}.

Lemma 7.13.

Under Assumption 7.1 and Assumption 7.2, the distribution that Algorithm 6 samples from, say \mathbb{H}^{\prime}, satisfies

dTV(^,)10Tη0.1.d_{\textsf{TV}}(\widehat{\mathbb{H}},\mathbb{H}^{\prime})\leq 10T\eta^{0.1}\,.
Proof.

First, for a fixed history h𝒪(t)h\in\mathcal{O}^{(t)}, let 𝜷h\boldsymbol{\beta}^{h} be the vector

max(𝜶h𝐌(t)[(t),:],cTt+1)\max({\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:],c^{T-t+1})

where the maximum is taken entrywise. Clearly 𝜷h\boldsymbol{\beta}^{h} has all positive entries and by Claim 7.8 (and the definition of cc), the sum of the entries is at most 1+2η1+2\eta. Now let

𝜽h=𝜷h𝜷h1\boldsymbol{\theta}^{h}=\frac{\boldsymbol{\beta}^{h}}{\left\lVert\boldsymbol{\beta}^{h}\right\rVert_{1}}

i.e. normalizing so that the sum of the entries is 11. Note that all of the entries of the matrix 𝐌(t)\mathbf{M}^{(t)} are at least cTtc^{T-t} so

KLcTt+1(𝐌(t)[h,:]𝜶h𝐌(t)[(t),:])=KL(𝐌(t)[h,:]𝜷h)KL(𝐌(t)[h,:]𝜽h)2η.\textsf{KL}_{\geq c^{T-t+1}}\left(\mathbf{M}^{(t)}[h,:]\|{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]\right)=\textsf{KL}\left(\mathbf{M}^{(t)}[h,:]\|\boldsymbol{\beta}^{h}\right)\geq\textsf{KL}\left(\mathbf{M}^{(t)}[h,:]\|\boldsymbol{\theta}^{h}\right)-2\eta\,.

Now combining the above with Lemma 7.12 implies

h𝒢(t)Pr^[h]KL(𝐌(t)[h,:]𝜽h)4tη0.5η0.4.\sum_{h\in\mathcal{G}^{(t)}}\Pr_{\widehat{\mathbb{H}}}[h]\cdot\textsf{KL}\left(\mathbf{M}^{(t)}[h,:]\|\boldsymbol{\theta}^{h}\right)\leq 4t\eta^{0.5}\leq\eta^{0.4}\,.

Recall by Assumption 7.1 that

Prh^[:t][h𝒢(t)]1η\Pr_{h\sim\widehat{\mathbb{H}}[:t]}[h\in\mathcal{G}^{(t)}]\geq 1-\eta

and thus by Pinsker’s inequality, we deduce

Prh^[:t][𝜽h𝐌(t)[h,:]1η0.1]2η0.2\Pr_{h\sim\widehat{\mathbb{H}}[:t]}\left[\left\lVert\boldsymbol{\theta}^{h}-\mathbf{M}^{(t)}[h,:]\right\rVert_{1}\geq\eta^{0.1}\right]\leq 2\eta^{0.2} (20)

and when this happens, since

𝜶h𝐌(t)[(t),:]𝜽h4η\left\lVert{\boldsymbol{\alpha}^{h}}^{\top}\mathbf{M}^{(t)}[\mathcal{H}^{(t)},:]-\boldsymbol{\theta}^{h}\right\rVert\leq 4\eta

we get that the distributions {Pr^[o|h]}o𝒪,{Pr[o|h]}o𝒪\{\Pr_{\widehat{\mathbb{H}}}[o|h]\}_{o\in\mathcal{O}},\{\Pr_{\mathbb{H}^{\prime}}[o|h]\}_{o\in\mathcal{O}} are 2η0.12\eta^{0.1}-close in TV distance. Now we construct a hybrid distribution ′′\mathbb{H}^{\prime\prime} that is equal to \mathbb{H}^{\prime} except starting from every prefix hh where the condition in (20) fails, all future characters are sampled according to ^\widehat{\mathbb{H}}. By (20), dTV(′′,)2Tη0.2d_{\textsf{TV}}(\mathbb{H}^{\prime\prime},\mathbb{H}^{\prime})\leq 2T\eta^{0.2} and by Claim 4.3, dTV(′′,^)2Tη0.1d_{\textsf{TV}}(\mathbb{H}^{\prime\prime},\widehat{\mathbb{H}})\leq 2T\eta^{0.1}. Putting these together yields the desired statement.

Now we can put everything together and complete the proof of Theorem 1.3.

Proof of Theorem 1.3.

Let ϵ=1/poly(S,O,T,1/η)\epsilon=1/\mathrm{poly}(S,O,T,1/\eta) for some sufficiently large polynomial. Now we will combine the following ingredients

  • Lemma 4.6 to get exact sample and pdf access to a distribution ^\widehat{\mathbb{H}} close to \mathbb{H}

  • Algorithm 5 to learn a description of a distribution

  • Algorithm 6 to draw samples from the distribution parameterized by this learned description

For all of these algorithms, we will set the failure probability parameter δ0.1η\delta\leftarrow 0.1\eta. First, we apply Lemma 4.6 to get exact sample and pdf access to a distribution ^\widehat{\mathbb{H}} that is ϵ\epsilon conditionally close to \mathbb{H} and ϵ/(10O)2\epsilon/(10O)^{2}-positive. Next, we apply Algorithm 5 to learn a set of parameters

{𝐏(t),(t),𝒳(t),{uh}h(t),w(t)}t[T]\{\mathbf{P}^{(t)},\mathcal{H}^{(t)},\mathcal{X}^{(t)},\{u_{h}\}_{h\in\mathcal{B}^{(t)}},w^{(t)}\}_{t\in[T]}

where (t)=(t){(t1)o}o𝒪\mathcal{B}^{(t)}=\mathcal{H}^{(t)}\cup\{\mathcal{H}^{(t-1)}\vee o\}_{o\in\mathcal{O}}. We apply Lemma 6.6 and Claim 6.8 on these learned parameters. They, combined with the definition of the vectors w(t),{uh(t)}h(t)w^{(t)},\{u_{h}^{(t)}\}_{h\in\mathcal{B}^{(t)}}, imply that with probability 1η1-\eta, the conditions of Assumption 7.1 are satisfied. Also, Assumption 7.2 is satisfied by construction. Finally Lemma 7.13 implies that when these conditions are satisfied, the distribution that Algorithm 6 samples from is 10Tη0.110T\eta^{0.1}-close to ^\widehat{\mathbb{H}} in TV distance. Note that the sampling algorithm runs in time poly(S,O,Tlog(1/η))\mathrm{poly}(S,O,T\log(1/\eta)). Redefining η(0.1η/T)10\eta\leftarrow(0.1\eta/T)^{10} completes the proof. ∎

References

  • [AHK12] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on learning theory, pages 33–1. JMLR Workshop and Conference Proceedings, 2012.
  • [AK08] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
  • [Ang87] Dana Angluin. Learning regular sets from queries and counterexamples. Information and computation, 75(2):87–106, 1987.
  • [BC18] Rishiraj Bhattacharyya and Sourav Chakraborty. Property testing of joint distributions using conditional samples. ACM Transactions on Computation Theory (TOCT), 10(4):1–20, 2018.
  • [BCPV19] Aditya Bhaskara, Aidao Chen, Aidan Perreault, and Aravindan Vijayaraghavan. Smoothed analysis in unsupervised learning via decoupling. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 582–610. IEEE, 2019.
  • [BLMY23a] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. A new approach to learning linear dynamical systems. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 335–348, 2023.
  • [BLMY23b] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. Tensor decompositions meet control theory: learning general mixtures of linear dynamical systems. In International Conference on Machine Learning, pages 1549–1563. PMLR, 2023.
  • [CFGM13] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 561–580, 2013.
  • [CGG01] Mary Cryan, Leslie Ann Goldberg, and Paul W Goldberg. Evolutionary trees can be learned in polynomial time in the two-state general markov model. SIAM Journal on Computing, 31(2):375–397, 2001.
  • [CJLW21] Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Learning and testing junta distributions with sub cube conditioning. In Conference on Learning Theory, pages 1060–1113. PMLR, 2021.
  • [CP22] Yanxi Chen and H Vincent Poor. Learning mixtures of linear dynamical systems. In International conference on machine learning, pages 3507–3557. PMLR, 2022.
  • [CPD+24] Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model. arXiv preprint arXiv:2403.06634, 2024.
  • [CRS15] Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using conditional samples. SIAM Journal on Computing, 44(3):540–616, 2015.
  • [DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539. PMLR, 2020.
  • [FKQR21] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  • [GGJ+20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587–3596. PMLR, 2020.
  • [GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147–2158, 2020.
  • [GMR23] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning and learning in partially observable systems via filter stability. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 349–362, 2023.
  • [HGKD15] Qingqing Huang, Rong Ge, Sham Kakade, and Munther Dahleh. Minimal realization problems for hidden markov models. IEEE Transactions on Signal Processing, 64(7):1896–1904, 2015.
  • [HJB+21] Xinlei He, Jinyuan Jia, Michael Backes, Neil Zhenqiang Gong, and Yang Zhang. Stealing links from graph neural networks. In 30th USENIX security symposium (USENIX security 21), pages 2669–2686, 2021.
  • [HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
  • [HLXS21] Xuanli He, Lingjuan Lyu, Qiongkai Xu, and Lichao Sun. Model extraction and adversarial transferability, your bert is vulnerable! arXiv preprint arXiv:2103.10013, 2021.
  • [HMR18] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1–44, 2018.
  • [Jac97] Jeffrey C Jackson. An efficient membership-query algorithm for learning dnf with respect to the uniform distribution. Journal of Computer and System Sciences, 55(3):414–440, 1997.
  • [JCCCP21] Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Papernot. Entangled watermarks as a defense against model extraction. In 30th USENIX security symposium (USENIX Security 21), pages 1937–1954, 2021.
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • [JKA+17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  • [JSMA19] Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. Prada: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pages 512–527. IEEE, 2019.
  • [KKMZ24] Sham M. Kakade, Akshay Krishnamurthy, Gaurav Mahajan, and Cyril Zhang. Learning hidden markov models using conditional samples, 2024.
  • [KNW13] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In International Conference on Machine Learning, pages 702–710. PMLR, 2013.
  • [KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
  • [LZJ+22] Yiming Li, Linghui Zhu, Xiaojun Jia, Yong Jiang, Shu-Tao Xia, and Xiaochun Cao. Defending against model stealing via verifying embedded external features. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 1464–1472, 2022.
  • [MCH+21] Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, and Zhangyang Wang. Undistillable: Making a nasty teacher that cannot teach students. arXiv preprint arXiv:2105.07381, 2021.
  • [MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on learning theory, pages 2388–2464. PMLR, 2019.
  • [MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
  • [NKIH23] Ali Naseh, Kalpesh Krishna, Mohit Iyyer, and Amir Houmansadr. Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1835–1849, 2023.
  • [NP23] Phan-Minh Nguyen and Huy Tuan Pham. A rigorous framework for the mean field limit of multilayer neural networks. Mathematical Statistics and Learning, 6(3):201–357, 2023.
  • [OMR23] Daryna Oliynyk, Rudolf Mayer, and Andreas Rauber. I know what you trained last summer: A survey on stealing machine learning models and defences. ACM Computing Surveys, 55(14s):1–41, 2023.
  • [OO19] Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference (ACC), pages 5655–5661. IEEE, 2019.
  • [OSF19] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Prediction poisoning: Towards defenses against dnn model stealing attacks. arXiv preprint arXiv:1906.10908, 2019.
  • [RST19] Robert Nikolai Reith, Thomas Schneider, and Oleksandr Tkachenko. Efficiently stealing your machine learning models. In Proceedings of the 18th ACM Workshop on Privacy in the Electronic Society, pages 198–210, 2019.
  • [SBR19] Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714–2802. PMLR, 2019.
  • [Sha51] Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
  • [SKLV17] Vatsal Sharan, Sham M Kakade, Percy S Liang, and Gregory Valiant. Learning overcomplete hmms. Advances in Neural Information Processing Systems, 30, 2017.
  • [SKLV18] Vatsal Sharan, Sham Kakade, Percy Liang, and Gregory Valiant. Prediction with a short memory. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1074–1087, 2018.
  • [SRD19] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Nonparametric finite time lti system identification. arXiv preprint arXiv:1902.01848, 2019.
  • [SZ24] Zeyang Sha and Yang Zhang. Prompt stealing attacks against large language models. arXiv preprint arXiv:2402.12959, 2024.
  • [TP19] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019.
  • [TZJ+16] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction {\{APIs}\}. In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
  • [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [WG18] Binghui Wang and Neil Zhenqiang Gong. Stealing hyperparameters in machine learning. In 2018 IEEE symposium on security and privacy (SP), pages 36–52. IEEE, 2018.
  • [WXGD20] Xinran Wang, Yu Xiang, Jun Gao, and Jie Ding. Information laundering for model privacy. arXiv preprint arXiv:2009.06112, 2020.
  • [ZWL23] Xuandong Zhao, Yu-Xiang Wang, and Lei Li. Protecting language generation models via invisible watermarking. In International Conference on Machine Learning, pages 42187–42199. PMLR, 2023.