Model Stealing for Any Low-Rank Language Model
Abstract
Model stealing, where a learner tries to recover an unknown model via carefully chosen queries, is a critical problem in machine learning, as it threatens the security of proprietary models and the privacy of data they are trained on. In recent years, there has been particular interest in stealing large language models (LLMs). In this paper, we aim to build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. We study model stealing for Hidden Markov Models (HMMs), and more generally low-rank language models.
We assume that the learner works in the conditional query model, introduced by Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. In other words, our algorithm succeeds at stealing any language model whose output distribution is low-rank. This improves upon the result in [KKMZ24] which also requires the unknown distribution to have high “fidelity” – a property that holds only in restricted cases. There are two key insights behind our algorithm: First, we represent the conditional distributions at each timestep by constructing barycentric spanners among a collection of vectors of exponentially large dimension. Second, for sampling from our representation, we iteratively solve a sequence of convex optimization problems that involve projection in relative entropy to prevent compounding of errors over the length of the sequence. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.
1 Introduction
Proprietary machine learning models are often highly confidential. Not only are their weights not publicly released, but even their architecture and hyperparameters used in training are kept a closely guarded secret. And yet these models are often deployed as a service, allowing users to make queries to the model and receive answers. These answers can take the form of labels or completions of prompts, and sometimes a model will even report additional information such as its confidence scores. This raises a natural question:
Question.
Are these black-box models actually secure, or is it possible to reverse engineer their parameters or replicate their functionality just from query access to them?
This task is called model stealing and it threatens the security of proprietary models and the privacy of data they are trained on. Beyond nefarious reasons, it can also be used in model distillation [MCH+21], where we have trained a large and highly complex model and we want to transfer its knowledge to a much smaller model. It can also be a useful tool for identifying vulnerabilities, as those are often inherited by stolen models.
In any case, model stealing continues to be a very active area of research. The influential work in [TZJ+16] showed that there are simple and efficient attacks on popular models like logistic regression, decision trees and deep neural networks that often work in practice. Since then, many new attacks and defenses have been formulated [HLXS21, HJB+21, RST19, WG18, JSMA19, OMR23, OSF19, WXGD20]. There are also approaches based on embedding watermarks [JCCCP21, ZWL23, LZJ+22] that make it possible to detect when one model has been stolen from another. In recent years, there has been particular interest in stealing large language models (LLMs). Various works have shown how to steal isolated components of a language model such as the decoding algorithm [NKIH23], prompts used to fine-tune the model [SZ24], and even the entire weight matrix of the last layer (the embedding projection layer) [CPD+24].
In this work, our main interest will be in theoretical foundations for stealing language models. As is all too familiar, proving rigorous end-to-end guarantees when working with modern machine learning models with all their bells and whistles seems to be an extremely difficult task. For example, while we can understand the training dynamics on multilayer neural networks in terms of gradient flow in the Wasserstein space of probability distributions [MMM19, NP23], it has turned out to be quite difficult to analyze these dynamics except in simplified settings with a high degree of symmetry. Even worse, there are strong lower bounds for learning deep neural networks [KS09] even with respect to nice input distributions [GGJ+20, DKKZ20, GGK20]. The task of reasoning about modern language models seems no easier, as they are built on transformers [VSP+17] with many building blocks such as word embeddings, positional encodings, queries, keys, values, attention, masking, feed-forward neural networks and layer normalization.
Nevertheless there are often simplified models that abstract important features of more complex models and give us a sandbox in which to try to find theoretical explanations of empirical phenomena. For example, analyzing the dynamics of gradient descent when training a deep neural network is notoriously difficult. But in an appropriate scaling limit, and when the network is wide enough, it can be approximated through the neural tangent kernel [JGH18]. For recurrent neural networks, a popular approach is to analyze gradient descent on linear dynamical systems instead [HMR18]. Likewise for language models, it is natural to work with Hidden Markov Models (HMMs), which are in some sense the original language model, dating back to the work of Claude Shannon in 1951 [Sha51] and were the basis of other early natural language processing systems including the IBM alignment models. More broadly, we can consider a generalization called low-rank language models (Definition 1.1). This brings us to our main questions:
Question.
Is there an efficient algorithm for stealing HMMs from query access? What about more generally for low-rank language models?
These questions were first introduced and studied in an exciting recent work of Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. However their motivation and framing was somewhat different, as we will explain.
1.1 Main Results
Formally, we view a language model as a distribution over for some alphabet and sequence length . For simplicity, we treat the sequence length as fixed. Following Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24], the rank of the distribution generated by a language model is defined as follows:
Definition 1.1.
[Low Rank Distribution] A distribution over for alphabet of size and sequence length is rank if for all , the matrix, , with entries equal to for and has rank at most .
In other words, a distribution is rank- if for any , the information in the prefix of length can be embedded in an -dimensional space such that the distribution of the future tokens can be represented as a linear function of this embedding. We note that low-rank distributions are expressive and encompass distributions generated by a Hidden Markov Model (HMM) with hidden states (see Fact 2.2).
Next, we formalize the setup for studying model stealing. We allow the learner to make conditional queries – that is, the learner can specify a history of observations, and then receives a random sample from the conditional distribution on the future observations. Formally, we have the following definition:
Definition 1.2 (Conditional Query).
The learner may make conditional queries to a distribution by querying a string where . Upon making this query, the learner obtains a string of length drawn from the distribution .
In this model, our goal is to design an algorithm that makes a total number of queries that is polynomial in and learns an efficiently samplable distribution that is -close in total variation distance to . This conditional query model was recently introduced by Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24]. Their motivation was two-fold: First, while learning an HMM from random samples is known to be computationally hard [MR05], in principle one can circumvent these barriers if we are allowed conditional samples. Second, a solution to their problem would generalize Angluin’s classic algorithm which learns deterministic finite automata from membership queries [Ang87]. In terms of results, Kakade, Krishnamurthy, Mahajan and Zhang [KKMZ24] introduced a notion that they called fidelity and gave a polynomial time algorithm to learn any low-rank distribution (and thus any HMM) which has high fidelity through conditional queries. However this property does not always hold. Thus, their main question still remains: Is it possible to learn arbitrary low-rank distributions through conditional queries? Here we resolve this question in the affirmative. We show:
Theorem 1.3.
Assume we are given conditional query access to an unknown rank distribution over where . Then given a parameter , there is an algorithm that takes conditional queries and running time and with probability outputs a description of a distribution such that is -close in TV distance to . Moreover there is an algorithm that samples from in time.
Note that crucially, the algorithm only makes conditional queries for learning the representation of . Once we have this learned representation, we may draw as many samples as we want without making any more queries to the original distribution .
1.2 Discussion
Theorem 1.3 shows that we can efficiently learn any low-rank distribution via conditional queries. Thus, we can view our results as showing that in some sense, the rank of a distribution can be a useful proxy for understanding the complexity of model stealing, similar to how complexity measures, such as Bellman rank [JKA+17] and its relatives [FKQR21], are useful for understanding the statistical complexity of learning a near optimal policy in reinforcement learning.
There is a key conceptual insight driving our algorithm. One of the challenges in learning sequential distributions is that the error can grow exponentially in the sequence length . In particular if we imagine sampling from one token at a time, the low-rank structure ensures that we only need to keep track of an -dimensional hidden state at each step. However, each step involves multiplication by a change-of-basis matrix and these repeated multiplications cause the error to grow exponentially. The key to mitigating this error blowup is that we combine each change-of-basis with a projection step, where we solve a convex optimization problem that performs a projection with respect to KL divergence. Crucially, projection in KL divergence has a contractive property (see Fact 3.9) that does not hold for other natural measures of distance between distributions, such as TV distance. We give a more detailed overview of our algorithm in Section 3. This is an interesting example where allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance. Of course, phrased in general terms, this is a driving philosophy behind OpenAI’s o1 model. But so far we have little theoretical understanding of the provable benefits of allowing more compute at inference time.
1.3 Related Work
There has been a long line of work on learning HMMs from random samples. Mossel and Roch [MR05] gave the first polynomial time algorithms that work when under appropriate full rankness conditions on the transition and observation matrices. Other works gave spectral [HKZ12] and method of moments based approaches [AHK12]. Learning HMMs can also be thought of as a special case of learning phylogenetic trees [CGG01, MR05]. Other approaches assume the output distributions belong to a parametric family [KNW13], or study quasi-HMMs [HGKD15].
The main computational obstruction in this area is that HMMs, without full rankness conditions, can encode noisy parities [MR05], which are believed to be computationally hard to learn. In the overcomplete setting, where the hidden state space is allowed to be larger than the observation space, there are ways around these lower bounds. First, one can aim to predict the sequence of observations, rather than learning the parameters [SKLV18]. Under a natural condition called multi-step observability one can get quasi-polynomial time algorithms [GMR23]. Second, one can make structural assumptions that the transition matrices are sparse, well-conditioned, and have small probability mass on short cycles [SKLV17]. Alternatively there are polynomial time algorithms that work under the assumption that the transition and observation matrices are smoothed [BCPV19].
There are also related models called linear dynamical systems where the hidden states and observations are represented as vectors. In contrast, in HMMs the hidden states and observations take on only a finite set of possible values. There is a long line of work on learning linear dynamical systems too [HMR18, OO19, TP19, SRD19, SBR19, BLMY23a, CP22, BLMY23b]. However, these algorithms require various structural assumptions, and even the weakest ones, namely condition-number bounds on the called observability and controllability matrices, are known to be necessary [BLMY23a].
The conditional query model has also been studied for other statistical problems such as testing discrete distributions and learning juntas [CFGM13, CRS15, BC18, CJLW21]. However, in most of these settings, the goal of studying the conditional query model is to obtain improved statistical rates rather than sidestep computational hardness. We remark that there are also many classic examples of problems in computational learning theory where, when allowed to make queries, there are better algorithms [Jac97] than if we are only given passive samples.
2 Basic Setup
Recall that we have conditional query access to some unknown distribution over for some alphabet . We also assume that has rank (recall Definition 1.1). Our goal will be to learn a description of via conditional queries. Let us first formally define Hidden Markov Models (HMMs):
Definition 2.1 (Hidden Markov Model).
An HMM with state space , , and observation space , , is specified by an initial distribution over the states, a transition matrix , and an emission matrix . For a given sequence length , the probability of generating a given sequence with each is
Next we give a basic observation (see [KKMZ24]) that the class of rank distributions contains the set of all distributions generated by an HMM with hidden states.
Fact 2.2.
[KKMZ24] Let be a distribution over generated by an HMM with hidden states. Then is a rank at most distribution.
Notation
Throughout this paper, we will use the same global notation for the unknown distribution with rank , observation space of size and sequence length . We also write for sequences of length less than . This will be used to denote the probability that conditioned on the prefix , the observations immediately following are those in .
We use the following notation for prefixes and conditional probabilities. For a distribution on and , we use to denote the distribution induced by on -character prefixes. For any prefix , we use to denote the distribution on futures given by . Finally, for a string and character , we use to denote the string of characters obtained by appending to .
3 Technical Overview
Our algorithm is composed of two main parts: We work with a certain representation of the distribution and show how to estimate it from conditional samples. Then we give an algorithm that takes this learned representation and can generate samples. In most learning problems, estimating the parameters is the challenging part and sampling is often trivial. But in our case, designing the sampling algorithm, which involves solving a sequence of convex optimization problems, is one of the key ingredients.
3.1 Idealized Representation
In this section, we introduce the representation that will be central to our learning algorithm. First we consider an idealized setting where we ignore issues like sampling noise and the fact that we cannot afford to work with exponentially large vectors directly. We include this subsection for pedagogical reasons as many of the ideas are already in [KKMZ24]. Nevertheless, it will be useful setup for explaining our new contributions in the following subsections.
We slightly abuse notation and for , let denote the vector whose entries are as ranges over all elements of . Now Definition 1.1 tells us that as ranges over , all of the vectors are contained in an -dimensional subspace. Thus, we only need to store of these vectors to “span” the whole space.
Barycentric Spanners
This leads us to the notion of a barycentric spanner, which is the key building block in the representation we will use:
Definition 3.1 (Barycentric Spanner).
For a collection of vectors , we say another set of vectors is a -spanner for them if for all , there are coefficients with such that
Crucially, for any set of vectors that are exactly contained in a -dimensional subspace, there is always a subset of them that form a barycentric spanner for the full collection.
Fact 3.2.
Let be arbitrary vectors in , Then there exists a subset with such that form a spanner for the collection .
Thus ideally, for each , we could store a subset with and such that is a -barycentric spanner for .
Change-of-Basis
In addition to these spanning subsets, we will need to store some additional information about sampling probabilities and “change-of-basis” between these spanning sets. In particular we claim that if, in addition to the barycentric spanners, we also had the following information, then it would be enough to sample:
-
•
The next character probabilities (under ) for all of the elements of
-
•
For each , , , coefficients such that
(1) i.e. we know how to write as a linear combination of the vectors . We refer to this as a “change-of-basis”.
To see why these suffice, let’s consider sampling a string one character at a time. Assume we have sampled so far. Throughout the process we will maintain a set of coefficients such that
(2) |
Since we know the next-character distributions for all of , the above allows us to exactly compute the next-character distribution for . After sampling a character, say from this distribution, we can then re-normalize to obtain
(3) |
We can then substitute (1) into the above to rewrite the right hand side as a linear combination of the vectors . And now we can iterate this process again to sample the next character. The key point is that once we have a barycentric spanner, we do not need to store the coefficients for each history but rather can track how they evolve.
Challenges
If we want to turn the above representation into a learning algorithm, there are some major obstacles. The vectors are exponentially large, and we only have approximate (but not exact) access to them. When our estimates are noisy, the sampling approach above has issues as the error may grow multiplicatively after each iteration.
3.2 Learning the Representation
Simulating pdf access to that is close to
Our first step is to show that, by using conditional queries, we can simulate exact p.d.f. access to a distribution that is -close to in TV distance for any inverse-polynomially small .
Lemma 3.3 (Informal, see Lemma 4.6).
There is a distribution such that for all , the conditional distributions and are -close, and we can simulate query access to the exact p.d.f. of all conditional distributions of .
Note that crucially this exact p.d.f. access is only for a distribution that is -close to and not for itself (as in the latter case [KKMZ24] already gives an algorithm for learning). For most of the algorithm, we will work directly with and only use, implicitly in the analysis, that it is close to some low rank distribution .
3.2.1 Dimensionality Reduction
Recall that our goal is to find a barycentric spanner for the set of vectors . These vectors are close to the vectors , so we will try to construct an approximate spanner for this latter collection. The first problem is that these vectors have exponentially many entries. In order to compute with them efficiently, we introduce a type of dimensionality reduction that allows us to subsample a subset of only polynomially many coordinates and only work with the restriction to those coordinates. First, let be some distribution over strings in . Now we sample polynomially elements from . And for each , we associate with the reduced vector
(4) |
where denotes the density of at . Recall that by assumption, we have exact access to the densities in the numerator. Thus as long as we also have exact density access to , then the above entries can be computed exactly. We first observe that these vectors give unbiased estimates for linear functionals of the original distributions .
Fact 3.4.
For any distribution over , in expectation over the random draws from , for any subset and real coefficients ,
In other words, at least in expectation, the subsampling of the entries preserves the norm of linear combinations of the vectors. Of course, the variance could be prohibitively large depending on the choice of . However, with a careful choice of , we can obtain concentration for the above quantities as well. The goal of our dimensionality reduction procedure is captured by the following statements:
Definition 3.5.
For a subset , we say vectors are -representative for the distributions if
for all sets of coefficients with .
Proposition 3.6.
[Informal, see Lemma 5.10 for more details] For and
If we draw samples from for and set
then with high probability the vectors are -representative for the distributions .
To see why the above holds, note that the entries of are all at most . Thus for any accuracy parameter , by taking , we can union bound over a net and get concentration in Fact 3.4. The main point now is that when we want to compute a barycentric spanner we can work with the representative vectors instead.
3.2.2 Computing a Spanner in the Reduced Space
Section 3.2.1 deals with the fact that there are exponentially many possible futures. There are also exponentially many histories, but because they are close to some low-dimensional subspace, we show that sampling a polynomial number of them suffices to obtain a representative subset.
Formally, we sample histories from . While we cannot ensure that the collection
contains a barycentric spanner for all of (because there could be some that is sampled with very low probability), we can still use Fact 3.2 to argue that this set contains a subset of size that is a barycentric spanner for most of i.e. with high probability over drawn from , is close to a bounded linear combination of vectors in this subset.
Now to compute the barycentric spanner, by Proposition 3.6, we can construct representative vectors and it suffices to work with these, which have polynomial size. Then we can run an algorithm for computing approximate barycentric spanners (see Lemma 5.6) on this collection. There are additional technical details due to the error i.e. the vectors are close to, but not exactly contained in an -dimensional subspace. For the algorithm to work, it is crucial that the error is smaller than the dimensionality of the vectors so we need to ensure that (see Section 6.1 for more details). Omitting these technical details, we have now sketched the proof of the following:
Lemma 3.7 (Informal, see Lemma 5.6).
With conditional queries and runtime, we can compute sets such that for each , and the collection is a -spanner for a -fraction of (where the fraction is measured with respect to ).
For simplicity, in this overview, we even assume that is actually an approximate spanner for all of — in the full proof, roughly the “exceptional” possibilities for can be absorbed into the failure probability.
3.2.3 A Reduced Representation
Let consist of all strings in and all strings obtained by taking a string in and appending a character in . In other words . Now the set of information that we store is as follows:
Definition 3.8 (Learned representation (informal)).
Our learned representation consists of the following information:
-
•
Sets for
-
•
The next character probabilities (under ) for all of the elements of
-
•
Vectors that are representative for the distributions (can be constructed using Proposition 3.6).
This is our first main departure from the idealized representation: rather than storing explicit linear combinations as in the idealized representation (see (1)), we simply store the representative vectors and defer the computation of the “change-of-basis” to the sampling algorithm. For technical reasons, the full algorithm requires storing some additional information but we will omit these in this overview. See Section 6.2 for details.
3.3 Sampling Step
As in the idealized sampling algorithm in Section 3.1, we sample a string one character at a time. At each step , we maintain a linear combination of the strings as in (2), except with conditional distributions with respect to . We then sample a character and re-normalize as in (3). The main difficulty now is the change-of-basis step.
Naive Attempt: Linear Change-of-Basis
Naively, for each and , we could simply pre-compute coefficients such that
and thus (since the vectors are representative for the distribution ),
We can then substitute this into the re-normalized relation (see (3)) and get
(5) |
This gives us an expression for as a linear combination of the vectors for . However, the issue with this approach is that the coefficients and approximation error may grow exponentially with .
Need for Projection
The fact that the coefficients may grow motivates the need for a “projection” step, where we reduce the magnitudes of the coefficients. One natural attempt is to take the coefficients in (5) and “project” them by finding an equivalent set of coefficients of bounded magnitude, say , such that
(6) |
In fact, there exists a set of coefficients of bounded magnitude such that
(7) |
with high probability over , by Lemma 3.7. However, there is a subtle issue with this approach. We only have access to the vectors so we can only compute (succinct representations) of the expressions in (6) but not of .
The issue is captured by the following abstraction. Let be the vector that is equal to the left hand side of (6). Let be the convex set consisting of all possible vectors obtainable on the right hand side of (6) for coefficients bounded by some constant. Let . We know (up to some small error), but we do not know . Now we want to map to a point in . Ideally, we want to guarantee , up to some small additive error. The issue is that we cannot guarantee better than
due to the fact that is unknown and having to use triangle inequality. In other words, the problem is that when projecting onto a convex set in norm, we may still double the distance to some of the elements of the convex set, causing the error to grow multiplicatively in each iteration.
Projection in KL
The above issue motivates projection in a different distance measure between distributions. Specifically, KL divergence has the following appealing property:
Fact 3.9.
Let be the -dimensional simplex i.e. the convex hull of the standard basis vectors. We view points in as distributions over elements. Let be a convex set and let be a point. Let . Then for all ,
This implies that “projecting” onto a convex set actually decreases the KL-divergence from all elements within the convex set. This is the key to overcoming the error doubling issue above. In particular, in the same abstraction as above, we would just set .
More concretely, since the vectors are succinct representations of , we can solve the projection in KL, which is a convex optimization problem, over these representative vectors. While Proposition 3.6 is stated for TV distance, we can extend it to also preserve the KL divergence between linear combinations of these distributions as well. Technically, this requires a “truncated” notion of KL to deal with possibly negative entries but we will gloss over this for now (see Definition 4.8 and Lemma 5.10 for more details).
To describe our actual algorithm, once we have the coefficients as in (3), we solve the following convex optimization problem. We solve for coefficients that are bounded in magnitude and minimize
(8) |
While we omit the details in this overview (the full analysis is in Section 7), the main point of this projection in KL step is that we can appeal to Fact 3.9 with and for the optima obtained in (8) to get
where the additive error comes only from how well KL-divergences are preserved in the succinct representation . In particular, this completes the “change-of-basis” from to and the error only grows additively. Now to sample the entire string, we simply sample one character at a time and iterate the above.
3.4 Organization
In Section 4, we present a few basic observations for simulating density access to a distribution that is close to and working with (truncated) KL divergence. In Section 5, we introduce our machinery for working with barycentric spanners, including the dimensionality reduction procedure. In Section 6, we present our full learning algorithm. In Section 7, we present our sampling algorithm for sampling from our learned representation.
4 Basic Results
In this section, we introduce notation and collect several basic observations that will be used throughout the later sections.
4.1 Algorithmic Primitives
We start by developing a few basic primitives that conditional queries allow us to implement. First, we have a subroutine for computing the probabilities of the next character given a specified history .
Claim 4.1 (Next Character Probabilities).
For and a string and parameters , we can make conditional queries, and with probability, we output a distribution over that is within TV distance of the true distribution of the next character .
Proof.
We can repeatedly make conditional queries with input and look at the next character. We make such queries and output the empirical distribution of the next character. By a Chernoff bound, with probability, the resulting distribution is within TV distance of the true distribution . ∎
In light of the above, we make the following definition.
Definition 4.2 (Conditional Closeness).
For two distributions on , we say they are -conditionally close if for any string for , the conditional distributions of the next character and are -close in TV distance.
The following statement shows that conditional closeness implies closeness in TV distance for the entire distribution (and further on any conditional distribution) up to a factor of .
Claim 4.3.
Let be distributions on that are conditionally close. Then for any history for , we must have
where represent the full distribution over the future .
Proof.
We prove by reverse induction on that
whenever has length . When , the above is immediate from the assumption that are conditionally close. Now assume we have proven the claim for . Consider that has length .
where in the above, we first used conditional closeness and then the inductive hypothesis. This completes the proof. ∎
For technical reasons later on, it will be convenient to have the following definition which necessitates that, regardless of the prefix, all possibilities for the next character occur with some non-negligible probability.
Definition 4.4 (Positivity).
Let be some parameter. We say a distribution on is -positive if for any string for , the probabilities are at least for any choice of .
Now we show how to implement sample and exact pdf access to a distribution that is conditionally close to the distribution of . Note that this is slightly stronger than just implementing approximate pdf access to since we need some “consistency” between the responses to pdf queries.
First, we can approximate the pdf of at a given string by iteratively computing the next character probabilities using Claim 4.1 and then multiplying them. The key to ensure consistency is to only estimate the “next character distribution” once for each possible prefix . Whenever this prefix shows up again in a different computation, we use the same “next character distribution” that has already been computed.
Definition 4.5 (Sample and PDF Access).
For a distribution over , we say that we have sample and pdf access to if we can perform the following operations:
-
•
Given a string for , draw a sample from
-
•
Given a string for , output (where this is the probability that the first characters of the string match )
Lemma 4.6 (PDF Estimation).
Let and be parameters. Assume we are given conditional query access to a distribution . With probability at least , we can respond to sample and pdf queries for some distribution that is -positive and conditionally close to . This uses a total of conditional queries to .
Remark 4.7.
While we don’t specify the whole distribution , the point is that the responses made by the algorithm to the sample and pdf queries are consistent with some distribution that is close to .
Proof.
We define a rooted tree with layers where each intermediate node has exactly children. Now the nodes of the tree are labeled with strings in for . The root is labeled with the empty string and the labels of the children of each node are obtained by appending each of the possible characters – thus the labels of the nodes at level are exactly the strings in .
We first describe an algorithm for answering sample and pdf queries that requires exponential time but then we will give a polynomial time algorithm for answering polynomially many queries that is statistically equivalent to it.
Naive Algorithm:
for every non-leaf node in the tree, say with label , apply Claim 4.1 (using new, independent samples) to compute the next-character probabilities for . We can ensure that with probability , for all . By perturbing the if necessary, we can also ensure and for all . Now assign the edge from to a weight of . Let be the distribution induced by this weighted tree where the probability of sampling a string is equal to the product of the weights along the root-to-leaf path to . Now answer all sample and pdf queries with respect to . It is clear that is a valid distribution, and union bounding over all nodes, with probability, is -positive and conditionally close to .
Efficient Implementation:
We say we visit a node labeled to mean that we apply Claim 4.1 to compute the next-character probabilities and assign edge weights as described above. We will lazily maintain the tree instead, only visiting nodes as necessary. After we have visited a node, we never revisit it, so its edge weights are fixed once it has been visited once. Initially, we have only visited the root. Each time we receive a pdf query for a string , we follow the root-to-leaf path to and visit all nodes along this path. Now the pdf we output is obtained by simply multiplying the edge weights along the path from the root to .
Next, for a conditional sample query at , we begin from the node labeled . If we have visited it already, then we sample one of its children with probabilities proportional to the edge weights. We then move to this child and repeat. If we have not visited a node yet, we first visit it and then sample a child and repeat until we reach a leaf. We output this leaf as the conditional sample.
In this implementation, answering each query only requires time and conditional queries to . Furthermore, this algorithm is statistically equivalent to the naive algorithm described earlier so we are done.
∎
4.2 Truncated KL Divergence
Later on in the analysis, we will need a modified version of KL Divergence that is truncated so that it is defined everywhere and doesn’t blow up at .
Definition 4.8.
For a parameter and real valued inputs , we define
Definition 4.9.
For vectors where has positive entries, we define
When has all entries equal to , we will slightly simplify notation and write
Note that the above is well-defined since we are only evaluating the logarithm when both and are positive. When are distributions and , the above corresponds to the usual definition of KL-divergence. When , this corresponds to a soft truncation of the distributions to elements with nontrivial mass.
One of the key properties of KL divergence is that it allows for projection in the following sense: when projecting a point onto a convex set, the projection is closer to all points in the set.
Fact 4.10.
Let be the -dimensional simplex i.e. the convex hull of the standard basis vectors. Let be a convex set and let be a point. Let . Then for all ,
Proof.
Let and . Then the optimality of implies that for any ,
Since are on the simplex and by non-negativity of KL divergence, we conclude
and this immediately gives the desired inequality. ∎
In our analysis later on, we will need a slightly more general version of Fact 4.10, stated below, that deals with truncated KL divergences and points that are not exactly on the simplex.
Corollary 4.11.
Let be vectors. Assume has positive coordinates and assume is a convex set such that all elements have sum of coordinates equal to and entrywise. Let be a point. Let . Then for any ,
Proof.
Note that by the assumptions in the statement, for any ,
where the max on the RHS is taken entrywise and is any positive constant. Now letting be equal to the sum of the coordinates of , note that . Then we can simply apply Fact 4.10 to the vector to get the desired inequality. ∎
We also have the following observation that will allow us to relate truncated KL divergence to TV distance.
Claim 4.12.
For a vector with positive entries and minimum entry and maximum entry , the divergence is Lipchitz with respect to with Lipchitz constant
Proof.
This follows immediately from the definition. ∎
5 Geometric Results
Our learning algorithm will also rely on a few geometric constructions which we introduce below.
5.1 Spanners
First, we recall the standard notion of a barycentric spanner.
Definition 5.1 (Barycentric Spanner).
For a collection of vectors , we say another set of vectors is a -spanner for them if for all , there are coefficients with such that
Note that the error is defined in because later on, we will have representing the density functions of distributions and the error will correspond to TV-closeness.
Next we introduce a distributional notion of a spanner, where we only require that the spanner covers most of the mass of some underlying distribution.
Definition 5.2 (Distribution Spanner).
Let be a distribution over . Let , be parameters. We say a set of points is a -spanner for if for a random point drawn from , with probability, there are coefficients with
It is folkore that any subset of points has an exact barycentric spanner. However, in our setting, it will be important to compute a spanner for a high-dimensional distribution given only polynomially many samples.
Fact 5.3.
[Folklore] Let be arbitrary vectors in , Then there exists a subset with such that form a spanner for the collection .
Lemma 5.4.
Let be a distribution over . Let be parameters. Given a set of independent samples from , with probability , there exists a subset of elements of , , which form a -spanner for .
Proof.
For any vectors in , say , let be the convex body whose vertices are . Let be the pdf of inside this convex body and let be the fraction of points of that are within this body. For fixed, , we have
with probability. Now we can union bound over all subsets and deduce that with probability , for all such subsets,
However, by Fact 5.3, there is a subset of elements of that is a -spanner for and thus the above implies that taking to be this subset gives us a -spanner for . ∎
Now we recall the following standard algorithm for computing a spanner (see e.g. [AK08]). We include a proof for completeness.
Claim 5.5.
Let be vectors and let be the matrix with columns given by and let its singular values be . Then there is an algorithm that runs in time that computes a subset with such that is a spanner for the collection
Proof.
First, we iteratively construct an initial subset as follows. Initially start with and in each step , find a vector such that
First, such a vector always exists for because we know that
where we used the extremal characterization of the th singular value of in terms of the best rank- approximation to .
After this procedure, we have a subset . We will use to denote the volume of the simplex formed by the vectors in . By construction, we have
Next, assume for the sake of contradiction that is not a -spanner. Then there is some vector that cannot be written as a linear combination of with coefficients of magnitude at most . There is a unique way to write
and there must be some coefficient say with . Then we can replace with in and this at least doubles . The maximal possible value of for any subset with is at most and thus we need to iterate this process at most times before we get a set that is a -spanner. This completes the proof. ∎
Claim 5.5 has the restriction that is nontrivial. We will now give a robust version of Claim 5.5 for computing an approximate spanner even when the vectors may be arbitrarily close to degenerate.
Lemma 5.6.
Let be vectors with for all . Let be parameters we are given. Assume that there exists a subspace of dimension such that for all . Then if we run Algorithm 1 on , it runs in time and its output satisfies
-
•
-
•
forms a spanner for the collection
Proof.
Recall that is the largest index such that . Note that since by assumption, there is rank- subspace say such that
and thus .
Next, note that is a matrix with all singular values being at least (and the largest singular value is at most ). Thus, when we apply Claim 5.5 in Line 9, it runs in time and the subset is indeed a -spanner for the . Finally, we show that is a good spanner for the collection of . For any , we can first write
where . Now
ad thus the set is a spanner. It is clear that the overall runtime is . ∎
5.2 Dimensionality Reduction
We will also need a type of dimensionality for distributions over exponentially large domains. For distributions over a large domain say , we can represent them as -dimensional vectors. However, we need a more succinct representation. Below we show how with only polynomially many samples and pdf access to the distributions, we can construct a succinct representation that approximately preserves important properties of the original distributions. We use the following notation for translating between distributions and vectors of their densities.
Definition 5.7.
Let be a distribution on a finite set of elements . For , we write for the density function of at . We define the vector to be . For a multiset of elements in , we define to be a vector whose entries are indexed by elements and the values are equal to .
Definition 5.8.
Given distributions on a set of elements, say , we say vectors for some dimension are -representative for if for all coefficients with such that at most of them are nonzero,
Definition 5.9.
Given distributions on a set of elements, say , we say vectors for some dimension are KL-preserving for if for all coefficients with such that at most of the and of the are nonzero and for any constant with ,
The main result of this section is Lemma 5.10, where we show that for any distributions , after sampling and appropriately re-normalizing, we can construct succinct representations that approximately preserve the TV distance and KL divergence between linear combinations of these distributions.
Lemma 5.10.
Let be distributions on a set of elements, say . Let and let be multisets where is obtained by drawing elements independently from . Let . Define the distribution . Define the vectors for where the division is done entrywise and define (reciprocated entrywise).
Let and be be some parameters. If then with probability
-
•
The vectors are -representative for
-
•
The vectors are KL-preserving for
Proof.
We begin by proving the first statement. Consider a fixed set of coefficients . We will first imagine that the sets are sampled after fixing the coefficients and then we will union bound over a net over all possible choices for . We have
Also the quantity inside the expectation on the RHS always has magnitude at most . Now we can draw samples from to approximate the RHS. By a Chernoff bound, if we draw at least samples from the distribution – this corresponds to taking , then with probability at least , the empirical estimate of the RHS is within of the true value i.e.
which is exactly the inequality that we want to show.
Since the failure probability of the above is , we can union bound over a -net of all of the possible choices of . Call the net . Then for any possible such that and at most of the are nonzero, there is some element of the net, say , such that . From the union bound over the net, we know that
and thus
as desired.
We prove the second statement in a similar way. First consider a fixed choice of . Consider fixed coefficients . Then
Also for all possible choices of , the RHS always has magnitude at most . Now, as before in the proof of the first statement, by a Chernoff bound, this implies that for fixed and also , with probability , we have
Again, as before, we next union bound over a -net over all possible choices of and also and we deduce that
for all choices of , as desired. ∎
The following fact will also be useful as we won’t have pdf access to the exact distribution that we get samples from but instead to a distribution that is close in TV distance.
Lemma 5.11.
Let be distributions on a set of elements, say . Let be distributions such that for all . Let and let be multisets where is obtained by drawing elements independently from . With probability , we have for all and
for all .
Proof.
Fix an . For , let be the set of elements such that
Note that by assumption,
and thus we must have
which also implies . Now we can union bound this over all choices of and all of the samples drawn to get that the overall failure probability is at most as desired. ∎
6 Learning Algorithm
In this section, we present our learning algorithm. In light of Lemma 4.6, throughout our learning algorithm, we will assume that we have sample/conditional query and pdf access to a distribution that is conditionally close to the unknown distribution . We will only use conditional queries to to simulate access to and almost all of our reasoning will be with respect to the distribution .
6.1 Finding a Spanner for the State Space
Recall that by Fact 2.2, for a fixed with , the (vectorized) distributions all lie in some -dimensional subspace. Thus, by Fact 5.3, there exists a barycentric spanner consisting of elements corresponding to some histories. Since is conditionally close to , the (vectorized) distributions are all close to some -dimensional subspace. The first step in our algorithm will be to, for each , compute a set of histories such that are an approximate spanner for . The main algorithm in this section is Algorithm 3 and we analyze it in Lemma 6.3.
First, we will need a few preliminaries. It will be convenient to introduce the following notation for arranging possible histories and futures into a matrix.
Definition 6.1.
For a subset for some and subset , let be the matrix with rows indexed by elements and columns indexed by elements and entries equal to . When either or is the full set, we may write or
We will repeatedly make use of the primitive in Algorithm 2 for taking histories and outputting vectors that are succinct representations (in the sense of Lemma 5.10) of the distributions .
Fact 6.2.
Whenever we run Algorithm 2, the output vectors have nonnegative entries and satisfy ,
Proof.
This follows immediately from the construction in Algorithm 2. ∎
Now we are ready to present the algorithm for building the spanners and its analysis. For technical reasons that will be important later, when computing a spanner for histories of length , the algorithm takes as input some strings of length . The goal will be to output a list of strings of length , say , such that the corresponding vectors are an approximate spanner for the distribution of vectors and they also span for all . Note that the first condition is the natural requirement for a spanner and the second will be important later when we compute “transitions” between the length- histories and length- histories.
We now prove guarantees on the spanning set found by Algorithm 3.
Lemma 6.3.
Assume that is conditionally close to a rank distribution where
for some sufficiently large polynomial. With probability at least , if we run Algorithm 3 for arbitrary input strings , the output satisfies the following conditions
-
•
and the vectors form a spanner for
-
•
The rows of form a -spanner for the rows of
-
•
The rows of form a -spanner for the distribution of vectors
Proof.
For , let be vectors with entries indexed by and entries given by and respectively. First, note that the vectors are contained in an -dimensional space by Fact 2.2. Thus, by Lemma 5.11, and the setting of sufficiently small, with probability at least , there is an -dimensional subspace, say , such that all of have projection at most onto . Assuming that this holds, all of the hypotheses of Lemma 5.6 are satisfied (note the condition about the norms of the input vectors is satisfied by the construction in Algorithm 2). We then get that the execution of Algorithm 3 successfully finds a set with that is a spanner for .
For each , we can use the fact that is a spanner for to find a linear combination with for all such that
(9) |
Next, by Lemma 5.10, with probability at least , the vectors are -representative for the distributions . This then implies
which gives the second condition.
Now we prove the last condition. Recall that the vectors are vectors in that are contained in some -dimensional subspace. Let be the distribution of these vectors for . By Lemma 5.4 (since these vectors live in an -dimensional subspace), with probability at least , there is a subset with , such that the vectors are a -spanner for this distribution . For each , we can again use (9) (since ). We can also again use Lemma 5.10 to get that the vectors are -representative for the distributions and thus by the conditional closeness of and and Claim 4.3, they are -representative for . Thus, (9) implies that for all , there exist coefficients with for all such that
(10) |
since . Next, recall that are a -spanner for the distribution of for . Thus, for , with probability, there are coefficients with such that
Finally, we can combine this with (10) (which we can apply for all ) and the closeness between and to get that there are coefficients with such that
which gives the second statement. Note that the overall failure probability for all of the statements is at most . ∎
6.2 Full Learning Algorithm
For our full learning algorithm, we will apply Algorithm 3 to build a sequence of spanning sets for each history-length. We then need to learn the “transitions” between these spanning sets. Rather than explicitly learning these transitions, we instead learn a representation of the distribution of the form described below where the transitions are implicit – the transitions are only computed in Section 7 when we actually want to sample from the learned distribution. The representation of the distribution consists of:
-
•
Subsets for all
-
•
Matrices for all containing the next character probabilities for each of the strings in
-
•
For each , a collection of vectors for all that are succinct representations of the distributions obtained from Algorithm 2
-
•
For each , we also maintain an index set and vector (these are also obtained from Algorithm 2)
Algorithm 5 gives a full description of how we compute this representation.
After running Algorithm 5, we obtain a description of the distribution in terms of
Note that for all , the set is just defined by . In the remainder of this section, we prove that the parameters learned above satisfy certain properties (see Lemma 6.6). These properties will then be used in Section 7 to argue that we can sample from the learned description to get a distribution that is close to the original distribution . We begin with a few definitions.
Definition 6.4.
We say a string for is -representable by if there exists some vector with such that
Definition 6.5.
We say a string for is -positively representable by and if there exists some vector with such that
-
•
-
•
has all entries at least
The main lemma of this section is stated below.
Lemma 6.6.
Assume that is conditionally close to a rank distribution where
for some sufficiently large polynomial. For any parameter , in the execution of Algorithm 5, with probability , we have the following properties:
-
•
For any , .
-
•
For any and , the string is -positively representable by with at least probability
-
•
For all , the vectors are -representative for the distributions
-
•
For all , the vectors are KL-preserving for the distributions
Remark 6.7.
We think of the parameters as being set such that .
Proof.
We apply Lemma 6.3 – with probability , the guarantees of the lemma hold each time we execute Algorithm 3. This immediately implies the first statement that we want to show. For the second statement, Lemma 6.3 also gives us that over the execution of the algorithm, for all , the the vectors form a -spanner for the distribution of vectors . Thus, for , with probability, there exists some vector with such that
(11) |
Let be obtained by increasing all entries of by . Since , we have
Now first consider sampling after is fixed. Now let be the set of all strings such that
(12) |
As long as doesn’t contain any of these elements, then would give us a -positive representation of by . Note that
and thus we must have
and using (12), this rearranges into
Also recall Lemma 6.3 implies that for any ,
Now we have
and if this happens, then is -positively representable by . Thus, so far we have shown that over the randomness of and ,
Now by Markov’s inequality, with probability at least over the choice of ,
and this proves the second of the desired statements.
Finally, the last two statements follow from Lemma 5.10. Combining the failure probabilities for all of the parts, the total failure probability is at most , and this completes the proof. ∎
Later on, we will also need the following simple observation.
Claim 6.8.
In the execution of Algorithm 5, assume that for all . Then for any , with probability at least , we have for all .
7 Sampling Procedure
After running Algorithm 5, we have learned a set of parameters
where recall . However, it is not yet clear how these parameters determine a distribution. In this section, we show how these parameters determine a distribution that we can efficiently sample from and argue that if the learned parameters satisfy the properties in Lemma 6.6, then this distribution is close to .
Throughout this section, we assume that we are given the global parameters and a target accuracy parameter . We make the following assumption on the accuracy of the learned parameters. In light of Lemma 6.6 and Claim 6.8, we can show that this assumption holds with high probability as long as we ran Algorithm 5 on a distribution that is -conditionally close to a rank distribution for sufficiently small .
Assumption 7.1.
We have satisfying for some sufficiently large polynomial. Let . The parameters
satisfy the following properties:
-
•
For any , .
-
•
For any and , the string is -positively representable by with at least probability
-
•
For all , the vectors have nonnegative entries and have the same dimensionality and satisfy (where denotes entrywise product)
-
•
For all , the vectors are -representative for the distributions
-
•
For all , the vectors are KL-preserving for the distributions
-
•
for all
We will also make the following assumption on the underlying distribution .
Assumption 7.2.
The distribution is -positive (recall Definition 4.4).
Throughout the rest of this section, we will treat the parameters as fixed. We will describe the sampling procedure and then analyze it, proving that it samples from a distribution close to as long as the assumptions above hold.
The main sampling algorithm is Algorithm 6. First, we define the following operation which rounds a vector to a point on the simplex corresponding to a probability distribution.
Definition 7.3.
For a vector and parameter , we define to be
where . Note that the output of is always a valid probability distribution.
At a high-level, Algorithm 6 works as follows. We sample a string one character at a time. At each step , we maintain a linear combination of the strings that is supposed to approximate the string given by the first characters of in the following sense: we want
as distributions, for some coefficients . Given this linear combination, and since gives us the next character probabilities for all of the strings in , we know what the distribution for the next character should be given the first characters in so we simply sample the next character from this distribution. If the next character is , so , then we can re-normalize the coefficients to get coefficients such that
(13) |
The main difficulty is that we now want to rewrite the RHS as a linear combination of the distributions
.
Recalling the discussion in Section 3.3, we will solve a convex optimization problem, that roughly corresponds to projection in KL, for each “change-of-basis” operation. In particular, we will use the fact that the vectors (where recall ) are succinct representations of . Then once we have the coefficients in (13), to “change the basis”, we solve a convex optimization problem for coefficients such that are bounded and
is minimized. Recall the crucial property of KL divergence that ensures the error only grows additively but not multiplicatively in each step is that projection in KL onto a convex set decreases the distance (in KL) to all other points in the set (see Fact 4.10).
For describing our algorithm formally, it will be convenient to define the following notation for arranging the rows as the rows of a matrix.
Definition 7.4.
Let be the matrix whose rows are given by . Let for , let be the matrix whose rows are given by .
Remark 7.5.
The analysis will rely on bounding the (truncated) KL divergence between the true distribution and the distribution that our algorithm samples from. We will do this inductively in via a hybridization argument. However, there are certain “bad” histories that we will need to truncate out and these are precisely those that are not positively representable in Lemma 6.6.
Definition 7.6.
For , we let be the subset of such that for all , the prefix of of length is -positively representable by .
The key lemma of the analysis is Lemma 7.12 but first we will need to prove a sequence of preliminary claims. We apply Corollary 4.11 to analyze the convex optimization step in Algorithm 6. We get that, up to some small additive error, replacing with the solution that we compute reduces the KL divergence to all vectors in the feasible set of the convex program. We can then use Assumption 7.1 to lift this argument to the original distributions and argue that our change-of-basis step only incurs some small additive error overall.
Claim 7.7.
Proof.
We apply Corollary 4.11 where is the set of all vectors where is feasible in the convex program. It is clear that this set is convex. Also, if we set in Corollary 4.11 then all of the hypotheses are satisfied. By Assumption 7.1, . Also by definition, all elements of have sum of entries equal to . Thus, we get
as desired. ∎
We will also need the following observation about Algorithm 6 that the vector
is always close to being a valid distribution.
Claim 7.8.
Proof.
Corollary 7.9.
Proof.
Now we can take Claim 7.7 and then use Assumption 7.1 to replace the matrix with the matrix of true probabilities as follows.
Claim 7.10.
Fix an (recall Definition 7.6). Let be such that . Assume that in the execution of Algorithm 6, the first characters we have sampled are exactly and the st character is sampled to be , so we just set in line 13. Then after solving the optimization problem in line 17 to compute the next set of coefficients , we have
Proof.
By assumption that , there exists coefficients such that and
-
•
(14) -
•
has all entries at least
Note that the first condition implies that
because all rows of the matrix have sum equal to . Let be the all ones vector of dimension (recall Assumption 7.1). Then define
By Assumption 7.1 (specifically that the vectors are representative for the distributions ), this implies that
(15) |
Thus, using Assumption 7.1, the vector must be a feasible solution to the convex program in Line 17 of Algorithm 6. By Claim 7.7,
Next, recall the vector constructed in Algorithm 6 and recall that
Then by the KL-preserving property in Assumption 7.1 (and Corollary 7.9 which bounds the entries of )
Finally, by Claim 4.12 and (14), (15), the above implies
(16) |
where in the last step we used the assumption that is sufficiently small. This completes the proof. ∎
Now we sum Claim 7.10 over all possibilities for to get the following bound.
Claim 7.11.
Fix an . Assume that in the execution of Algorithm 6, the first characters we have sampled are exactly . Then we have
Proof.
Let be the subset of all strings whose first character is . Then we have
Also note that
Note that all entries of the matrix are at least by Assumption 7.2 and by Corollary 7.9 so using the definition of truncated KL,
(17) |
Next, we will apply Claim 7.10 and (17) to bound the sum
However, we can only apply Claim 7.10 when is such that . For other choices of , we can nevertheless use the following trivial version of Claim 7.10
where the above holds simply because both of the KL divergences are bounded in magnitude by . Thus, we have the bound
(18) |
Next, by definition and Assumption 7.1,
Observe that the vector is the same as concatenating over all choices of . Thus by Claim 7.8
Combining this with (18) and using nonnegativity of KL gives
∎
Next, we sum Claim 7.11 over all possibilities for . Note that in the execution of Algorithm 6, we can associate a unique vector of coefficients to every possible history to be the state of the vector conditioned on the prefix being equal to .
Lemma 7.12.
For all , we have
Proof.
Summing Claim 7.11 as ranges over all (with weights ) gives
(19) |
Now we can prove the main lemma of this section, where we show that the distribution that our algorithm samples from is close to the desired distribution .
Lemma 7.13.
Proof.
First, for a fixed history , let be the vector
where the maximum is taken entrywise. Clearly has all positive entries and by Claim 7.8 (and the definition of ), the sum of the entries is at most . Now let
i.e. normalizing so that the sum of the entries is . Note that all of the entries of the matrix are at least so
Now combining the above with Lemma 7.12 implies
Recall by Assumption 7.1 that
and thus by Pinsker’s inequality, we deduce
(20) |
and when this happens, since
we get that the distributions are -close in TV distance. Now we construct a hybrid distribution that is equal to except starting from every prefix where the condition in (20) fails, all future characters are sampled according to . By (20), and by Claim 4.3, . Putting these together yields the desired statement.
∎
Now we can put everything together and complete the proof of Theorem 1.3.
Proof of Theorem 1.3.
Let for some sufficiently large polynomial. Now we will combine the following ingredients
-
•
Lemma 4.6 to get exact sample and pdf access to a distribution close to
-
•
Algorithm 5 to learn a description of a distribution
-
•
Algorithm 6 to draw samples from the distribution parameterized by this learned description
For all of these algorithms, we will set the failure probability parameter . First, we apply Lemma 4.6 to get exact sample and pdf access to a distribution that is conditionally close to and -positive. Next, we apply Algorithm 5 to learn a set of parameters
where . We apply Lemma 6.6 and Claim 6.8 on these learned parameters. They, combined with the definition of the vectors , imply that with probability , the conditions of Assumption 7.1 are satisfied. Also, Assumption 7.2 is satisfied by construction. Finally Lemma 7.13 implies that when these conditions are satisfied, the distribution that Algorithm 6 samples from is -close to in TV distance. Note that the sampling algorithm runs in time . Redefining completes the proof. ∎
References
- [AHK12] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on learning theory, pages 33–1. JMLR Workshop and Conference Proceedings, 2012.
- [AK08] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
- [Ang87] Dana Angluin. Learning regular sets from queries and counterexamples. Information and computation, 75(2):87–106, 1987.
- [BC18] Rishiraj Bhattacharyya and Sourav Chakraborty. Property testing of joint distributions using conditional samples. ACM Transactions on Computation Theory (TOCT), 10(4):1–20, 2018.
- [BCPV19] Aditya Bhaskara, Aidao Chen, Aidan Perreault, and Aravindan Vijayaraghavan. Smoothed analysis in unsupervised learning via decoupling. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 582–610. IEEE, 2019.
- [BLMY23a] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. A new approach to learning linear dynamical systems. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 335–348, 2023.
- [BLMY23b] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. Tensor decompositions meet control theory: learning general mixtures of linear dynamical systems. In International Conference on Machine Learning, pages 1549–1563. PMLR, 2023.
- [CFGM13] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 561–580, 2013.
- [CGG01] Mary Cryan, Leslie Ann Goldberg, and Paul W Goldberg. Evolutionary trees can be learned in polynomial time in the two-state general markov model. SIAM Journal on Computing, 31(2):375–397, 2001.
- [CJLW21] Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Learning and testing junta distributions with sub cube conditioning. In Conference on Learning Theory, pages 1060–1113. PMLR, 2021.
- [CP22] Yanxi Chen and H Vincent Poor. Learning mixtures of linear dynamical systems. In International conference on machine learning, pages 3507–3557. PMLR, 2022.
- [CPD+24] Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model. arXiv preprint arXiv:2403.06634, 2024.
- [CRS15] Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using conditional samples. SIAM Journal on Computing, 44(3):540–616, 2015.
- [DKKZ20] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539. PMLR, 2020.
- [FKQR21] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- [GGJ+20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pages 3587–3596. PMLR, 2020.
- [GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147–2158, 2020.
- [GMR23] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning and learning in partially observable systems via filter stability. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 349–362, 2023.
- [HGKD15] Qingqing Huang, Rong Ge, Sham Kakade, and Munther Dahleh. Minimal realization problems for hidden markov models. IEEE Transactions on Signal Processing, 64(7):1896–1904, 2015.
- [HJB+21] Xinlei He, Jinyuan Jia, Michael Backes, Neil Zhenqiang Gong, and Yang Zhang. Stealing links from graph neural networks. In 30th USENIX security symposium (USENIX security 21), pages 2669–2686, 2021.
- [HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
- [HLXS21] Xuanli He, Lingjuan Lyu, Qiongkai Xu, and Lichao Sun. Model extraction and adversarial transferability, your bert is vulnerable! arXiv preprint arXiv:2103.10013, 2021.
- [HMR18] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1–44, 2018.
- [Jac97] Jeffrey C Jackson. An efficient membership-query algorithm for learning dnf with respect to the uniform distribution. Journal of Computer and System Sciences, 55(3):414–440, 1997.
- [JCCCP21] Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Papernot. Entangled watermarks as a defense against model extraction. In 30th USENIX security symposium (USENIX Security 21), pages 1937–1954, 2021.
- [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- [JKA+17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
- [JSMA19] Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. Prada: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pages 512–527. IEEE, 2019.
- [KKMZ24] Sham M. Kakade, Akshay Krishnamurthy, Gaurav Mahajan, and Cyril Zhang. Learning hidden markov models using conditional samples, 2024.
- [KNW13] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In International Conference on Machine Learning, pages 702–710. PMLR, 2013.
- [KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
- [LZJ+22] Yiming Li, Linghui Zhu, Xiaojun Jia, Yong Jiang, Shu-Tao Xia, and Xiaochun Cao. Defending against model stealing via verifying embedded external features. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 1464–1472, 2022.
- [MCH+21] Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, and Zhangyang Wang. Undistillable: Making a nasty teacher that cannot teach students. arXiv preprint arXiv:2105.07381, 2021.
- [MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on learning theory, pages 2388–2464. PMLR, 2019.
- [MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
- [NKIH23] Ali Naseh, Kalpesh Krishna, Mohit Iyyer, and Amir Houmansadr. Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1835–1849, 2023.
- [NP23] Phan-Minh Nguyen and Huy Tuan Pham. A rigorous framework for the mean field limit of multilayer neural networks. Mathematical Statistics and Learning, 6(3):201–357, 2023.
- [OMR23] Daryna Oliynyk, Rudolf Mayer, and Andreas Rauber. I know what you trained last summer: A survey on stealing machine learning models and defences. ACM Computing Surveys, 55(14s):1–41, 2023.
- [OO19] Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American control conference (ACC), pages 5655–5661. IEEE, 2019.
- [OSF19] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Prediction poisoning: Towards defenses against dnn model stealing attacks. arXiv preprint arXiv:1906.10908, 2019.
- [RST19] Robert Nikolai Reith, Thomas Schneider, and Oleksandr Tkachenko. Efficiently stealing your machine learning models. In Proceedings of the 18th ACM Workshop on Privacy in the Electronic Society, pages 198–210, 2019.
- [SBR19] Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714–2802. PMLR, 2019.
- [Sha51] Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
- [SKLV17] Vatsal Sharan, Sham M Kakade, Percy S Liang, and Gregory Valiant. Learning overcomplete hmms. Advances in Neural Information Processing Systems, 30, 2017.
- [SKLV18] Vatsal Sharan, Sham Kakade, Percy Liang, and Gregory Valiant. Prediction with a short memory. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1074–1087, 2018.
- [SRD19] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Nonparametric finite time lti system identification. arXiv preprint arXiv:1902.01848, 2019.
- [SZ24] Zeyang Sha and Yang Zhang. Prompt stealing attacks against large language models. arXiv preprint arXiv:2402.12959, 2024.
- [TP19] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019.
- [TZJ+16] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction APIs. In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
- [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- [WG18] Binghui Wang and Neil Zhenqiang Gong. Stealing hyperparameters in machine learning. In 2018 IEEE symposium on security and privacy (SP), pages 36–52. IEEE, 2018.
- [WXGD20] Xinran Wang, Yu Xiang, Jun Gao, and Jie Ding. Information laundering for model privacy. arXiv preprint arXiv:2009.06112, 2020.
- [ZWL23] Xuandong Zhao, Yu-Xiang Wang, and Lei Li. Protecting language generation models via invisible watermarking. In International Conference on Machine Learning, pages 42187–42199. PMLR, 2023.