This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation and inference for the Wasserstein distance between mixing measures in topic models

Xin Bing label=e1][email protected] [    Florentina Bunea label=e2][email protected] [    Jonathan Niles-Weed label=e3][email protected] [ Department of Statistical Sciences, University of Toronto, Department of Statistics and Data Science, Cornell University, Courant Institute for Mathematical Sciences and Center for Data Science, New York University,
Abstract

The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. This work proposes a new canonical interpretation of this distance and provides tools to perform inference on the Wasserstein distance between mixing measures in topic models. We consider the general setting of an identifiable mixture model consisting of mixtures of distributions from a set 𝒜\mathcal{A} equipped with an arbitrary metric dd, and show that the Wasserstein distance between mixing measures is uniquely characterized as the most discriminative convex extension of the metric dd to the set of mixtures of elements of 𝒜\mathcal{A}. The Wasserstein distance between mixing measures has been widely used in the study of such models, but without axiomatic justification. Our results establish this metric to be a canonical choice. Specializing our results to topic models, we consider estimation and inference of this distance. Although upper bounds for its estimation have been recently established elsewhere, we prove the first minimax lower bounds for the estimation of the Wasserstein distance between mixing measures, in topic models, when both the mixing weights and the mixture components need to be estimated. Our second main contribution is the provision of fully data-driven inferential tools for estimators of the Wasserstein distance between potentially sparse mixing measures of high-dimensional discrete probability distributions on pp points, in the topic model context. These results allow us to obtain the first asymptotically valid, ready to use, confidence intervals for the Wasserstein distance in (sparse) topic models with potentially growing ambient dimension pp.

De-biased mle,
limiting distribution,
mixing measure,
mixture distribution,
optimal rates of convergence,
probability distance,
sparse mixtures,
topic model,
Wasserstein distance,
keywords:
\startlocaldefs\endlocaldefs

1 Introduction

This work is devoted to estimation and inference for the Wasserstein distance between the mixing measures associated with mixture distributions, in topic models.

Our motivation stems from the general observation that standard distances (Hellinger, Total Variation, Wasserstein) between mixture distributions do not capture the possibility that similar distributions may arise from mixing completely different mixture components, and have therefore different mixing measures. Figure 1 illustrates this possibility.

Refer to caption
Refer to caption
Figure 1: Pearson’s crab mixture [51], in orange, and an alternative mixture, in green. Although these mixtures are close in total variation (left plot), they arise from very different mixture components.

In applications of mixture models in sciences and data analysis, discovering differences between mixture components is crucial for model interpretability. As a result, the statistical literature on finite mixture models largely focuses on the underlying mixing measure as the main object of interest, and compares mixtures by measuring distances between their corresponding mixing measures. This perspective has been adopted in the analysis of estimators for mixture models [42, 49, 31, 17, 48], in Bayesian mixture modeling [45, 44], and in the design of model selection procedures for finite mixtures [47]. The Wasserstein distance is a popular choice for this comparison [49, 31], as it is ideally suited for comparing discrete measures supported on different points of the common space on which they are defined.

Our first main contribution is to give an axiomatic justification for this distance choice, for any identifiable mixture models.

Our main statistical contribution consists in providing fully data driven inferential tools for the Wasserstein distance between mixing measures, in topic models. Optimal estimation of mixing measures in particular mixture models is an area of active interest (see, e.g., [42, 49, 31, 17, 48, 35, 69]). In parallel, estimation and inference of distances between discrete distributions has begun to receive increasing attention [30, 65]. This work complements both strands of literature and we provide solutions to the following open problems:
(i) Minimax lower bounds for the estimation of the Wasserstein distance between mixing measures in topic models, when both the mixture weights and the mixture components are estimated from the data;
(ii) Fully data driven inferential tools for the Wasserstein distance between topic-model based, potentially sparse, mixing measures.

Our solution to (i) complements recently derived upper bounds on this distance [8], thereby confirming that near-minimax optimal estimation of the distance is possible, while the inference problem (ii), for topic models, has not been studied elsewhere, to the best of our knowledge. We give below specific necessary background and discuss our contributions, as well as the organization of the paper.

2 Our contributions

2.1 The Sketched Wasserstein Distance (SWD) between mixture distributions equals the Wasserstein distance between their mixing measures

Given a set 𝒜𝒫(𝒴)\mathcal{A}\subseteq\mathcal{P}(\mathcal{Y}) of probability distributions on a Polish space 𝒴\mathcal{Y}, consider the set 𝒮\mathcal{S} of finite mixtures of elements of 𝒜\mathcal{A}. As long as mixtures consisting of elements of 𝒜\mathcal{A} are identifiable, we may associate any r(1)=k=1Kαk(1)Ak𝒮r^{(1)}=\sum_{k=1}^{K}\alpha_{k}^{(1)}A_{k}\in\mathcal{S} with a unique mixing measure 𝜶(1)=k=1Kαk(1)δAk\boldsymbol{\alpha}^{(1)}=\sum_{k=1}^{K}\alpha_{k}^{(1)}\delta_{A_{k}}, which is a probability measure on 𝒜𝒴\mathcal{A}\subseteq\mathcal{Y}.

We assume that 𝒫(𝒴)\mathcal{P}(\mathcal{Y}) is equipped with a metric dd. This may be one of the classic distances on probability measures, such as the Total Variation (TV) or Hellinger distance, or it may be a more exotic distance, for instance, one computed via a machine learning procedure or tailored to a particular data analysis task. As mentioned above, given mixing measures 𝜶(1)\boldsymbol{\alpha}^{(1)} and 𝜶(2)\boldsymbol{\alpha}^{(2)}, corresponding to mixture distributions r(1)r^{(1)} and r(2)r^{(2)}, respectively, it has become common in the literature to compare them by computing their Wasserstein distance W(𝜶(1),𝜶(2);d)W(\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)};d) with respect to the distance dd (see, e.g., [49, 31]). It is intuitively clear that computing a distance between the mixing measures will indicate, in broad terms, the similarity or dissimilarity between the mixture distributions; however, it is not clear whether using the Wasserstein distance between the mixing measures best serves this goal, nor whether such a comparison has any connection to a bona fide distance between the mixture distributions themselves.

As our first main contribution, we show that, indeed, the Wassersetin distance between mixing measures is a fundamental quantity to consider and study, but we arrive at it from a different perspective. In Section 3 we introduce a new distance, the Sketched Wasserstein Distance (SWD\operatorname{SWD}) on 𝒮\mathcal{S}, between mixture distributions which is defined and uniquely characterized as the largest—that is, most discriminative—jointly convex metric on 𝒮\mathcal{S} for which the inclusion ι:(𝒜,d)(𝒮,SWD)\iota:(\mathcal{A},d)\to(\mathcal{S},\operatorname{SWD}) is an isometry. Despite the abstractness of this definition, we show, in Theorem 1 below that

SWD(r(1),r(2))=W(𝜶(1),𝜶(2);d).\operatorname{SWD}(r^{(1)},r^{(2)})=W(\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)};d)\,.

This fact justifies our name Sketched Wasserstein Distance for the new metric—it is a distance on mixture distributions that reduces to a Wasserstein distance between mixing measures, which may be viewed as “sketches” of the mixtures. We use the results of Section 3 as further motivation for developing inferential tools for the Wasserstein distance between mixing measures, in the specific setting provided by topic models. We present this below.

2.2 Background on topic models

To state our results, we begin by introducing the topic model, as well as notation that will be used throughout the paper. We also give in this section relevant existing estimation results for the model. They serve as partial motivation for the results of this paper, summarized in the following sub-sections.

Topic models are widely used modeling tools for linking nn discrete, typically high dimensional, probability vectors. Their motivation and name stems from text analysis, where they were first introduced [13], but their applications go beyond that, for instance to biology [15, 18]. Adopting a familiar jargon, topic models are used to model a corpus of nn documents, each assumed to have been generated from a maximum of KK topics. Each document i[n]:={1,,n}i\in[n]:=\{1,\ldots,n\} is modeled as a set of NiN_{i} words drawn from a discrete distribution r(i)Δp:={v+p:v𝟙p=1}r^{(i)}\in\Delta_{p}:=\{v\in\mathbb{R}_{+}^{p}:v^{\top}\mathds{1}_{p}=1\} supported on the set of words 𝒴:={y1,,yp}\mathcal{Y}:=\{y_{1},\ldots,y_{p}\}, where pp is the dictionary size. One observes nn independent, pp-dimensional word counts Y(i)Y^{(i)} for i[n]i\in[n] and assumes that

Y(i)Multinomialp(Ni,r(i)).Y^{(i)}\sim\text{Multinomial}_{p}(N_{i},r^{(i)}). (1)

For future reference, we let X(i):=Y(i)/NiX^{(i)}:={Y^{(i)}/N_{i}} denote the associated word frequency vector.

The topic model assumption is that the p×np\times n matrix RR, with columns r(i)r^{(i)}, the expected word frequencies of each document, can be factorized as R=ATR=AT, where the p×Kp\times K matrix A:=[A1,,AK]A:=[A_{1},\ldots,A_{K}] has columns AkA_{k} in the probability simplex Δp\Delta_{p}, corresponding to the conditional probabilities of word occurrences, given a topic. The K×nK\times n matrix T:=[α(1),,α(n)]T:=[\alpha^{(1)},\ldots,\alpha^{(n)}] collects probability vectors α(i)ΔK\alpha^{(i)}\in\Delta_{K} with entries corresponding to the probability with which each topic k[K]k\in[K] is covered by document ii, for each i[n]i\in[n]. The only assumption made in this factorization is that AA is common to the corpus, and therefore document independent: otherwise the matrix factorization above always holds, by a standard application of Bayes’s theorem. Writing out the matrix factorization one column i[n]i\in[n] at a time

r(i)=k=1Kαk(i)Ak,with α(i)ΔK and AkΔp for all k[K],r^{(i)}=\sum_{k=1}^{K}\alpha^{(i)}_{k}A_{k},\quad\text{with }\alpha^{(i)}\in\Delta_{K}\text{ and }A_{k}\in\Delta_{p}\text{ for all }k\in[K], (2)

one recognizes that each (word) probability vector r(i)Δpr^{(i)}\in\Delta_{p} is a discrete mixture of the (conditional) probability vectors AkΔpA_{k}\in\Delta_{p}, with potentially sparse mixture weights α(i)\alpha^{(i)}, as not all topics are expected to be covered by a single document ii. Each probability vector r(i)r^{(i)} can be written uniquely as in (2) if uniqueness of the topic model factorization holds. The latter has been the subject of extensive study and is closely connected to the uniqueness of non-negative matrix factorization. Perhaps the most popular identifiability assumption for topic models, owing in part to its interpretability and constructive nature, is the so called anchor word assumption [25, 4], coupled with the assumption that the matrix TK×nT\in\mathbb{R}^{K\times n} has rank KK. We state these assumptions formally in Section 4.1. For the remainder of the paper, we assume that we can write r(i)r^{(i)} uniquely as in (2), and refer to α(i)\alpha^{(i)} as the true mixture weights, and to AkA_{k}, k[K]k\in[K], as the mixture components. Then the collection of mixture components defined in the previous section becomes 𝒜={A1,,AK}\mathcal{A}=\{A_{1},\dots,A_{K}\} and the mixing measure giving rise to r(i)r^{(i)} is

𝜶(i)=k=1Kαk(i)δAk,\boldsymbol{\alpha}^{(i)}=\sum_{k=1}^{K}\alpha_{k}^{(i)}\delta_{A_{k}},

which we view as a probability measure on the metric space 𝒳:=(Δp,d)\mathcal{X}:=(\Delta_{p},d) obtained by equipping Δp\Delta_{p} with a metric dd.

Computationally tractable estimators A^\widehat{A} of the mixture components collected in AA have been proposed by [4, 3, 61, 54, 9, 10, 67, 38], with finite sample performance, including minimax adaptive rates relative to A^A1,=max1kKA^kAk1\|\widehat{A}-A\|_{1,\infty}=\max_{1\leq k\leq K}\|\widehat{A}_{k}-A_{k}\|_{1} and A^A1=k=1Kj=1p|A^jkAjk|\|\widehat{A}-A\|_{1}=\sum_{k=1}^{K}\sum_{j=1}^{p}|\widehat{A}_{jk}-A_{jk}| obtained by [61, 9, 10]. In Section 4.1 below we give further details on these results.

In contrast, minimax optimal, sparsity-adaptive estimation of the potentially sparse mixture weights has only been studied very recently in [8]. Given an estimator A^\widehat{A} with A^kΔp\widehat{A}_{k}\in\Delta_{p} for all k[K]k\in[K], this work advocates the usage of an MLE-type estimator α^(i)\widehat{\alpha}^{(i)},

α^(i)=argmaxαΔKNij=1pXj(i)log(A^jα),for each i[n].\widehat{\alpha}^{(i)}=\operatorname*{argmax}_{\alpha\in\Delta_{K}}~{}N_{i}\sum_{j=1}^{p}X_{j}^{(i)}\log(\widehat{A}_{j\cdot}^{\top}\alpha),\quad\text{for each $i\in[n]$}. (3)

If A^\widehat{A} were treated as known, and replaced by AA, the estimator in (3) reduces to α^A(i)\widehat{\alpha}_{A}^{(i)}, the MLE of α(i)\alpha^{(i)} under the multinomial model (1).

Despite being of MLE-type, the analysis of α^(i)\widehat{\alpha}^{(i)}, for i[n]i\in[n], is non-standard, and [8] conducted it under the general regime that we will also adopt in this work. We state it below, to facilitate contrast with the classical set-up.

Classical Regime General Regime
(a) AA is known (a’) AA is estimated from nn samples of size NiN_{i} each
(b) pp is independent of NiN_{i} (b’) pp can grow with NiN_{i} (and nn).
(c) 0<αk(i)<10<\alpha_{k}^{(i)}<1, for all k[K]k\in[K] (c’) α(i)\alpha^{(i)} can be sparse.
(d) rj(i)>0r_{j}^{(i)}>0, for all j[p]j\in[p] (d’) r(i)r^{(i)} can be sparse.

The General Regime stated above is natural in the topic model context. The fact that AA must be estimated from the data is embedded in the definition of the topic model. Also, although one can work with a fixed dictionary size, and fixed number of topics, one can expect both to grow as more documents are added to the corpus. As already mentioned, each document in a corpus will typically cover only a few topics, so α(i)\alpha^{(i)} is typically sparse. When AA is also sparse, as it is generally the case [10], then r(i)r^{(i)} can also be sparse. To see this, take K=2K=2 and write r1(i)=α1(i)A11+α2(i)A12r^{(i)}_{1}=\alpha^{(i)}_{1}A_{11}+\alpha^{(i)}_{2}A_{12}. If document ii covers topic 1 (α1(i)>0\alpha^{(i)}_{1}>0), but not topic 2 (α2(i)=0\alpha^{(i)}_{2}=0), and the probability of using word 1, given that topic 1 is covered, is zero (A11=0A_{11}=0), then r1(i)=0r^{(i)}_{1}=0.

Under the General Regime above, [8, Theorem 10] shows that, whenever α(i)\alpha^{(i)} is sparse, the estimator α^(i)\widehat{\alpha}^{(i)} is also sparse, with high probability, and derive the sparsity adaptive finite-sample rates for α^(i)α(i)1\|\widehat{\alpha}^{(i)}-\alpha^{(i)}\|_{1} summarized in Section 4.1 below. Furthermore, [8] suggests estimating W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) by W(𝜶^(i),𝜶^(j);d)W(\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)};d), with

𝜶^(i)=k=1Kα^k(i)δA^k,𝜶^(j)=k=1Kα^k(j)δA^k,\widehat{\boldsymbol{\alpha}}^{(i)}=\sum_{k=1}^{K}\widehat{\alpha}_{k}^{(i)}\delta_{\widehat{A}_{k}},\quad\quad\widehat{\boldsymbol{\alpha}}^{(j)}=\sum_{k=1}^{K}\widehat{\alpha}_{k}^{(j)}\delta_{\widehat{A}_{k}}, (4)

for given estimators A^\widehat{A}, α^(i)\widehat{\alpha}^{(i)} and α^(j)\widehat{\alpha}^{(j)} of their population-level counterparts, and i,j[n]i,j\in[n]. Upper bounds for these distance estimates are derived [8], and re-stated in a slightly more general form in Theorem 2 of Section 4.2. The work of [8], however, does not study the optimality of their upper bound, and motivates our next contribution.

2.3 Lower bounds for the estimation of the Wasserstein distance between mixing measures in topic models

In Theorem 3 of Section 4.3, we establish a minimax optimal bound for estimating W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d), for any pair i,j[n]i,j\in[n], from a collection of nn independent multinomial vectors whose collective cell probabilities satisfy the topic model assumption. Concretely, for any dd on Δp\Delta_{p} that is bi-Lipschitz equivalent to the TV distance (c.f. Eq. 26), we have

infW^sup𝜶(i),𝜶(j)𝔼|W^W(𝜶(i),𝜶(j);d)|τNlog2(τ)+pKnNlog2(p)\inf_{\widehat{W}}\sup_{\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)}}\mathbb{E}|\widehat{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)|\gtrsim\sqrt{\tau\over N\log^{2}(\tau)}+\sqrt{pK\over nN\log^{2}(p)}

where τ=max{α(i)0,α(j)0}\tau=\max\{\|\alpha^{(i)}\|_{0},\|\alpha^{(j)}\|_{0}\}, and 0\|\cdot\|_{0} denotes the 0\ell_{0} norm of a vector, counting the number of its non-zero elements, For simplicity of presentation we assume that Ni=Nj=NN_{i}=N_{j}=N, although this is not needed for either theory or practice. The two terms in the above lower bound quantify, respectively, the smallest error of estimating the distance when the mixture components are known, and the smallest error when the mixture weights are known.

We also obtain a slightly sharper lower bound for the estimation of the mixing measures 𝜶(i)\boldsymbol{\alpha}^{(i)} and 𝜶(j)\boldsymbol{\alpha}^{(j)} themselves in the Wasserstein distance:

inf𝜶^sup𝜶𝔼W(𝜶^,𝜶;d)|τN+pKnN.\inf_{\widehat{\boldsymbol{\alpha}}}\sup_{\boldsymbol{\alpha}}\,\mathbb{E}W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)|\gtrsim\sqrt{\tau\over N}+\sqrt{pK\over nN}\,.

As above, these two terms reflect the inherent difficulty of estimating the mixture weights and mixture components, respectively. As we discuss in Section 4.3, a version of this lower bound with additional logarithmic factors follows directly from the lower bound presented above for the estimation of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d). For completeness, we give the simple argument leading to this improved bound in Appendix D of [7].

The upper bounds obtained in [8] match the above lower bounds, up to possibly sub-optimal logarithmic terms. Together, these results show that the plug-in approach to estimating the Wasserstein distance is nearly optimal, and complement a rich literature on the performance of plug-in estimators for non-smooth functional estimation tasks [41, 16]. Our lower bounds complement those recently obtained for the minimax-optimal estimation of the TV distance for discrete measures [36], but apply to a wider class of distance measures and differ by an additional logarithmic factor; whether this logarithmic factor is necessary is an interesting open question. The results we present are the first nearly tight lower bounds for estimating the class of distances we consider, and our proofs require developing new techniques beyond those used for the total variation case. More details can be found in Section 4.3. The results of this section are valid for any K,p,n,NK,p,n,N. In particular, the ambient dimension pp, and the number of mixture components KK are allowed to grow with the sample sizes n,Nn,N.

2.4 Inference for the Wasserstein distance between sparse mixing measures in topic models

To the best of our knowledge, despite an active interest in inference for the Wasserstein distance in general [46, 22, 23, 19, 20, 21] and the Wasserstein distance between discrete measures, in particular [57, 59], inference for the Wasserstein between sparse discrete mixing measures, in general, and in particular in topic models, has not been studied.

Our main statistical contribution is the provision of inferential tools for W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d), for any pair i,j[n]i,j\in[n], in topic models. To this end:

  1. (1)

    In Section 5.2, Theorem 5, we derive a N\sqrt{N} distributional limit for an appropriate estimator of the distance.

  2. (2)

    In Section 5.3, Theorem 6, we estimate consistently the parameters of the limit, thereby providing theoretically justified, fully data driven, estimates for inference. We also contrast our suggested approach with a number of possible bootstrap schemes, in our simulation study of Section F.4 in [7], revealing the benefits of our method.

Our approach to (1) is given in Section 5. In Proposition 1 we show that, as soon as that error in estimating AA can be appropriately controlled in probability, for instance by employing the estimator given in Section 4.1, the desired limit of a distance estimator hinges on inference for the potentially sparse mixture weights α(i)\alpha^{(i)}. Perhaps surprisingly, this problem has not been studied elsewhere, under the General Regime introduced in Section 2.2 above. Its detailed treatment is of interest in its own right, and constitutes one of our main contributions to the general problem of inference for the Wasserstein distance between mixing measures in topic models. Since our asymptotic distribution results for the distance estimates rest on determining the distributional limit of KK-dimensional mixture weight estimators, the asymptotic analysis will be conducted for KK fixed. However, we do allow the ambient dimension pp to depend on the sample sizes nn and NN.

2.4.1 Asymptotically normal estimators for sparse mixture weights in topic models, under general conditions

We construct estimators of the sparse mixture weights that admit a N\sqrt{N} Gaussian limit under the General Regime, and are asymptotically efficient under the Classical Regime. The challenges associated with this problem under the General Regime inform our estimation strategy, and can be best seen in contrast to the existing solutions offered in the Classical Regime. In what follows we concentrate on inference of α(i)\alpha^{(i)} for some arbitrarily picked i[n]i\in[n].

In the Classical Regime, the estimation and inference for the MLE of α(i)\alpha^{(i)}, in a low dimensional parametrization r(i)=g(α(i))r^{(i)}=g(\alpha^{(i)}) of a multinomial probability vector r(i)r^{(i)}, when gg is a known function and α(i)\alpha^{(i)} lies in an open set of K\mathbb{R}^{K}, with K<pK<p, have been studied for over eighty years. Early results go back to [52, 53, 11], and are reproduced in updated forms in standard text books, for instance Section 16.2 in [1]. Appealing to these results for g(α(i))=Aα(i)g(\alpha^{(i)})=A\alpha^{(i)}, with AA known, the asymptotic distribution of the MLE α^A(i)\widehat{\alpha}_{A}^{(i)} of α(i)\alpha^{(i)} under the Classical Regime is given by

N(α^A(i)α(i))𝑑𝒩K(0,Γ(i)),as N,\sqrt{N}(\widehat{\alpha}_{A}^{(i)}-\alpha^{(i)})\overset{d}{\to}\mathcal{N}_{K}(0,\Gamma^{(i)}),\quad\text{as }N\to\infty, (5)

where the asymptotic covariance matrix is

Γ(i)=(j=1pAjAjrj(i))1α(i)α(i),\Gamma^{(i)}=\Bigl{(}\sum_{j=1}^{p}{A_{j\cdot}A_{j\cdot}^{\top}\over r_{j}^{(i)}}\Bigr{)}^{-1}-\alpha^{(i)}\alpha^{(i)\top}, (6)

a K×KK\times K matrix of rank K1K-1. By standard MLE theory, under the Classical Regime, any (K1)(K-1) sub-vector of α^A(i)\widehat{\alpha}_{A}^{(i)} is asymptotically efficient, for any interior point of ΔK\Delta_{K}.

However, even if conditions (a) and (b) of the Classical Regime hold, but conditions (c) and (d) are violated, in that the model parameters are on the boundary of their respective simplices, the standard MLE theory, which is devoted to the analysis of estimators of interior points, no longer applies. In general, an estimator α¯(i)\bar{\alpha}^{(i)}, restricted to lie in ΔK\Delta_{K}, cannot be expected to have a Gaussian limit around a boundary point α(i)\alpha^{(i)} (see, e.g., [2] for a review of inference for boundary points).

Since inference for the sparse mixture weights is only an intermediate step towards inference on the Wasserstein distance between mixing distributions, we aim at estimators of the weights that have the simplest possible limiting distribution, Gaussian in this case. Our solution is given in Theorem 4 of Section 5.1. We show, in the General Regime, that an appropriate one-step update α~(i)\widetilde{\alpha}^{(i)}, that removes the asymptotic bias of the (consistent) MLE-type estimator α^(i)\widehat{\alpha}^{(i)} given in (3) above, is indeed asymptotically normal:

N(Σ(i))+12(α~(i)α(i))𝑑𝒩K(0,[\bmIK1000]),as n,N,\sqrt{N}~{}\bigl{(}\Sigma^{(i)}\bigr{)}^{+\frac{1}{2}}\bigl{(}\widetilde{\alpha}^{(i)}-\alpha^{(i)}\bigr{)}\overset{d}{\to}\mathcal{N}_{K}\left(0,\begin{bmatrix}\bm{I}_{K-1}&0\\ 0&0\end{bmatrix}\right),\quad\text{as }n,N\to\infty, (7)

where the asymptotic covariance matrix is

Σ(i)=(jJ¯AjAjrj(i))1α(i)α(i)\Sigma^{(i)}=\Bigl{(}\sum_{j\in\overline{J}}{A_{j\cdot}A_{j\cdot}^{\top}\over r_{j}^{(i)}}\Bigr{)}^{-1}-\alpha^{(i)}\alpha^{(i)\top} (8)

with J¯={j[p]:rj(i)>0\overline{J}=\{j\in[p]:r_{j}^{(i)}>0} and (Σ(i))+12(\Sigma^{(i)})^{+\frac{1}{2}} being the square root of its generalized inverse. We note although the dimension KK of Σ(i)\Sigma^{(i)} is fixed, its entries are allowed to grow with p,np,n and NN, under the General Regime.

We benchmark the behavior of our proposed estimator against that of the MLE, in the Classical Regime. We observe that, under the Classical Regime, we have J¯=[p]\overline{J}=[p], and Σ(i)\Sigma^{(i)} reduces to Γ(i)\Gamma^{(i)}, the limiting covariance matrix of the MLE, given by (6). This shows that, in this regime, our proposed estimator α~(i)\widetilde{\alpha}^{(i)} enjoys the same asymptotic efficiency properties as the MLE.

In Remark 5 of Section 5.1 we discuss other natural estimation possibilities. It turns out that a weighted least squares estimator of the mixture weights is also asymptotically normal, however it is not asymptotically efficient in the Classical Regime (cf. Theorem 15 in Appendix H, where we show that the covariance matrix of the limiting distribution of the weighted least squares does not equal Γ(i)\Gamma^{(i)}). As a consequence, the simulation results in Section F.3 show that the resulting distance confidence intervals are slightly wider than those corresponding to our proposed estimator. For completeness, we do however include a full theoretical treatment of this natural estimator in Appendix H.

2.4.2 The limiting distribution of the distance estimates and consistent quantile estimation

Having constructed estimators of the mixture weights admitting N\sqrt{N} Gaussian limits, we obtain as a consequence the limiting distribution of our proposed estimator for W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d). The limiting distribution is non-Gaussian and involves an auxiliary optimization problem based on the Kantorovich–Rubinstein dual formulation of the Wasserstein distance: namely, under suitable conditions we have

N(W~W(𝜶(i),𝜶(j);d))𝑑supfijfZij,as n,N,\sqrt{N}(\widetilde{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d))\overset{d}{\to}\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}Z_{ij}\,,\quad\text{as }n,N\to\infty,

where ZijZ_{ij} is a Gaussian vector and ij\mathcal{F}^{\prime}_{ij} is a certain polytope in K\mathbb{R}^{K}. The covariance of ZijZ_{ij} and the polytope ij\mathcal{F}^{\prime}_{ij} depend on the parameters of the problem.

To use this theorem for practical inference, we provide a consistent method of estimating the limit on the right-hand side, and thereby obtain the first theoretically justified, fully data-driven confidence intervals for W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d). We refer to Theorem 6 and Section 5.3.2 for details.

Furthermore, a detailed simulation study presented in Appendix F of [7], shows that, numerically, (i) our method compares favorably with the mm-out-of-NN bootstrap [26], especially in small samples, while also avoiding the delicate choice of mm; (ii) our method has comparable performance to the much more computationally expensive derivative-based bootstrap [28]. This investigation, combined with our theoretical guarantees, provides strong support in favor of the potential of our proposal for inference problems involving the Wasserstein distance between potentially sparse mixing measures in large topic models.

3 A new distance for mixture distributions

In this section, we begin by focusing on a given subset 𝒜={A1,,AK}\mathcal{A}=\{A_{1},\dots,A_{K}\} of probability measures on 𝒴\mathcal{Y}, and consider mixtures relative to this subset. The following assumption guarantees that, for a mixture r=k=1KαkAkr=\sum_{k=1}^{K}\alpha_{k}A_{k}, the mixing weights α=(α1,,αk)\alpha=(\alpha_{1},\dots,\alpha_{k}) are identifiable.

Assumption 1.

The set 𝒜\mathcal{A} is affinely independent.

As discussed in Section 2.2 above, in the context of topic models, it is necessary to simultaneously identify the mixing weights α\alpha and the mixture components A1,,AKA_{1},\dots,A_{K}; for this reason, the conditions for identifiability of topic models (3 in Section 4.1) is necessarily stronger than 1. However, the considerations in this section apply more generally to finite mixtures of distributions on an arbitrary Polish space. In the following sections, we will apply the general theory developed here to the particular case of topic models, where the set 𝒜\mathcal{A} is unknown and has to be estimated along with the mixing weights.

We assume that 𝒫(𝒴)\mathcal{P}(\mathcal{Y}) is equipped with a metric dd. Our goal is to define a metric based on dd and adapted to 𝒜\mathcal{A} on the space 𝒮=conv(𝒜)\mathcal{S}=\mathrm{conv}(\mathcal{A}), the set of mixtures arising from the mixing components A1,,AKA_{1},\dots,A_{K}. That is to say, the metric should retain geometric features of the original metric dd, while also measuring distances between elements of 𝒮\mathcal{S} relative to their unique representation as convex combinations of elements of 𝒜\mathcal{A}. Specifically, to define a new distance SWD(,)\operatorname{SWD}(\cdot,\cdot) between elements of 𝒮\mathcal{S}, we propose that it satisfies two desiderata:

  1. R1:

    The function SWD\operatorname{SWD} should be a jointly convex function of its two arguments.

  2. R2:

    The function SWD\operatorname{SWD} should agree with the original distance dd on the mixture components i.e., SWD(Ak,A)=d(Ak,A)\operatorname{SWD}(A_{k},A_{\ell})=d(A_{k},A_{\ell}) for k,[K]k,\ell\in[K].

We do not specifically impose the requirement that SWD\operatorname{SWD} be a metric, since many useful measure of distance between probability distributions, such as the Kullback–Leibler divergence, do not possess this property; nevertheless, in Corollary 1, we show that our construction does indeed give rise to a metric on 𝒮\mathcal{S}.

R1 is motivated by both mathematical and practical considerations. Mathematically speaking, this property is enjoyed by a wide class of well behaved metrics, such as those arising from norms on vector spaces, and, in the specific context of probability distributions, holds for any ff-divergence. Metric spaces with jointly convex metrics are known as Busemann spaces, and possess many useful analytical and geometrical properties [37, 58]. Practically speaking, convexity is crucial for both computational and statistical purposes, as it can imply, for instance, that optimization problems involving SWD\operatorname{SWD} are computationally tractable and that minimum-distance estimators using SWD\operatorname{SWD} exist almost surely. R2 says that the distance should accurately reproduce the original distance dd when restricted to the original mixture components, and therefore that SWD\operatorname{SWD} should capture the geometric structure induced by dd.

To identify a unique function satisfying R1 and R2, note that the set of such functions is closed under taking pointwise suprema. Indeed, the supremum of convex functions is convex [see, e.g., 32, Proposition IV.2.1.2], and given any set of functions satisfying R2, their supremum clearly satisfies R2 as well. For r(1),r(2)𝒮r^{(1)},r^{(2)}\in\mathcal{S}, we therefore define SWD\operatorname{SWD} by

SWD(r(1),r(2))supϕ(,): ϕ satisfies R1&R2ϕ(r(1),r(2)).\operatorname{SWD}(r^{(1)},r^{(2)})\coloneqq\sup_{\phi(\cdot,\cdot):\text{ $\phi$ satisfies {R1}\&{R2}}}\phi(r^{(1)},r^{(2)})\,. (9)

This function is the largest—most discriminative—quantity satisfying both R1 and R2.

Under 1, we can uniquely associate a mixture r=k=1KαkAkr=\sum_{k=1}^{K}\alpha_{k}A_{k} with the mixing measure 𝜶=k=1KαkδAk\boldsymbol{\alpha}=\sum_{k=1}^{K}\alpha_{k}\delta_{A_{k}}, which is a probability measure on the metric space (𝒫(𝒴),d)(\mathcal{P}(\mathcal{Y}),d). Our first main result shows that SWD\operatorname{SWD} agrees precisely with the Wasserstein distance on this space.

Theorem 1.

Under Assumption 1, any all r(1),r(2)𝒮r^{(1)},r^{(2)}\in\mathcal{S},

SWD(r(1),r(2))\displaystyle\operatorname{SWD}(r^{(1)},r^{(2)}) =W(𝜶(1),𝜶(2);d)\displaystyle=W(\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)};d)

where W(𝛂(1),𝛂(2);d)W(\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)};d) denotes the Wasserstein distance between the mixing measures 𝛂(1),𝛂(2)\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)} corresponding to r(1),r(2)r^{(1)},r^{(2)}, respectively.

Proof.

The proof can be found in Section B.1.1. ∎

Theorem 1 reveals a surprising characterization of the Wasserstein distances for mixture models. On a convex set of measures, the Wasserstein distance is uniquely specified as the largest jointly convex function taking prescribed values at the extreme points. This characterization gives an axiomatic justification of the Wasserstein distance for statistical applications involving mixtures.

The study of the relationship between the Wasserstein distance on the mixing measures and the classical probability distances on the mixtures themselves was inaugurated by [49]. Clearly, if the Wasserstein distance between the mixing measures is zero, then the mixture distributions agree as well, but the converse is not generally true. This line of research has therefore sought conditions under which the Wasserstein distance on the mixing measures is comparable with a classical probability distance on the mixtures. In the context of mixtures of continuous densities on d\mathbb{R}^{d}, [49] showed that a strong identifiability condition guarantees that the total variation distance between the mixtures is bounded below by the squared Wasserstein distance between the mixing measures, as long as the number of atoms in the mixing measures is bounded. Subsequent refinements of this result (e.g., [33, 34]) obtained related comparison inequalities under weaker identifiability conditions.

This work adopts a different approach, viewing the Wasserstein distance between mixing measures as a bona fide metric, the Sketched Wasserstein Distance, on the mixture distributions themselves.

Our Theorem 1 can also be viewed as an extension of a similar result in the nonlinear Banach space literature: Weaver’s theorem [66, Theorem 3.3 and Lemma 3.5] identifies the norm defining the Arens–Eells space, a Banach space into which the Wasserstein space isometrically embeds, as the largest seminorm on the space of finitely supported signed measures which reproduces the Wasserstein distance on pairs of Dirac measures.

Corollary 1.

Under 1, (𝒮,SWD)(\mathcal{S},\operatorname{SWD}) is a complete metric space.

The proof appears in Appendix B.1.2.

Remark 1.

If we work with models that are not affinely independent, we pay a price: the quantity defined by (9) is no longer a metric. This happens even if Assumption 1 is only slightly relaxed to

Assumption 2.

For each k[K]k\in[K], Akconv(𝒜Ak)A_{k}\notin\mathrm{conv}(\mathcal{A}\setminus A_{k}).

The proof of Theorem 1 shows that under Assumption 2 we have, for all r(1),r(2)𝒮r^{(1)},r^{(2)}\in\mathcal{S}

SWD(r(1),r(2))\displaystyle\operatorname{SWD}(r^{(1)},r^{(2)}) =inf𝜶(1),𝜶(2)ΔK:r(i)=Aα(i),i{1,2}W(𝜶(1),𝜶(2);d)\displaystyle=\inf_{\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)}\in\Delta_{K}:r^{(i)}=A\alpha^{(i)},\,\,i\in\{1,2\}}W(\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)};d) (10)

However, SWD\operatorname{SWD} is only a semi-metric in this case: it is symmetric, and SWD(r(1),r(2))=0\operatorname{SWD}(r^{(1)},r^{(2)})=0 iff r(1)=r(2)r^{(1)}=r^{(2)}, but it no longer satisfies the triangle inequality in general, as we show by example in Appendix B.1.3.

4 Near-Optimal estimation of the Wasserstein distance between mixing measures in topic models

Finite sample upper bounds for estimators of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d), when dd is either the total variation distance or the Wasserstein distance on discrete probability distributions supported on pp points have been recently established in [8]. In this section we investigate the optimality of these bounds, by deriving minimax lower bounds for estimators of this distance, a problem not previously studied in the literature.

For clarity of exposition, we begin with a review on estimation in topic models in Section 4.1, state existing upper bounds on estimation in Section 4.2, then prove the lower bound in Section 4.3.

4.1 Review of topic model identifiability conditions and of related estimation results

In this section, we begin by reviewing (i) model identifiability and the minimax optimal estimation of the mixing components AA, and (ii) estimation and finite sample error bounds for the mixture weights α(i)\alpha^{(i)} for i[n]i\in[n]. Necessary notation, used in the remainder of the paper, is also introduced here.

4.1.1 Identifiability and estimation of AA

Model identifiability in topic models is a sub-problem of the larger class devoted to the study of the existence and uniqueness of non-negative matrix factorization. The latter has been studied extensively in the past two decades. One popular set of identifiability conditions involve the following assumption on AA:

Assumption 3.

For each k[K]k\in[K], there exists at least one j[p]j\in[p] such that Ajk>0A_{jk}>0 and Ajk=0A_{jk^{\prime}}=0 for all kkk^{\prime}\neq k.

It is informally referred to as the anchor word assumption, as it postulates that every topic k[K]k\in[K] has at least one word j[p]j\in[p] solely associated with it. 3 in conjunction with rank(T)=K\mathrm{rank}(T)=K ensures that both AA and TT are identifiable [25], therefore guaranteeing the full model identifiability. In particular, 3 implies 1. 3 is sufficient, but not necessary, for model identifiability, see, e.g., [29]. Nevertheless, all known provably fast algorithms make constructive usage of it. Furthermore, 3 improves the interpretability of the matrix factorization, and the extensive empirical study in [24] shows that it is supported by a wide range of data sets where topic models can be employed.

We therefore adopt this condition in our work. Since the paper [13], estimation of AA has been studied extensively from the Bayesian perspective (see, [12], for a comprehensive review of this class of techniques). A recent list of papers [4, 3, 6, 61, 9, 10, 38, 67] have proposed computationally fast algorithms of estimating AA, from a frequentist point of view, under 3. Furthermore, minimax optimal estimation of AA under various metrics has also been well understood [61, 9, 10, 38, 67]. In particular, under 3, by letting mm denote the number of words jj that satisfy 3, the work of [9] has established the following minimax lower bound

infA^supAΘA𝔼A^A1,pKnN,\inf_{\widehat{A}}\sup_{A\in\Theta_{A}}\mathbb{E}\|\widehat{A}-A\|_{1,\infty}\gtrsim\sqrt{pK\over nN}, (11)

for all estimators A^\widehat{A}, over the parameter space

ΘA={A:AkΔp,k[K],minj[p]Ajc/p,3 holds with mcp}\displaystyle\Theta_{A}=\{A:A_{k}\in\Delta_{p},\forall k\in[K],~{}\min_{j\in[p]}\|A_{j\cdot}\|_{\infty}\geq c^{\prime}/p,~{}\textrm{\lx@cref{creftype~refnum}{ass_anchor} holds with }m\leq c~{}p\} (12)

with c(0,1)c\in(0,1), c>0c^{\prime}>0 being some absolute constant. Above Ni=NN_{i}=N for all i[n]i\in[n] is assumed for convenience of notation, and will be adopted in this paper as well, while noting that all the related theoretical results continue to hold in general.

The estimator A^\widehat{A} of AA proposed in [9] is shown to be minimax-rate adaptive, and computationally efficient. Under the conditions collected in [8, Appendix K.1]), this estimator A^\widehat{A} satisfies

𝔼A^A1,pKlog(L)nN,\mathbb{E}\|\widehat{A}-A\|_{1,\infty}\lesssim\sqrt{pK\log(L)\over nN}, (13)

and is therefore minimax optimal up to a logarithmic factor of L:=npNL:=n\vee p\vee N. This log(L)\log(L) factor comes from a union bound over the whole corpus and all words. Under the same set of conditions, we also have the following normalized sup-norm control (see, [8, Theorem K.1])

𝔼[maxj[p]A^jAjAj]pKlog(L)nN.\mathbb{E}\left[\max_{j\in[p]}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\over\|A_{j\cdot}\|_{\infty}}\right]\lesssim\sqrt{pK\log(L)\over nN}. (14)

We will use this estimator of AA as our running example for the remainder of this paper. For the convenience of the reader, Section A of [7] gives its detailed construction. All our theoretical results are still applicable when A^\widehat{A} is replaced by any other minimax-optimal estimator such as [61, 10, 38, 67] coupled with a sup-norm control as in (14).

4.1.2 Mixture weight estimation

We review existing finite sample error bounds for the MLE-type estimator α^(i)\widehat{\alpha}^{(i)} defined above in (3), for i[n]i\in[n], arbitrary fixed. Recall that r(i)=Aα(i)r^{(i)}=A\alpha^{(i)}. The estimator α^(i)\widehat{\alpha}^{(i)} is valid and analyzed for a generic A^\widehat{A} in [8], and the results pertain, in particular, to all estimators mentioned in Section 4.1. For brevity, we only state below the results for an estimator α^(i)\widehat{\alpha}^{(i)} calculated relative to the estimator A^\widehat{A} in [9].

The conditions under which α^(i)α(i)1\|\widehat{\alpha}^{(i)}-\alpha^{(i)}\|_{1} is analyzed are lengthy, and fully explained in [8]. To improve readability, while keeping this paper self-contained, we opted for stating them in Appendix G, and we introduce here only the minimal amount of notation that will be used first in Section 4.3, and also in the remainder of the paper.

Let 1τK1\leq\tau\leq K be any integer and consider the following parameter space associated with the sparse mixture weights,

Θα(τ)={αΔK:α0τ}.\Theta_{\alpha}(\tau)=\left\{\alpha\in\Delta_{K}:\|\alpha\|_{0}\leq\tau\right\}. (15)

For r=Aαr=A\alpha with any αΘα(τ)\alpha\in\Theta_{\alpha}(\tau), let XX denote the vector of observed frequencies of rr, with

J:=supp(X)={j[p]:Xj>0}.J:={\rm supp}(X)=\{j\in[p]:X_{j}>0\}.

Define

J¯:={j[p]:rj>0},J¯:={j[p]:rj>5log(p)/N}.\overline{J}:=\left\{j\in[p]:r_{j}>0\right\},\qquad\underline{J}:=\left\{j\in[p]:r_{j}>{5\log(p)/N}\right\}. (16)

These sets are chosen such that (see, for instance, [8, Lemma I.1])

{J¯JJ¯}12p1.\mathbb{P}\left\{\underline{J}\subseteq J\subseteq\overline{J}\right\}\geq 1-2p^{-1}.

Let Sα=supp(α)S_{\alpha}={\rm supp}(\alpha) and define

αmin:=minkSααk,ξmaxjJ¯maxkSαAjkkSαAjk.\alpha_{\min}:=\min_{k\in S_{\alpha}}\alpha_{k},\qquad\xi\coloneqq\max_{j\in\overline{J}}{\max_{k\notin S_{\alpha}}A_{jk}\over\sum_{k\in S_{\alpha}}A_{jk}}. (17)

Another important quantity appearing in the results of [8] is the restricted 11\ell_{1}\to\ell_{1} condition number of AA defined, for any integer k[K]k\in[K], as

κ(A,k)minS[K]:|S|kminv𝒞(S)Av1v1,\kappa(A,k)\coloneqq\min_{S\subseteq[K]:|S|\leq k}\min_{v\in\mathcal{C}(S)}{\|Av\|_{1}\over\|v\|_{1}}, (18)

with 𝒞(S){vK{0}:vS1vSc1}.\mathcal{C}(S)\coloneqq\{v\in\mathbb{R}^{K}\setminus\{0\}:\|v_{S}\|_{1}\geq\|v_{S^{c}}\|_{1}\}. Let

κ¯τ=minαΘα(τ)κ(AJ¯,α0),κ¯τ=minαΘα(τ)κ(AJ¯,α0).\displaystyle\overline{\kappa}_{\tau}=\min_{\alpha\in\Theta_{\alpha}(\tau)}\kappa(A_{\overline{J}},\|\alpha\|_{0}),\qquad\underline{\kappa}_{\tau}=\min_{\alpha\in\Theta_{\alpha}(\tau)}\kappa(A_{\underline{J}},\|\alpha\|_{0}). (19)

For future reference, we have

κ¯τκ¯τ,κ¯τκ¯τ,for any ττ.\underline{\kappa}_{\tau}\leq\overline{\kappa}_{\tau},\quad\overline{\kappa}_{\tau^{\prime}}\leq\overline{\kappa}_{\tau},\quad\text{for any $\tau\leq\tau^{\prime}$.} (20)

Under conditions in Appendix G for α(i)Θα(τ)\alpha^{(i)}\in\Theta_{\alpha}(\tau) and A^\widehat{A}, Theorems 9 & 10 in [8] show that

𝔼α^(i)α(i)11κ¯ττlog(L)N+1κ¯τ2pKlog(L)nN.\mathbb{E}\|\widehat{\alpha}^{(i)}-\alpha^{(i)}\|_{1}~{}\lesssim~{}{1\over\overline{\kappa}_{\tau}}\sqrt{\tau\log(L)\over N}+{1\over\overline{\kappa}_{\tau}^{2}}\sqrt{pK\log(L)\over nN}. (21)

The statement of the above result, included in Appendix G, of [8, Theorems 9 & 10] is valid for any generic estimator of AA that is minimax adaptive, up to a logarithmic factor of LL.

We note that the first term (21) reflects the error of estimating the mixture weights, if AA were known, whereas the second one is the error in estimating AA.

4.2 Review on finite sample upper bounds for W(α(i),α(j);d)W(\alpha^{(i)},\alpha^{(j)};d)

For any estimators α^(i),α^(j)ΔK\widehat{\alpha}^{(i)},\widehat{\alpha}^{(j)}\in\Delta_{K} and A^\widehat{A}, plug-in estimators W(𝜶^(i),𝜶^(j);d)W(\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)};d) of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) with 𝜶^(i)\widehat{\boldsymbol{\alpha}}^{(i)} and 𝜶^(j)\widehat{\boldsymbol{\alpha}}^{(j)} defined in (4) can be analyzed using the fact that

𝔼|W(𝜶^(i),𝜶^(j);d)W(𝜶(i),𝜶(j);d)|\displaystyle\mathbb{E}\left|W(\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)};d)-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right| 2ϵA+diam(𝒜)ϵα,A,\displaystyle~{}\leq~{}2\epsilon_{A}+\mathrm{diam}(\mathcal{A})~{}\epsilon_{\alpha,A}, (22)

with diam(𝒜)maxk,kd(Ak,Ak)\mathrm{diam}(\mathcal{A})\coloneqq\max_{k,k^{\prime}}d(A_{k},A_{k^{\prime}}), whenever the following hold

𝔼[α^(i)α(i)1+α^(j)α(j)1]2ϵα,A\mathbb{E}\left[\|\widehat{\alpha}^{(i)}-\alpha^{(i)}\|_{1}+\|\widehat{\alpha}^{(j)}-\alpha^{(j)}\|_{1}\right]\leq 2\epsilon_{\alpha,A} (23)

and

𝔼[maxk[K]d(A^k,Ak)]ϵA,\mathbb{E}\left[\max_{k\in[K]}d(\widehat{A}_{k},A_{k})\right]\leq\epsilon_{A}, (24)

for some deterministic sequences ϵα,A\epsilon_{\alpha,A} and ϵA\epsilon_{A}. The notation ϵα,A\epsilon_{\alpha,A} is mnemonic of the fact that the estimation of mixture weights depends on the estimation of AA.

Inequality (22) follows by several appropriate applications of the triangle inequality and, for completeness, we give the details in Section B.2.1. We also note that (22) holds for 𝜶^(i)\widehat{\boldsymbol{\alpha}}^{(i)} based on any estimator α^(i)ΔK\widehat{\alpha}^{(i)}\in\Delta_{K}.

When α^(i)\widehat{\alpha}^{(i)} and α^(j)\widehat{\alpha}^{(j)} are the MLE-type estimators given by (3), [8] used this strategy to bound the left hand side in (22) for any estimator A^\widehat{A}, and then for a minimax-optimal estimator of AA, when dd is either the total variation distance or the Wasserstein distance on Δp\Delta_{p}.

We reproduce the result [8, Corollary 13] below, but we state it in the slightly more general framework in which the distance dd on Δp\Delta_{p} only needs to satisfy

2d(u,v)Cduv1,for any u,vΔp2d(u,v)\leq C_{d}\|u-v\|_{1},\qquad\text{for any $u,v\in\Delta_{p}$} (25)

for some Cd>0C_{d}>0. This holds, for instance, if dd is an q\ell_{q} norm with q1q\geq 1 on p\mathbb{R}^{p} (with Cd=2C_{d}=2), or the Wasserstein distance on Δp\Delta_{p} (with Cd=diam(𝒴)C_{d}=\mathrm{diam}(\mathcal{Y})). Recall that ΘA\Theta_{A} and Θα(τ)\Theta_{\alpha}(\tau) are defined in (12) and (15). Further recall κ¯τ\overline{\kappa}_{\tau} from (19) and L=npNL=n\vee p\vee N.

Theorem 2 (Corollary 13, [8]).

Let α^(i)\widehat{\alpha}^{(i)} and α^(j)\widehat{\alpha}^{(j)} be obtained from (3) by using the estimator A^\widehat{A} in [9]. Grant conditions in Appendix G for α(i),α(j)Θα(τ)\alpha^{(i)},\alpha^{(j)}\in\Theta_{\alpha}(\tau) and A^\widehat{A}. Then for any dd satisfying (25) with Cd>0C_{d}>0, we have

𝔼|W(𝜶^(i),𝜶^(j);d)W(𝜶(i),𝜶(j);d)|Cdκ¯τ(τlog(L)N+1κ¯τpKlog(L)nN).\mathbb{E}\left|W(\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)};d)-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right|~{}\lesssim~{}{C_{d}\over\overline{\kappa}_{\tau}}\left(\sqrt{\tau\log(L)\over N}+{1\over\overline{\kappa}_{\tau}}\sqrt{pK\log(L)\over nN}\right).

In the next section we investigate the optimality of this upper bound by establishing a matching minimax lower bound on estimating W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d).

4.3 Minimax lower bounds for the Wasserstein distance between topic model mixing measures

Theorem 3 below establishes the first lower bound on estimating W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d), the Wasserstein distance between two mixing measures, under topic models. Our lower bounds are valid for any distance dd on Δp\Delta_{p} satisfying

cduv12d(u,v)Cduv1,for all u,vΔp,c_{d}\|u-v\|_{1}\leq 2d(u,v)\leq C_{d}\|u-v\|_{1},\qquad\text{for all }u,v\in\Delta_{p}, (26)

with some 0<cdCd<0<c_{d}\leq C_{d}<\infty.

Theorem 3.

Grant topic model assumptions and assume 1<τcN1<\tau\leq cN and pKc(nN)pK\leq c^{\prime}(nN) for some universal constants c,c>0c,c^{\prime}>0. Then, for any dd satisfying (26) with some 0<cdCd<0<c_{d}\leq C_{d}<\infty, we have

infW^supAΘAα(i),α(j)Θα(τ)𝔼|W^W(𝜶(i),𝜶(j);d)|cd2Cd(κ¯τ2τNlog2(τ)+pKnNlog2(p)).\displaystyle\begin{split}&\inf_{\widehat{W}}\sup_{\begin{subarray}{c}A\in\Theta_{A}\\ \alpha^{(i)},\alpha^{(j)}\in\Theta_{\alpha}(\tau)\end{subarray}}\mathbb{E}\left|\widehat{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right|\gtrsim~{}{c_{d}^{2}\over C_{d}}\left(\overline{\kappa}_{\tau}^{2}\sqrt{\tau\over N\log^{2}(\tau)}+\sqrt{pK\over nN\log^{2}(p)}\right).\end{split}

Here the infimum is taken over all distance estimators.

Proof.

The proof can be found in Section B.2.2. ∎

The lower bound in Theorem 3 consists of two terms that quantify, respectively, the estimation error of estimating the distance when the mixture components, A1,,AKA_{1},\ldots,A_{K}, are known, and that when the mixture weights, α(1),,α(n)\alpha^{(1)},\ldots,\alpha^{(n)}, are known. Combined with Theorem 2, this lower bound establishes that the simple plug-in procedure can lead to near-minimax optimal estimators of the Wasserstein distance between mixing measures in topic models, up to the 11\ell_{1}\to\ell_{1} condition numbers of AA, the factor Cd/cdC_{d}/c_{d} and multiplicative logarithmic factors of LL.

Remark 2 (Optimal estimation of the mixing measure in Wasserstein distance).

The proof of Theorem 2 actually shows that the same upper bound holds for 𝔼W(𝜶^(i),𝜶(i);d)\mathbb{E}W(\widehat{\boldsymbol{\alpha}}^{(i)},\boldsymbol{\alpha}^{(i)};d); that is, that the rate in Theorem 2 is also achievable for the problem of estimating the mixing measure in Wasserstein distance. Our lower bound in Theorem 3 indicates that the plug-in estimator is also nearly minimax optimal for this task. Indeed, if we let 𝜶^(i)\widehat{\boldsymbol{\alpha}}^{(i)} and 𝜶^(j)\widehat{\boldsymbol{\alpha}}^{(j)} be arbitrary estimators of 𝜶(i)\boldsymbol{\alpha}^{(i)} and 𝜶(j)\boldsymbol{\alpha}^{(j)}, respectively, then

|W(𝜶^(i),𝜶^(j);d)W(𝜶(i),𝜶(j);d)|W(𝜶^(i),𝜶(i);d)+W(𝜶^(j),𝜶(j);d).|W(\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)};d)-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)|\leq W(\widehat{\boldsymbol{\alpha}}^{(i)},\boldsymbol{\alpha}^{(i)};d)+W(\widehat{\boldsymbol{\alpha}}^{(j)},\boldsymbol{\alpha}^{(j)};d)\,.

Therefore, the lower bound in Theorem 3 also implies a lower bound for the problem of estimating the measures 𝜶(i)\boldsymbol{\alpha}^{(i)} and 𝜶(j)\boldsymbol{\alpha}^{(j)} in Wasserstein distance. In fact, a slightly stronger lower bound, without logarithmic factors, holds for the latter problem. We give a simple argument establishing this improved lower bound in Appendix D.

Remark 3 (New minimax lower bounds for estimating a general metric DD on discrete probability measures).

The proof of Theorem 3 relies on a new result, stated in Theorem 10 of Appendix C, which is of interest on its own. It establishes minimax lower bounds for estimating a distance D(P,Q)D(P,Q) between two probability measures P,Q𝒫(𝒳)P,Q\in\mathcal{P}(\mathcal{X}) on a finite set 𝒳\mathcal{X} based on NN i.i.d. samples from PP and QQ. Our results are valid for any distance DD satisfying

cTV(P,Q)D(P,Q)CTV(P,Q),for all P,Q𝒫(𝒳),c_{*}~{}\textrm{TV}(P,Q)\leq D(P,Q)\leq C_{*}~{}\textrm{TV}(P,Q),\quad\text{for all }P,Q\in\mathcal{P}(\mathcal{X}), (27)

with some 0<cC<0<c_{*}\leq C_{*}<\infty, including the Wasserstein distance as a particular case.

Prior to our work, no such bound even for estimating the Wasserstein distance was known except in the special case where the metric space associated with the Wasserstein distance is a tree [5, 65]. Proving lower bounds on the rate of estimation of the Wasserstein distance for more general metrics is a long standing problem, and the first nearly-tight bounds in the continuous case were given only recently [50, 43].

More generally, the lower bound in Theorem 10 adds to the growing literature on minimax rates of functional estimation for discrete probability measures (see, e.g., [36, 68]). The rate we prove is smaller by a factor of log|𝒳|\sqrt{\log|\mathcal{X}|} than the optimal rate of estimation of the total variation distance [36]; on the other hand, our lower bounds apply to any metric which is bi-Lipschitz equivalent to the total variation distance. We therefore do not know whether this additional factor of log|𝒳|\sqrt{\log|\mathcal{X}|} is spurious or whether there in fact exist metrics in this class for which the estimation problem is strictly easier. At a technical level, our proof follows a strategy of [50] based on reduction to estimation in total variation, for which we prove a minimax lower bound based on the method of “fuzzy hypotheses” [62]. The novelty of our bounds involves the fact that we must design priors which exhibit a large multiplicative gap in the functional of interest, whereas prior techniques [36, 68] are only able to control the magnitude of the additive gap between the values of the functionals. Though additive control is enough to prove a lower bound for estimating total variation distance, this additional multiplicative control is necessary if we are to prove a bound for any metric satisfying (27).

5 Inference for the Wasserstein distance between mixing measures in topic models, under the General Regime

We follow the program outlined in the Introduction and begin by giving our general strategy towards obtaining a N\sqrt{N} asymptotic limit for a distance estimate. Throughout this section KK is considered fixed, and does not grow with nn or NN. However, the dictionary size pp is allowed to depend on both nn and NN. As in the Introduction, we let Ni=Nj=NN_{i}=N_{j}=N purely for notational convenience. We let α~(i)\widetilde{\alpha}^{(i)} and α~(j)\widetilde{\alpha}^{(j)} be (pseudo) estimators of α(i)\alpha^{(i)} and α(j)\alpha^{(j)}, that we will motivate below, and give formally in (41). Using the Kantorovich-Rubinstein dual formulation of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) (see, Eq. 61), we propose distance estimates of the type

W~:=supf^f(α~(i)α~(j)),\widetilde{W}:=\sup_{f\in\widehat{\mathcal{F}}}f^{\top}(\widetilde{\alpha}^{(i)}-\widetilde{\alpha}^{(j)}), (28)

for

^={fK:fkfd(A^k,A^),k,[K],f1=0}.\widehat{\mathcal{F}}=\{f\in\mathbb{R}^{K}:f_{k}-f_{\ell}\leq d(\widehat{A}_{k},\widehat{A}_{\ell}),~{}\forall k,\ell\in[K],~{}f_{1}=0\}. (29)

constructed, for concreteness, relative to the estimator A^\widehat{A} described in Section 4.1. Proposition 1, stated below, and proved in Section B.3.1, gives our general approach to inference for W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) based on W~\widetilde{W}.

Proposition 1.

Assume access to:

  1. (i)

    Estimators α~(i),α~(j)K\widetilde{\alpha}^{(i)},\widetilde{\alpha}^{(j)}\in\mathbb{R}^{K} that satisfy

    N((α~(i)α~(j))(α(i)α(j)))𝑑X(ij),as n,N.\sqrt{N}\Bigl{(}(\widetilde{\alpha}^{(i)}-\widetilde{\alpha}^{(j)})-(\alpha^{(i)}-\alpha^{(j)})\Bigr{)}\overset{d}{\to}X^{(ij)},\qquad\text{as }n,N\to\infty. (30)

    for X(ij)𝒩K(0,Q(ij))X^{(ij)}\sim\mathcal{N}_{K}(0,Q^{(ij)}), with some positive semi-definite covariance matrix Q(ij)Q^{(ij)}.

  2. (ii)

    Estimators A^\widehat{A} of AA such that ϵA\epsilon_{A} defined in (24) satisfies

    limn,NϵAN=0.\lim_{n,N\to\infty}\epsilon_{A}\sqrt{N}=0. (31)

Then, we have the following convergence in distribution as n,Nn,N\to\infty,

N(W~W(𝜶(i),𝜶(j);d))𝑑supfijfX(ij),\sqrt{N}\Bigl{(}\widetilde{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\Bigr{)}\overset{d}{\to}\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}X^{(ij)}, (32)

where

ij:={fK:f(α(i)α(j))=W(𝜶(i),𝜶(j);d)}\mathcal{F}^{\prime}_{ij}:=\mathcal{F}\cap\left\{f\in\mathbb{R}^{K}:f^{\top}(\alpha^{(i)}-\alpha^{(j)})=W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right\} (33)

with

={fK:fkfd(Ak,A),k,[K],f1=0}.\mathcal{F}=\{f\in\mathbb{R}^{K}:f_{k}-f_{\ell}\leq d(A_{k},A_{\ell}),~{}\forall k,\ell\in[K],~{}f_{1}=0\}. (34)

Proposition 1 shows that in order to obtain the desired N\sqrt{N} asymptotic limit (32) it is sufficient to control, separately, the estimation of the mixture components, in probability, and that of the mixture weight estimators, in distribution. Since estimation of AA is based on all nn documents, each of size NN, all the limits in Proposition 1 are taken over both nn and NN to ensure that the error in estimating AA becomes negligible in the distributional limit. For our running example of A^\widehat{A}, constructed in Section A of [7],

ϵA=O(pKlog(L)nN),\epsilon_{A}=O\left(\sqrt{pK\log(L)\over nN}\right),

for L:=npNL:=n\vee p\vee N, when dd is either the Total Variation distance, or the Wasserstein distance, as shown in [8].

The proof of Proposition 1 shows that, when (31) holds, as soon as (30) is established, the desired (32) follows by an application of the functional δ\delta-method, recognizing (see, Proposition 2 in Section B.3.1) that the function h:Kh:\mathbb{R}^{K}\to\mathbb{R} defined as h(x)=supffxh(x)=\sup_{f\in\mathcal{F}}f^{\top}x is Hadamard-directionally differentiable at u=α(i)α(j)u=\alpha^{(i)}-\alpha^{(j)} with derivative hu:Kh^{\prime}_{u}:\mathbb{R}^{K}\to\mathbb{R} defined as

hu(x)=supf:fu=h(u)fx=supfijfx.h^{\prime}_{u}(x)=\sup_{f\in\mathcal{F}:f^{\top}u=h(u)}f^{\top}x=\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}x. (35)

The last step of the proof has also been advocated in prior work [56, 26, 28], and closest to ours is [57]. This work estimates the Wasserstein distance between two discrete probability vectors of fixed dimension by the Wasserstein distance between their observed frequencies. In [57], the equivalent of (30) is the basic central limit theorem for the empirical estimates of the cell probabilities of a multinomial distribution. In topic models, and in the Classical Regime, that in particular makes, in the topic model context, the unrealistic assumption that AA is known, one could take α~=α^A\widetilde{\alpha}=\widehat{\alpha}_{A}, the MLE of α\alpha. Then, invoking classical results, for instance, [1, Section 16.2], the convergence in (30) holds, with Q(ij)=Γ(i)+Γ(j)Q^{(ij)}=\Gamma^{(i)}+\Gamma^{(j)} given above in (6).

In contrast, constructing estimators of the potentially sparse mixture weights for which (30) holds, under the General Regime, in the context of topic models, requires special care, is a novel contribution to the literature, and is treated in the following section. Their analysis, coupled with Proposition 1, will be the basis of Theorem 5 of Section 5.2 in which we establish the asymptotic distribution of our proposed distance estimator in topic models.

5.1 Asymptotically normal estimators of sparse mixture weights in topic models, under the General Regime

We concentrate on one sample, and drop the superscript ii from all relevant quantities. To avoid any possibility of confusion, in this section we also let α=α\alpha=\alpha_{*} denote the true population quantity.

As discussed in Section 2.2 of the Introduction, under the Classical Regime, the MLE estimator of α\alpha_{*} is asymptotically normal and efficient, a fact which cannot be expected to hold under General Regime for an estimator restricted to ΔK\Delta_{K}, especially when α\alpha_{*} is on the boundary of ΔK\Delta_{K}. We exploit the KKT conditions satisfied by α^\widehat{\alpha}, defined in (3), to motivate the need for its correction, leading to a final estimator that will be a “de-biased” version of α^\widehat{\alpha}, in a sense that will be made precise shortly.

Recall that α^\widehat{\alpha} is the maximizer, over ΔK\Delta_{K}, of jJXjlog(A^jα)\sum_{j\in J}X_{j}\log(\widehat{A}_{j\cdot}^{\top}\alpha), with J=supp(X)J={\rm supp}(X), for the observed frequency vector XΔpX\in\Delta_{p}. Let

r^=A^α^,withJ^:=supp(r^).\widehat{r}=\widehat{A}\widehat{\alpha},\qquad\text{with}\quad\widehat{J}:={\rm supp}(\widehat{r}~{}). (36)

We have JJ^J\subseteq\widehat{J} with probability equal to one. To see this, note that if for some jJ^j\in\widehat{J} we would have r^j=A^jα^=0\widehat{r}_{j}=\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}=0, and then α^\widehat{\alpha} would not be a maximizer.

Observe that the KKT conditions associated with α^\widehat{\alpha}, for dual variables λK\lambda\in\mathbb{R}^{K} satisfying λk0\lambda_{k}\geq 0, λkα^k=0\lambda_{k}\widehat{\alpha}_{k}=0 for all k[K]k\in[K], and arising from the non-negativity constraints on α\alpha in the primal problem, imply that

0K\displaystyle 0_{K} =\displaystyle= jJ^XjA^jα^A^j𝟙K+λN.\displaystyle\sum_{j\in{\widehat{J}}}\frac{X_{j}}{\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\widehat{A}_{j\cdot}-\mathds{1}_{K}+{\lambda\over N}. (37)

By adding and subtracting terms in (37) (see also the details of the proof of Theorem 9), we see that α^\widehat{\alpha} satisfies the following equality

NV~(α^α)=NΨ(α)NΨ(α^),\sqrt{N}~{}\widetilde{V}(\widehat{\alpha}-\alpha_{*})=\sqrt{N}~{}\Psi(\alpha_{*})-\sqrt{N}~{}\Psi(\widehat{\alpha}), (38)

where

V~:=jJ^XjA^jα^A^jαA^jA^j,Ψ(α):=jJ^XjA^jαA^jαA^j.\widetilde{V}:=\sum_{j\in\widehat{J}}{X_{j}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}~{}\widehat{A}_{j}^{\top}\alpha_{*}}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top},\qquad\Psi(\alpha):=\sum_{j\in\widehat{J}}{X_{j}-\widehat{A}_{j\cdot}^{\top}\alpha\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot}. (39)

On the basis of (38), standard asymptotic principles dictate that the asymptotic normality of α^\widehat{\alpha} would follow by establishing that V~\widetilde{V} converges in probability to a matrix that will contribute to the asymptotic covariance matrix, the term NΨ(α)\sqrt{N}\Psi(\alpha_{*}) converges to a Gaussian limit, and NΨ(α^)\sqrt{N}\Psi(\widehat{\alpha}) vanishes, in probability. The latter, however, cannot be expected to happen, in general, and it is this term that creates the first difficulty in the analysis. To see this, writing the right hand side in (37) as

jJ^XjA^jα^A^jα^A^jjJ^cA^j+λN=Ψ(α^)jJ^cA^j+λN,\displaystyle\sum_{j\in\widehat{J}}{X_{j}-\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\widehat{A}_{j\cdot}-\sum_{j\in\widehat{J}^{c}}\widehat{A}_{j\cdot}+{\lambda\over N}=\Psi(\widehat{\alpha})-\sum_{j\in\widehat{J}^{c}}\widehat{A}_{j\cdot}+{\lambda\over N},

we obtain from (37) that

Ψ(α^)=λN+jJ^cA^j.\Psi(\widehat{\alpha})=-{\lambda\over N}+\sum_{j\in\widehat{J}^{c}}\widehat{A}_{j\cdot}. (40)

We note that, under the Classical Regime, J¯:={j:Ajα>0}=[p]\bar{J}:=\{j:A_{j\cdot}^{\top}\alpha_{*}>0\}=[p], which implies that, with probability tending to one, J=[p]J=[p] thus J^=[p]\widehat{J}=[p] and J^c=\widehat{J}^{c}=\emptyset. Consequently, Eq. 40 yields NΨ(α^)=𝒪(λ/N)\sqrt{N}\Psi(\widehat{\alpha})=\mathcal{O}_{\mathbb{P}}(\lambda/\sqrt{N}), which is expected to vanish asymptotically. This is indeed in line with classical analyses, in which the asymptotic bias term, Ψ(α^)\Psi(\widehat{\alpha}), is of order 𝒪(1/N)\mathcal{O}_{\mathbb{P}}(1/N).

However, in the General Regime, we do not expect this to happen as we allow J¯[p]\bar{J}\subset[p]. Since we show in Lemma 3 of Section B.3.4 that, with high probability, J¯=J^\bar{J}=\widehat{J}, we therefore have J^c\widehat{J}^{c}\neq\emptyset. Thus, one cannot ensure that NΨ(α^)\sqrt{N}{\Psi}(\widehat{\alpha}) vanishes asymptotically, because the last term in (40) is non-zero, and we do not have direct control on the dual variables λ\lambda either. It is worth mentioning that the usage of J^\widehat{J} is needed as the naive estimator J=supp(X)J={\rm supp}(X) of J¯\bar{J} is not consistent in general when p>Np>N, whereas J^\widehat{J} is; see, Lemma 3 of Section B.3.4.

The next natural step is to construct a new estimator, by removing the bias of α^\widehat{\alpha}. We let V^\widehat{V} be a matrix that will be defined shortly, and denote by V^+\widehat{V}^{+} its generalized inverse. Define

α~=α^+V^+Ψ(α^),\widetilde{\alpha}=\widehat{\alpha}+\widehat{V}^{+}\Psi(\widehat{\alpha}), (41)

and observe that it satisfies

α~α=(\bmIKV^+V~)(α^α)+V^+Ψ(α),\widetilde{\alpha}-\alpha_{*}=(\bm{I}_{K}-\widehat{V}^{+}\widetilde{V})(\widehat{\alpha}-\alpha_{*})+\widehat{V}^{+}\Psi(\alpha_{*}), (42)

a decomposition that no longer contains the possibly non-vanishing bias term Ψ(α^)\Psi(\widehat{\alpha}) in (38). The (lengthy) proof of Theorem 9 stated in [7] shows that, indeed, after appropriate scaling, the first term in the right hand side of (42) vanishes asymptotically, and the second one has a Gaussian limit, as soon as V^\widehat{V} is appropriately chosen.

As mentioned above, the choice of V^\widehat{V} is crucial for obtaining the desired asymptotic limit, and is given by

V^:=jJ^1A^jα^A^jA^j.\widehat{V}:=\sum_{j\in\widehat{J}}{1\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}. (43)

This choice is motivated by using Σ:=Σ(i)\Sigma:=\Sigma^{(i)} defined in (8) as a benchmark for the asymptotic covariance matrix of the limit, as we would like that the new estimator α~\widetilde{\alpha} not only be valid in the General Regime, but also not lose the asymptotic efficiency of the MLE, in the Classical Regime. Indeed, the proof of Theorem 9 shows that the asymptotic covariance matrix of NV^+Ψ(α)\sqrt{N}\widehat{V}^{+}\Psi(\alpha_{*}) is nothing but Σ\Sigma. Notably, from the decomposition in (42), it would be tempting to use V~\widetilde{V}, given by (39), with α\alpha_{*} replaced by α^\widehat{\alpha}, instead of V^\widehat{V}. However, although this matrix has the desired asymptotic behavior, we find in our simulations that its finite sample performance relative to V^\widehat{V} is sub-par.

Our final estimator is therefore α~\widetilde{\alpha} given by (41), with V^\widehat{V} given by (43). We show below that this estimator is asymptotically normal. The construction of α~\widetilde{\alpha} in (41) has the flavor of a Newton–Raphson one-step correction of α^\widehat{\alpha} (see, [40]), relative to the estimating equation Ψ(α)=0\Psi(\alpha)=0. However, classical general results on the asymptotic distribution of one-step corrected estimators such as Theorem 5.45 of [63] cannot be employed directly, chiefly because, when translated to our context, they are derived for deterministic AA. In our problem, after determining the appropriate de-biasing quantity and the appropriate V^\widehat{V}, the main remaining difficulty in proving asymptotic normality is in controlling the (scaled) terms in the right hand side of (42). A key challenge is the delicate interplay between A^\widehat{A} and α^\widehat{\alpha}, which are dependent, as they are both estimated via the same sample. This difficulty is further elevated in the General Regime, when not only pp, but also the entries of AA and α\alpha (thereby the quantities ξ\xi, αmin\alpha_{\min}, κ¯K\overline{\kappa}_{K} and κ¯τ\underline{\kappa}_{\tau}, defined in Section 4.1.2), can grow with NN and nn. In this case, one needs careful control of their interplay.

We state and prove results pertaining to this general situation in Section B.3.4: Theorem 8 is our most general result, proved for any estimator A^\widehat{A} whose estimation errors, defined by the left hand sides of (13) and (14), can be well controlled, in a sense made precise in the statement of Theorem 8. As an immediate consequence, Theorem 9 in Section B.3.4 establishes the asymptotic normality of α^\widehat{\alpha} under the specific control given by the right hand sides of (13) and (14). We recall that, for instance, these upper bounds are valid for the estimator given in Section A of [7].

We state below a version of our results, corresponding to an estimator of AA that satisfies (13) and (14), by making the following simplifying assumptions: the condition numbers κ¯K1,κ¯τ1\overline{\kappa}_{K}^{-1},\underline{\kappa}_{\tau}^{-1}, the signal strength ξ\xi, and |J¯J¯||\overline{J}\setminus\underline{J}| are bounded. The latter means that we assume that the number of positive entries in rr that fall below logp/N\log p/N is bounded. We also assume, in this simplified version of our results, that the magnitude of the entries of α\alpha does not grow with either NN or nn. We stress that we do not make any of these assumptions in Section B.3.4, and use them here for clarity of exposition only.

Recall that Σ:=Σ(i)\Sigma:=\Sigma^{(i)} is defined in (8). Let Σ+\Sigma^{+} be the Moore-Penrose inverse of Σ\Sigma and let Σ+12\Sigma^{+\frac{1}{2}} be its matrix square root.

Theorem 4.

Let A^\widehat{A} be any estimator such that (13) and (14) hold, for KK that does not grow with nn and NN. Assume log2(pn)=o(N)\log^{2}(p\vee n)=o(N) and plog(L)=o(n).p\log(L)=o(n).

Then

limn,NNΣ+12(α~α)𝑑𝒩K(0,[\bmIK1000]).\lim_{n,N\to\infty}\sqrt{N}~{}\Sigma^{+{1\over 2}}\left(\widetilde{\alpha}-\alpha_{*}\right)\overset{d}{\to}\mathcal{N}_{K}\left(0,\begin{bmatrix}\bm{I}_{K-1}&0\\ 0&0\end{bmatrix}\right). (44)

The proof of Theorem 4 is obtained as a direct consequence of the general Theorem 9, stated and proved in Section B.3.4. Theorem 4 holds under a mild requirement on NN, the document length, and on a requirement on nn, the number of documents, that is expected to hold in topic model contexts, when nn is typically (much) larger than the dictionary size pp.

Remark 4.

In line with classical theory, joint asymptotic normality of mixture weights estimators, as in Theorem 4, can only be established when KK does not grow with either NN or nn.

In problems in which KK is expected to grow with the ambient sample sizes, although joint asymptotic results such as Eq. 78 become ill-posed, one can still study the marginal distributions of α~\widetilde{\alpha}, over any subset of constant dimension. In particular, a straightforward modification of our analysis yields the limiting distribution of each component of α~\widetilde{\alpha}: for any k[K]k\in[K],

limn,NN/Σkk(α~kαk)𝑑𝒩(0,1).\lim_{n,N\to\infty}\sqrt{N/\Sigma_{kk}}\left(\widetilde{\alpha}_{k}-\alpha_{*k}\right)\overset{d}{\to}\mathcal{N}(0,1).

The above result can be used to construct confidence intervals for any entry of α\alpha_{*}, including those with zero values.

Remark 5 (Alternative estimation of the mixture weights).

Another natural estimator of the mixture weights is the weighted least squares estimators, defined as

α~LS:=argminαKD^1/2(XA^α)22\widetilde{\alpha}_{LS}:=\operatorname*{argmin}_{\alpha\in\mathbb{R}^{K}}\|\widehat{D}^{-1/2}(X-\widehat{A}\alpha)\|_{2}^{2} (45)

with D^:=diag(A^11,,A^p1).\widehat{D}:=\mathrm{diag}(\|\widehat{A}_{1\cdot}\|_{1},\ldots,\|\widehat{A}_{p\cdot}\|_{1}). The use of the pre-conditioner D^\widehat{D} in the definition of A^+\widehat{A}^{+} is reminiscent of the definition of the normalized Laplacian in graph theory: it moderates the size of the jj-th entry of each mixture estimate, across the KK mixtures, for each j[p]j\in[p].

The estimator α~LS\widetilde{\alpha}_{LS} can also be viewed as an (asymptotically) de-biased version of a restricted least squares estimator, obtained as in (45), but minimizing only over ΔK\Delta_{K} (see Remark 11 in Appendix H). It therefore has the same flavor as our proposed estimator α~\widetilde{\alpha}. In Theorem 14 of Appendix H we further prove that under suitable conditions, as n,Nn,N\to\infty,

NΣLS1/2(α~LSα)𝑑𝒩K(0,\bmIK).\sqrt{N}\Sigma_{LS}^{-1/2}\left(\widetilde{\alpha}_{LS}-\alpha_{*}\right)\overset{d}{\to}\mathcal{N}_{K}(0,\bm{I}_{K}).

for some matrix ΣLS\Sigma_{LS} that does not equal the covariance matrix Σ\Sigma given by (8).

The asymptotic normality of α~LS\widetilde{\alpha}_{LS} together with Proposition 1 shows that inference for the distance between mixture weights can also be conducted relative to a distance estimate (28) based on α~LS\widetilde{\alpha}_{LS}. Note however that the potential sub-optimality of the limiting covariance matrix of this mixture weight estimator also affects the length of the confidence intervals for the Wasserstein distance, see our simulation results in Section F.3. We therefore recommend the usage of the asymptotically debiased, MLE-type estimators α~\widetilde{\alpha} analyzed in this section, with α~LS\widetilde{\alpha}_{LS} being a second best.

5.2 The limiting distribution of the proposed distance estimator

As a consequence of Proposition 1 and Theorem 9, we derive the limiting distribution of our proposed estimator of the distance W(𝜶(i),𝜶(i);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(i)};d) in Theorem 5 below.

Let W~\widetilde{W} be defined as (28) by using A^\widehat{A} and the de-biased estimators α~(i),α~(j)\widetilde{\alpha}^{(i)},\widetilde{\alpha}^{(j)} in (41). Let

Zij𝒩K(0,Q(ij))Z_{ij}\sim\mathcal{N}_{K}(0,Q^{(ij)}) (46)

where we assume that

Q(ij)limn,N(Σ(i)+Σ(j))K×KQ^{(ij)}\coloneqq\lim_{n,N\to\infty}(\Sigma^{(i)}+\Sigma^{(j)})\in\mathbb{R}^{K\times K} (47)

exists with Σ(i)\Sigma^{(i)} defined in (8). Here we rely on the fact that KK is independent of NN and nn to define the limit, but allow the model parameters AA, α(i)\alpha^{(i)} and α(j)\alpha^{(j)} to depend on both nn and NN.

Theorem 5.

Grant conditions in Theorem 9 for α(i),α(j)Θα(τ)\alpha^{(i)},\alpha^{(j)}\in\Theta_{\alpha}(\tau). For any dd satisfying (25) with Cd=𝒪(1)C_{d}=\mathcal{O}(1), we have the following convergence in distribution, as n,Nn,N\to\infty,

N(W~W(𝜶(i),𝜶(j);d))𝑑supfijfZij\sqrt{N}\left(\widetilde{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right)\overset{d}{\to}\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}Z_{ij} (48)

with ij\mathcal{F}^{\prime}_{ij} defined in (33).

The proof of Theorem 5 immediately follows from Proposition 1 and Theorem 9 together with ϵACdA^A1,\epsilon_{A}\leq C_{d}\|\widehat{A}-A\|_{1,\infty} and (13). The requirement ϵAN0\epsilon_{A}\sqrt{N}\to 0 of Proposition 1 in this context reduces to plog(L)/n0p\log(L)/n\to 0, which holds when: (i) pp is fixed, and nn grows, a situation encountered when the dictionary size is not affected by the growing size of the corpus; (ii) pp is independent of NN but grows with, and slower than, nn, as we expect that not many words are added to an initially fixed dictionary as more documents are added to the corpus; (iii) p=p(N)<np=p(N)<n, which reflects the fact that the dictionary size can depend on the length of the document, but again should grow slower than the number of documents.

Our results in Theorem 5 can be readily generalized to the cases where NiN_{i} is different from NjN_{j}. We refer to Appendix E for the precise statement in this case.

5.3 Fully data driven inference for the Wasserstein distance between mixing measures

To utilize Theorem 5 in practice for inference on W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d), for instance for constructing confidence intervals or conducting hypothesis testing, we provide below consistent quantile estimation for the limiting distribution in Theorem 5.

Other possible data-driven inference strategies are bootstrap-based. The table below summarizes the pros and cons of these procedures, in contrast with our proposal (in green). The symbol ×\times means that the procedure is not available in a certain scenario, while \checkmark means that it is, and \colorgreen \checkmark means that the procedure has best performance, in our experiments.

Procedures Make use of the form of the limiting distribution Any α,βΔK\alpha,\beta\in\Delta_{K} Only α=βΔK\alpha=\beta\in\Delta_{K}
(1) Classical bootstrap No ×\times ×\times
(2) mm-out-of-NN bootstrap No \checkmark \checkmark
(3) Derivative-based bootstrap Yes \checkmark \checkmark
(4) Plug-in estimation Yes \colorgreen \checkmark \colorgreen \checkmark
Performance for constructing confidence intervals (>> means better) (4) ¿ (2) (3) \approx (4) ¿ (2)

5.3.1 Bootstrap

The bootstrap [27] is a powerful tool for estimating a distribution. However, since the Hadamard-directional derivative of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) w.r.t. (α(i)α(j)\alpha^{(i)}-\alpha^{(j)}) is non-linear (see Eq. 35), [26] shows that the classical bootstrap is not consistent. The same paper also shows that this can be corrected by a version of mm-out-of-NN bootstrap, for m/N0m/N\to 0 and mm\to\infty. Unfortunately, the optimal choice of mm is not known, which hampers its practical implementation, as suggested by our simulation result in Section F.7. Moreover, our simulation result in Section F.4 also shows that the mm-out-of-NN bootstrap seems to have inferior finite sample performance comparing to other procedures described below.

An alternative to the mm-out-of-NN bootstrap is to use a derivative-based bootstrap [28] for estimating the limiting distribution in Theorem 5, by plugging the bootstrap samples into the Hadamard-directional derivative (HDD) on the right hand side of (48). As proved in [28] and pointed out by [57], this procedure is consistent at the null α(i)=α(j)\alpha^{(i)}=\alpha^{(j)} in which case ij=\mathcal{F}^{\prime}_{ij}=\mathcal{F} (see (34) and (33)) hence the HDD is a known function. However, this procedure is not directly applicable when α(i)α(j)\alpha^{(i)}\neq\alpha^{(j)} since the HDD depends on ij\mathcal{F}^{\prime}_{ij} which needs to be consistently estimated. Nevertheless, we provide below a consistent estimator of ij\mathcal{F}^{\prime}_{ij} which can be readily used in conjunction with the derivative-based bootstrap. For the reader’s convenience, we provide details of both the mm-out-of-NN bootstrap and the derivative-based bootstrap in Section F.6.

5.3.2 A plug-in estimator for the limiting distribution of the distance estimate

In view of Theorem 5, we propose to replace the population level quantities that appear in the limit by their consistent estimates. Then, we estimate the cumulative distribution function of the limit via Monte Carlo simulations.

To be concrete, for a specified integer M>0M>0, let Zij(1),,Zij(M)Z_{ij}^{(1)},\ldots,Z_{ij}^{(M)} be i.i.d. samples from 𝒩K(0,Σ^(i)+Σ^(j))\mathcal{N}_{K}(0,\widehat{\Sigma}^{(i)}+\widehat{\Sigma}^{(j)}) where, for {i,j}\ell\in\{i,j\},

Σ^()=(jJ^()AjAjr^j())1α^()α^(),\widehat{\Sigma}^{(\ell)}=\Bigl{(}\sum_{j\in\widehat{J}^{(\ell)}}{A_{j\cdot}A_{j\cdot}^{\top}\over\widehat{r}_{j}^{(\ell)}}\Bigr{)}^{-1}-\widehat{\alpha}^{(\ell)}\widehat{\alpha}^{(\ell)\top}, (49)

with r^()=A^α^()\widehat{r}^{(\ell)}=\widehat{A}\widehat{\alpha}^{(\ell)} and J^()=supp(r^())\widehat{J}^{(\ell)}={\rm supp}(\widehat{r}^{(\ell)}). We then propose to estimate the limiting distribution on the right hand side of (48) by the empirical distribution of

supf^δfZij(b),for b=1,,M,\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}Z_{ij}^{(b)},\qquad\text{for }~{}b=1,\ldots,M, (50)

where, for some tuning parameter δ0\delta\geq 0,

^δ=^{fK:|f(α^(i)α^(j))W(𝜶^(i),𝜶^(j);d)|δ}\widehat{\mathcal{F}}^{\prime}_{\delta}=\widehat{\mathcal{F}}\cap\left\{f\in\mathbb{R}^{K}:|f^{\top}(\widehat{\alpha}^{(i)}-\widehat{\alpha}^{(j)})-W(\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)};d)|\leq\delta\right\} (51)

with ^\widehat{\mathcal{F}} defined in (29) and 𝜶^(i),𝜶^(j)\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)} defined in (4).

Let F^N,M\widehat{F}_{N,M} be the empirical cumulative density function (c.d.f.) of (50) and write FF for the c.d.f. of the limiting distribution on the right hand side of (48). The following theorem states that F^N,M(t)\widehat{F}_{N,M}(t) converges to F(t)F(t) in probability for all tt\in\mathbb{R}. Its proof is deferred to Section B.3.10.

Theorem 6.

Grant conditions in Theorem 4 for α(i)\alpha^{(i)} and α(j)\alpha^{(j)}. For any dd satisfying (25) with some constant Cd>0C_{d}>0, by choosing δlog(L)/N+plog(L)/(nN)\delta\asymp\sqrt{\log(L)/N}+\sqrt{p\log(L)/(nN)} in (51), we have, for any tt\in\mathbb{R},

|F^N,M(t)F(t)|=o(1),as M and n,N.|\widehat{F}_{N,M}(t)-F(t)|=o_{\mathbb{P}}(1),\quad\text{as }M\to\infty\text{ and }n,N\to\infty.
Remark 6 (Confidence intervals).

Theorem 6 in conjunction with [63, Lemma 21.2] ensures that any quantile of the limiting distribution on the right hand side of (48) can also be consistently estimated by its empirical counterpart of (50), which, therefore by Slutsky’s theorem, can be used to provide confidence intervals of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d). This is summarized in the following corollary. Let F^N,M1(t)\widehat{F}_{N,M}^{-1}(t) be the quantile of (50) for any t(0,1)t\in(0,1).

Corollary 2.

Grant conditions in Theorem 6. For any t(0,1)t\in(0,1), we have

limn,NlimM{W~F^N,M1(1t/2)NW(𝜶(i),𝜶(j);d)W~F^N,M1(t/2)N}=1t.\lim_{n,N\to\infty}\lim_{M\to\infty}\mathbb{P}\left\{\widetilde{W}-{\widehat{F}_{N,M}^{-1}(1-t/2)\over\sqrt{N}}\leq W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\leq\widetilde{W}-{\widehat{F}_{N,M}^{-1}(t/2)\over\sqrt{N}}\right\}=1-t.
Remark 7 (Hypothesis testing at the null α(i)=α(j)\alpha^{(i)}=\alpha^{(j)}).

In applications where we are interested in inference for α(i)=α(j)\alpha^{(i)}=\alpha^{(j)}, the above plug-in procedure can be simplified. There is no need for the tuning parameter δ\delta, since one only needs to compute the empirical distribution of

supf^fZij(b),for b=1,,M.\sup_{f\in\widehat{\mathcal{F}}}f^{\top}Z_{ij}^{(b)},\qquad\text{for }~{}b=1,\ldots,M.
Remark 8 (The tuning parameter δ\delta).

The rate of δ\delta in Theorem 6 is stated for the MLE-type estimators of α(i)\alpha^{(i)} and α(j)\alpha^{(j)} as well as the estimator A^\widehat{A} of AA satisfying (13).

Recall ϵα,A\epsilon_{\alpha,A} and ϵA\epsilon_{A} from (23) and (24). For a generic estimator α^(i),α^(j)ΔK\widehat{\alpha}^{(i)},\widehat{\alpha}^{(j)}\in\Delta_{K} and A^\widehat{A}, Lemma 14 of Section B.3.10 requires the choice of δ\delta to satisfy δ6ϵA+2diam(𝒜)ϵα,A\delta\geq 6\epsilon_{A}+2\mathrm{diam}(\mathcal{A})\epsilon_{\alpha,A} and δ0\delta\to 0 as n,Nn,N\to\infty.

To ensure consistent estimation of the limiting c.d.f., we need to first estimate the set ij\mathcal{F}^{\prime}_{ij} given by (33) consistently, for instance, in the Hausdorff distance. Although consistency of the plug-in estimator ^\widehat{\mathcal{F}} given by (29) of \mathcal{F} can be done with relative ease, proving the consistency of the plug-in estimator of the facet f(α(i)α(j))=W(𝜶(i),𝜶(j);d)f^{\top}(\alpha^{(i)}-\alpha^{(j)})=W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) requires extra care. We introduce δ\delta in (51) mainly for technical reasons related to this consistency proof. Our simulations reveal that simply setting δ=0\delta=0 yields good performance overall.

{supplement}\stitle

Supplement to “Estimation and inference for the Wasserstein distance between mixing measures in topic models” \sdescriptionThe supplement [7] contains all the proofs, additional theoretical results, all numerical results and review of the error bound of the MLE-type estimators of the mixture weights.

References

  • [1] {bbook}[author] \bauthor\bsnmAgresti, \bfnmAlan\binitsA. (\byear2013). \btitleCategorical Data Analysis, \beditionthird ed. \bseriesWiley Series in Probability and Statistics. \bpublisherWiley-Interscience [John Wiley & Sons], Hoboken, NJ. \bmrnumber3087436 \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAndrews, \bfnmDonald W. K.\binitsD. W. K. (\byear2002). \btitleGeneralized method of moments estimation when a parameter is on a boundary. \bjournalJ. Bus. Econom. Statist. \bvolume20 \bpages530–544. \bnoteTwentieth anniversary GMM issue. \bdoi10.1198/073500102288618667 \bmrnumber1945607 \endbibitem
  • [3] {binproceedings}[author] \bauthor\bsnmArora, \bfnmSanjeev\binitsS., \bauthor\bsnmGe, \bfnmRong\binitsR., \bauthor\bsnmHalpern, \bfnmYonatan\binitsY., \bauthor\bsnmMimno, \bfnmDavid M.\binitsD. M., \bauthor\bsnmMoitra, \bfnmAnkur\binitsA., \bauthor\bsnmSontag, \bfnmDavid A.\binitsD. A., \bauthor\bsnmWu, \bfnmYichen\binitsY. and \bauthor\bsnmZhu, \bfnmMichael\binitsM. (\byear2013). \btitleA practical algorithm for topic modeling with provable guarantees. In \bbooktitleProceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013. \bseriesJMLR Workshop and Conference Proceedings \bvolume28 \bpages280–288. \bpublisherJMLR.org. \endbibitem
  • [4] {binproceedings}[author] \bauthor\bsnmArora, \bfnmSanjeev\binitsS., \bauthor\bsnmGe, \bfnmRong\binitsR. and \bauthor\bsnmMoitra, \bfnmAnkur\binitsA. (\byear2012). \btitleLearning topic models - Going beyond SVD. In \bbooktitle53rd Annual IEEE Symposium on Foundations of Computer Science, FOCS 2012, New Brunswick, NJ, USA, October 20-23, 2012 \bpages1–10. \bpublisherIEEE Computer Society. \bdoi10.1109/FOCS.2012.49 \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBa, \bfnmKhanh Do\binitsK. D., \bauthor\bsnmNguyen, \bfnmHuy L.\binitsH. L., \bauthor\bsnmNguyen, \bfnmHuy N.\binitsH. N. and \bauthor\bsnmRubinfeld, \bfnmRonitt\binitsR. (\byear2011). \btitleSublinear time algorithms for earth mover’s distance. \bjournalTheory Comput. Syst. \bvolume48 \bpages428–442. \bdoi10.1007/S00224-010-9265-8 \endbibitem
  • [6] {binproceedings}[author] \bauthor\bsnmBansal, \bfnmTrapit\binitsT., \bauthor\bsnmBhattacharyya, \bfnmChiranjib\binitsC. and \bauthor\bsnmKannan, \bfnmRavindran\binitsR. (\byear2014). \btitleA provable SVD-based algorithm for learning topics in dominant admixture corpus. In \bbooktitleAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada (\beditor\bfnmZoubin\binitsZ. \bsnmGhahramani, \beditor\bfnmMax\binitsM. \bsnmWelling, \beditor\bfnmCorinna\binitsC. \bsnmCortes, \beditor\bfnmNeil D.\binitsN. D. \bsnmLawrence and \beditor\bfnmKilian Q.\binitsK. Q. \bsnmWeinberger, eds.) \bpages1997–2005. \endbibitem
  • [7] {barticle}[author] \bauthor\bsnmBing, \bfnmXin\binitsX., \bauthor\bsnmBunea, \bfnmFlorentina\binitsF. and \bauthor\bsnmNiles-Weed, \bfnmJonathan\binitsJ. (\byear2024). \btitleSupplement to “Estimation and inference for the Wasserstein distance between mixing measures in topic models”. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBing, \bfnmXin\binitsX., \bauthor\bsnmBunea, \bfnmFlorentina\binitsF., \bauthor\bsnmStrimas-Mackey, \bfnmSeth\binitsS. and \bauthor\bsnmWegkamp, \bfnmMarten\binitsM. (\byear2022). \btitleLikelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations. \bjournalAnn. Statist. \bvolume50 \bpages3307–3333. \bdoi10.1214/22-AOS2229 \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBing, \bfnmXin\binitsX., \bauthor\bsnmBunea, \bfnmFlorentina\binitsF. and \bauthor\bsnmWegkamp, \bfnmMarten\binitsM. (\byear2020). \btitleA fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. \bjournalBernoulli \bvolume26 \bpages1765–1796. \bdoi10.3150/19-BEJ1166 \bmrnumber4091091 \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmBing, \bfnmXin\binitsX., \bauthor\bsnmBunea, \bfnmFlorentina\binitsF. and \bauthor\bsnmWegkamp, \bfnmMarten\binitsM. (\byear2020). \btitleOptimal estimation of sparse topic models. \bjournalJ. Mach. Learn. Res. \bvolume21 \bpagesPaper No. 177, 45. \bmrnumber4209463 \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmBirch, \bfnmM. W.\binitsM. W. (\byear1964). \btitleA new proof of the Pearson-Fisher theorem. \bjournalAnn. Math. Statist. \bvolume35 \bpages817–824. \bdoi10.1214/aoms/1177703581 \bmrnumber169324 \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmBlei, \bfnmDavid M.\binitsD. M. (\byear2012). \btitleProbabilistic topic models. \bjournalCommun. ACM \bvolume55 \bpages77–84. \bdoi10.1145/2133806.2133826 \endbibitem
  • [13] {barticle}[author] \bauthor\bsnmBlei, \bfnmDavid M.\binitsD. M., \bauthor\bsnmNg, \bfnmAndrew Y.\binitsA. Y. and \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I. (\byear2003). \btitleLatent dirichlet allocation. \bjournalJ. Mach. Learn. Res. \bvolume3 \bpages993–1022. \endbibitem
  • [14] {bbook}[author] \bauthor\bsnmBonnans, \bfnmJ. Frédéric\binitsJ. F. and \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear2000). \btitlePerturbation Analysis of Optimization Problems. \bseriesSpringer Series in Operations Research. \bpublisherSpringer-Verlag, New York. \bdoi10.1007/978-1-4612-1394-9 \bmrnumber1756264 \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmBravo González-Blas, \bfnmCarmen\binitsC., \bauthor\bsnmMinnoye, \bfnmLiesbeth\binitsL., \bauthor\bsnmPapasokrati, \bfnmDafni\binitsD., \bauthor\bsnmAibar, \bfnmSara\binitsS., \bauthor\bsnmHulselmans, \bfnmGert\binitsG., \bauthor\bsnmChristiaens, \bfnmValerie\binitsV., \bauthor\bsnmDavie, \bfnmKristofer\binitsK., \bauthor\bsnmWouters, \bfnmJasper\binitsJ. and \bauthor\bsnmAerts, \bfnmStein\binitsS. (\byear2019). \btitlecisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. \bjournalNat. Methods \bvolume16 \bpages397–400. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmCai, \bfnmT. Tony\binitsT. T. and \bauthor\bsnmLow, \bfnmMark G.\binitsM. G. (\byear2011). \btitleTesting composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. \bjournalAnn. Statist. \bvolume39 \bpages1012–1041. \bdoi10.1214/10-AOS849 \bmrnumber2816346 \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmChen, \bfnmJia Hua\binitsJ. H. (\byear1995). \btitleOptimal rate of convergence for finite mixture models. \bjournalAnn. Statist. \bvolume23 \bpages221–233. \bdoi10.1214/aos/1176324464 \bmrnumber1331665 \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmChen, \bfnmSisi\binitsS., \bauthor\bsnmRivaud, \bfnmPaul\binitsP., \bauthor\bsnmPark, \bfnmJong H.\binitsJ. H., \bauthor\bsnmTsou, \bfnmTiffany\binitsT., \bauthor\bsnmCharles, \bfnmEmeric\binitsE., \bauthor\bsnmHaliburton, \bfnmJohn R.\binitsJ. R., \bauthor\bsnmPichiorri, \bfnmFlavia\binitsF. and \bauthor\bsnmThomson, \bfnmMatt\binitsM. (\byear2020). \btitleDissecting heterogeneous cell populations across drug and disease conditions with PopAlign. \bjournalProc. Natl. Acad. Sci. U.S.A. \bvolume117 \bpages28784–28794. \bdoi10.1073/pnas.2005990117 \endbibitem
  • [19] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE., \bauthor\bsnmGiné, \bfnmEvarist\binitsE. and \bauthor\bsnmMatrán, \bfnmCarlos\binitsC. (\byear1999). \btitleCentral limit theorems for the Wasserstein distance between the empirical and the true distributions. \bjournalAnn. Probab. \bvolume27 \bpages1009–1071. \bdoi10.1214/aop/1022677394 \bmrnumber1698999 \endbibitem
  • [20] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE., \bauthor\bsnmGonzález Sanz, \bfnmAlberto\binitsA. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2024). \btitleCentral limit theorems for semi-discrete Wasserstein distances. \bjournalBernoulli \bvolume30 \bpages554–580. \bdoi10.3150/23-bej1608 \bmrnumber4665589 \endbibitem
  • [21] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE., \bauthor\bsnmGonzález-Sanz, \bfnmAlberto\binitsA. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2024). \btitleCentral limit theorems for general transportation costs. \bjournalAnn. Inst. Henri Poincaré Probab. Stat. \bvolume60 \bpages847–873. \bdoi10.1214/22-aihp1356 \bmrnumber4757510 \endbibitem
  • [22] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE., \bauthor\bsnmGordaliza, \bfnmPaula\binitsP. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2019). \btitleA central limit theorem for LpL_{p} transportation cost on the real line with application to fairness assessment in machine learning. \bjournalInf. Inference \bvolume8 \bpages817–849. \bdoi10.1093/imaiai/iaz016 \bmrnumber4045479 \endbibitem
  • [23] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2019). \btitleCentral limit theorems for empirical transportation cost in general dimension. \bjournalAnn. Probab. \bvolume47 \bpages926–951. \bdoi10.1214/18-AOP1275 \bmrnumber3916938 \endbibitem
  • [24] {binproceedings}[author] \bauthor\bsnmDing, \bfnmWeicong\binitsW., \bauthor\bsnmIshwar, \bfnmPrakash\binitsP. and \bauthor\bsnmSaligrama, \bfnmVenkatesh\binitsV. (\byear2015). \btitleMost large topic models are approximately separable. In \bbooktitle2015 Information Theory and Applications Workshop (ITA) \bpages199-203. \bdoi10.1109/ITA.2015.7308989 \endbibitem
  • [25] {binproceedings}[author] \bauthor\bsnmDonoho, \bfnmDavid L.\binitsD. L. and \bauthor\bsnmStodden, \bfnmVictoria\binitsV. (\byear2003). \btitleWhen does non-negative matrix factorization give a correct decomposition into parts? In \bbooktitleAdvances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada] (\beditor\bfnmSebastian\binitsS. \bsnmThrun, \beditor\bfnmLawrence K.\binitsL. K. \bsnmSaul and \beditor\bfnmBernhard\binitsB. \bsnmSchölkopf, eds.) \bpages1141–1148. \bpublisherMIT Press. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmDümbgen, \bfnmLutz\binitsL. (\byear1993). \btitleOn nondifferentiable functions and the bootstrap. \bjournalProbab. Theory Related Fields \bvolume95 \bpages125–140. \bdoi10.1007/BF01197342 \bmrnumber1207311 \endbibitem
  • [27] {barticle}[author] \bauthor\bsnmEfron, \bfnmB.\binitsB. (\byear1979). \btitleBootstrap methods: another look at the jackknife. \bjournalAnn. Statist. \bvolume7 \bpages1–26. \bmrnumber515681 \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmFang, \bfnmZheng\binitsZ. and \bauthor\bsnmSantos, \bfnmAndres\binitsA. (\byear2019). \btitleInference on directionally differentiable functions. \bjournalRev. Econ. Stud. \bvolume86 \bpages377–412. \bdoi10.1093/restud/rdy049 \bmrnumber3936869 \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmFu, \bfnmXiao\binitsX., \bauthor\bsnmHuang, \bfnmKejun\binitsK., \bauthor\bsnmSidiropoulos, \bfnmNicholas D.\binitsN. D., \bauthor\bsnmShi, \bfnmQingjiang\binitsQ. and \bauthor\bsnmHong, \bfnmMingyi\binitsM. (\byear2019). \btitleAnchor-free correlated topic modeling. \bjournalIEEE Trans. Pattern Anal. Mach. Intell. \bvolume41 \bpages1056-1071. \bdoi10.1109/TPAMI.2018.2827377 \endbibitem
  • [30] {barticle}[author] \bauthor\bsnmHan, \bfnmYanjun\binitsY., \bauthor\bsnmJiao, \bfnmJiantao\binitsJ. and \bauthor\bsnmWeissman, \bfnmTsachy\binitsT. (\byear2020). \btitleMinimax estimation of divergences between discrete distributions. \bjournalIEEE J. Sel. Areas Inf. Theory \bvolume1 \bpages814–823. \bdoi10.1109/JSAIT.2020.3041036 \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmHeinrich, \bfnmPhilippe\binitsP. and \bauthor\bsnmKahn, \bfnmJonas\binitsJ. (\byear2018). \btitleStrong identifiability and optimal minimax rates for finite mixture estimation. \bjournalAnn. Statist. \bvolume46 \bpages2844–2870. \bdoi10.1214/17-AOS1641 \bmrnumber3851757 \endbibitem
  • [32] {bbook}[author] \bauthor\bsnmHiriart-Urruty, \bfnmJean-Baptiste\binitsJ.-B. and \bauthor\bsnmLemaréchal, \bfnmClaude\binitsC. (\byear1993). \btitleConvex analysis and minimization algorithms. I. \bseriesGrundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] \bvolume305. \bpublisherSpringer-Verlag, Berlin \bnoteFundamentals. \bmrnumber1261420 \endbibitem
  • [33] {barticle}[author] \bauthor\bsnmHo, \bfnmNhat\binitsN. and \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX. (\byear2016). \btitleConvergence rates of parameter estimation for some weakly identifiable finite mixtures. \bjournalAnn. Statist. \bvolume44 \bpages2726–2755. \bdoi10.1214/16-AOS1444 \bmrnumber3576559 \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmHo, \bfnmNhat\binitsN. and \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX. (\byear2016). \btitleOn strong identifiability and convergence rates of parameter estimation in finite mixtures. \bjournalElectron. J. Stat. \bvolume10 \bpages271–307. \bdoi10.1214/16-EJS1105 \bmrnumber3466183 \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmHo, \bfnmNhat\binitsN., \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX. and \bauthor\bsnmRitov, \bfnmYa’acov\binitsY. (\byear2020). \btitleRobust estimation of mixing measures in finite mixture models. \bjournalBernoulli \bvolume26 \bpages828–857. \bdoi10.3150/18-BEJ1087 \bmrnumber4058353 \endbibitem
  • [36] {barticle}[author] \bauthor\bsnmJiao, \bfnmJiantao\binitsJ., \bauthor\bsnmHan, \bfnmYanjun\binitsY. and \bauthor\bsnmWeissman, \bfnmTsachy\binitsT. (\byear2018). \btitleMinimax estimation of the L1L_{1} distance. \bjournalIEEE Trans. Inf. Theory \bvolume64 \bpages6672–6706. \bdoi10.1109/TIT.2018.2846245 \endbibitem
  • [37] {bbook}[author] \bauthor\bsnmKirk, \bfnmWilliam\binitsW. and \bauthor\bsnmShahzad, \bfnmNaseer\binitsN. (\byear2014). \btitleFixed Point Theory in Distance Spaces. \bpublisherSpringer, Cham. \bdoi10.1007/978-3-319-10927-5 \bmrnumber3288309 \endbibitem
  • [38] {barticle}[author] \bauthor\bsnmKlopp, \bfnmOlga\binitsO., \bauthor\bsnmPanov, \bfnmMaxim\binitsM., \bauthor\bsnmSigalla, \bfnmSuzanne\binitsS. and \bauthor\bsnmTsybakov, \bfnmAlexandre B.\binitsA. B. (\byear2023). \btitleAssigning topics to documents by successive projections. \bjournalAnn. Statist. \bvolume51 \bpages1989–2014. \bdoi10.1214/23-aos2316 \bmrnumber4678793 \endbibitem
  • [39] {bbook}[author] \bauthor\bsnmLe Cam, \bfnmLucien\binitsL. (\byear1986). \btitleAsymptotic Methods in Statistical Decision Theory. \bseriesSpringer Series in Statistics. \bpublisherSpringer-Verlag, New York. \bdoi10.1007/978-1-4612-4946-7 \bmrnumber856411 \endbibitem
  • [40] {binbook}[author] \bauthor\bsnmLe Cam, \bfnmLucien\binitsL. and \bauthor\bsnmLo Yang, \bfnmGrace\binitsG. (\byear1990). \btitleLocally Asymptotically Normal Families. In \bbooktitleAsymptotics in Statistics: Some Basic Concepts \bpages52–98. \bpublisherSpringer US, \baddressNew York, NY. \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmLepski, \bfnmO.\binitsO., \bauthor\bsnmNemirovski, \bfnmA.\binitsA. and \bauthor\bsnmSpokoiny, \bfnmV.\binitsV. (\byear1999). \btitleOn estimation of the LrL_{r} norm of a regression function. \bjournalProbab. Theory Related Fields \bvolume113 \bpages221–253. \bdoi10.1007/s004409970006 \bmrnumber1670867 \endbibitem
  • [42] {barticle}[author] \bauthor\bsnmLeroux, \bfnmBrian G.\binitsB. G. (\byear1992). \btitleConsistent estimation of a mixing distribution. \bjournalAnn. Statist. \bvolume20 \bpages1350–1360. \bdoi10.1214/aos/1176348772 \bmrnumber1186253 \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmLiang, \bfnmTengyuan\binitsT. (\byear2019). \btitleOn the minimax optimality of estimating the Wasserstein metric. \endbibitem
  • [44] {barticle}[author] \bauthor\bsnmLo, \bfnmAlbert Y.\binitsA. Y. (\byear1984). \btitleOn a class of Bayesian nonparametric estimates. I. Density estimates. \bjournalAnn. Statist. \bvolume12 \bpages351–357. \bdoi10.1214/aos/1176346412 \bmrnumber733519 \endbibitem
  • [45] {barticle}[author] \bauthor\bsnmMaceachern, \bfnmSteven N.\binitsS. N. and \bauthor\bsnmMüller, \bfnmPeter\binitsP. (\byear1998). \btitleEstimating mixture of dirichlet process models. \bjournalJ. Comput. Graph. Statist. \bvolume7 \bpages223–238. \bdoi10.1080/10618600.1998.10474772 \endbibitem
  • [46] {barticle}[author] \bauthor\bsnmManole, \bfnmTudor\binitsT., \bauthor\bsnmBalakrishnan, \bfnmSivaraman\binitsS., \bauthor\bsnmNiles-Weed, \bfnmJonathan\binitsJ. and \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2024). \btitlePlugin estimation of smooth optimal transport maps. \bjournalAnn. Statist. \bpagesTo appear. \endbibitem
  • [47] {barticle}[author] \bauthor\bsnmManole, \bfnmTudor\binitsT. and \bauthor\bsnmKhalili, \bfnmAbbas\binitsA. (\byear2021). \btitleEstimating the number of components in finite mixture models via the group-sort-fuse procedure. \bjournalAnn. Statist. \bvolume49 \bpages3043–3069. \bdoi10.1214/21-aos2072 \bmrnumber4352522 \endbibitem
  • [48] {bbook}[author] \bauthor\bsnmMcLachlan, \bfnmGeoffrey\binitsG. and \bauthor\bsnmPeel, \bfnmDavid\binitsD. (\byear2000). \btitleFinite mixture models. \bseriesWiley Series in Probability and Statistics: Applied Probability and Statistics. \bpublisherWiley-Interscience, New York. \bdoi10.1002/0471721182 \bmrnumber1789474 \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX. (\byear2013). \btitleConvergence of latent mixing measures in finite and infinite mixture models. \bjournalAnn. Statist. \bvolume41 \bpages370–400. \bdoi10.1214/12-AOS1065 \bmrnumber3059422 \endbibitem
  • [50] {barticle}[author] \bauthor\bsnmNiles-Weed, \bfnmJonathan\binitsJ. and \bauthor\bsnmRigollet, \bfnmPhilippe\binitsP. (\byear2022). \btitleEstimation of Wasserstein distances in the spiked transport model. \bjournalBernoulli \bvolume28 \bpages2663–2688. \bdoi10.3150/21-bej1433 \bmrnumber4474558 \endbibitem
  • [51] {barticle}[author] \bauthor\bsnmPearson, \bfnmKarl\binitsK. (\byear1894). \btitleContributions to the mathematical theory of evolution. \bjournalPhilos. Trans. R. Soc. A \bvolume185 \bpages71–110. \endbibitem
  • [52] {barticle}[author] \bauthor\bsnmRao, \bfnmC. Radhakrishna\binitsC. R. (\byear1957). \btitleMaximum likelihood estimation for the multinomial distribution. \bjournalSankhyā \bvolume18 \bpages139–148. \bmrnumber105183 \endbibitem
  • [53] {barticle}[author] \bauthor\bsnmRao, \bfnmC. Radhakrishna\binitsC. R. (\byear1958). \btitleMaximum likelihood estimation for the multinomial distribution with infinite number of cells. \bjournalSankhyā \bvolume20 \bpages211–218. \bmrnumber107334 \endbibitem
  • [54] {binproceedings}[author] \bauthor\bsnmRecht, \bfnmBen\binitsB., \bauthor\bsnmRé, \bfnmChristopher\binitsC., \bauthor\bsnmTropp, \bfnmJoel A.\binitsJ. A. and \bauthor\bsnmBittorf, \bfnmVictor\binitsV. (\byear2012). \btitleFactoring nonnegative matrices with linear programs. In \bbooktitleAdvances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States (\beditor\bfnmPeter L.\binitsP. L. \bsnmBartlett, \beditor\bfnmFernando C. N.\binitsF. C. N. \bsnmPereira, \beditor\bfnmChristopher J. C.\binitsC. J. C. \bsnmBurges, \beditor\bfnmLéon\binitsL. \bsnmBottou and \beditor\bfnmKilian Q.\binitsK. Q. \bsnmWeinberger, eds.) \bpages1223–1231. \endbibitem
  • [55] {binbook}[author] \bauthor\bsnmRömisch, \bfnmWerner\binitsW. (\byear2006). \btitleDelta Method, Infinite Dimensional In \bbooktitleEncyclopedia of Statistical Sciences. \bpublisherJohn Wiley & Sons, Ltd. \bdoihttps://doi.org/10.1002/0471667196.ess3139 \endbibitem
  • [56] {barticle}[author] \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear1991). \btitleAsymptotic analysis of stochastic programs. \bjournalAnn. Oper. Res. \bvolume30 \bpages169–186. \bnoteStochastic programming, Part I (Ann Arbor, MI, 1989). \bdoi10.1007/BF02204815 \bmrnumber1118896 \endbibitem
  • [57] {barticle}[author] \bauthor\bsnmSommerfeld, \bfnmMax\binitsM. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2018). \btitleInference for empirical Wasserstein distances on finite spaces. \bjournalJ. R. Stat. Soc. Ser. B. Stat. Methodol. \bvolume80 \bpages219–238. \bdoi10.1111/rssb.12236 \bmrnumber3744719 \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmTakahashi, \bfnmWataru\binitsW. (\byear1970). \btitleA convexity in metric space and nonexpansive mappings. I. \bjournalKodai Math. Sem. Rep. \bvolume22 \bpages142–149. \bmrnumber267565 \endbibitem
  • [59] {barticle}[author] \bauthor\bsnmTameling, \bfnmCarla\binitsC., \bauthor\bsnmSommerfeld, \bfnmMax\binitsM. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2019). \btitleEmpirical optimal transport on countable metric spaces: distributional limits and statistical applications. \bjournalAnn. Appl. Probab. \bvolume29 \bpages2744–2781. \bdoi10.1214/19-AAP1463 \bmrnumber4019874 \endbibitem
  • [60] {bbook}[author] \bauthor\bsnmTiman, \bfnmA. F.\binitsA. F. (\byear1994). \btitleTheory of Approximation of Functions of a Real Variable. \bpublisherDover Publications, Inc., New York \bnoteTranslated from the Russian by J. Berry, Translation edited and with a preface by J. Cossar, Reprint of the 1963 English translation. \bmrnumber1262128 \endbibitem
  • [61] {barticle}[author] \bauthor\bsnmTracy Ke, \bfnmZheng\binitsZ. and \bauthor\bsnmWang, \bfnmMinzhe\binitsM. (\byear2024). \btitleUsing SVD for topic modeling. \bjournalJ. Amer. Statist. Assoc. \bvolume119 \bpages434–449. \bdoi10.1080/01621459.2022.2123813 \bmrnumber4713904 \endbibitem
  • [62] {bbook}[author] \bauthor\bsnmTsybakov, \bfnmAlexandre B.\binitsA. B. (\byear2009). \btitleIntroduction to Nonparametric Estimation. \bseriesSpringer Series in Statistics. \bpublisherSpringer, New York \bnoteRevised and extended from the 2004 French original, Translated by Vladimir Zaiats. \bdoi10.1007/b13794 \bmrnumber2724359 \endbibitem
  • [63] {bbook}[author] \bauthor\bparticlevan der \bsnmVaart, \bfnmA. W.\binitsA. W. (\byear1998). \btitleAsymptotic Statistics. \bseriesCambridge Series in Statistical and Probabilistic Mathematics \bvolume3. \bpublisherCambridge University Press, Cambridge. \bdoi10.1017/CBO9780511802256 \bmrnumber1652247 \endbibitem
  • [64] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. (\byear2009). \btitleOptimal Transport. \bseriesGrundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] \bvolume338. \bpublisherSpringer-Verlag, Berlin. \bdoi10.1007/978-3-540-71050-9 \bmrnumber2459454 \endbibitem
  • [65] {barticle}[author] \bauthor\bsnmWang, \bfnmShulei\binitsS., \bauthor\bsnmCai, \bfnmT. Tony\binitsT. T. and \bauthor\bsnmLi, \bfnmHongzhe\binitsH. (\byear2021). \btitleOptimal estimation of Wasserstein distance on a tree with an application to microbiome studies. \bjournalJ. Amer. Statist. Assoc. \bvolume116 \bpages1237–1253. \bdoi10.1080/01621459.2019.1699422 \bmrnumber4309269 \endbibitem
  • [66] {bbook}[author] \bauthor\bsnmWeaver, \bfnmNik\binitsN. (\byear2018). \btitleLipschitz Algebras. \bpublisherWorld Scientific Publishing Co. Pte. Ltd., Hackensack, NJ \bnoteSecond edition of [ MR1832645]. \bmrnumber3792558 \endbibitem
  • [67] {barticle}[author] \bauthor\bsnmWu, \bfnmRuijia\binitsR., \bauthor\bsnmZhang, \bfnmLinjun\binitsL. and \bauthor\bsnmCai, \bfnmT. Tony\binitsT. T. (\byear2023). \btitleSparse topic modeling: computational efficiency, near-optimal algorithms, and statistical inference. \bjournalJ. Amer. Statist. Assoc. \bvolume118 \bpages1849–1861. \bdoi10.1080/01621459.2021.2018329 \bmrnumber4646611 \endbibitem
  • [68] {barticle}[author] \bauthor\bsnmWu, \bfnmYihong\binitsY. and \bauthor\bsnmYang, \bfnmPengkun\binitsP. (\byear2019). \btitleChebyshev polynomials, moment matching, and optimal estimation of the unseen. \bjournalAnn. Statist. \bvolume47 \bpages857–883. \bdoi10.1214/17-AOS1665 \bmrnumber3909953 \endbibitem
  • [69] {barticle}[author] \bauthor\bsnmWu, \bfnmYihong\binitsY. and \bauthor\bsnmYang, \bfnmPengkun\binitsP. (\byear2020). \btitleOptimal estimation of Gaussian mixtures via denoised method of moments. \bjournalAnn. Statist. \bvolume48 \bpages1981–2007. \bdoi10.1214/19-AOS1873 \bmrnumber4134783 \endbibitem
  • [70] {barticle}[author] \bauthor\bsnmYu, \bfnmY.\binitsY., \bauthor\bsnmWang, \bfnmT.\binitsT. and \bauthor\bsnmSamworth, \bfnmR. J.\binitsR. J. (\byear2015). \btitleA useful variant of the Davis-Kahan theorem for statisticians. \bjournalBiometrika \bvolume102 \bpages315–323. \bdoi10.1093/biomet/asv008 \bmrnumber3371006 \endbibitem

Appendix A contains the description of an estimator of AA. Appendix B contains most of the proofs. In Appendix C we state the minimax lower bounds of estimating any metric on discrete probability measures that is bi-Lipschitz equivalent to the Total Variation Distance. In Appendix D we state minimax lower bounds for estimation of mixing measures in topic models. In Appendix E we derive the limiting distribution of the proposed W-distance estimator for two samples with different sample sizes. Appendix F contains all numerical results. In Appendix G we review the 1\ell_{1}-norm error bound of the MLE-type estimators of the mixture weights in [8]. Finally, Appendix H contains our theoretical guarantees of the W-distance estimator based on the weighted least squares estimator of the mixture weights.

Appendix A An algorithm for estimating the mixture component matrix AA

We recommend the following procedure for estimating the matrix of mixture components AA, typically referred to, in the topic model literature, as the word-topic matrix AA. Our procedure requires that Assumption 3 holds, and it consists of two parts: (a) estimation of the partition of anchor words, and (b) estimation of the word-topic matrix AA. Step (a) uses the procedure proposed in [9], stated in Algorithm 1, while step (b) uses the procedure proposed in [10], summarized in Algorithm 2.

Recall that \bmX=(X(1),,X(n))\bm X=(X^{(1)},\ldots,X^{(n)}) with NiN_{i} denoting the length of document ii. Define

Θ^=1ni=1n[NiNi1X(i)X(i)1Ni1diag(X(i))]\widehat{\Theta}={1\over n}\sum_{i=1}^{n}\left[{N_{i}\over N_{i}-1}X^{(i)}X^{(i)\top}-{1\over N_{i}-1}\textrm{diag}(X^{(i)})\right] (52)

and

R^=DX1Θ^DX1\widehat{R}=D_{X}^{-1}\widehat{\Theta}D_{X}^{-1} (53)

with DX=n1diag(\bmX𝟙n)D_{X}=n^{-1}\mathrm{diag}(\bm X\mathds{1}_{n}).

A.1 Estimation of the index set of the anchor words, its partition and the number of topics

We write the set of anchor words as I=k[K]IkI=\cup_{k\in[K]}I_{k} and its partition ={I1,,IK}\mathcal{I}=\{I_{1},\ldots,I_{K}\} where

Ik={j[p]:Ajk>0,Ak=0,j}.I_{k}=\{j\in[p]:A_{jk}>0,\ A_{\ell k}=0,\ \forall\ \ell\neq j\}.

Algorithm 1 estimates the index set II, its partition \mathcal{I} and the number of topics KK from the input matrix R^\widehat{R}. The choice C1=1.1C_{1}=1.1 is recommended and is empirically verified to be robust in [9]. A data-driven choice of δj\delta_{j\ell} is specified in [9] as

δ^j=n2\bmXj1\bmX1{η^j+2Θ^jlogMn[n\bmXj1(1ni=1n\bmXjiNi)12+n\bmX1(1ni=1n\bmXiNi)12]}\widehat{\delta}_{j\ell}={n^{2}\over\|\bm X_{j\cdot}\|_{1}\|\bm X_{\ell\cdot}\|_{1}}\!\left\{\widehat{\eta}_{j\ell}+2\widehat{\Theta}_{j\ell}\sqrt{\log M\over n}\!\left[\!\frac{n}{\|\bm X_{j\cdot}\|_{1}}\!\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\bm X_{ji}}{N_{i}}\right)^{{1\over 2}}\!\!\!+\!\frac{n}{\|\bm X_{\ell\cdot}\|_{1}}\!\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\bm X_{\ell i}}{N_{i}}\right)^{{1\over 2}}\right]\right\} (54)

with M=npmaxiNiM=n\vee p\vee\max_{i}N_{i} and

η^j=\displaystyle\widehat{\eta}_{j\ell}= 36(\bmXj12+\bmX12)logMn(1ni=1n\bmXji\bmXiNi)12+\displaystyle~{}3\sqrt{6}\left(\left\|\bm X_{j\cdot}\right\|_{\infty}^{1\over 2}+\left\|\bm X_{\ell\cdot}\right\|_{\infty}^{1\over 2}\right)\sqrt{\log M\over n}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\bm X_{ji}\bm X_{\ell i}}{N_{i}}\right)^{1\over 2}+ (55)
+2logMn(\bmXj+\bmX)1ni=1n1Ni+31(logM)4n(1ni=1n\bmXji+\bmXiNi3)12\displaystyle+{2\log M\over n}\left(\|\bm X_{j\cdot}\|_{\infty}+\|\bm X_{\ell\cdot}\|_{\infty}\right)\frac{1}{n}\sum_{i=1}^{n}{1\over N_{i}}+31\sqrt{(\log M)^{4}\over n}\left({1\over n}\sum_{i=1}^{n}{\bm X_{ji}+\bm X_{\ell i}\over N_{i}^{3}}\right)^{\frac{1}{2}}
Algorithm 1 Estimate the partition of the anchor words \mathcal{I} by ^\widehat{\mathcal{I}}
1:matrix R^p×p\widehat{R}\in\mathbb{R}^{p\times p}, C1C_{1} and Qp×pQ\in\mathbb{R}^{p\times p} such that Q[j,]:=C1δ^jQ[j,\ell]:=C_{1}\widehat{\delta}_{j\ell}
2:procedure FindAnchorWords(R^\widehat{R}, QQ)
3:     initialize ^=\widehat{\mathcal{I}}=\emptyset
4:     for i[p]i\in[p] do
5:         a^i=argmax1jpR^ij\widehat{a}_{i}=\operatorname*{argmax}_{1\leq j\leq p}\widehat{R}_{ij}
6:         set I^(i)={[p]:R^ia^iR^ilQ[i,a^i]+Q[i,]}\widehat{I}^{(i)}=\{\ell\in[p]:\widehat{R}_{i\widehat{a}_{i}}-\widehat{R}_{il}\leq Q[i,\widehat{a}_{i}]+Q[i,\ell]\} and Anchor(i)=True\textsc{Anchor}(i)=\textsc{True}
7:         for jI^(i)j\in\widehat{I}^{(i)} do
8:              a^j=argmax1kpR^jk\widehat{a}_{j}=\operatorname*{argmax}_{1\leq k\leq p}\widehat{R}_{jk}
9:              if |R^ijR^ja^j|>Q[i,j]+Q[j,a^j]\Bigl{|}\widehat{R}_{ij}-\widehat{R}_{j\widehat{a}_{j}}\Bigr{|}>Q[i,j]+Q[j,\widehat{a}_{j}] then
10:                  Anchor(i)=False\textsc{Anchor}(i)=\textsc{False}
11:                  break                        
12:         if Anchor(i)\textsc{Anchor}(i) then
13:              ^=Merge(I^(i)\widehat{\mathcal{I}}=\textsc{Merge}(\widehat{I}^{(i)}, ^\widehat{\mathcal{I}})               
14:     return ^={I^1,I^2,,I^K^}\widehat{\mathcal{I}}=\{\widehat{I}_{1},\widehat{I}_{2},\ldots,\widehat{I}_{\widehat{K}}\}
15:
16:procedure Merge(I^(i)\widehat{I}^{(i)}, ^\widehat{\mathcal{I}})
17:     for G^G\in\widehat{\mathcal{I}} do
18:         if GI^(i)G\cap\widehat{I}^{(i)}\neq\emptyset then
19:              replace GG in ^\widehat{\mathcal{I}} by GI^(i)G\cap\widehat{I}^{(i)}
20:              return ^\widehat{\mathcal{I}}               
21:     I^(i)^\widehat{I}^{(i)}\in\widehat{\mathcal{I}}
22:     return ^\widehat{\mathcal{I}}

A.2 Estimation of the word-topic matrix AA with a given partition of anchor words

Given the estimated partition of anchor words ^={I^1,,I^K^}\widehat{\mathcal{I}}=\{\widehat{I}_{1},\ldots,\widehat{I}_{\widehat{K}}\} and its index set I^=k[K^]I^k\widehat{I}=\cup_{k\in[\widehat{K}]}\widehat{I}_{k}, Algorithm 2 below estimates the matrix AA.

[10] recommends to set λ=0\lambda=0 whenever M^\widehat{M} is invertible and otherwise choose λ\lambda large enough such that M^+λ\bmIK\widehat{M}+\lambda\bm{I}_{K} is invertible. Specifically, [10] recommends to choose λ\lambda as

λ(t)=0.01tK(Klog(np)[miniI^(DX)ii]n1ni=1n1Ni)1/2.\lambda(t^{*})=0.01\cdot t^{*}\cdot K\left({K\log(n\vee p)\over[\min_{i\in\widehat{I}}(D_{X})_{ii}]n}\cdot\frac{1}{n}\sum_{i=1}^{n}{1\over N_{i}}\right)^{1/2}.\\

where

t=argmin{t{0,1,2,}:M^+λ(t)\bmIK is invertible}.t^{*}=\arg\min\left\{t\in\{0,1,2,\ldots\}:\,\widehat{M}+\lambda(t)\bm{I}_{K}\text{ is invertible}\right\}.
Algorithm 2 Sparse Topic Model solver (STM)
1:frequency data matrix \bmXp×n\bm X\in\mathbb{R}^{p\times n} with document lengths N1,,NnN_{1},\ldots,N_{n}; the partition of anchor words {I1,,I^K^}\{I_{1},\ldots,\widehat{I}_{\widehat{K}}\} and its index set I^=k[K^]I^k\widehat{I}=\cup_{k\in[\widehat{K}]}\widehat{I}_{k}, the tuning parameter λ0\lambda\geq 0
2:procedure 
3:     compute DX=n1diag(\bmX𝟙n)D_{X}=n^{-1}\mathrm{diag}(\bm X\mathds{1}_{n}), Θ^\widehat{\Theta} from (52) and R^\widehat{R} from (53)
4:     compute B^I^\widehat{B}_{\widehat{I}\cdot} by B^i=𝐞k\widehat{B}_{i\cdot}=\mathbf{e}_{k} for each iI^ki\in\widehat{I}_{k} and k[K^]k\in[\widehat{K}]
5:     compute M^=B^I^+R^I^I^B^I^+\widehat{M}=\widehat{B}_{\widehat{I}\cdot}^{+}\widehat{R}_{\widehat{I}\widehat{I}}\widehat{B}_{\widehat{I}\cdot}^{+\top} and H^=B^I^+R^I^I^c\widehat{H}=\widehat{B}_{\widehat{I}\cdot}^{+}\widehat{R}_{\widehat{I}\widehat{I}^{c}} with B^I^+=(B^I^B^I^)1B^I^\widehat{B}_{\widehat{I}\cdot}^{+}=(\widehat{B}_{\widehat{I}\cdot}^{\top}\widehat{B}_{\widehat{I}\cdot})^{-1}\widehat{B}_{\widehat{I}\cdot}^{\top} and I^c=[p]I^\widehat{I}^{c}=[p]\setminus\widehat{I}
6:     solve B^I^c\widehat{B}_{\widehat{I}^{c}\cdot} from
B^j\displaystyle\widehat{B}_{j\cdot} =0,\displaystyle=0, if (DX)jj7log(np)n(1ni=1n1Ni),\displaystyle\quad\text{if }(D_{X})_{jj}\leq{7\log(n\vee p)\over n}\left({1\over n}\sum_{i=1}^{n}{1\over N_{i}}\right),
B^j\displaystyle\widehat{B}_{j\cdot} =argminβ0,β1=1β(M^+λ\bmIK)β2βh^(j),\displaystyle=\operatorname*{argmin}_{\beta\geq 0,\ \|\beta\|_{1}=1}\beta^{\top}(\widehat{M}+\lambda\bm{I}_{K})\beta-2\beta^{\top}\widehat{h}^{(j)}, otherwise,
for each jI^cj\in\widehat{I}^{c}, with h^(j)\widehat{h}^{(j)} being the corresponding column of H^\widehat{H}.
7:     compute A^\widehat{A} by normalizing DXB^D_{X}\widehat{B} to unit column sums
8:     return A^\widehat{A}

Appendix B Proofs

To avoid a proliferation of superscripts, we adopt simplified notation for r()=Aα()r^{(\ell)}=A\alpha^{(\ell)} with {i,j}\ell\in\{i,j\} and its related quantities throughout the proofs. In place of r(i)=Aα(i)r^{(i)}=A\alpha^{(i)} and r(j)=Aα(j)r^{(j)}=A\alpha^{(j)}, we consider instead

r=Aα,s=Aβr=A\alpha,\quad s=A\beta

with their corresponding mixing measures

𝜶=k=1KαkδAk,𝜷=k=1KβkδAk.\boldsymbol{\alpha}=\sum_{k=1}^{K}\alpha_{k}\delta_{A_{k}},\quad\boldsymbol{\beta}=\sum_{k=1}^{K}\beta_{k}\delta_{A_{k}}.

Similarly, we write 𝜶^,𝜷^\widehat{\boldsymbol{\alpha}},\widehat{\boldsymbol{\beta}} and 𝜶~,𝜷~\widetilde{\boldsymbol{\alpha}},\widetilde{\boldsymbol{\beta}} for 𝜶^(i),𝜶^(j)\widehat{\boldsymbol{\alpha}}^{(i)},\widehat{\boldsymbol{\alpha}}^{(j)} and 𝜶~(i),𝜶~(j)\widetilde{\boldsymbol{\alpha}}^{(i)},\widetilde{\boldsymbol{\alpha}}^{(j)}, respectively.

B.1 Proofs for Section 3

B.1.1 Proof of Theorem 1

Proof.

We will prove

SWD(r,s)\displaystyle\operatorname{SWD}(r,s) =infα,βΔK:r=Aα,s=AβinfγΓ(𝜶,𝜷),k[K]γkd(Ak,A)\displaystyle=\inf_{\alpha,\beta\in\Delta_{K}:r=A\alpha,s=A\beta}\inf_{\gamma\in\Gamma(\boldsymbol{\alpha},\boldsymbol{\beta})}\sum_{\ell,k\in[K]}\gamma_{k\ell}d(A_{k},A_{\ell})
=infα,βΔK:r=Aα,s=AβW(𝜶,𝜷;d)\displaystyle=\inf_{\alpha,\beta\in\Delta_{K}:r=A\alpha,s=A\beta}W(\boldsymbol{\alpha},\boldsymbol{\beta};d) (56)

under 2. In particular, Theorem 1 follows immediately under 1.

To prove (B.1.1), let us denote the right side of (B.1.1) by D(r,s)D(r,s) and write W(𝜶,𝜷):=W(𝜶,𝜷;d)W(\boldsymbol{\alpha},\boldsymbol{\beta}):=W(\boldsymbol{\alpha},\boldsymbol{\beta};d) for simplicity. We first show that DD satisfies R1 and R2. To show convexity, fix r,r,s,s𝒮r,r^{\prime},s,s^{\prime}\in\mathcal{S}, and let α,α,β,βΔK\alpha,\alpha^{\prime},\beta,\beta^{\prime}\in\Delta_{K} satisfy

r=Aαr=Aαs=Aβs=Aβ.\begin{split}r&=A\alpha\\ r^{\prime}&=A\alpha^{\prime}\\ s&=A\beta\\ s^{\prime}&=A\beta^{\prime}\,.\end{split} (57)

Then for any λ[0,1]\lambda\in[0,1], it is clear that λr+(1λ)r=A(λα+(1λ)α)\lambda r+(1-\lambda)r^{\prime}=A(\lambda\alpha+(1-\lambda)\alpha^{\prime}), and similarly for ss. Therefore

D(λr+(1λ)r,λs+(1λ)s)\displaystyle D(\lambda r+(1-\lambda)r^{\prime},\lambda s+(1-\lambda)s^{\prime}) W(λ𝜶+(1λ)𝜶,λ𝜷+(1λ)𝜷)\displaystyle\leq W(\lambda\boldsymbol{\alpha}+(1-\lambda)\boldsymbol{\alpha}^{\prime},\lambda\boldsymbol{\beta}+(1-\lambda)\boldsymbol{\beta}^{\prime})
λW(𝜶,𝜷)+(1λ)W(𝜶,𝜷),\displaystyle\leq\lambda W(\boldsymbol{\alpha},\boldsymbol{\beta})+(1-\lambda)W(\boldsymbol{\alpha}^{\prime},\boldsymbol{\beta}^{\prime})\,,

where the second inequality uses the convexity of the Wasserstein distance [64, Theorem 4.8].

Taking infima on both sides over α,α,β,βΔK\alpha,\alpha^{\prime},\beta,\beta^{\prime}\in\Delta_{K} satisfying Eq. 57 shows that D(λr+(1λ)r,λs+(1λ)s)λD(r,s)+(1λ)D(r,s)D(\lambda r+(1-\lambda)r^{\prime},\lambda s+(1-\lambda)s^{\prime})\leq\lambda D(r,s)+(1-\lambda)D(r^{\prime},s^{\prime}), which establishes that DD^{\prime} satisfies R1. To show that it satisfies R2, we note that 2 implies that if Ak=AαA_{k}=A\alpha, then we must have 𝜶=δAk\boldsymbol{\alpha}=\delta_{A_{k}}. Therefore

D(Ak,A)=infα,βΔK:Ak=Aα,A=AβW(𝜶,𝜷)=W(δAk,δA),D(A_{k},A_{\ell})=\inf_{\alpha,\beta\in\Delta_{K}:A_{k}=A\alpha,A_{\ell}=A\beta}W(\boldsymbol{\alpha},\boldsymbol{\beta})=W(\delta_{A_{k}},\delta_{A_{\ell}})\,, (58)

Recalling that the Wasserstein distance is defined as an infimum over couplings between the two marginal measures and using the fact that the only coupling between δAk\delta_{A_{k}} and δA\delta_{A_{\ell}} is δAk×δA\delta_{A_{k}}\times\delta_{A_{\ell}}, we obtain

W(δAk,δA)=d(A,A)d(δAk×δA)(A,A)=d(Ak,A),W(\delta_{A_{k}},\delta_{A_{\ell}})=\int d(A,A^{\prime})\,\mathrm{d}(\delta_{A_{k}}\times\delta_{A_{\ell}})(A,A^{\prime})=d(A_{k},A_{\ell})\,,

showing that DD also satisfies R2. As a consequence, by Definition (9), we obtain in particular that SWD(r,s)D(r,s)\operatorname{SWD}(r,s)\geq D(r,s) for all r,s𝒮r,s\in\mathcal{S}.

To show equality, it therefore suffices to show that SWD(r,s)D(r,s)\operatorname{SWD}(r,s)\leq D(r,s) for all r,s𝒮r,s\in\mathcal{S}. To do so, fix r,s𝒮r,s\in\mathcal{S} and let α,β\alpha,\beta satisfy r=Aαr=A\alpha and s=Aβs=A\beta. Let γΓ(𝜶,𝜷)\gamma\in\Gamma(\boldsymbol{\alpha},\boldsymbol{\beta}) be an arbitrary coupling between α\alpha and β\beta, which we identify with an element of ΔK×K\Delta_{K\times K}. The definition of Γ(𝜶,𝜷)\Gamma(\boldsymbol{\alpha},\boldsymbol{\beta}) implies that, =1Kγk=αk\sum_{\ell=1}^{K}\gamma_{k\ell}=\alpha_{k} for all k[K]k\in[K], and since r=Aαr=A\alpha, we obtain r=,k[K]γkAkr=\sum_{\ell,k\in[K]}\gamma_{k\ell}A_{k}. Similarly, s=,k[K]γkAs=\sum_{\ell,k\in[K]}\gamma_{k\ell}A_{\ell}. The convexity of SWD\operatorname{SWD} (R1) therefore implies

SWD(r,s)=SWD(,k[K]γkAk,,k[K]γkA),k[K]γkSWD(Ak,A).\operatorname{SWD}(r,s)=\operatorname{SWD}\left(\sum_{\ell,k\in[K]}\gamma_{k\ell}A_{k},\sum_{\ell,k\in[K]}\gamma_{k\ell}A_{\ell}\right)\leq\sum_{\ell,k\in[K]}\gamma_{k\ell}\operatorname{SWD}(A_{k},A_{\ell})\,.

Since SWD\operatorname{SWD} agrees with dd on the original mixture components (R2), we further obtain

SWD(r,s),k[K]γkd(Ak,A).\operatorname{SWD}(r,s)\leq\sum_{\ell,k\in[K]}\gamma_{k\ell}\ d(A_{k},A_{\ell})\,.

Finally, we may take infima over all α,β\alpha,\beta satisfying r=Aαr=A\alpha and s=Aβs=A\beta and γΓ(𝜶,𝜷)\gamma\in\Gamma(\boldsymbol{\alpha},\boldsymbol{\beta}) to obtain

SWD(r,s)\displaystyle\operatorname{SWD}(r,s) infα,βΔK:r=Aα,s=AβinfγΓ(𝜶,𝜷),k[K]γkd(Ak,A)\displaystyle\leq\inf_{\alpha,\beta\in\Delta_{K}:r=A\alpha,s=A\beta}\inf_{\gamma\in\Gamma(\boldsymbol{\alpha},\boldsymbol{\beta})}\sum_{\ell,k\in[K]}\gamma_{k\ell}d(A_{k},A_{\ell})
=infα,βΔK:r=Aα,s=AβW(𝜶,𝜷)\displaystyle=\inf_{\alpha,\beta\in\Delta_{K}:r=A\alpha,s=A\beta}W(\boldsymbol{\alpha},\boldsymbol{\beta})
=D(r,s).\displaystyle=D(r,s)\,.

Therefore SWD(r,s)D(r,s)\operatorname{SWD}(r,s)\leq D(r,s), establishing that SWD(r,s)=D(r,s)\operatorname{SWD}(r,s)=D(r,s), as claimed. ∎

B.1.2 Proof of Corollary 1

Proof.

Under 1, the map ι:αr=Aα\iota:\alpha\mapsto r=A\alpha is a bijection between ΔK\Delta_{K} and 𝒮\mathcal{S}. Since (𝒳,d)(\mathcal{X},d) is a (finite) metric space, (ΔK,W)(\Delta_{K},W) is a complete metric space [64, Theorem 6.18], and Theorem 1 establishes that the map from (ΔK,W)(\Delta_{K},W) to (𝒮,SWD)(\mathcal{S},\operatorname{SWD}) induced by ι\iota is an isomorphism of metric spaces. In particular, (𝒮,SWD)(\mathcal{S},\operatorname{SWD}) is a complete metric space, as desired. ∎

B.1.3 A counterexample to Corollary 1 under 2

Let A1,,A4A_{1},\dots,A_{4} be probability measures on the set {1,2,3,4}\{1,2,3,4\}, represented as elements of Δ4\Delta_{4} as

A1\displaystyle A_{1} =(12,12,0,0)\displaystyle=(\frac{1}{2},\frac{1}{2},0,0)
A2\displaystyle A_{2} =(0,0,12,12)\displaystyle=(0,0,\frac{1}{2},\frac{1}{2})
A3\displaystyle A_{3} =(12,0,12,0)\displaystyle=(\frac{1}{2},0,\frac{1}{2},0)
A4\displaystyle A_{4} =(0,12,0,12).\displaystyle=(0,\frac{1}{2},0,\frac{1}{2}).

We equip 𝒳={A1,A2,A3,A4}\mathcal{X}=\{A_{1},A_{2},A_{3},A_{4}\} with any metric dd satisfying the following relations:

d(A1,A2),d(A3,A4)\displaystyle d(A_{1},A_{2}),d(A_{3},A_{4}) <1\displaystyle<1
d(A1,A3)\displaystyle d(A_{1},A_{3}) =1.\displaystyle=1\,.

These distributions satisfy Assumption 2 but not Assumption 1, since A1span(A2,A3,A4)A_{1}\in\mathrm{span}(A_{2},A_{3},A_{4}). Let r=A1r=A_{1}, t=A3t=A_{3}, and s=12(A1+A2)=12(A3+A4).s=\frac{1}{2}(A_{1}+A_{2})=\frac{1}{2}(A_{3}+A_{4}). Then SWD(r,t)=d(A1,A3)=1\operatorname{SWD}(r,t)=d(A_{1},A_{3})=1, but SWD(r,s)+SWD(s,t)=12d(A1,A2)+12d(A3,A4)<1\operatorname{SWD}(r,s)+\operatorname{SWD}(s,t)=\frac{1}{2}d(A_{1},A_{2})+\frac{1}{2}d(A_{3},A_{4})<1. Therefore, 2 alone is not strong enough to guarantee that SWD\operatorname{SWD} as defined in (10) is a metric.

B.2 Proofs for Section 4

B.2.1 Proof of the upper bound in Eq. 22

Proof.

Start with

𝔼|W(𝜶^,𝜷^;d)W(𝜶,𝜷;d)|𝔼W(𝜶^,𝜶;d)+𝔼W(𝜷^,𝜷;d),\mathbb{E}|W(\widehat{\boldsymbol{\alpha}},\widehat{\boldsymbol{\beta}};d)-W(\boldsymbol{\alpha},\boldsymbol{\beta};d)|~{}\leq~{}\mathbb{E}W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)+\mathbb{E}W(\widehat{\boldsymbol{\beta}},\boldsymbol{\beta};d),

which follows by two applications of the triangle inequality. Let 𝜶~=k=1Kα^kδAk\widetilde{\boldsymbol{\alpha}}=\sum_{k=1}^{K}\widehat{\alpha}_{k}\delta_{A_{k}}. The triangle inequality yields

W(𝜶^,𝜶;d)W(𝜶^,𝜶~;d)+W(𝜶~,𝜶;d).W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\leq W(\widehat{\boldsymbol{\alpha}},\widetilde{\boldsymbol{\alpha}};d)+W(\widetilde{\boldsymbol{\alpha}},\boldsymbol{\alpha};d).

The result follows by noting that

𝔼W(𝜶^,𝜶~;d)=𝔼(k=1Kα^kd(A^k,Ak))ϵA,\displaystyle\mathbb{E}W(\widehat{\boldsymbol{\alpha}},\widetilde{\boldsymbol{\alpha}};d)=\mathbb{E}(\sum_{k=1}^{K}\widehat{\alpha}_{k}d(\widehat{A}_{k},A_{k}))\leq\epsilon_{A},
𝔼W(𝜶~,𝜶;d)12diam(𝒜)𝔼α^α1\displaystyle\mathbb{E}W(\widetilde{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\leq{1\over 2}\mathrm{diam}(\mathcal{A})\mathbb{E}\|\widehat{\alpha}-\alpha\|_{1}

and using the same arguments for W(𝜷^,𝜷;d)W(\widehat{\boldsymbol{\beta}},\boldsymbol{\beta};d). ∎

B.2.2 Proof of Theorem 3

Proof.

We first prove a reduction scheme. Fix 1<τK1<\tau\leq K. Denote the parameter space of W(𝜶,𝜷;d)W(\boldsymbol{\alpha},\boldsymbol{\beta};d) as

ΘW={W(𝜶,𝜷;d):α,βΘα(τ),AΘA}.\Theta_{W}=\{W(\boldsymbol{\alpha},\boldsymbol{\beta};d):\alpha,\beta\in\Theta_{\alpha}(\tau),A\in\Theta_{A}\}.

Let 𝔼W\mathbb{E}_{W} denote the expectation with respect to the law corresponding to any WΘWW\in\Theta_{W}. We first notice that for any subset Θ¯WΘW\bar{\Theta}_{W}\subseteq\Theta_{W},

infW^supWΘW𝔼W|W^W|\displaystyle\inf_{\widehat{W}}\sup_{W\in\Theta_{W}}\mathbb{E}_{W}|\widehat{W}-W| infW^supWΘ¯W𝔼W|W^W|\displaystyle\geq\inf_{\widehat{W}}\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-W|
12infW^Θ¯WsupWΘ¯W𝔼W|W^W|.\displaystyle\geq{1\over 2}\inf_{\widehat{W}\in\bar{\Theta}_{W}}\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-W|. (59)

To see why the second inequality holds, for any Θ¯WΘW\bar{\Theta}_{W}\subseteq\Theta_{W}, pick any W¯Θ¯W\bar{W}\in\bar{\Theta}_{W}. By the triangle inequality,

supWΘ¯W𝔼W|W¯W|\displaystyle\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\bar{W}-W| supWΘ¯W𝔼W|W^W¯|+supWΘ¯W𝔼W|W^W|.\displaystyle\leq\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-\bar{W}|+\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-W|.

Hence

infW¯Θ¯WsupWΘ¯W𝔼W|W¯W|\displaystyle\inf_{\bar{W}\in\bar{\Theta}_{W}}\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\bar{W}-W| supWΘ¯WinfW¯Θ¯W𝔼W|W^W¯|+supWΘ¯W𝔼W|W^W|\displaystyle\leq\sup_{W\in\bar{\Theta}_{W}}\inf_{\bar{W}\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-\bar{W}|+\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-W|
2supWΘ¯W𝔼W|W^W|.\displaystyle\leq 2\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}_{W}|\widehat{W}-W|.

Taking the infimum over W^\widehat{W} yields the desired inequality.

We then use (B.2.2) to prove each term in the lower bound separately by choosing different subsets Θ¯W\bar{\Theta}_{W}. For simplicity, we write 𝔼=𝔼W\mathbb{E}=\mathbb{E}_{W}. To prove the first term, let us fix AΘAA\in\Theta_{A}, some set S[K]S\subseteq[K] with |S|=τ|S|=\tau and choose

Θ¯W={W(𝜶,𝜷;d):αS,βSΔτ,αSc=βSc=0}.\bar{\Theta}_{W}=\{W(\boldsymbol{\alpha},\boldsymbol{\beta};d):\alpha_{S},\beta_{S}\in\Delta_{\tau},\alpha_{S^{c}}=\beta_{S^{c}}=0\}.

It then follows that

infW^supWΘW𝔼|W^W|\displaystyle\inf_{\widehat{W}}\sup_{W\in\Theta_{W}}\mathbb{E}|\widehat{W}-W| 12infW^supαS,βSΔτ𝔼|W^W(𝜶S,𝜷S;d)|.\displaystyle\geq{1\over 2}\inf_{\widehat{W}}\sup_{\alpha_{S},\beta_{S}\in\Delta_{\tau}}\mathbb{E}|\widehat{W}-W(\boldsymbol{\alpha}_{S},\boldsymbol{\beta}_{S};d)|.

Here we slightly abuse the notation by writing 𝜶S=kSαkδAk\boldsymbol{\alpha}_{S}=\sum_{k\in S}\alpha_{k}\delta_{A_{k}} and 𝜷S=kSβkδAk\boldsymbol{\beta}_{S}=\sum_{k\in S}\beta_{k}\delta_{A_{k}}. Note that Theorem 10 in Appendix C applies to an observation model where we have direct access to i.i.d. samples from the measures 𝜶S\boldsymbol{\alpha}_{S} and 𝜷S\boldsymbol{\beta}_{S} on known {Ak}kS\{A_{k}\}_{k\in S}. There exists a transition (in the sense of Le Cam [39]) from this model to the observation model where only samples from rr and ss are observed; therefore, a lower bound from Theorem 10 implies a lower bound for the above display. By noting that, for any αSβS\alpha_{S}\neq\beta_{S},

2W(𝜶S,𝜷S;d)αSβS1\displaystyle{2W(\boldsymbol{\alpha}_{S},\boldsymbol{\beta}_{S};d)\over\|\alpha_{S}-\beta_{S}\|_{1}} maxk,kSd(Ak,Ak)Cd2maxk,kSAkAk1Cd,\displaystyle\leq\max_{k,k^{\prime}\in S}d(A_{k},A_{k^{\prime}})\leq{C_{d}\over 2}\max_{k,k^{\prime}\in S}\|A_{k}-A_{k^{\prime}}\|_{1}\leq C_{d},
2W(𝜶S,𝜷S;d)αSβS1\displaystyle{2W(\boldsymbol{\alpha}_{S},\boldsymbol{\beta}_{S};d)\over\|\alpha_{S}-\beta_{S}\|_{1}} minkkSd(Ak,Ak)cd2minkkSAkAk1cdκ¯τ,\displaystyle\geq\min_{k\neq k^{\prime}\in S}d(A_{k},A_{k^{\prime}})\geq{c_{d}\over 2}\min_{k\neq k^{\prime}\in S}\|A_{k}-A_{k^{\prime}}\|_{1}\geq c_{d}~{}\overline{\kappa}_{\tau},

where the last inequality uses the definition in (18), an application of Theorem 10 with K=τK=\tau, c=κ¯τcdc_{*}=\overline{\kappa}_{\tau}c_{d} and C=CdC_{*}=C_{d} yields the first term in the lower bound.

To prove the second term, fix α=𝐞1\alpha=\mathbf{e}_{1} and β=𝐞2\beta=\mathbf{e}_{2}, where {𝐞1,,𝐞K}\{\mathbf{e}_{1},\ldots,\mathbf{e}_{K}\} is the canonical basis of K\mathbb{R}^{K}. For any k[K]k\in[K], we write Ik[p]I_{k}\subset[p] as the index set of anchor words in the kkth topic under 3. The set of non-anchor words is defined as Ic=[p]II^{c}=[p]\setminus I with I=kIkI=\cup_{k}I_{k}. We have |Ic|=p|I|=pm|I^{c}|=p-|I|=p-m. Define

Θ¯W={W(𝜶,𝜷;d):AΘ¯A}\bar{\Theta}_{W}=\{W(\boldsymbol{\alpha},\boldsymbol{\beta};d):A\in\bar{\Theta}_{A}\}

where

Θ¯A:={AΘA:|I1|=|I2|,Ai1=Aj2=γ,for all iI1,jI2}\bar{\Theta}_{A}:=\{A\in\Theta_{A}:|I_{1}|=|I_{2}|,A_{i1}=A_{j2}=\gamma,\text{for all }i\in I_{1},j\in I_{2}\}

for some 0<γ1/|I1|0<\gamma\leq 1/|I_{1}| to be specified later. Using (B.2.2) we find that

infW^supWΘW𝔼|W^W|\displaystyle\inf_{\widehat{W}}\sup_{W\in\Theta_{W}}\mathbb{E}|\widehat{W}-W| 12infW^Θ¯WsupWΘ¯W𝔼|W^W(𝜶,𝜷;d)|\displaystyle\geq{1\over 2}\inf_{\widehat{W}\in\bar{\Theta}_{W}}\sup_{W\in\bar{\Theta}_{W}}\mathbb{E}|\widehat{W}-W(\boldsymbol{\alpha},\boldsymbol{\beta};d)|
=12infA^Θ¯AsupAΘ¯A𝔼|d(A^1,A^2)d(A1,A2)|.\displaystyle={1\over 2}\inf_{\widehat{A}\in\bar{\Theta}_{A}}\sup_{A\in\bar{\Theta}_{A}}\mathbb{E}\left|d(\widehat{A}_{1},\widehat{A}_{2})-d(A_{1},A_{2})\right|. (60)

By similar arguments as before, it suffices to prove a lower bound for a stronger observational model where one has samples of Multinomialp(M,A1)\text{Multinomial}_{p}(M,A_{1}) and Multinomialp(M,A2)\text{Multinomial}_{p}(M,A_{2}) with MnN/KM\asymp nN/K. Indeed, under the topic model assumptions, let {g1,,gK}\{g_{1},\ldots,g_{K}\} be a partition of [n][n] with ||gk||gk||1||g_{k}|-|g_{k^{\prime}}||\leq 1 for any kk[K]k\neq k^{\prime}\in[K]. If we treat the topic proportions α(1),,α(n)\alpha^{(1)},\ldots,\alpha^{(n)} as known and choose them such that α(i)=𝐞k\alpha^{(i)}=\mathbf{e}_{k} for any igki\in g_{k} and k[K]k\in[K], then for the purpose of estimating d(A1,A2)d(A_{1},A_{2}), this model is equivalent to the observation model where one has samples

Y^(k)Multinomialp(|gk|N,Ak),for k{1,2}.\widehat{Y}^{(k)}\sim\text{Multinomial}_{p}(|g_{k}|N,A_{k}),\quad\text{for }k\in\{1,2\}.

Here N|gk|nN/KN|g_{k}|\asymp nN/K. Then we aim to invoke Theorem 10 to bound from below (B.2.2). To this end, first notice that

cdA1A212d(A1,A2)CdA1A21.c_{d}\|A_{1}-A_{2}\|_{1}\leq 2d(A_{1},A_{2})\leq C_{d}\|A_{1}-A_{2}\|_{1}.

By inspecting the proof of Theorem 10, and by choosing

γ=1|I1|+|Ic|=1|I1|+pm1p\gamma={1\over|I_{1}|+|I^{c}|}={1\over|I_{1}|+p-m}\asymp{1\over p}

under mcpm\leq cp, we can verify that Theorem 10 with c=cdc_{*}=c_{d} and C=CdC_{*}=C_{d} yields

infd^supAΘ¯A𝔼|d^d(A1,A2)|\displaystyle\inf_{\widehat{d}}\sup_{A\in\bar{\Theta}_{A}}\mathbb{E}\left|\widehat{d}-d(A_{1},A_{2})\right| cd2CdpKnNlog2(p),\displaystyle~{}\gtrsim~{}{c_{d}^{2}\over C_{d}}\sqrt{pK\over nN\log^{2}(p)},

completing the proof. Indeed, in the proof of Theorem 10 and under the notations therein, we can specify the following choice of parameters:

ρ=[0𝟙|I2|0𝟙|Ic|]γ,α=[𝟙|I1|00𝟙|Ic|]γ\displaystyle\rho=\begin{bmatrix}0\\ \mathds{1}_{|I_{2}|}\\ 0\\ \mathds{1}_{|I^{c}|}\end{bmatrix}\gamma,\qquad\alpha=\begin{bmatrix}\mathds{1}_{|I_{1}|}\\ 0\\ 0\\ \mathds{1}_{|I^{c}|}\end{bmatrix}\gamma

with the two priors of PP as

μ0\displaystyle\mu_{0} =Law(α+[0X1X2X|Ic|]ϵγ),μ1=Law(α+[0Y1Y2Y|Ic|]ϵγ).\displaystyle=\text{Law}\left(\alpha+\begin{bmatrix}0\\ X_{1}\\ X_{2}\\ \vdots\\ X_{|I^{c}|}\end{bmatrix}{\epsilon\gamma}\right),\qquad\mu_{1}=\text{Law}\left(\alpha+\begin{bmatrix}0\\ Y_{1}\\ Y_{2}\\ \vdots\\ Y_{|I^{c}|}\end{bmatrix}{\epsilon\gamma}\right).

Here XiX_{i} and YiY_{i} for i{1,,|Ic|}i\in\{1,\ldots,|I^{c}|\} are two sets of random variables constructed as in Proposition 3. Following the same arguments in the proof of Theorem 10 with γ1/p\gamma\asymp 1/p and |Ic|p|I^{c}|\asymp p yields the claimed result. ∎

B.3 Proofs for Section 5

B.3.1 Proof of Proposition 1

Proof.

We first recall the Kantorovich-Rubinstein dual formulation

W(𝜶,𝜷;d)=supff(αβ).\displaystyle W(\boldsymbol{\alpha},\boldsymbol{\beta};d)=\sup_{f\in\mathcal{F}}f^{\top}(\alpha-\beta). (61)

with \mathcal{F} defined in (34). By (61) and (28), we have the decomposition of

W~W(𝜶,𝜷;d)=I+II,\widetilde{W}-W(\boldsymbol{\alpha},\boldsymbol{\beta};d)=\mathrm{I}+\mathrm{II},

where

I\displaystyle\mathrm{I} =supf^f(α~β~)supff(α~β~),\displaystyle=\sup_{f\in\widehat{\mathcal{F}}}f^{\top}(\widetilde{\alpha}-\widetilde{\beta})-\sup_{f\in\mathcal{F}}f^{\top}(\widetilde{\alpha}-\widetilde{\beta}),
II\displaystyle\mathrm{II} =supff(α~β~)supff(αβ).\displaystyle=\sup_{f\in\mathcal{F}}f^{\top}(\widetilde{\alpha}-\widetilde{\beta})-\sup_{f\in\mathcal{F}}f^{\top}(\alpha-\beta).

We control these terms separately. In Lemmas 1 & 2 of Section B.3.3, we show that

IdH(,^)α~β~1ϵAα~β~1=(30)𝒪(ϵA),\mathrm{I}\leq d_{H}(\mathcal{F},\widehat{\mathcal{F}})~{}\|\widetilde{\alpha}-\widetilde{\beta}\|_{1}\leq\epsilon_{A}~{}\|\widetilde{\alpha}-\widetilde{\beta}\|_{1}\overset{\eqref{cond_distr}}{=}\mathcal{O}_{\mathbb{P}}(\epsilon_{A}),

where, for two subsets AA and BB of K\mathbb{R}^{K}, the Hausdorff distance is defined as

dH(A,B)=max{supaAinfbBab,supbBinfaAab}.d_{H}(A,B)=\max\left\{\sup_{a\in A}\inf_{b\in B}\|a-b\|_{\infty},~{}\sup_{b\in B}\inf_{a\in A}\|a-b\|_{\infty}\right\}\,.

Therefore for any estimator A^\widehat{A} for which NϵA0\sqrt{N}\epsilon_{A}\to 0, we have NI=o(1)\sqrt{N}\mathrm{I}=o_{\mathbb{P}}(1), and the asymptotic limit of the target equals the asymptotic limit of II\mathrm{II}, scaled by N\sqrt{N}. Under (30), by recognizing (as proved in Proposition 2) that the function h:Kh:\mathbb{R}^{K}\to\mathbb{R} defined as h(x)=supffxh(x)=\sup_{f\in\mathcal{F}}f^{\top}x is Hadamard-directionally differentiable at u=αβu=\alpha-\beta with derivative hu:Kh^{\prime}_{u}:\mathbb{R}^{K}\to\mathbb{R} defined as

hu(x)=supf:fu=h(u)fx.h^{\prime}_{u}(x)=\sup_{f\in\mathcal{F}:f^{\top}u=h(u)}f^{\top}x.

Eq. 32 follows by an application the functional δ\delta-method in Theorem 7. The proof is complete. ∎

The following theorem is a variant of the δ\delta-method that is suitable for Hadamard directionally differentiable functions.

Theorem 7 (Theorem 1, [55]).

Let ff be a function defined on a subset FF of d\mathbb{R}^{d} with values in \mathbb{R}, such that: (1) ff is Hadamard directionally differentiable at uFu\in F with derivative fu:Ff^{\prime}_{u}:F\to\mathbb{R}, and (2) there is a sequence of d\mathbb{R}^{d}-valued random variables XnX_{n} and a sequence of non-negative numbers ρn\rho_{n}\to\infty such that ρn(Xnu)𝑑X\rho_{n}(X_{n}-u)\overset{d}{\to}X for some random variable XX taking values in FF. Then, ρn(f(Xn)f(u))𝑑fu(X)\rho_{n}(f(X_{n})-f(u))\overset{d}{\to}f^{\prime}_{u}(X).

B.3.2 Hadamard directional derivative

To obtain distributional limits, we employ the notion of directional Hadamard differentiability. The differentiability of our functions of interest follows from well known general results [see, e.g., 14, Section 4.3]; we give a self-contained proof for completeness..

Proposition 2.

Define a function f:df:\mathbb{R}^{d}\to\mathbb{R} by

f(x)=supc𝒞cx,f(x)=\sup_{c\in\mathcal{C}}c^{\top}x\,, (62)

for a convex, compact set 𝒞d\mathcal{C}\subseteq\mathbb{R}^{d}. Then for any udu\in\mathbb{R}^{d} and sequences tn0t_{n}\searrow 0 and hnhdh_{n}\to h\in\mathbb{R}^{d},

limnf(u+tnhn)f(u)tn=supc𝒞uch,\lim_{n\to\infty}\frac{f(u+t_{n}h_{n})-f(u)}{t_{n}}=\sup_{c\in\mathcal{C}_{u}}c^{\top}h\,, (63)

where 𝒞u{c𝒞:cu=f(u)}\mathcal{C}_{u}\coloneqq\{c\in\mathcal{C}:c^{\top}u=f(u)\}. In particular, ff is Hadamard-directionally differentiable.

Proof.

Denote the right side of Eq. 63 by gu(h)g_{u}(h). For any c𝒞uc\in\mathcal{C}_{u}, the definition of ff implies

f(u+tnhn)f(u)tnc(u+tnhn)cutn=chn.\frac{f(u+t_{n}h_{n})-f(u)}{t_{n}}\geq\frac{c^{\top}(u+t_{n}h_{n})-c^{\top}u}{t_{n}}=c^{\top}h_{n}\,. (64)

Taking limits of both sides, we obtain

lim infnf(u+tnhn)f(u)tnch,\liminf_{n\to\infty}\frac{f(u+t_{n}h_{n})-f(u)}{t_{n}}\geq c^{\top}h\,, (65)

and since c𝒞uc\in\mathcal{C}_{u} was arbitrary, we conclude that

lim infnf(u+tnhn)f(u)tngu(h).\liminf_{n\to\infty}\frac{f(u+t_{n}h_{n})-f(u)}{t_{n}}\geq g_{u}(h)\,. (66)

In the other direction, it suffices to show that any cluster point of

f(u+tnhn)f(u)tn\frac{f(u+t_{n}h_{n})-f(u)}{t_{n}}

is at most gu(h)g_{u}(h). For each nn, pick cn𝒞u+tnhnc_{n}\in\mathcal{C}_{u+t_{n}h_{n}}, which exists by compactness of 𝒞\mathcal{C}. By passing to a subsequence, we may assume that cnc¯𝒞c_{n}\to\bar{c}\in\mathcal{C}, and since

c¯u=limncn(u+tnhn)=limnf(u+tnhn)=f(u),\bar{c}^{\top}u=\lim_{n\to\infty}c_{n}^{\top}(u+t_{n}h_{n})=\lim_{n\to\infty}f(u+t_{n}h_{n})=f(u)\,, (67)

we obtain that c¯𝒞u\bar{c}\in\mathcal{C}_{u}. On this subsequence, we therefore have

lim supnf(u+tnhn)f(u)tnlimncn(u+tnhn)cnutn=c¯hgu(h),\limsup_{n\to\infty}\frac{f(u+t_{n}h_{n})-f(u)}{t_{n}}\leq\lim_{n\to\infty}\frac{c_{n}^{\top}(u+t_{n}h_{n})-c_{n}^{\top}u}{t_{n}}=\bar{c}^{\top}h\leq g_{u}(h)\,, (68)

proving the claim. ∎

B.3.3 Lemmata used in the proof of Proposition 1

Lemma 1.

For any α,βK\alpha,\beta\in\mathbb{R}^{K} and any subsets A,BKA,B\subseteq\mathbb{R}^{K},

|supfAf(αβ)supfBf(αβ)|dH(A,B)αβ1.\left|\sup_{f\in A}f^{\top}(\alpha-\beta)-\sup_{f\in B}f^{\top}(\alpha-\beta)\right|\leq d_{H}(A,B)\|\alpha-\beta\|_{1}.
Proof.

Since the result trivially holds if either AA or BB is empty, we consider the case that AA and BB are non-empty sets. Pick any α,βK\alpha,\beta\in\mathbb{R}^{K}. Fix ε>0\varepsilon>0, and let fAf^{*}\in A be such that f(αβ)supfAf(αβ)εf^{*\top}(\alpha-\beta)\geq\sup_{f\in A}f^{\top}(\alpha-\beta)-\varepsilon. By definition of dH(A,B)d_{H}(A,B), there exists f^B\widehat{f}\in B such that f^fdH(A,B)+ε\|\widehat{f}-f^{*}\|_{\infty}\leq d_{H}(A,B)+\varepsilon. Then

supfAf(αβ)supfBf(αβ)\displaystyle\sup_{f\in A}f^{\top}(\alpha-\beta)-\sup_{f\in B}f^{\top}(\alpha-\beta) f(αβ)supfBf(αβ)+ε\displaystyle\leq f^{*\top}(\alpha-\beta)-\sup_{f\in B}f^{\top}(\alpha-\beta)+\varepsilon
(ff^)(αβ)+ε\displaystyle\leq(f^{*}-\widehat{f})^{\top}(\alpha-\beta)+\varepsilon
dH(A,B)αβ1+ε(1+αβ1).\displaystyle\leq d_{H}(A,B)\|\alpha-\beta\|_{1}+\varepsilon(1+\|\alpha-\beta\|_{1}).

Since ε>0\varepsilon>0 was arbitrary, we obtain

supfAf(αβ)supfBf(αβ)dH(A,B)αβ1.\sup_{f\in A}f^{\top}(\alpha-\beta)-\sup_{f\in B}f^{\top}(\alpha-\beta)\leq d_{H}(A,B)\|\alpha-\beta\|_{1}.

Analogous arguments also yield

supfBf(αβ)supfAf(αβ)dH(A,B)αβ1\sup_{f\in B}f^{\top}(\alpha-\beta)-\sup_{f\in A}f^{\top}(\alpha-\beta)\leq d_{H}(A,B)\|\alpha-\beta\|_{1}

completing the proof. ∎

Lemma 2.

Let ^\widehat{\mathcal{F}} and \mathcal{F} be defined in (29) and (34), respectively. Then

dH(^,)\displaystyle d_{H}(\widehat{\mathcal{F}},\mathcal{F}) maxk,k[K]|d(Ak,Ak)d(A^k,A^k)|.\displaystyle\leq\max_{k,k^{\prime}\in[K]}\left|d(A_{k},A_{k^{\prime}})-d(\widehat{A}_{k},\widehat{A}_{k^{\prime}})\right|.
Proof.

For two subsets AA and BB of K\mathbb{R}^{K}, we recall the Hausdorff distance

dH(A,B)=max{supaAinfbBab,supbBinfaAab}.d_{H}(A,B)=\max\left\{\sup_{a\in A}\inf_{b\in B}\|a-b\|_{\infty},~{}\sup_{b\in B}\inf_{a\in A}\|a-b\|_{\infty}\right\}\,.

For notational simplicity, we write

Ekk=|d(Ak,Ak)d¯(k,k)|,E_{kk^{\prime}}=\left|d(A_{k},A_{k^{\prime}})-\bar{d}(k,k^{\prime})\right|, (69)

with d¯(k,k)=d(A^k,A^k)\bar{d}(k,k^{\prime})=d(\widehat{A}_{k},\widehat{A}_{k^{\prime}}), for any k,k[K]k,k^{\prime}\in[K]. We first prove

supfinff^^f^fmaxk,k[K]|d(Ak,Ak)d¯(k,k)|=maxk,k[K]Ekk.\sup_{f\in\mathcal{F}}\inf_{\widehat{f}\in\widehat{\mathcal{F}}}\|\widehat{f}-f\|_{\infty}\leq\max_{k,k^{\prime}\in[K]}\left|d(A_{k},A_{k^{\prime}})-\bar{d}(k,k^{\prime})\right|=\max_{k,k^{\prime}\in[K]}E_{kk^{\prime}}. (70)

For any ff\in\mathcal{F}, we know that f1=0f_{1}=0 and fkfd(Ak,A)f_{k}-f_{\ell}\leq d(A_{k},A_{\ell}) for all k,[K]k,\ell\in[K]. Define

f^k=min[K]{d¯(k,)+f+E1},k[K].\widehat{f}_{k}=\min_{\ell\in[K]}\left\{\bar{d}(k,\ell)+f_{\ell}+E_{\ell 1}\right\},\qquad\forall~{}k\in[K]. (71)

It suffices to show f^^\widehat{f}\in\widehat{\mathcal{F}} and

f^fmaxk,k[K]Ekk.\|\widehat{f}-f\|_{\infty}\leq\max_{k,k^{\prime}\in[K]}E_{kk^{\prime}}. (72)

Clearly, for any k,k[K]k,k^{\prime}\in[K], the definition in Eq. 71 yields

f^kf^k\displaystyle\widehat{f}_{k}-\widehat{f}_{k^{\prime}} =min[K]{d¯(k,)+f+E1}min[K]{d¯(k,)+f+E1}\displaystyle=\min_{\ell\in[K]}\left\{\bar{d}(k,\ell)+f_{\ell}+E_{\ell 1}\right\}-\min_{\ell^{\prime}\in[K]}\left\{\bar{d}(k^{\prime},\ell^{\prime})+f_{\ell^{\prime}}+E_{\ell^{\prime}1}\right\}
=min[K]{d¯(k,)+f+E1}{d¯(k,)+f+E1}for some [K]\displaystyle=\min_{\ell\in[K]}\left\{\bar{d}(k,\ell)+f_{\ell}+E_{\ell 1}\right\}-\Bigl{\{}\bar{d}(k^{\prime},\ell^{*})+f_{\ell^{*}}+E_{\ell^{*}1}\Bigr{\}}\quad\quad\text{for some $\ell^{*}\in[K]$}
d¯(k,)d¯(k,)\displaystyle\leq\bar{d}(k,\ell^{*})-\bar{d}(k,\ell^{*})
d¯(k,k).\displaystyle\leq\bar{d}(k,k^{\prime}).

Also notice that

f^1\displaystyle\widehat{f}_{1} =min{min[K]{1}{d¯(1,)+f+E1},d¯(1,1)+f1+E11}\displaystyle=\min\left\{\min_{\ell\in[K]\setminus\{1\}}\left\{\bar{d}(1,\ell)+f_{\ell}+E_{\ell 1}\right\},~{}\bar{d}(1,1)+f_{1}+E_{11}\right\}
=min{min[K]{1}{d¯(1,)+f+E1},0},\displaystyle=\min\left\{\min_{\ell\in[K]\setminus\{1\}}\left\{\bar{d}(1,\ell)+f_{\ell}+E_{\ell 1}\right\},~{}0\right\},

by using f1=0f_{1}=0 and E11=0E_{11}=0. Since, for any [K]{1}\ell\in[K]\setminus\{1\}, by the definition of \mathcal{F},

d¯(1,)+f+E1d¯(1,)d(A1,A)+E10,\bar{d}(1,\ell)+f_{\ell}+E_{\ell 1}\geq\bar{d}(1,\ell)-d(A_{1},A_{\ell})+E_{\ell 1}\geq 0,

we conclude f^1=0\widehat{f}_{1}=0, hence f^^\widehat{f}\in\widehat{\mathcal{F}}.

Finally, to show Eq. 72, we have f1=f^1=0f_{1}=\widehat{f}_{1}=0 and for any k[K]{1}k\in[K]\setminus\{1\},

f^kfk\displaystyle\widehat{f}_{k}-f_{k} =min[K]{d¯(k,)+f+E1fk}Ek1\displaystyle=\min_{\ell\in[K]}\left\{\bar{d}(k,\ell)+f_{\ell}+E_{\ell 1}-f_{k}\right\}\leq E_{k1}

and, conversely,

f^kfk\displaystyle\widehat{f}_{k}-f_{k} =min[K]{d¯(k,)+ffk+E1}\displaystyle=\min_{\ell\in[K]}\{\bar{d}(k,\ell)+f_{\ell}-f_{k}+E_{\ell 1}\}
min[K]{d¯(k,)d(Ak,A)+E1}\displaystyle\geq\min_{\ell\in[K]}\{\bar{d}(k,\ell)-d(A_{k},A_{\ell})+E_{\ell 1}\}
min[K]{Ek+E1}.\displaystyle\geq\min_{\ell\in[K]}\{-E_{k\ell}+E_{\ell 1}\}.

The last two displays imply

f^f\displaystyle\|\widehat{f}-f\|_{\infty} maxk[K]max{Ek1,max[K]{EkE1}}\displaystyle\leq\max_{k\in[K]}\max\left\{E_{k1},\max_{\ell\in[K]}\{E_{k\ell}-E_{\ell 1}\}\right\}
ł=maxk,[K]{EkE1}\displaystyle\l=\max_{k,\ell\in[K]}\{E_{k\ell}-E_{\ell 1}\}
maxk,Ek,\displaystyle\leq\max_{k,\ell}E_{k\ell},

completing the proof of Eq. 70.

By analogous arguments, we also have

maxf^^minff^fmaxk,k[K]Ekk,\max_{\widehat{f}\in\widehat{\mathcal{F}}}\min_{f\in\mathcal{F}}\|\widehat{f}-f\|_{\infty}\leq\max_{k,k^{\prime}\in[K]}E_{kk^{\prime}},

completing the proof. ∎

B.3.4 Asymptotic normality proofs

Recall the notation in Section 4.1. We first state our most general result, Theorem 8, for any estimator A^\widehat{A} that satisfies (138) – (140).

Theorem 8.

Assume |J¯J¯|=𝒪(1)|\overline{J}\setminus\underline{J}|=\mathcal{O}(1) and

log2(L)κ¯τ4κ¯K2(1ξ)3αmin31N0,as n,N.{\log^{2}(L)\over\underline{\kappa}_{\tau}^{4}\wedge\overline{\kappa}_{K}^{2}}{(1\vee\xi)^{3}\over\alpha_{\min}^{3}}{1\over N}\to 0,\qquad\text{as }n,N\to\infty. (73)

Let A^\widehat{A} be any estimator satisfying (138) – (140) with

2(1ξ)αminϵ,1{2(1\vee\xi)\over\alpha_{\min}}\epsilon_{\infty,\infty}\leq 1 (74)

and

Nκ¯K(1ξ)αminϵ1,0,as n,N.{\sqrt{N}\over\overline{\kappa}_{K}}{(1\vee\xi)\over\alpha_{\min}}\epsilon_{1,\infty}\to 0,\qquad\text{as }n,N\to\infty. (75)

Then as n,Nn,N\to\infty, we have

NΣ+12(α~α)𝑑𝒩K(0,[\bmIK1000]).\sqrt{N}~{}\Sigma^{+{1\over 2}}\left(\widetilde{\alpha}-\alpha\right)\overset{d}{\to}\mathcal{N}_{K}\left(0,\begin{bmatrix}\bm{I}_{K-1}&0\\ 0&0\end{bmatrix}\right).

Theorem 9 below is a particular case of the general Theorem 8 corresponding to estimators A^\widehat{A} satisfying (13) and (14). Its proof is an immediate consequence of the proof below, when one uses (13) and (14) in conjunction with (76) and (77).

Theorem 9.

Let A^\widehat{A} be any estimator such that (13) and (14) hold. For any αΘα(τ)\alpha_{*}\in\Theta_{\alpha}(\tau), assume |J¯J¯|=𝒪(1)|\overline{J}\setminus\underline{J}|=\mathcal{O}(1),

limn,Nlog2(L)κ¯τ4κ¯K2(1ξ)3αmin31N=0\lim_{n,N\to\infty}{\log^{2}(L)\over\underline{\kappa}_{\tau}^{4}\wedge\overline{\kappa}_{K}^{2}}{(1\vee\xi)^{3}\over\alpha_{\min}^{3}}{1\over N}=0 (76)

and

limn,Nlog(L)κ¯K2(1ξ)2αmin2pn=0.\lim_{n,N\to\infty}{\log(L)\over\overline{\kappa}_{K}^{2}}{(1\vee\xi)^{2}\over\alpha_{\min}^{2}}{p\over n}=0. (77)

Then we have

limn,NNΣ+12(α~α)𝑑𝒩K(0,[\bmIK1000]).\lim_{n,N\to\infty}\sqrt{N}~{}\Sigma^{+{1\over 2}}\left(\widetilde{\alpha}-\alpha_{*}\right)\overset{d}{\to}\mathcal{N}_{K}\left(0,\begin{bmatrix}\bm{I}_{K-1}&0\\ 0&0\end{bmatrix}\right). (78)
Remark 9 ( Discussion of conditions of Theorem 9 and Theorem 4 ).

Quantities such as κ¯τ1,κ¯K1\underline{\kappa}_{\tau}^{-1},\overline{\kappa}_{K}^{-1}, ξ\xi and αmin1\alpha_{\min}^{-1} turn out to be important quantities even in the finite sample analysis of α^α1\|\widehat{\alpha}-\alpha_{*}\|_{1}, as discussed in detail in [8]. The first two quantities are bounded from above in many cases, see for instance [8, Remarks 1 & 2] for a detailed discussion.

The assumptions of Theorem 9 can be greatly simplified when the quantities κ¯τ1,κ¯K1\underline{\kappa}_{\tau}^{-1},\overline{\kappa}_{K}^{-1}, ξ\xi are bounded. Further, assume that the order of magnitude of the mixture weights does not depend on nn or NN. Then, the smallest non-zero mixture weight will be bounded below by a constant, and αmin1=𝒪(1)\alpha_{\min}^{-1}=\mathcal{O}(1).

Under these simplifying assumptions, condition (76) reduces to log2(pn)=o(N)\log^{2}(p\vee n)=o(N), a mild restriction on NN, while condition (77) requires the sizes of pp and nn to satisfy plog(L)=o(n)p\log(L)=o(n). The latter typically holds in the topic model context, as the number of documents (nn) is typically (much) larger than the dictionary size (pp). The corresponding result is stated as Theorem 4, in Section 5.1 of the main document.

Proof.

We give below the proof of Theorem 8. Recall that XX is the vector of empirical frequencies of r=Aαr=A\alpha with J=supp(X)J={\rm supp}(X) and J^={j[p]:A^jα^>0}\widehat{J}=\{j\in[p]:\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}>0\}. The K.K.T. condition of (3) states that

𝟙K=jJXjA^jα^A^j+λN=jJ^XjA^jα^A^j+λN\mathds{1}_{K}=\sum_{j\in J}{X_{j}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\widehat{A}_{j\cdot}+{\lambda\over N}=\sum_{j\in\widehat{J}}{X_{j}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\widehat{A}_{j\cdot}+{\lambda\over N} (79)

with λk0\lambda_{k}\geq 0, λkα^k=0\lambda_{k}\widehat{\alpha}_{k}=0 for all k[K]k\in[K]. The last step in (79) uses the fact that JJ^J\subseteq\widehat{J} with probability one. Recall from (39) that

V~=jJ^XjA^jα^A^jαA^jA^j,Ψ(α)=jJ^XjA^jαA^jαA^j.\widetilde{V}=\sum_{j\in\widehat{J}}{X_{j}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}~{}\widehat{A}_{j}^{\top}\alpha}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top},\qquad\Psi(\alpha)=\sum_{j\in\widehat{J}}{X_{j}-\widehat{A}_{j\cdot}^{\top}\alpha\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot}.

Further recall from (40) that

Ψ(α^)=λN+jJ^cA^j.\Psi(\widehat{\alpha})=-{\lambda\over N}+\sum_{j\in\widehat{J}^{c}}\widehat{A}_{j\cdot}.

Adding and subtracting terms in (79) yields

V~(α^α)=jJ^XjA^jα^A^j(α^α)A^jαA^j\displaystyle\widetilde{V}(\widehat{\alpha}-\alpha)=\sum_{j\in\widehat{J}}{X_{j}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}{\widehat{A}_{j\cdot}^{\top}(\widehat{\alpha}-\alpha)\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot} =Ψ(α)+λN+jJ^A^j𝟙K\displaystyle=\Psi(\alpha)+{\lambda\over N}+\sum_{j\in\widehat{J}}\widehat{A}_{j\cdot}-\mathds{1}_{K}
=Ψ(α)+λNjJ^cA^j\displaystyle=\Psi(\alpha)+{\lambda\over N}-\sum_{j\in\widehat{J}^{c}}\widehat{A}_{j\cdot}
=Ψ(α)Ψ(α^)\displaystyle=\Psi(\alpha)-\Psi(\widehat{\alpha})

where the second line uses the fact that j=1pA^j=𝟙K\sum_{j=1}^{p}\widehat{A}_{j\cdot}=\mathds{1}_{K}. By (41), we obtain

α~α=(\bmIKV^+V~)(α^α)+V^+Ψ(α).\widetilde{\alpha}-\alpha=\left(\bm{I}_{K}-\widehat{V}^{+}\widetilde{V}\right)(\widehat{\alpha}-\alpha)+\widehat{V}^{+}\Psi(\alpha).

By using the equality

Ψ(α)\displaystyle\Psi(\alpha) =jJ^(AjA^j)αA^jαA^j+jJ^XjrjrjAj+jJ^(Xjrj)(A^jA^jαAjAjα)\displaystyle=\sum_{j\in\widehat{J}}{(A_{j\cdot}-\widehat{A}_{j\cdot})^{\top}\alpha\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot}+\sum_{j\in\widehat{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}+\sum_{j\in\widehat{J}}(X_{j}-r_{j})\left({\widehat{A}_{j\cdot}\over\widehat{A}_{j\cdot}^{\top}\alpha}-{A_{j\cdot}\over A_{j\cdot}^{\top}\alpha}\right)

we conclude

α~α\displaystyle\widetilde{\alpha}-\alpha =I+II+III+IV\displaystyle=\mathrm{I}+\mathrm{II}+\mathrm{III}+\mathrm{IV}

where

I=V^+jJ^XjrjrjAj\displaystyle\mathrm{I}=\widehat{V}^{+}\sum_{j\in\widehat{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}
II=(\bmIKV^+V~)(α^α)\displaystyle\mathrm{II}=\left(\bm{I}_{K}-\widehat{V}^{+}\widetilde{V}\right)(\widehat{\alpha}-\alpha)
III=V^+jJ^(AjA^j)αA^jαA^j\displaystyle\mathrm{III}=\widehat{V}^{+}\sum_{j\in\widehat{J}}{(A_{j\cdot}-\widehat{A}_{j\cdot})^{\top}\alpha\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot}
IV=V^+jJ^(Xjrj)(A^jA^jαAjAjα).\displaystyle\mathrm{IV}=\widehat{V}^{+}\sum_{j\in\widehat{J}}(X_{j}-r_{j})\left({\widehat{A}_{j\cdot}\over\widehat{A}_{j\cdot}^{\top}\alpha}-{A_{j\cdot}\over A_{j\cdot}^{\top}\alpha}\right).

For notational convenience, write

H=AD1/rA,F=\bmIKH1/2ααH1/2H=A^{\top}D_{1/r}A,\qquad F=\bm{I}_{K}-H^{1/2}\alpha\alpha^{\top}H^{1/2} (80)

such that

Σ=H1αα=H1/2FH1/2.\Sigma=H^{-1}-\alpha\alpha^{\top}=H^{-1/2}FH^{-1/2}.

Here D1/rD_{1/r} is a diagonal matrix with its jjth entry equal to 1/rj1/r_{j} if rj>0r_{j}>0 and 0 otherwise for any j[p]j\in[p]. From Lemma 7, condition (73) implies κ(AJ¯,K)κ(AJ¯,K)>0\kappa(A_{\overline{J}},K)\geq\kappa(A_{\underline{J}},K)>0, hence rank(H)=K\mathrm{rank}(H)=K. Meanwhile, the K×KK\times K matrix FF has rank (K1)(K-1) which can be seen from αFα=0\alpha^{\top}F\alpha=0. Furthermore, the eigen-decomposition of F=k=1KσkukukF=\sum_{k=1}^{K}\sigma_{k}u_{k}u_{k}^{\top} satisfies

σ1==σK1=1,σK=0.\sigma_{1}=\cdots=\sigma_{K-1}=1,\quad\sigma_{K}=0. (81)

Let 𝒮(F)K\mathcal{S}(F)\subset\mathbb{R}^{K} be the space spanned by u1,,uK1u_{1},\ldots,u_{K-1} and 𝒦(F)\mathcal{K}(F) be its complement space. It remains to show that,

NuH1/2I𝑑𝒩(0,1),\displaystyle\sqrt{N}u^{\top}H^{1/2}\mathrm{I}\overset{d}{\to}\mathcal{N}(0,1), for any u𝒮(F)𝕊K;\displaystyle\textrm{for any $u\in\mathcal{S}(F)\cap\mathbb{S}^{K}$}; (82)
NuH1/2I=o(1),\displaystyle\sqrt{N}u^{\top}H^{1/2}\mathrm{I}=o_{\mathbb{P}}(1), for any u𝒦(F)𝕊K;\displaystyle\textrm{for any $u\in\mathcal{K}(F)\cap\mathbb{S}^{K}$}; (83)
NuH1/2(II+III+IV)=o(1),\displaystyle\sqrt{N}u^{\top}H^{1/2}(\mathrm{II}+\mathrm{III}+\mathrm{IV})=o_{\mathbb{P}}(1),\qquad for any u𝕊K.\displaystyle\textrm{for any $u\in\mathbb{S}^{K}$}. (84)

These statements are proved in the following three subsections.

B.3.5 Proof of Eq. 82

We work on the event (140) intersecting with

α:={4ρα^α11}\mathcal{E}_{\alpha}:=\left\{4\rho\|\widehat{\alpha}-\alpha\|_{1}\leq 1\right\} (85)

where, for future reference, we define

ρ:=1ξαminmaxjJ¯Ajrj.\rho:={1\vee\xi\over\alpha_{\min}}\geq\max_{j\in\overline{J}}{\|A_{j\cdot}\|_{\infty}\over r_{j}}. (86)

The inequality is shown in display (8) of [8]. From Lemma 3 and Lemma 4 in Section B.3.8, we have J^=J¯\widehat{J}=\overline{J} and V^\widehat{V} is invertible. Also note that limn,N(α)=1\lim_{n,N\to\infty}\mathbb{P}(\mathcal{E}_{\alpha})=1 is ensured by Theorem 13 under conditions (73) – (75). Indeed, we see that (144) and (145) hold under (73) – (75) such that Theorem 13 ensures

{α^α11κ¯τKlog(p)N+1κ¯τ2ϵ1,}=1𝒪(p1).\displaystyle\mathbb{P}\left\{\|\widehat{\alpha}-\alpha\|_{1}\lesssim{1\over\overline{\kappa}_{\tau}}\sqrt{K\log(p)\over N}+{1\over\overline{\kappa}_{\tau}^{2}}\epsilon_{1,\infty}\right\}=1-\mathcal{O}(p^{-1}).

It then follows from (73) – (75) together with (20) that

ρα^α1=𝒪((1ξ)κ¯ταminKlog(L)N+(1ξ)κ¯τ2αminϵ1,)=o(1).\rho\|\widehat{\alpha}-\alpha\|_{1}=\mathcal{O}_{\mathbb{P}}\left({(1\vee\xi)\over\overline{\kappa}_{\tau}\alpha_{\min}}\sqrt{K\log(L)\over N}+{(1\vee\xi)\over\overline{\kappa}_{\tau}^{2}\alpha_{\min}}\epsilon_{1,\infty}\right)=o_{\mathbb{P}}(1).

We now prove (82). Pick any u𝒮(F)𝕊Ku\in\mathcal{S}(F)\cap\mathbb{S}^{K}. Since

uH1/2I=uH1/2H1jJ^XjrjrjAj+uH1/2(V^1H1)jJ^XjrjrjAju^{\top}H^{1/2}\mathrm{I}=u^{\top}H^{1/2}H^{-1}\sum_{j\in\widehat{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}+u^{\top}H^{1/2}\left(\widehat{V}^{-1}-H^{-1}\right)\sum_{j\in\widehat{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot} (87)

and J^=J¯\widehat{J}=\overline{J}, it remains to prove

NuH1/2jJ¯XjrjrjAj𝑑𝒩(0,1),\displaystyle\sqrt{N}u^{\top}H^{-1/2}\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}\overset{d}{\to}\mathcal{N}(0,1), (88)
NuH1/2(V^1H1)jJ¯XjrjrjAj=o(1).\displaystyle\sqrt{N}u^{\top}H^{1/2}\left(\widehat{V}^{-1}-H^{-1}\right)\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}=o_{\mathbb{P}}(1). (89)

For (88), by recognizing that the left-hand-side of (88) can be written as

1Nt=1NuH1/2AD1/r(Ztr):=1Nt=1NYt.{1\over\sqrt{N}}\sum_{t=1}^{N}u^{\top}H^{-1/2}A^{\top}D_{1/r}(Z_{t}-r):={1\over\sqrt{N}}\sum_{t=1}^{N}Y_{t}.

with ZtZ_{t}, for t[N]t\in[N], being i.i.d. samples of Multinomialp(1;r)\textrm{Multinomial}_{p}(1;r). We will apply the Lyapunov central limit theorem. It is easy to see that 𝔼[Yt]=0\mathbb{E}[Y_{t}]=0 and

𝔼[Yt2]\displaystyle\mathbb{E}[Y_{t}^{2}] =uH1/2AD1/r(Drrr)D1/rAH1/2u\displaystyle=u^{\top}H^{-1/2}A^{\top}D_{1/r}\left(D_{r}-rr^{\top}\right)D_{1/r}\ AH^{-1/2}u
=uH1/2(AD1/rAAD1/rrrD1/rA)H1/2u\displaystyle=u^{\top}H^{-1/2}\left(A^{\top}D_{1/r}A^{\top}-A^{\top}D_{1/r}rr^{\top}D_{1/r}A\right)H^{-1/2}u
=u(\bmIKH1/2AD1/rrrD1/rAH1/2)u\displaystyle=u^{\top}\left(\bm{I}_{K}-H^{-1/2}A^{\top}D_{1/r}rr^{\top}D_{1/r}AH^{-1/2}\right)u
=u(\bmIKH1/2ααH1/2)u\displaystyle=u^{\top}\left(\bm{I}_{K}-H^{1/2}\alpha\alpha^{\top}H^{1/2}\right)u ()\displaystyle(\star)
=uFu=1.\displaystyle=u^{\top}Fu=1. (90)

Here the ()(\star) step uses the fact H1/2α=H1/2AD1/rrH^{1/2}\alpha=H^{-1/2}A^{\top}D_{1/r}r implied by

Hα=AD1/rAα=AD1/rr.H\alpha=A^{\top}D_{1/r}A\alpha=A^{\top}D_{1/r}r.

It then remains to verify

limN𝔼[|Yt|3]N=0.\lim_{N\to\infty}{\mathbb{E}[|Y_{t}|^{3}]\over\sqrt{N}}=0.

This follows by noting that

𝔼[|Yt|3]\displaystyle\mathbb{E}[|Y_{t}|^{3}] 2uH1/2AD1/r\displaystyle\leq 2\|u^{\top}H^{-1/2}A^{\top}D_{1/r}\|_{\infty} by 𝔼[Yt2]=1,Ztr12\displaystyle\textrm{by }\mathbb{E}[Y_{t}^{2}]=1,\|Z_{t}-r\|_{1}\leq 2
=maxjJ¯|AjH1/2u|rj\displaystyle=\max_{j\in\overline{J}}{|A_{j\cdot}^{\top}H^{-1/2}u|\over r_{j}}
maxjJ¯AjrjH1/2u1\displaystyle\leq\max_{j\in\overline{J}}{\|A_{j\cdot}\|_{\infty}\over r_{j}}\|H^{-1/2}u\|_{1}
ρu2κ(AJ¯,K)\displaystyle\leq\rho{\|u\|_{2}\over\kappa(A_{\overline{J}},K)} by Lemma 7
2(1ξ)κ¯Kαmin\displaystyle\leq{2(1\vee\xi)\over\overline{\kappa}_{K}\alpha_{\min}} by (86) and (19)

and by using (73).

To prove (89), starting with

H1/2(V^1H1)\displaystyle H^{1/2}\left(\widehat{V}^{-1}-H^{-1}\right) =H1/2V^1(HV^)H1,\displaystyle=H^{1/2}\widehat{V}^{-1}\left(H-\widehat{V}\right)H^{-1},

we obtain

uH1/2(V^1H1)jJ¯XjrjrjAj\displaystyle u^{\top}H^{1/2}\left(\widehat{V}^{-1}-H^{-1}\right)\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}
uH1/2V^1H1/22H1/2(HV^)H1/2opH1/2jJ¯XjrjrjAj2\displaystyle\leq\left\|u^{\top}H^{1/2}\widehat{V}^{-1}H^{1/2}\right\|_{2}\left\|H^{-1/2}(H-\widehat{V})H^{-1/2}\right\|_{\mathrm{op}}\left\|H^{-1/2}\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}\right\|_{2}
H1/2V^1H1/2opH1/2(HV^)H1/2opH1/2jJ¯XjrjrjAj2.\displaystyle\leq\left\|H^{1/2}\widehat{V}^{-1}H^{1/2}\right\|_{\mathrm{op}}\left\|H^{-1/2}(H-\widehat{V})H^{-1/2}\right\|_{\mathrm{op}}\left\|H^{-1/2}\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}\right\|_{2}.

By invoking Lemma 4 and Lemma 9, we have

NuH1/2(V^1H1)jJ¯XjrjrjAj\displaystyle\sqrt{N}u^{\top}H^{1/2}\left(\widehat{V}^{-1}-H^{-1}\right)\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}
=𝒪((K+ρκ(AJ¯,K)KN)(ρ2κ2(AJ¯,K)ϵ1,+ρα^α1)).\displaystyle=\mathcal{O}_{\mathbb{P}}\left(\left(\sqrt{K}+{\rho\over\kappa(A_{\overline{J}},K)}{K\over\sqrt{N}}\right)\left({\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\epsilon_{1,\infty}+\rho\|\widehat{\alpha}-\alpha\|_{1}\right)\right).

Recall that K=𝒪(1)K=\mathcal{O}(1), κ(AJ¯,K)κ¯K\kappa(A_{\overline{J}},K)\geq\overline{\kappa}_{K} and ρα^α1=o(1)\rho\|\widehat{\alpha}-\alpha\|_{1}=o_{\mathbb{P}}(1). Conditions (73) and (75) imply (89).

B.3.6 Proof of Eq. 83

In view of (87) and (89) together with J^=J¯\widehat{J}=\overline{J}, it remains to prove, for any u𝒦(F)𝕊Ku\in\mathcal{K}(F)\cap\mathbb{S}^{K},

NuH1/2jJ¯XjrjrjAj=o(1).\sqrt{N}u^{\top}H^{-1/2}\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}A_{j\cdot}=o_{\mathbb{P}}(1).

This follows from Lemma 10, (86) and (73).

B.3.7 Proof of Eq. 84

Pick any u𝕊Ku\in\mathbb{S}^{K}. We bound from above each terms in the left hand side of (84) separately on the event {J^=J¯}\{\widehat{J}=\overline{J}\}.

First note that

uH1/2II\displaystyle u^{\top}H^{1/2}\mathrm{II} =uH1/2(\bmIKV^1V~)(α^α)\displaystyle=u^{\top}H^{1/2}\left(\bm{I}_{K}-\widehat{V}^{-1}\widetilde{V}\right)(\widehat{\alpha}-\alpha)
=uH1/2V^1(V^V~)H1/2H1/2(α^α)\displaystyle=u^{\top}H^{1/2}\widehat{V}^{-1}\left(\widehat{V}-\widetilde{V}\right)H^{-1/2}H^{1/2}(\widehat{\alpha}-\alpha)
H1/2V^1H1/2opH1/2(V~V^)H1/2opH1/2(α^α)2.\displaystyle\leq\left\|H^{1/2}\widehat{V}^{-1}H^{1/2}\right\|_{\mathrm{op}}\left\|H^{-1/2}\left(\widetilde{V}-\widehat{V}\right)H^{-1/2}\right\|_{\mathrm{op}}\left\|H^{1/2}(\widehat{\alpha}-\alpha)\right\|_{2}.

The proof of Theorem 9 in [8] (see, the last display in Appendix G.2) shows that

H1/2(α^α)2\displaystyle\left\|H^{1/2}(\widehat{\alpha}-\alpha)\right\|_{2} =𝒪(Klog(K)N+1κ(AJ¯,τ)A^J¯AJ¯1,+1κ(AJ¯,τ)log(p)N)\displaystyle=\mathcal{O}_{\mathbb{P}}\left(\sqrt{K\log(K)\over N}+{1\over\kappa(A_{\overline{J}},\tau)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+{1\over\kappa(A_{\overline{J}},\tau)}{\log(p)\over N}\right)
=𝒪(1N+1κ¯τϵ1,).\displaystyle=\mathcal{O}_{\mathbb{P}}\left({1\over\sqrt{N}}+{1\over\overline{\kappa}_{\tau}}\epsilon_{1,\infty}\right).

under condition (73). To further invoke Lemmas 4 & 6, note that conditions (92) and (94) assumed therein hold under conditions (73) – (75). In particular,

Blog(p)=ρ(1+ξKs)log(p)κ2(AJ¯,K)αmin=𝒪(1ξ2κ¯K2log(p)αmin2)=o(N).B\log(p)={\rho\left(1+\xi\sqrt{K-s}\right)\log(p)\over\kappa^{2}(A_{\overline{J}},K)\alpha_{\min}}=\mathcal{O}\left({1\vee\xi^{2}\over\overline{\kappa}_{K}^{2}}{\log(p)\over\alpha_{\min}^{2}}\right)=o(N).

Thus, invoking Lemmas 4 & 6 together with conditions (73) – (75) and (86) yields

NuH1/2II\displaystyle\sqrt{N}u^{\top}H^{1/2}\mathrm{II} =𝒪(ρ2κ2(AJ¯,K)ϵ1,+ρα^α1+Blog(p)N)=o(1).\displaystyle=\mathcal{O}_{\mathbb{P}}\left({\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\epsilon_{1,\infty}+\rho\|\widehat{\alpha}-\alpha\|_{1}+\sqrt{B\log(p)\over N}\right)=o_{\mathbb{P}}(1).

Second,

uH1/2III\displaystyle u^{\top}H^{1/2}\mathrm{III} =uH1/2V^1jJ¯(AjA^j)αA^jαA^j\displaystyle=u^{\top}H^{1/2}\widehat{V}^{-1}\sum_{j\in\overline{J}}{(A_{j\cdot}-\widehat{A}_{j\cdot})^{\top}\alpha\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot}
V^1/2V^1/2H1/2u1maxk[K]jJ¯|AjkA^jk|α1A^jαA^j\displaystyle\leq\left\|\widehat{V}^{-1/2}\widehat{V}^{-1/2}H^{1/2}u\right\|_{1}\left\|\max_{k\in[K]}\sum_{j\in\overline{J}}{|A_{jk}-\widehat{A}_{jk}|\|\alpha\|_{1}\over\widehat{A}_{j\cdot}^{\top}\alpha}\widehat{A}_{j\cdot}\right\|_{\infty}
V^1/2V^1/2H1/2u1maxjJ¯A^jA^jαA^J¯AJ¯1,.\displaystyle\leq\left\|\widehat{V}^{-1/2}\widehat{V}^{-1/2}H^{1/2}u\right\|_{1}\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\alpha}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}.

By Lemma 3 and Lemma 4, we conclude

NuH1/2III\displaystyle\sqrt{N}u^{\top}H^{1/2}\mathrm{III} =𝒪(ρNκ(AJ¯,K)V^1/2H1/2opA^J¯AJ¯1,)\displaystyle=\mathcal{O}_{\mathbb{P}}\left({\rho\sqrt{N}\over\kappa(A_{\overline{J}},K)}\|\widehat{V}^{-1/2}H^{1/2}\|_{\mathrm{op}}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}\right)
=𝒪((1ξ)Nκ¯Kαminϵ1,)\displaystyle=\mathcal{O}_{\mathbb{P}}\left({(1\vee\xi)\sqrt{N}\over\overline{\kappa}_{K}\alpha_{\min}}\epsilon_{1,\infty}\right) by (86) and (20)
=o(1)\displaystyle=o_{\mathbb{P}}(1) by (75).\displaystyle\textrm{by (\ref{conds_2})}.

Finally, by similar arguments and Lemma 11, we find

NuH1/2IV\displaystyle\sqrt{N}u^{\top}H^{1/2}\mathrm{IV} NV^1H1/2u1jJ¯(Xjrj)(A^jA^jαAjAjα)\displaystyle\leq\sqrt{N}\left\|\widehat{V}^{-1}H^{1/2}u\right\|_{1}\left\|\sum_{j\in\overline{J}}(X_{j}-r_{j})\left({\widehat{A}_{j\cdot}\over\widehat{A}_{j\cdot}^{\top}\alpha}-{A_{j\cdot}\over A_{j\cdot}^{\top}\alpha}\right)\right\|_{\infty}
=𝒪(ρNκ(AJ¯,K)(A^J¯AJ¯1,+jJ¯J¯A^jAjrjlog(p)N))\displaystyle=\mathcal{O}_{\mathbb{P}}\left({\rho\sqrt{N}\over\kappa(A_{\overline{J}},K)}\left(\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+\sum_{j\in\overline{J}\setminus\underline{J}}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\over r_{j}}{\log(p)\over N}\right)\right)
=𝒪((1ξ)Nκ¯Kαminϵ1,+(1ξ)κ¯Kαminlog(p)N)\displaystyle=\mathcal{O}_{\mathbb{P}}\left({(1\vee\xi)\sqrt{N}\over\overline{\kappa}_{K}\alpha_{\min}}\epsilon_{1,\infty}+{(1\vee\xi)\over\overline{\kappa}_{K}\alpha_{\min}}{\log(p)\over\sqrt{N}}\right)
=o(1).\displaystyle=o_{\mathbb{P}}(1).

This completes the proof of Theorem 9. ∎

B.3.8 Technical lemmas used in the proof of Theorem 9

Recall that

J¯={j[p]:rj>0},J^={j[p]:A^jα^>0}.\overline{J}=\{j\in[p]:r_{j}>0\},\qquad\widehat{J}=\{j\in[p]:\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}>0\}.

Lemma 3 below provides upper and lower bounds of A^jα\widehat{A}_{j\cdot}^{\top}\alpha and A^jα^\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}.

Lemma 3.

On the event (140), we have

12rjA^jα32rj,jJ¯,{1\over 2}r_{j}\leq\widehat{A}_{j\cdot}^{\top}\alpha\leq{3\over 2}r_{j},\qquad\forall j\in\bar{J},

and

maxjJ¯A^jrj32ρ,maxjJ¯A^jA^jα3ρ.\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}\over r_{j}}\leq{3\over 2}\rho,\qquad\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\alpha}\leq 3\rho.

On the event (140) intersecting (85), we further have

14rjA^jα^2rj,jJ¯,{1\over 4}r_{j}\leq\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}\leq 2r_{j},\qquad\forall j\in\bar{J},

hence

J¯=J^.\bar{J}=\widehat{J}.
Proof.

For the first result, for any jJ¯j\in\overline{J}, we have

32rjA^jα+A^jAjα1A^jα\displaystyle{3\over 2}r_{j}\geq\widehat{A}_{j\cdot}^{\top}\alpha+\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\|\alpha\|_{1}\geq\widehat{A}_{j\cdot}^{\top}\alpha AjαA^jAjα112rj.\displaystyle\geq A_{j\cdot}^{\top}\alpha-\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\|\alpha\|_{1}\geq{1\over 2}r_{j}.

Furthermore, by using rj=AjαAjr_{j}=A_{j\cdot}^{\top}\alpha\leq\|A_{j\cdot}\|_{\infty},

A^jrjAjrj+A^jAjrj32Ajrj.{\|\widehat{A}_{j\cdot}\|_{\infty}\over r_{j}}\leq{\|A_{j\cdot}\|_{\infty}\over r_{j}}+{{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\over r_{j}}}\leq{3\over 2}{\|A_{j\cdot}\|_{\infty}\over r_{j}}.

Similarly,

A^jA^jα\displaystyle{\|\widehat{A}_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\alpha} A^jAjα|(A^jAj)α|A^jrjA^jAj3Ajrj.\displaystyle\leq{\|\widehat{A}_{j\cdot}\|_{\infty}\over A_{j\cdot}^{\top}\alpha-|(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}\alpha|}\leq{\|\widehat{A}_{j\cdot}\|_{\infty}\over r_{j}-\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}}\leq{3\|A_{j\cdot}\|_{\infty}\over r_{j}}.

Taking the union over jJ¯j\in\overline{J} together with the definition of ρ\rho proves the first claim.

To prove the second result, for any jJ¯j\in\overline{J}, we have

A^jα^\displaystyle\widehat{A}_{j\cdot}^{\top}\widehat{\alpha} Ajα^A^jAjα^1\displaystyle\geq A_{j\cdot}^{\top}\widehat{\alpha}-\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\|\widehat{\alpha}\|_{1}
rjAjα^α1A^jAj\displaystyle\geq r_{j}-\|A_{j\cdot}\|_{\infty}\|\widehat{\alpha}-\alpha\|_{1}-\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}
rj/4.\displaystyle\geq r_{j}/4.

Similarly,

A^jα^\displaystyle\widehat{A}_{j\cdot}^{\top}\widehat{\alpha} Ajα^+A^jAjα^1\displaystyle\leq A_{j\cdot}^{\top}\widehat{\alpha}+\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\|\widehat{\alpha}\|_{1}
rj+Ajα^α1+A^jAj\displaystyle\leq r_{j}+\|A_{j\cdot}\|_{\infty}\|\widehat{\alpha}-\alpha\|_{1}+\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}
2rj.\displaystyle\leq 2~{}r_{j}.

The second claim follows immediately from the first result. ∎

The following lemma studies the properties related with the matrix V^\widehat{V} in (43).

Lemma 4.

Suppose

2A^J¯AJ¯1,κ(AJ¯,K).2\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}\leq\kappa(A_{\overline{J}},K). (91)

On the event (140) intersecting (85), V^\widehat{V} satisfies the following statements.

  1. (a)

    For any uKu\in\mathbb{R}^{K},

    uV^u14κ2(AJ¯,K)u12.u^{\top}\widehat{V}u\geq{1\over 4}\kappa^{2}(A_{\overline{J}},K)\|u\|_{1}^{2}.
  2. (b)
    H1/2(V^H)H1/2op\displaystyle\left\|H^{-1/2}\left(\widehat{V}-H\right)H^{-1/2}\right\|_{\mathrm{op}}\leq 14ρ2κ2(AJ¯,K)A^J¯AJ¯1,+ρα^α1.\displaystyle~{}{14\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+\rho\|\widehat{\alpha}-\alpha\|_{1}.
  3. (c)

    If additionally

    ρ2κ2(AJ¯,K)A^J¯AJ¯1,156{\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}\leq{1\over 56} (92)

    holds, then

    12λK(H1/2V^H1/2)λ1(H1/2V^H1/2)2.{1\over 2}\leq\lambda_{K}(H^{-1/2}\widehat{V}H^{1/2})\leq\lambda_{1}(H^{-1/2}\widehat{V}H^{1/2})\leq 2.
Proof.

We work on the event (140) intersecting (85). From Lemma 3, we know J^=J¯\widehat{J}=\overline{J}.

To prove part (a), using the fact that

jJ^A^jα^=j=1pA^jα^=𝟙Kα^=1,\sum_{j\in\widehat{J}}\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}=\sum_{j=1}^{p}\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}=\mathds{1}_{K}^{\top}\widehat{\alpha}=1,

we have, for any uKu\in\mathbb{R}^{K},

uV^u\displaystyle u^{\top}\widehat{V}u =jJ^1A^jα^(A^ju)2(jJ^A^jα^)\displaystyle=\sum_{j\in\widehat{J}}{1\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}(\widehat{A}_{j\cdot}^{\top}u)^{2}\left(\sum_{j\in\widehat{J}}\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}\right)
(jJ^|A^ju|)2\displaystyle\geq\left(\sum_{j\in\widehat{J}}|\widehat{A}_{j\cdot}^{\top}u|\right)^{2} by Cauchy Schwarz
(AJ¯u1(A^J¯AJ¯)u1)2\displaystyle\geq\left(\|A_{\overline{J}}u\|_{1}-\|(\widehat{A}_{\overline{J}}-A_{\overline{J}})u\|_{1}\right)^{2} by J^=J¯\displaystyle\textrm{by }\widehat{J}=\overline{J}
u12(κ(AJ¯,K)A^J¯AJ¯1,)2.\displaystyle\geq\|u\|_{1}^{2}\left(\kappa(A_{\overline{J}},K)-\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}\right)^{2}.

Invoking condition (91) concludes the result.

To prove part (b), by definition,

V^H\displaystyle\widehat{V}-H =jJ¯1A^jα^(A^jA^jAjAj)+jJ¯AjαA^jα^A^jα^AjAjAjα.\displaystyle=\sum_{j\in\overline{J}}{1\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\left(\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}-A_{j\cdot}A_{j\cdot}^{\top}\right)+\sum_{j\in\overline{J}}{A_{j\cdot}^{\top}\alpha-\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}{A_{j\cdot}A_{j\cdot}^{\top}\over A_{j\cdot}^{\top}\alpha}.

Therefore, for any u𝕊Ku\in\mathbb{S}^{K},

uH1/2(V^H)H1/2uI+II+III+IVu^{\top}H^{-1/2}(\widehat{V}-H)H^{-1/2}u\leq\mathrm{I}+\mathrm{II}+\mathrm{III}+\mathrm{IV}

where

I\displaystyle\mathrm{I} =jJ¯1A^jα^uH1/2A^j(A^jAj)H1/2u\displaystyle=\sum_{j\in\overline{J}}{1\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}u^{\top}H^{-1/2}\widehat{A}_{j\cdot}(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}H^{-1/2}u
II\displaystyle\mathrm{II} =jJ¯1A^jα^uH1/2Aj(A^jAj)H1/2u\displaystyle=\sum_{j\in\overline{J}}{1\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}u^{\top}H^{-1/2}A_{j\cdot}(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}H^{-1/2}u
III\displaystyle\mathrm{III} =jJ¯(AjA^j)α^A^jα^uH1/2AjAjAjαH1/2u\displaystyle=\sum_{j\in\overline{J}}{(A_{j\cdot}-\widehat{A}_{j\cdot})^{\top}\widehat{\alpha}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}u^{\top}H^{-1/2}{A_{j\cdot}A_{j\cdot}^{\top}\over A_{j\cdot}^{\top}\alpha}H^{-1/2}u
IV\displaystyle\mathrm{IV} =jJ¯Aj(αα^)A^jα^uH1/2AjAjAjαH1/2u.\displaystyle=\sum_{j\in\overline{J}}{A_{j\cdot}^{\top}(\alpha-\widehat{\alpha})\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}u^{\top}H^{-1/2}{A_{j\cdot}A_{j\cdot}^{\top}\over A_{j\cdot}^{\top}\alpha}H^{-1/2}u.

By using Lemmas 3 & 7, we obtain

I\displaystyle\mathrm{I} H1/2u12maxjJ¯A^jA^jα^maxk[K]jJ¯|A^jkAjk|6ρκ2(AJ¯,K)A^J¯AJ¯1,\displaystyle\leq\left\|H^{-1/2}u\right\|_{1}^{2}~{}\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\max_{k\in[K]}\sum_{j\in\overline{J}}|\widehat{A}_{jk}-A_{jk}|\leq{6\rho\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}
II\displaystyle\mathrm{II} H1/2u12maxjJ¯AjA^jα^maxk[K]jJ¯|A^jkAjk|4ρκ2(AJ¯,K)A^J¯AJ¯1,\displaystyle\leq\left\|H^{-1/2}u\right\|_{1}^{2}~{}\max_{j\in\overline{J}}{\|A_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\max_{k\in[K]}\sum_{j\in\overline{J}}|\widehat{A}_{jk}-A_{jk}|\leq{4\rho\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}
III\displaystyle\mathrm{III} H1/2u12maxjJ¯AjA^jα^AjAjαmaxk[K]jJ¯|A^jkAjk|4ρ2κ2(AJ¯,K)A^J¯AJ¯1,.\displaystyle\leq\left\|H^{-1/2}u\right\|_{1}^{2}~{}\max_{j\in\overline{J}}{\|A_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}{\|A_{j\cdot}\|_{\infty}\over A_{j\cdot}^{\top}\alpha}\max_{k\in[K]}\sum_{j\in\overline{J}}|\widehat{A}_{jk}-A_{jk}|\leq{4\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}.

By also noting that

IV\displaystyle\mathrm{IV} α^α1maxjJ¯AjA^jα^jJ¯uH1/2AjAjAjαH1/2u4ρα^α1\displaystyle\leq\|\widehat{\alpha}-\alpha\|_{1}\max_{j\in\overline{J}}{\|A_{j\cdot}\|_{\infty}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}}\sum_{j\in\overline{J}}u^{\top}H^{-1/2}{A_{j\cdot}A_{j\cdot}^{\top}\over A_{j\cdot}^{\top}\alpha}H^{-1/2}u\leq 4\rho\|\widehat{\alpha}-\alpha\|_{1}

and using the fact that ρ1\rho\geq 1, we have proved part (b).

Finally, part (c) follows from part (b), (92) and Weyl’s inequality. Indeed,

1H1/2(V^H)H1/2op\displaystyle 1-\left\|H^{-1/2}\left(\widehat{V}-H\right)H^{-1/2}\right\|_{\mathrm{op}} λK(H1/2V^H1/2)\displaystyle\leq\lambda_{K}(H^{-1/2}\widehat{V}H^{-1/2})
λ1(H1/2V^H1/2)\displaystyle\leq\lambda_{1}(H^{-1/2}\widehat{V}H^{-1/2})
1+H1/2(V^H)H1/2op.\displaystyle\leq 1+\left\|H^{-1/2}\left(\widehat{V}-H\right)H^{-1/2}\right\|_{\mathrm{op}}.

The following lemmas controls the operator norm bounds for the normalized estimation errors of HH for various estimators. Let

H~=jJ¯Xjrj2A^jA^j.\widetilde{H}=\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}. (93)

Recall that

B=ρ(1+ξKs)κ2(AJ¯,K)αminB={\rho\left(1+\xi\sqrt{K-s}\right)\over\kappa^{2}(A_{\overline{J}},K)\alpha_{\min}}
Lemma 5.

Suppose |J¯J¯|C|\overline{J}\setminus\underline{J}|\leq C for some constant C>0C>0. Then, on the event (140), with probability 12p11-2p^{-1}, we have

H1/2(H~H)H1/2op\displaystyle\left\|H^{-1/2}\left(\widetilde{H}-H\right)H^{-1/2}\right\|_{\mathrm{op}}\lesssim~{} ρκ2(AJ¯,K)A^J¯AJ¯1,+Blog(p)N+Blog(p)N.\displaystyle{\rho\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+\sqrt{B\log(p)\over N}+{B\log(p)\over N}.

Moreover, if there exists a sufficiently large constant C>0C^{\prime}>0 such that

ρκ2(AJ¯,K)A^J¯AJ¯1,C,Blog(p)NC,{\rho\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}\leq C^{\prime},\qquad{B\log(p)\over N}\leq C^{\prime}, (94)

then, on the event (140), with probability 12p11-2p^{-1},

12λK(H1/2H~H1/2)λ1(H1/2H~H1/2)2.{1\over 2}\leq\lambda_{K}(H^{-1/2}\widetilde{H}H^{-1/2})\leq\lambda_{1}(H^{-1/2}\widetilde{H}H^{-1/2})\leq 2.
Proof.

Define

H^=jJ¯Xjrj2AjAj.\widehat{H}=\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}A_{j\cdot}A_{j\cdot}^{\top}.

By

H1/2(H~H)H1/2opH1/2(H~H^)H1/2op+H1/2(H^H)H1/2op\left\|H^{-1/2}\left(\widetilde{H}-H\right)H^{-1/2}\right\|_{\mathrm{op}}\leq\left\|H^{-1/2}\left(\widetilde{H}-\widehat{H}\right)H^{-1/2}\right\|_{\mathrm{op}}+\left\|H^{-1/2}\left(\widehat{H}-H\right)H^{-1/2}\right\|_{\mathrm{op}}

and Lemma 8, it remains to bound from above the first term on the right hand side. To this end, pick any u𝕊Ku\in\mathbb{S}^{K}. Using the definitions of H~\widetilde{H} and H^\widehat{H} gives

uH1/2(H~H^)H1/2u\displaystyle u^{\top}H^{-1/2}\left(\widetilde{H}-\widehat{H}\right)H^{-1/2}u
=jJ¯Xjrj2uH1/2(A^jA^jAjAj)H1/2u\displaystyle=\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}u^{\top}H^{-1/2}\left(\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}-A_{j\cdot}A_{j\cdot}^{\top}\right)H^{-1/2}u
=jJ¯Xjrj2[uH1/2A^j(A^jAj)H1/2u+uH1/2Aj(A^jAj)H1/2u].\displaystyle=\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}\left[u^{\top}H^{-1/2}\widehat{A}_{j\cdot}(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}H^{-1/2}u+u^{\top}H^{-1/2}A_{j\cdot}(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}H^{-1/2}u\right].

The first term is bounded from above by

H1/2u12maxk[K]jJ¯Xjrj|A^jkAjk|A^jrj\displaystyle\|H^{-1/2}u\|_{1}^{2}\max_{k\in[K]}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}|{\|\widehat{A}_{j\cdot}\|_{\infty}\over r_{j}}
H1/2u12maxk[K]jJ¯Xjrj|A^jkAjk|maxjJ¯A^jrj\displaystyle\leq\|H^{-1/2}u\|_{1}^{2}\max_{k\in[K]}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}|\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}\over r_{j}}
3ρ2κ2(AJ¯,K)maxk[K]jJ¯Xjrj|A^jkAjk|\displaystyle\leq{3\rho\over 2\kappa^{2}(A_{\overline{J}},K)}\max_{k\in[K]}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}|

where we used Lemmas 3 & 7 in the last step. Similarly, we have

jJ¯Xjrj2uH1/2Aj(A^jAj)H1/2u\displaystyle\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}u^{\top}H^{-1/2}A_{j\cdot}(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}H^{-1/2}u ρκ2(AJ¯,K)maxk[K]jJ¯Xjrj|A^jkAjk|\displaystyle\leq{\rho\over\kappa^{2}(A_{\overline{J}},K)}\max_{k\in[K]}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}|

By the arguments in the proof of [8, Theorem 8], on the event \mathcal{E}, we have

maxk[K]jJ¯Xjrj|A^jkAjk|\displaystyle\max_{k\in[K]}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}| maxk[K]jJ¯|A^jkAjk|(1+7log(p)3rjN)\displaystyle\leq\max_{k\in[K]}\sum_{j\in\overline{J}}|\widehat{A}_{jk}-A_{jk}|\left(1+{7\log(p)\over 3r_{j}N}\right)
4A^J¯AJ¯1,+143jJ¯J¯A^jAjrjlog(p)N\displaystyle\leq 4\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+{14\over 3}\sum_{j\in\overline{J}\setminus\underline{J}}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\over r_{j}}{\log(p)\over N}
4A^J¯AJ¯1,+7C3log(p)N\displaystyle\leq 4\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+{7C^{\prime}\over 3}{\log(p)\over N} (95)

where in the last step we uses |J¯J¯|C|\overline{J}\setminus\underline{J}|\leq C^{\prime} and (140). The first result then follows by combining the last three bounds and Lemma 8 with t=4log(p)t=4\log(p).

Under (94), the second result follows immediately by using Weyl’s inequality. ∎

Recall that V^\widehat{V} and V~\widetilde{V} are defined in (43) and (39), respectively. The following lemma bounds their difference normalized by H1/2H^{-1/2} in operator norm.

Lemma 6.

Suppose |J¯J¯|C|\overline{J}\setminus\underline{J}|\leq C for some constant C>0C>0. Assume (92) and (94). On the event (140) intersecting with (85), with probability 12p11-2p^{-1}, we have

H1/2(V^V~)H1/2op\displaystyle\left\|H^{-1/2}\left(\widehat{V}-\widetilde{V}\right)H^{-1/2}\right\|_{\mathrm{op}} ρα^α1+ρ2κ2(AJ¯,K)A^J¯AJ¯1,+Blog(p)N.\displaystyle\lesssim~{}\rho\|\widehat{\alpha}-\alpha\|_{1}+{\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+\sqrt{B\log(p)\over N}.
Proof.

By

V^V~=V^H+HH~+H~V~,\widehat{V}-\widetilde{V}=\widehat{V}-H+H-\widetilde{H}+\widetilde{H}-\widetilde{V},

we have

H1/2(V^V~)H1/2op\displaystyle\left\|H^{-1/2}\left(\widehat{V}-\widetilde{V}\right)H^{-1/2}\right\|_{\mathrm{op}} H1/2(V^H)H1/2op+H1/2(H~H)H1/2op\displaystyle\leq\left\|H^{-1/2}\left(\widehat{V}-H\right)H^{-1/2}\right\|_{\mathrm{op}}+\left\|H^{-1/2}\left(\widetilde{H}-H\right)H^{-1/2}\right\|_{\mathrm{op}}
+H1/2(V~H~)H1/2op.\displaystyle\quad+\left\|H^{-1/2}\left(\widetilde{V}-\widetilde{H}\right)H^{-1/2}\right\|_{\mathrm{op}}.

In view of Lemmas 4 & 5, it suffices to bound from above H1/2(V~H~)H1/2op\|H^{-1/2}\left(\widetilde{V}-\widetilde{H}\right)H^{-1/2}\|_{\mathrm{op}}. Recalling from (39) and (93), it follows that

H1/2(V~H~)H1/2op\displaystyle\left\|H^{-1/2}\left(\widetilde{V}-\widetilde{H}\right)H^{-1/2}\right\|_{\mathrm{op}}
=H1/2jJ¯(XjA^jα^A^jαXjrj2)A^jA^jH1/2op\displaystyle=\left\|H^{-1/2}\sum_{j\in\overline{J}}\left({X_{j}\over\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}\widehat{A}_{j\cdot}^{\top}\alpha}-{X_{j}\over r_{j}^{2}}\right)\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}\right\|_{\mathrm{op}}
=H1/2jJ¯Xj(A^jα^)(A^jα)rj2(A^jα^A^jαAjαAjα)A^jA^jH1/2op\displaystyle=\left\|H^{-1/2}\sum_{j\in\overline{J}}{X_{j}\over(\widehat{A}_{j\cdot}^{\top}\widehat{\alpha})(\widehat{A}_{j\cdot}^{\top}\alpha)r_{j}^{2}}\left(\widehat{A}_{j\cdot}^{\top}\widehat{\alpha}\widehat{A}_{j\cdot}^{\top}\alpha-A_{j\cdot}^{\top}\alpha A_{j\cdot}^{\top}\alpha\right)\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}\right\|_{\mathrm{op}}
H1/2jJ¯Xj(A^jα^)rj2A^j(α^α)A^jA^jH1/2op\displaystyle\leq\left\|H^{-1/2}\sum_{j\in\overline{J}}{X_{j}\over(\widehat{A}_{j\cdot}^{\top}\widehat{\alpha})r_{j}^{2}}\widehat{A}_{j\cdot}^{\top}(\widehat{\alpha}-\alpha)\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}\right\|_{\mathrm{op}}
+H1/2jJ¯Xj(A^jα^)(A^jα)rj2[(A^jα)2(Ajα)2]A^jA^jH1/2op.\displaystyle+\quad\left\|H^{-1/2}\sum_{j\in\overline{J}}{X_{j}\over(\widehat{A}_{j\cdot}^{\top}\widehat{\alpha})(\widehat{A}_{j\cdot}^{\top}\alpha)r_{j}^{2}}\left[\left(\widehat{A}_{j\cdot}^{\top}\alpha\right)^{2}-\left(A_{j\cdot}^{\top}\alpha\right)^{2}\right]\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}\right\|_{\mathrm{op}}.

For the first term, we find

supu𝕊KuH1/2jJ¯Xj(A^jα^)rj2A^jα^α1A^jA^jH1/2u\displaystyle\sup_{u\in\mathbb{S}^{K}}u^{\top}H^{-1/2}\sum_{j\in\overline{J}}{X_{j}\over(\widehat{A}_{j\cdot}^{\top}\widehat{\alpha})r_{j}^{2}}\|\widehat{A}_{j\cdot}\|_{\infty}\|\widehat{\alpha}-\alpha\|_{1}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}u
16α^α1maxjJ¯A^jrjsupu𝕊KuH1/2jJ¯Xjrj2A^jA^jH1/2u\displaystyle\leq 16\|\widehat{\alpha}-\alpha\|_{1}\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}\over r_{j}}\sup_{u\in\mathbb{S}^{K}}u^{\top}H^{-1/2}\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}u
24ρα^α1H1/2H~H1/2op\displaystyle\leq 24\rho\|\widehat{\alpha}-\alpha\|_{1}\left\|H^{-1/2}\widetilde{H}H^{-1/2}\right\|_{\mathrm{op}} by Lemma 3
48ρα^α1\displaystyle\leq 48\rho\|\widehat{\alpha}-\alpha\|_{1} by Lemma 5.\displaystyle\text{by Lemma \ref{lem_Htilde}}.

Regarding the second term, using similar arguments bounds it from above by

H1/2jJ¯Xj(A^jα^)(A^jα)rj2(A^jα+Ajα)[(A^jAj)α]A^jA^jH1/2op\displaystyle\left\|H^{-1/2}\sum_{j\in\overline{J}}{X_{j}\over(\widehat{A}_{j\cdot}^{\top}\widehat{\alpha})(\widehat{A}_{j\cdot}^{\top}\alpha)r_{j}^{2}}\left(\widehat{A}_{j\cdot}^{\top}\alpha+A_{j\cdot}^{\top}\alpha\right)\left[(\widehat{A}_{j\cdot}-A_{j\cdot})^{\top}\alpha\right]\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}\right\|_{\mathrm{op}}
12maxk[K]supu𝕊KjJ¯Xjrj2|A^jkAjk|rjuH1/2A^jA^jH1/2u\displaystyle\leq 12\max_{k\in[K]}\sup_{u\in\mathbb{S}^{K}}\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}{|\widehat{A}_{jk}-A_{jk}|\over r_{j}}u^{\top}H^{-1/2}\widehat{A}_{j\cdot}\widehat{A}_{j\cdot}^{\top}H^{-1/2}u
12maxjJ¯A^j2rj2maxk[K]supu𝕊KH1/2u12jJ¯Xjrj|A^jkAjk|\displaystyle\leq 12\max_{j\in\overline{J}}{\|\widehat{A}_{j\cdot}\|_{\infty}^{2}\over r_{j}^{2}}\max_{k\in[K]}\sup_{u\in\mathbb{S}^{K}}\|H^{-1/2}u\|_{1}^{2}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}|
27ρ2κ2(AJ¯,K)maxk[K]jJ¯Xjrj|A^jkAjk|\displaystyle\leq{27\rho^{2}\over\kappa^{2}(A_{\overline{J}},K)}\max_{k\in[K]}\sum_{j\in\overline{J}}{X_{j}\over r_{j}}|\widehat{A}_{jk}-A_{jk}|

In the last step we used Lemmas 3 & 7. Invoking (B.3.8) and collecting terms obtains the desired result. ∎

B.3.9 Auxiliary lemmas used in the proof of Theorem 9

For completeness, we collect several lemmas that are used in the proof of Theorem 9 and established in [8]. Recall that

H=jJ¯1rjAjAj,H^=jJ¯Xjrj2AjAj.H=\sum_{j\in\overline{J}}{1\over r_{j}}A_{j\cdot}A_{j\cdot}^{\top},\qquad\widehat{H}=\sum_{j\in\overline{J}}{X_{j}\over r_{j}^{2}}A_{j\cdot}A_{j\cdot}^{\top}.
Lemma 7.

For any uKu\in\mathbb{R}^{K},

uHuκ2(AJ¯,K)u12.u^{\top}Hu\geq\kappa^{2}(A_{\overline{J}},K)\|u\|_{1}^{2}.
Proof.

This is proved in the proof of Theorem 2 [8], see, display (F.8). ∎

Lemma 8 (Lemma I.4 in [8]).

For any t0t\geq 0, one has

{H1/2(H^H)H1/2op2BtN+Bt3N}12Ket/2,\mathbb{P}\left\{\left\|H^{-1/2}(\widehat{H}-H)H^{-1/2}\right\|_{\rm op}\leq\sqrt{2Bt\over N}+{Bt\over 3N}\right\}\geq 1-2Ke^{-t/2},

with

B=ρ(1+ξKs)κ2(AJ¯,K)αmin1ξκ2(AJ¯,K)αmin2(1+ξKs).B={\rho\left(1+\xi\sqrt{K-s}\right)\over\kappa^{2}(A_{\overline{J}},K)\alpha_{\min}}\leq{1\vee\xi\over\kappa^{2}(A_{\overline{J}},K)\alpha_{\min}^{2}}\left(1+\xi\sqrt{K-s}\right).

Moreover, if

NCBlogKN\geq CB\log K

for some sufficiently large constant C>0C>0, then, with probability 12K11-2K^{-1}, one has

12λK(H1/2H^H1/2)λ1(H1/2H^H1/2)2.{1\over 2}\leq\lambda_{K}(H^{-1/2}\widehat{H}H^{-1/2})\leq\lambda_{1}(H^{-1/2}\widehat{H}H^{-1/2})\leq 2.
Lemma 9 (Lemma I.3 in [8]).

For any t0t\geq 0, with probability 12et/2+Klog51-2e^{-t/2+K\log 5},

jJ¯XjrjrjH1/2Aj22tN+2ρ3κ(AJ¯,K)tN.\left\|\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}H^{-1/2}A_{j\cdot}\right\|_{2}\leq 2\sqrt{t\over N}+{2\rho\over 3\kappa(A_{\overline{J}},K)}{t\over N}.
Lemma 10.

For any t0t\geq 0 and any u𝕊K𝒦(F)u\in\mathbb{S}^{K}\cap\mathcal{K}(F), with probability 12et/21-2e^{-t/2},

|ujJ¯XjrjrjH1/2Aj|2ρ3κ(AJ¯,K)tN.\left|u^{\top}\sum_{j\in\overline{J}}{X_{j}-r_{j}\over r_{j}}H^{-1/2}A_{j\cdot}\right|\leq{2\rho\over 3\kappa(A_{\overline{J}},K)}{t\over N}.
Proof.

The proof follow the same arguments of proving [8, Lemma I.3] except by applying the Bernstein inequality with the variance equal to

uH1/2AD1/r(Drrr)D1/rAH1/2u=(B.3.5)uFu=0.u^{\top}H^{-1/2}A^{\top}D_{1/r}\left(D_{r}-rr^{\top}\right)D_{1/r}AH^{-1/2}u\overset{(\ref{eq_F})}{=}u^{\top}Fu=0.

Lemma 11.

On the event that (145) holds, we have

jJ¯(Xjrj)(A^jA^jαAjAjα)ρA^J¯AJ¯1,+ρjJ¯J¯A^jAjrjlog(p)N.\Bigl{\|}\sum_{j\in\overline{J}}(X_{j}-r_{j})\Bigl{(}{\widehat{A}_{j\cdot}\over\widehat{A}_{j\cdot}^{\top}\alpha}-{A_{j\cdot}\over A_{j\cdot}^{\top}\alpha}\Bigr{)}\Bigr{\|}_{\infty}\lesssim\rho\|\widehat{A}_{\overline{J}}-A_{\overline{J}}\|_{1,\infty}+\rho\sum_{j\in\overline{J}\setminus\underline{J}}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\over r_{j}}{\log(p)\over N}.
Proof.

This is proved in the proof of [8, Theorem 9]. ∎

B.3.10 Proof of Theorem 6

We prove Theorem 6 under the same notation as the main paper, and focus on the pair (i,j)=(1,2)(i,j)=(1,2) without loss of generality. Recall 12\mathcal{F}^{\prime}_{12} from (33) and Σ(i)\Sigma^{(i)} from (8). Let

Y:=Y12=supf12fZ12,Y:=Y_{12}=\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}Z_{12},

where

Z12𝒩K(0,Σ12),withΣ12=Σ(1)+Σ(2).Z_{12}\sim\mathcal{N}_{K}(0,\Sigma_{12}),\qquad\text{with}\quad\Sigma_{12}=\Sigma^{(1)}+\Sigma^{(2)}. (96)

Similarly, with ^δ\widehat{\mathcal{F}}^{\prime}_{\delta} defined in (51) and Σ^(i)\widehat{\Sigma}^{(i)} given by (49), let Y^b\widehat{Y}_{b} for b[M]b\in[M] be i.i.d. realizations of

Y^:=supf^δfZ^12\widehat{Y}:=\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}\widehat{Z}_{12}

where

Z^12𝒩K(0,Σ^12),withΣ^12=Σ^(1)+Σ^(2).\widehat{Z}_{12}\sim\mathcal{N}_{K}(0,\widehat{\Sigma}_{12}),\qquad\text{with}\quad\widehat{\Sigma}_{12}=\widehat{\Sigma}^{(1)}+\widehat{\Sigma}^{(2)}. (97)

Finally, recall that F^N,M(t)\widehat{F}_{N,M}(t) is the the empirical c.d.f. of {Y^1,,Y^M}\{\widehat{Y}_{1},\ldots,\widehat{Y}_{M}\} at any t(0,1)t\in(0,1) while F(t)F(t) is the c.d.f. of YY.

Proof.

Denote by F^N\widehat{F}_{N} the c.d.f. of Y^\widehat{Y} conditioning on the observed data. Since

|F^N,M(t)F(t)||F^N,M(t)F^N(t)|+|F^N(t)F(t)|,|\widehat{F}_{N,M}(t)-F(t)|\leq|\widehat{F}_{N,M}(t)-\widehat{F}_{N}(t)|+|\widehat{F}_{N}(t)-F(t)|,

and the Glivenko-Cantelli theorem ensures that the first term on the right hand side converges to 0, almost surely, as MM\to\infty, it remains to prove F^N(t)F(t)=o(1)\widehat{F}_{N}(t)-F(t)=o_{\mathbb{P}}(1) for any t[0,1]t\in[0,1], or, equivalently, Y^𝑑Y\widehat{Y}\overset{d}{\to}Y.

To this end, notice that

Y^Y\displaystyle\widehat{Y}-Y =supf^δfZ^12supf12fZ^12I+supf12fZ^12supf12fZ12II.\displaystyle=\underbrace{\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}\widehat{Z}_{12}-\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}\widehat{Z}_{12}}_{\mathrm{I}}+\underbrace{\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}\widehat{Z}_{12}-\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}Z_{12}}_{\mathrm{II}}.

For the first term, we have

|I|\displaystyle|\mathrm{I}| dH(^δ,12)Z^121\displaystyle\leq d_{H}(\widehat{\mathcal{F}}^{\prime}_{\delta},\mathcal{F}^{\prime}_{12})~{}\|\widehat{Z}_{12}\|_{1} by Lemma 1
dH(^δ,12)KZ^122\displaystyle\leq d_{H}(\widehat{\mathcal{F}}^{\prime}_{\delta},\mathcal{F}^{\prime}_{12})~{}\sqrt{K}\|\widehat{Z}_{12}\|_{2}
=dH(^δ,12)𝒪(Ktr(Σ^12))\displaystyle=d_{H}(\widehat{\mathcal{F}}^{\prime}_{\delta},\mathcal{F}^{\prime}_{12})~{}\mathcal{O}_{\mathbb{P}}\left(\sqrt{K{\rm tr}(\widehat{\Sigma}_{12})}\right)
=dH(^δ,12)𝒪(Ktr(Σ12))\displaystyle=d_{H}(\widehat{\mathcal{F}}^{\prime}_{\delta},\mathcal{F}^{\prime}_{12})~{}\mathcal{O}_{\mathbb{P}}\left(\sqrt{K{\rm tr}(\Sigma_{12})}\right) by part (b) of Lemma 13
=o(1)\displaystyle=o_{\mathbb{P}}(1) by Lemmas 12 & 14 .\displaystyle\text{by Lemmas \ref{lem_Sigma} \& \ref{lem_hausdorff_alternative} }. (98)

Regarding II\mathrm{II}, observe that the function

h(u)=supf12fu,uK,h(u)=\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}u,\quad\forall~{}u\in\mathbb{R}^{K},

is Lipschitz, hence continuous. Indeed, for any u,vKu,v\in\mathbb{R}^{K},

|h(u)h(v)|supf12fuv1supffuv1diam(𝒜)uv1,|h(u)-h(v)|\leq\sup_{f\in\mathcal{F}^{\prime}_{12}}\|f\|_{\infty}\|u-v\|_{1}\leq\sup_{f\in\mathcal{F}}\|f\|_{\infty}\|u-v\|_{1}\leq\mathrm{diam}(\mathcal{A})\|u-v\|_{1}, (99)

and

diam(𝒜)=maxk,kd(Ak,Ak)(25)Cdmaxk,k12AkAk1Cd.\mathrm{diam}(\mathcal{A})=\max_{k,k^{\prime}}d(A_{k},A_{k^{\prime}})\overset{\eqref{lip_d_upper}}{\leq}C_{d}\max_{k,k^{\prime}}{1\over 2}\|A_{k}-A_{k^{\prime}}\|_{1}\leq C_{d}. (100)

By the continuous mapping theorem, it remains to prove Z^12𝑑Z12\widehat{Z}_{12}\overset{d}{\to}Z_{12}. Since Z^12=𝑑Σ^121/2Z\widehat{Z}_{12}\overset{d}{=}\widehat{\Sigma}_{12}^{1/2}Z and Z12=𝑑Σ121/2ZZ_{12}\overset{d}{=}\Sigma_{12}^{1/2}Z for Z𝒩K(0,\bmIK)Z\sim\mathcal{N}_{K}(0,\bm{I}_{K}), and

Σ^121/2ZΣ121/2Z22=𝒪(Σ^121/2Σ121/2F2)=o(1)\|\widehat{\Sigma}_{12}^{1/2}Z-\Sigma_{12}^{1/2}Z\|_{2}^{2}=\mathcal{O}_{\mathbb{P}}\left(\|\widehat{\Sigma}_{12}^{1/2}-\Sigma_{12}^{1/2}\|_{F}^{2}\right)=o_{\mathbb{P}}(1)

from part (c) of Lemma 13, we conclude that Z^12𝑑Z12\widehat{Z}_{12}\overset{d}{\to}Z_{12} hence

supf12fZ^12𝑑supf12fZ12.\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}\widehat{Z}_{12}\overset{d}{\to}\sup_{f\in\mathcal{F}^{\prime}_{12}}f^{\top}Z_{12}.

In conjunction with (B.3.10), the proof is complete. ∎

B.3.11 Lemmas used in the proof of Theorem 6

The lemma below provides upper and lower bounds of the eigenvalues of Σ(i)\Sigma^{(i)} defined in (8). Recall κ¯K\overline{\kappa}_{K} and ρ\rho from (19) and (86).

Lemma 12.

Provided that κ¯K>0\overline{\kappa}_{K}>0, we have rank(Σ(i))=K1\mathrm{rank}(\Sigma^{(i)})=K-1 and

1KρλK1(Σ(i))λ1(Σ(i))1κ¯K2.{1\over K\rho}\leq\lambda_{K-1}(\Sigma^{(i)})\leq\lambda_{1}(\Sigma^{(i)})\leq{1\over\overline{\kappa}_{K}^{2}}.
Proof.

Let us drop the superscript. Recall from (80) that

Σ=H1/2FH1/2.\Sigma=H^{-1/2}FH^{-1/2}.

By (81), we have

1λ1(H)λK1(Σ)λ1(Σ)1λK(H).{1\over\lambda_{1}(H)}\leq\lambda_{K-1}(\Sigma)\leq\lambda_{1}(\Sigma)\leq{1\over\lambda_{K}(H)}.

On the one hand, Lemma 7 implies

λK(H)κ2(AJ¯,K)κ¯K2.\lambda_{K}(H)\geq\kappa^{2}(A_{\overline{J}},K)\geq\overline{\kappa}_{K}^{2}.

On the other hand, by recalling that r=Aαr=A\alpha,

λ1(H)maxk[K]j:rj>0AjkAj1rj(86)Kρmaxk[K]j:rj>0AjkKρ.\lambda_{1}(H)\leq\max_{k\in[K]}\sum_{j:r_{j}>0}{A_{jk}\|A_{j\cdot}\|_{1}\over r_{j}}\overset{\eqref{bd_rho}}{\leq}K\rho\max_{k\in[K]}\sum_{j:r_{j}>0}A_{jk}\leq K\rho. (101)

The proof is complete. ∎

Recall Σ^12\widehat{\Sigma}_{12} from (97) and Σ12\Sigma_{12} from (96). The following lemma controls Σ^12Σ12\widehat{\Sigma}_{12}-\Sigma_{12} as well as Σ^121/2Σ121/2\widehat{\Sigma}_{12}^{1/2}-\Sigma_{12}^{1/2}.

Lemma 13.

Under conditions of Theorem 6, we have

  • (a)

    Σ^12Σ12op=o(1)\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}=o_{\mathbb{P}}(1);

  • (b)

    tr(Σ^12)tr(Σ12)=o(1){\rm tr}(\widehat{\Sigma}_{12})-{\rm tr}(\Sigma_{12})=o_{\mathbb{P}}(1);

  • (c)

    Σ^121/2Σ121/2F2=o(1)\|\widehat{\Sigma}_{12}^{1/2}-\Sigma_{12}^{1/2}\|_{F}^{2}=o_{\mathbb{P}}(1).

Proof.

We first prove the bound in the operator norm. By definition,

Σ^12Σ12opQ^(1)Q1(1)op+Q^(2)Q(2)op.\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}\leq\|\widehat{Q}^{(1)}-Q_{1}^{(1)}\|_{\mathrm{op}}+\|\widehat{Q}^{(2)}-Q^{(2)}\|_{\mathrm{op}}.

Let us focus on Q^(1)Q1(1)op\|\widehat{Q}^{(1)}-Q_{1}^{(1)}\|_{\mathrm{op}} and drop the superscripts. Starting with

Q^Qop\displaystyle\|\widehat{Q}-Q\|_{\mathrm{op}} V^1H1op+α^α^ααop\displaystyle\leq\|\widehat{V}^{-1}-H^{-1}\|_{\mathrm{op}}+\|\widehat{\alpha}\widehat{\alpha}^{\top}-\alpha\alpha^{\top}\|_{\mathrm{op}}

by recalling V^\widehat{V} and HH from (43) and (80), the second term is bounded from above by

α^α^ααop\displaystyle\|\widehat{\alpha}\widehat{\alpha}^{\top}-\alpha\alpha^{\top}\|_{\mathrm{op}} =supu𝕊Ku(α^+α)(α^α)u\displaystyle=\sup_{u\in\mathbb{S}^{K}}u^{\top}(\widehat{\alpha}+\alpha)(\widehat{\alpha}^{-}\alpha)^{\top}u
α^+α2α^α2\displaystyle\leq\|\widehat{\alpha}+\alpha\|_{2}\|\widehat{\alpha}-\alpha\|_{2}
2α^α1\displaystyle\leq 2\|\widehat{\alpha}-\alpha\|_{1}
=o(1).\displaystyle=o_{\mathbb{P}}(1).

Furthermore, by invoking Lemma 4, (19) and (101), we have

V^1H1op\displaystyle\|\widehat{V}^{-1}-H^{-1}\|_{\mathrm{op}} =V^1(HV^)H1op\displaystyle=\|\widehat{V}^{-1}(H-\widehat{V})H^{-1}\|_{\mathrm{op}}
H1/2opH1/2V^1H1/2opH1/2(HV^)H1/2opH1/2op\displaystyle\leq\|H^{-1/2}\|_{\mathrm{op}}\|H^{1/2}\widehat{V}^{-1}H^{1/2}\|_{\mathrm{op}}\|H^{-1/2}(H-\widehat{V})H^{-1/2}\|_{\mathrm{op}}\|H^{-1/2}\|_{\mathrm{op}}
=H1op𝒪(ρ2κ¯K2ϵ1,+ρα^α1)\displaystyle=\|H^{-1}\|_{\mathrm{op}}\mathcal{O}_{\mathbb{P}}\left({\rho^{2}\over\overline{\kappa}_{K}^{2}}\epsilon_{1,\infty}+\rho\|\widehat{\alpha}-\alpha\|_{1}\right)
=o(1).\displaystyle=o_{\mathbb{P}}(1).

Combining the last two displays and using the same arguments for Q^(2)Q(2)op\|\widehat{Q}^{(2)}-Q^{(2)}\|_{\mathrm{op}} finish the proof of part (a).

The second result in part (b) follows immediately as tr(Σ^12Σ12)KΣ^12Σ12op{\rm tr}(\widehat{\Sigma}_{12}-\Sigma_{12})\leq K\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}.

To prove part (c), write the singular value decomposition of Σ^12\widehat{\Sigma}_{12} as

Σ^12=U^Λ^U^\widehat{\Sigma}_{12}=\widehat{U}\widehat{\Lambda}\widehat{U}^{\top}

where U^=[u^1,,u^K]\widehat{U}=[\widehat{u}_{1},\ldots,\widehat{u}_{K}] and Λ^=diag(λ^1,,λ^r^,0,,0)\widehat{\Lambda}=\mathrm{diag}(\widehat{\lambda}_{1},\ldots,\widehat{\lambda}_{\widehat{r}},0,\ldots,0) with λ^1λ^r^\widehat{\lambda}_{1}\geq\cdots\geq\widehat{\lambda}_{\widehat{r}} and r^=rank(Σ^12)\widehat{r}=\mathrm{rank}(\widehat{\Sigma}_{12}). Similarly, write

Σ12=UΛU\Sigma_{12}=U\Lambda U^{\top}

with U=[u1,,uK]U=[u_{1},\ldots,u_{K}], Λ=diag(λ1,,λr,0,,0)\Lambda=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{r},0,\ldots,0) and r=rank(Σ12)r=\mathrm{rank}(\Sigma_{12}). We will verify in the end of the proof that

λrc,\lambda_{r}\geq c, (102)

for some constant c>0c>0. Since Weyl’s inequality and part (a) imply that

max1jrr^|λ^jλj|Σ^12Σ12op=o(1),\max_{1\leq j\leq r\vee\widehat{r}}\left|\widehat{\lambda}_{j}-\lambda_{j}\right|\leq\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}=o_{\mathbb{P}}(1), (103)

by further using (102), we can deduce that

limn,N(r^r)=1.\lim_{n,N\to\infty}\mathbb{P}(\widehat{r}\geq r)=1.

From now on, let us work on the event {r^r}\{\widehat{r}\geq r\}. Notice that (102) and (103) also imply

maxr<jr^λ^j1/2Σ^12Σ12op,\displaystyle\max_{r<j\leq\widehat{r}}\widehat{\lambda}_{j}^{1/2}\leq\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}, (104)
max1jr|λ^j1/2λj1/2|=max1jr|λ^jλj|λ^j1/2+λj1/2Σ^12Σ12opc.\displaystyle\max_{1\leq j\leq r}\left|\widehat{\lambda}_{j}^{1/2}-\lambda_{j}^{1/2}\right|=\max_{1\leq j\leq r}{|\widehat{\lambda}_{j}-\lambda_{j}|\over\widehat{\lambda}_{j}^{1/2}+\lambda_{j}^{1/2}}\leq{\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}\over\sqrt{c}}. (105)

Furthermore, by the variant of Davis-Kahan theorem [70], there exists some orthogonal matrix Qr×rQ\in\mathbb{R}^{r\times r} such that

U^rUrQF23/2rΣ^12Σ12opλr23/2rΣ^12Σ12opc.\|\widehat{U}_{r}-U_{r}Q\|_{F}\leq{2^{3/2}\sqrt{r}\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}\over\lambda_{r}}\leq{2^{3/2}\sqrt{r}\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}\over\sqrt{c}}. (106)

Here U^r\widehat{U}_{r} (UrU_{r}) contains the first rr columns of U^\widehat{U} (UU). By writing Λ^r=diag(λ^1,,λ^r)\widehat{\Lambda}_{r}=\mathrm{diag}(\widehat{\lambda}_{1},\ldots,\widehat{\lambda}_{r}) and Λr=diag(λ1,,λr)\Lambda_{r}=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{r}) and using the identity

Σ^121/2Σ121/2\displaystyle\widehat{\Sigma}_{12}^{1/2}-\Sigma_{12}^{1/2} =(U^rUrQ)Λ^r1/2U^r+UrQΛ^r1/2(U^rUrQ)\displaystyle=~{}(\widehat{U}_{r}-U_{r}Q)\widehat{\Lambda}_{r}^{1/2}\widehat{U}_{r}^{\top}+U_{r}Q\widehat{\Lambda}_{r}^{1/2}(\widehat{U}_{r}-U_{r}Q)^{\top}
+UrQ(Λ^r1/2QΛr1/2Q)QUr+r<jr^λ^j1/2u^ju^j,\displaystyle\quad+U_{r}Q(\widehat{\Lambda}_{r}^{1/2}-Q^{\top}\Lambda_{r}^{1/2}Q)Q^{\top}U_{r}^{\top}+\sum_{r<j\leq\widehat{r}}\widehat{\lambda}_{j}^{1/2}\widehat{u}_{j}\widehat{u}_{j}^{\top},

we find

Σ^121/2Σ121/2F\displaystyle\|\widehat{\Sigma}_{12}^{1/2}-\Sigma_{12}^{1/2}\|_{F}
2U^rUrQFλ^11/2+QΛ^r1/2QΛr1/2F+(r^r)λ^r+1\displaystyle\leq 2\|\widehat{U}_{r}-U_{r}Q\|_{F}~{}\widehat{\lambda}_{1}^{1/2}+\|Q\widehat{\Lambda}_{r}^{1/2}Q^{\top}-\Lambda_{r}^{1/2}\|_{F}+\sqrt{(\widehat{r}-r)\widehat{\lambda}_{r+1}}
=𝒪(Σ^12Σ12op2+Σ^12Σ12opλ11/2)+rΛ^r1/2QΛr1/2Qop,\displaystyle=\mathcal{O}\left(\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}^{2}+\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}\lambda_{1}^{1/2}\right)+\sqrt{r}~{}\|\widehat{\Lambda}_{r}^{1/2}-Q^{\top}\Lambda_{r}^{1/2}Q\|_{\mathrm{op}},

where we use (104) – (106) in the last step. For the second term, we now prove the following by considering two cases,

Λ^r1/2QΛr1/2Qopmax1jr|λ^j1/2λj1/2|(103)Σ^12Σ12opc.\|\widehat{\Lambda}_{r}^{1/2}-Q^{\top}\Lambda_{r}^{1/2}Q\|_{\mathrm{op}}\lesssim\max_{1\leq j\leq r}|\widehat{\lambda}_{j}^{1/2}-\lambda_{j}^{1/2}|\overset{\eqref{singular_values}}{\leq}{\|\widehat{\Sigma}_{12}-\Sigma_{12}\|_{\mathrm{op}}\over\sqrt{c}}.

When Λr\Lambda_{r} has distinct values, that is, λiλi+1c\lambda_{i}-\lambda_{i+1}\geq c^{\prime} for all i{1,,(r1)}i\in\{1,\ldots,(r-1)\} with some constant c>0c^{\prime}>0, then QQ is the identity matrix up to signs as u^i\widehat{u}_{i} consistently estimates uiu_{i}, up to the sign, for all i[r]i\in[r]. When Λr\Lambda_{r} has values of the same order, we consider the case λ1λs\lambda_{1}\asymp\cdots\asymp\lambda_{s} for some s<rs<r and λiλi+1c\lambda_{i}-\lambda_{i+1}\geq c^{\prime} for all sir1s\leq i\leq r-1, and similar arguments can be used to prove the general case. Then since {us+1,,ur}\{u_{s+1},\ldots,u_{r}\} can be consistently estimated, we must have, up to signs,

Q=[Q100\bmIrs]Q=\begin{bmatrix}Q_{1}&0\\ 0&\bm{I}_{r-s}\end{bmatrix}

with Q1Q1=\bmIsQ_{1}^{\top}Q_{1}=\bm{I}_{s}. Then by writing λ¯1/2=s1i=1sλi1/2\bar{\lambda}^{1/2}=s^{-1}\sum_{i=1}^{s}\lambda_{i}^{1/2}, we have

Λ^r1/2QΛr1/2Qopmax{maxs<ir|λ^j1/2λj1/2|,}\displaystyle\|\widehat{\Lambda}_{r}^{1/2}-Q^{\top}\Lambda_{r}^{1/2}Q\|_{\mathrm{op}}\leq\max\left\{\max_{s<i\leq r}|\widehat{\lambda}_{j}^{1/2}-\lambda_{j}^{1/2}|,~{}\right\}

and

Λ^s1/2Q1Λs1/2Q1op\displaystyle\|\widehat{\Lambda}_{s}^{1/2}-Q_{1}^{\top}\Lambda_{s}^{1/2}Q_{1}\|_{\mathrm{op}} Λ^s1/2λ¯1/2\bmIsop+λ¯1/2\bmIsΛs1/2op\displaystyle\leq\|\widehat{\Lambda}_{s}^{1/2}-\bar{\lambda}^{1/2}\bm{I}_{s}\|_{\mathrm{op}}+\|\bar{\lambda}^{1/2}\bm{I}_{s}-\Lambda_{s}^{1/2}\|_{\mathrm{op}}
Λ^s1/2Λs1/2op+2λ¯1/2\bmIsΛs1/2op\displaystyle\leq\|\widehat{\Lambda}_{s}^{1/2}-\Lambda_{s}^{1/2}\|_{\mathrm{op}}+2\|\bar{\lambda}^{1/2}\bm{I}_{s}-\Lambda_{s}^{1/2}\|_{\mathrm{op}}
=max1is|λ^j1/2λj1/2|+o(1).\displaystyle=\max_{1\leq i\leq s}|\widehat{\lambda}_{j}^{1/2}-\lambda_{j}^{1/2}|+o(1).

Collecting terms concludes the claim in part (c).

Finally, we verify (102) to complete the proof. By Lemma 12, we know that rK1r\geq K-1. Furthermore, when r=K1r=K-1, we must have α(1)=α(2)\alpha^{(1)}=\alpha^{(2)} whence λK12λK1(Σ(i))\lambda_{K-1}\geq 2\lambda_{K-1}(\Sigma^{(i)}) is bounded from below, by using Lemma 12 again. If r=Kr=K, then α(1)α(2)\alpha^{(1)}\neq\alpha^{(2)}, hence the null space of Σ(i)\Sigma^{(i)} does not overlap with that of Σ(j)\Sigma^{(j)}. This together with Lemma 12 implies (102). ∎

B.3.12 Consistency of ^δ\widehat{\mathcal{F}}^{\prime}_{\delta} in the Hausdorff distance

The following lemma proves the consistency of ^δ\widehat{\mathcal{F}}^{\prime}_{\delta} in the Hausdorff distance. For some sequences ϵA,ϵα,A>0\epsilon_{A},\epsilon_{\alpha,A}>0, define the event

(ϵA,ϵα,A):={maxk[K]d(A^k,Ak)ϵA}{i=12α^(i)α(i)12ϵα,A}.\mathcal{E}(\epsilon_{A},\epsilon_{\alpha,A}):=\left\{\max_{k\in[K]}d(\widehat{A}_{k},A_{k})\leq\epsilon_{A}\right\}\bigcap\left\{\sum_{i=1}^{2}\|\widehat{\alpha}^{(i)}-\alpha^{(i)}\|_{1}\leq 2\epsilon_{\alpha,A}\right\}.
Lemma 14.

On the event (ϵA,ϵα,A)\mathcal{E}(\epsilon_{A},\epsilon_{\alpha,A}) for some sequences ϵA,ϵα,A0\epsilon_{A},\epsilon_{\alpha,A}\to 0, by choosing

δ6ϵA+2diam(𝒜)ϵα,A,\delta\geq 6\epsilon_{A}+2\mathrm{diam}(\mathcal{A})\epsilon_{\alpha,A},

and δ0\delta\to 0, we have

limδ0dH(^δ,12)=0.\lim_{\delta\to 0}d_{H}(\widehat{\mathcal{F}}^{\prime}_{\delta},\mathcal{F}^{\prime}_{12})=0.

In particular, under conditions of Theorem 6, we have

dH(^δ,12)=o(1),as n,N.d_{H}(\widehat{\mathcal{F}}^{\prime}_{\delta},\mathcal{F}^{\prime}_{12})=o_{\mathbb{P}}(1),\quad\text{as }n,N\to\infty.
Proof.

For simplicity, we drop the subscripts and write

W=W(𝜶(1),𝜶(2);d),W^=W(𝜶^(1),𝜶^(2);d).W=W(\boldsymbol{\alpha}^{(1)},\boldsymbol{\alpha}^{(2)};d),\qquad\widehat{W}=W(\widehat{\boldsymbol{\alpha}}^{(1)},\widehat{\boldsymbol{\alpha}}^{(2)};d).

By the proof of Eq. 22 together with (100), we have

|W^W|2ϵA+diam(𝒜)ϵα,A.|\widehat{W}-W|\leq 2\epsilon_{A}+\mathrm{diam}(\mathcal{A})\epsilon_{\alpha,A}. (107)

Recall that

=G,\displaystyle\mathcal{F}^{\prime}=\mathcal{F}\cap G,\qquad G:={fK:f(α(1)α(2))W=0};\displaystyle G:=\left\{f\in\mathbb{R}^{K}:f^{\top}(\alpha^{(1)}-\alpha^{(2)})-W=0\right\};
^δ=^G^δ,\displaystyle\widehat{\mathcal{F}}^{\prime}_{\delta}=\widehat{\mathcal{F}}\cap\widehat{G}_{\delta}, G^δ:={fK:|f(α^(1)α^(2))W^|δ}.\displaystyle\widehat{G}_{\delta}:=\left\{f\in\mathbb{R}^{K}:\left|f^{\top}(\widehat{\alpha}^{(1)}-\widehat{\alpha}^{(2)})-\widehat{W}\right|\leq\delta\right\}.

By definition of the Hausdorff distance, we need to show that, as δ0\delta\to 0,

supf^δfvsupffv,uniformly over 1:={vK:v11}.\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}v\to\sup_{f\in\mathcal{F}^{\prime}}f^{\top}v,\quad\text{uniformly over }\mathcal{B}_{1}:=\{v\in\mathbb{R}^{K}:\|v\|_{1}\leq 1\}.

Since the functions hδ:1h_{\delta}:\mathcal{B}_{1}\to\mathbb{R}, defined as hδ(v)=supf^δfvh_{\delta}(v)=\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}v for the sequence of δ\delta, are equicontinuous by the fact that ^\widehat{\mathcal{F}} is compact (see, (99)), it suffices to prove

limδ0supf^δfv=supffv,for each v1.\lim_{\delta\to 0}\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}v=\sup_{f\in\mathcal{F}^{\prime}}f^{\top}v,\qquad\text{for each }v\in\mathcal{B}_{1}.

To this end, fix any v1v\in\mathcal{B}_{1}. We first bound from above

g(v):=supffvsupf^δfv.g(v):=\sup_{f\in\mathcal{F}^{\prime}}f^{\top}v-\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}v.

Pick any ff\in\mathcal{F}^{\prime}. By the construction of f^\widehat{f} as in the proof of Lemma 2, there exists f^^\widehat{f}\in\widehat{\mathcal{F}} such that

f^fmaxk,k[K]|d(Ak,Ak)d(A^k,A^k)|2ϵA.\|\widehat{f}-f\|_{\infty}\leq\max_{k,k^{\prime}\in[K]}\left|d(A_{k},A_{k^{\prime}})-d(\widehat{A}_{k},\widehat{A}_{k^{\prime}})\right|\leq 2\epsilon_{A}.

Moreover, by adding and subtracting terms, we find

|f^(α^(1)α^(2))W^|\displaystyle\left|\widehat{f}^{\top}(\widehat{\alpha}^{(1)}-\widehat{\alpha}^{(2)})-\widehat{W}\right| |(f^f)(α^(1)α^(2))|+|f(α^(1)α(1)α^(2)+α(2))|\displaystyle\leq\left|(\widehat{f}-f)^{\top}(\widehat{\alpha}^{(1)}-\widehat{\alpha}^{(2)})\right|+\left|f^{\top}(\widehat{\alpha}^{(1)}-\alpha^{(1)}-\widehat{\alpha}^{(2)}+\alpha^{(2)})\right|
+|f(α(1)α(2))W^|\displaystyle\qquad+\left|f^{\top}(\alpha^{(1)}-\alpha^{(2)})-\widehat{W}\right|
f^fα^(1)α^(2)1+fα^(1)α(1)α^(2)+α(2)1\displaystyle\leq\|\widehat{f}-f\|_{\infty}\|\widehat{\alpha}^{(1)}-\widehat{\alpha}^{(2)}\|_{1}+\|f\|_{\infty}\|\widehat{\alpha}^{(1)}-\alpha^{(1)}-\widehat{\alpha}^{(2)}+\alpha^{(2)}\|_{1}
+|W^W|\displaystyle\qquad+|\widehat{W}-W|
6ϵA+2diam(𝒜)ϵα,A\displaystyle\leq 6\epsilon_{A}+2\mathrm{diam}(\mathcal{A})\epsilon_{\alpha,A}
δ.\displaystyle\leq\delta. (108)

Therefore, f^G^δ\widehat{f}\in\widehat{G}_{\delta}, hence f^^δ\widehat{f}\in\widehat{\mathcal{F}}^{\prime}_{\delta}. For this choice of f^\widehat{f}, we have

g(v)\displaystyle g(v) supf(fvf^v)supff^fv1ϵA0.\displaystyle\leq\sup_{f\in\mathcal{F}^{\prime}}\left(f^{\top}v-\widehat{f}~{}^{\top}v\right)\leq\sup_{f\in\mathcal{F}^{\prime}}\|\widehat{f}-f\|_{\infty}\|v\|_{1}\leq\epsilon_{A}\to 0. (109)

We proceed to bound from above g(v)-g(v). Define

δ={fK:f1=0,|fkf|d(Ak,A)+δ,k,[K]},\displaystyle\mathcal{F}_{\delta}=\{f\in\mathbb{R}^{K}:f_{1}=0,~{}|f_{k}-f_{\ell}|\leq d(A_{k},A_{\ell})+\delta,~{}\forall k,\ell\in[K]\},
Gδ={fK:|f(α(1)α(2))W|2δ}.\displaystyle G_{\delta}=\left\{f\in\mathbb{R}^{K}:\left|f^{\top}(\alpha^{(1)}-\alpha^{(2)})-W\right|\leq 2\delta\right\}.

Clearly, δ\mathcal{F}_{\delta} is compact, and both δ\mathcal{F}_{\delta} and GδG_{\delta} are convex and monotonic in the sense that δδ\mathcal{F}_{\delta}\subseteq\mathcal{F}_{\delta^{\prime}} and GδGδG_{\delta}\subseteq G_{\delta^{\prime}} for any δδ\delta\leq\delta^{\prime}. Furthermore, on the event (ϵA,ϵα,A)\mathcal{E}(\epsilon_{A},\epsilon_{\alpha,A}), we have ^δ\widehat{\mathcal{F}}\subseteq\mathcal{F}_{\delta} and G^δGδ\widehat{G}_{\delta}\subseteq G_{\delta}, implying that

g(v)=supf^G^δfvsupffvsupfδGδfvsupffv.-g(v)=\sup_{f\in\widehat{\mathcal{F}}\cap\widehat{G}_{\delta}}f^{\top}v-\sup_{f\in\mathcal{F}^{\prime}}f^{\top}v\leq\sup_{f\in\mathcal{F}_{\delta}\cap G_{\delta}}f^{\top}v-\sup_{f\in\mathcal{F}^{\prime}}f^{\top}v.

Since δGδ\mathcal{F}_{\delta}\cap G_{\delta} is compact, we write fδf_{\delta} for

fδv=supfδGδfv.f_{\delta}^{\top}v=\sup_{f\in\mathcal{F}_{\delta}\cap G_{\delta}}f^{\top}v.

Observe that fδvf_{\delta}^{\top}v is a non-increasing sequence as δ0\delta\to 0 and bounded from below by 0. Therefore, we can take its limit point which can be achieved by f0vf_{0}^{\top}v with f0=Gf_{0}\in\mathcal{F}^{\prime}=\mathcal{F}\cap G. We thus conclude

lim supδ0(g(v))lim supδ0fδvsupffv0.\limsup_{\delta\to 0}(-g(v))\leq\limsup_{\delta\to 0}f_{\delta}^{\top}v-\sup_{f\in\mathcal{F}^{\prime}}f^{\top}v\leq 0. (110)

Combining (109) and (110) completes the proof. ∎

Appendix C Minimax lower bounds of estimating any metric on discrete probability measures that is bi-Lipschitz equivalent to the Total Variation Distance

Denote by (𝒳,d)(\mathcal{X},d) an arbitrary finite metric space of cardinality KK with dd satisfying (27) for some 0<cC<0<c_{*}\leq C_{*}<\infty. In the main text, we used the notation DD for this distance. We re-denote it here by dd, as we found that this increased the clarity of exposition.

The following theorem provide minimax lower bounds of estimating d(P,Q)d(P,Q) for any P,Q𝒫(𝒳)P,Q\in\mathcal{P}(\mathcal{X}) based on NN i.i.d. samples of (P,Q)(P,Q).

Theorem 10.

Let (𝒳,d)(\mathcal{X},d) and 0<cC<0<c_{*}\leq C_{*}<\infty be as above. There exist positive universal constants C,CC,C^{\prime} such that if NCKN\geq CK, then

infd^supP,Q𝒫(𝒳)𝔼|d^d(P,Q)|C(c2CKN(logK)2CN),\inf_{\widehat{d}}\sup_{P,Q\in\mathcal{P}(\mathcal{X})}\mathbb{E}|\widehat{d}-d(P,Q)|\geq C^{\prime}\left({c_{*}^{2}\over C_{*}}\sqrt{\frac{K}{N(\log K)^{2}}}\vee\frac{C_{*}}{\sqrt{N}}\right)\,,

where the infimum is taken over all estimators constructed from NN i.i.d. samples from PP and QQ, respectively.

Proof.

In this proof, the symbols CC and cc denote universal positive constants whose value may change from line to line. For notational simplicity, we let

κ𝒳=Cc\kappa_{\mathcal{X}}={C_{*}\over c_{*}}

Without loss of generality, we may assume by rescaling the metric that C=1C_{*}=1. The lower bound 1/N1/\sqrt{N} follows directly from Le Cam’s method [see 62, Section 2.3]. We therefore focus on proving the lower bound κ𝒳2K/(N(logK)2)\kappa_{\mathcal{X}}^{-2}\sqrt{K/(N(\log K)^{2})}, and may assume that K/(logK)2Cκ𝒳4K/(\log K)^{2}\geq C\kappa_{\mathcal{X}}^{4}. In proving the lower bound, we may also assume that QQ is fixed to be the uniform distribution over 𝒳\mathcal{X}, which we denote by ρ\rho, and that the statistician obtains NN i.i.d. observations from the unknown distribution PP alone. We write

RNinfd^supPΔK𝔼P|d^d(P,ρ)|R_{N}\coloneqq\inf_{\widehat{d}}\sup_{P\in\Delta_{K}}\mathbb{E}_{P}|\widehat{d}-d(P,\rho)|

for the corresponding minimax risk.

We employ the method of “fuzzy hypotheses” [62], and, following a standard reduction [see, e.g 36, 68], we will derive a lower bound on the minimax risk by considering a modified observation model with Poisson observations. Concretely, in the original observation model, the empirical frequency count vector ZNP^Z\coloneqq N\widehat{P} is a sufficient statistic, with distribution

ZMultinomialK(N;P).Z\sim\mathrm{Multinomial}_{K}(N;P)\,. (111)

We define an alternate observation model under which ZZ has independent entries, with ZkPoisson(NPk)Z_{k}\sim\mathrm{Poisson}(NP_{k}), which we abbreviate as ZPoisson(NP)Z\sim\mathrm{Poisson}(NP). We write 𝔼P\mathbb{E}_{P}^{\prime} for expectations with respect to this probability measure. Note that, in contrast to the multinomial model, the Poisson model is well defined for any P+KP\in\mathbb{R}_{+}^{K}.

Write ΔK={P+K:P13/4}\Delta_{K}^{\prime}=\{P\in\mathbb{R}_{+}^{K}:\|P\|_{1}\geq 3/4\}, and define

R~Ninfd^supPΔK𝔼P|d^d(P,ρ)|\tilde{R}_{N}\coloneqq\inf_{\widehat{d}}\sup_{P\in\Delta_{K}^{\prime}}\mathbb{E}^{\prime}_{P}\big{|}\widehat{d}-d(P,\rho)\big{|}

The following lemma shows that RNR_{N} may be controlled by R~2N\tilde{R}_{2N}.

Lemma 15.

For all N1N\geq 1,

R~2NRN+exp(N/12).\tilde{R}_{2N}\leq R_{N}+\exp(-N/12)\,.
Proof.

Fix δ>0\delta>0, and for each n1n\geq 1 denote by d^n\widehat{d}_{n} a near-minimax estimator for the multinomial sampling model:

supPΔK𝔼P|d^nd(P,ρ)|Rn+δ.\sup_{P\in\Delta_{K}}\mathbb{E}_{P}|\widehat{d}_{n}-d(P,\rho)|\leq R_{n}+\delta\,.

Since ZZ is a sufficient statistic, we may assume without loss of generality that d^n\widehat{d}_{n} is a function of the empirical counts ZZ.

We define an estimator for the Poisson sampling model by setting d^(Z)=d^N(Z)\widehat{d}^{\prime}(Z)=\widehat{d}_{N^{\prime}}(Z), where Nk=1KZkN^{\prime}\coloneqq\sum_{k=1}^{K}Z_{k}. If ZPoisson(2NP)Z\sim\mathrm{Poisson}(2NP) for P+KP\in\mathbb{R}_{+}^{K}, then, conditioned on N=nN^{\prime}=n^{\prime}, the random variable ZZ has distribution MultinomialK(n;P/P1)\mathrm{Multinomial}_{K}(n^{\prime};P/\|P\|_{1}). For any PΔK(ϵ)P\in\Delta_{K}(\epsilon), we then have

𝔼P|d^d(P/P1,ρ)|\displaystyle\mathbb{E}_{P}^{\prime}|\widehat{d}^{\prime}-d(P/\|P\|_{1},\rho)| =n0𝔼P[|d^d(P/P1,ρ)|N=n]P[N=n]\displaystyle=\sum_{n^{\prime}\geq 0}\mathbb{E}_{P}^{\prime}\Big{[}|\widehat{d}^{\prime}-d(P/\|P\|_{1},\rho)\Big{|}N^{\prime}=n^{\prime}\Big{]}\mathbb{P}_{P}[N^{\prime}=n^{\prime}]
=n0𝔼P[|d^nd(P/P1,ρ)||N=n]P[N=n]\displaystyle=\sum_{n^{\prime}\geq 0}\mathbb{E}_{P}^{\prime}\Big{[}|\widehat{d}_{n^{\prime}}-d(P/\|P\|_{1},\rho)|\Big{|}N^{\prime}=n^{\prime}\Big{]}\mathbb{P}_{P}[N^{\prime}=n^{\prime}]
=n0𝔼P/P1[|d^nd(P/P1,ρ)|]P[N=n]\displaystyle=\sum_{n^{\prime}\geq 0}\mathbb{E}_{P/\|P\|_{1}}\Big{[}|\widehat{d}_{n^{\prime}}-d(P/\|P\|_{1},\rho)|\Big{]}\mathbb{P}_{P}[N^{\prime}=n^{\prime}]
(n0RnP[N=n])+δ.\displaystyle\leq\Big{(}\sum_{n^{\prime}\geq 0}R_{n^{\prime}}\mathbb{P}_{P}[N^{\prime}=n^{\prime}]\Big{)}+\delta\,.

Since NPoisson(2NP1)N^{\prime}\sim\mathrm{Poisson}(2N\|P\|_{1}), and RnR_{n^{\prime}} is a non-increasing function of nn^{\prime} and satisfies Rn1R_{n^{\prime}}\leq 1, standard tail bounds for the Poisson distribution show that if P13/4\|P\|_{1}\geq 3/4, then

𝔼P|d^d(P/P1,ρ)|P[N<N]+nNRnP[N=n]+δeN/12+RN+δ.\mathbb{E}_{P}^{\prime}|\widehat{d}^{\prime}-d(P/\|P\|_{1},\rho)|\leq\mathbb{P}_{P}[N^{\prime}<N]+\sum_{n^{\prime}\geq N}R_{n^{\prime}}\mathbb{P}_{P}[N^{\prime}=n^{\prime}]+\delta\leq e^{-N/12}+R_{N}+\delta\,.

Since PΔKP\in\Delta_{K}^{\prime} and δ>0\delta>0 were arbitrary, taking the supremum over PP and infimum over all estimators d^N\widehat{d}_{N} yields the claim. ∎

It therefore suffices to prove a lower bound on R~N\tilde{R}_{N}. Fix an ϵ(0,1/4)\epsilon\in(0,1/4), ϰ1\varkappa\geq 1 and a positive integer LL to be specified. We employ the following proposition.

Proposition 3.

There exists a universal positive constant c0c_{0} such that for any ϰ1\varkappa\geq 1 and positive integer LL, there exists a pair of mean-zero random variables XX and YY on [1,1][-1,1] satisfying the following properties:

  • 𝔼X=𝔼Y\mathbb{E}X^{\ell}=\mathbb{E}Y^{\ell} for =1,,2L2\ell=1,\dots,2L-2

  • 𝔼|X|ϰ𝔼|Y|c0L1ϰ1\mathbb{E}|X|\geq\varkappa\mathbb{E}|Y|\geq c_{0}L^{-1}\varkappa^{-1}

Proof.

A proof appears in Section C.1

We define two priors μ0\mu_{0} and μ1\mu_{1} on +K\mathbb{R}_{+}^{K} by letting

μ1\displaystyle\mu_{1} =Law(1K+ϵKX1,,1K+ϵKXK)\displaystyle=\mathrm{Law}\left(\frac{1}{K}+\frac{\epsilon}{K}X_{1},\dots,\frac{1}{K}+\frac{\epsilon}{K}X_{K}\right)
μ0\displaystyle\mu_{0} =Law(1K+ϵKY1,,1K+ϵKYK),\displaystyle=\mathrm{Law}\left(\frac{1}{K}+\frac{\epsilon}{K}Y_{1},\dots,\frac{1}{K}+\frac{\epsilon}{K}Y_{K}\right)\,,

where X1,,XKX_{1},\dots,X_{K} are i.i.d. copies of XX and Y1,,YKY_{1},\dots,Y_{K} are i.i.d. copies of YY. Since ϵ<1/4\epsilon<1/4 and Xk,Yk1X_{k},Y_{k}\geq-1 almost surely, μ1\mu_{1} and μ0\mu_{0} are supported on ΔK\Delta_{K}^{\prime}.

The following lemma shows that μ0\mu_{0} and μ1\mu_{1} are well separated with respect to the values of the functional Pd(P/P1,ρ)P\mapsto d(P/\|P\|_{1},\rho).

Lemma 16.

Assume that ϰ7κ𝒳\varkappa\geq 7\kappa_{\mathcal{X}}. Then there exists r+r\in\mathbb{R}_{+}. such that

μ0(P:d(P/P1,ρ)r)1δ\displaystyle\mu_{0}(P:d(P/\|P\|_{1},\rho)\leq r)\geq 1-\delta
μ1(P:d(P/P1,ρ)r+2s)1δ\displaystyle\mu_{1}(P:d(P/\|P\|_{1},\rho)\geq r+2s)\geq 1-\delta

where s=12ϵc0L1ϰ2s=\frac{1}{2}\epsilon c_{0}L^{-1}\varkappa^{-2} and δ=2eKc02/L2ϰ4\delta=2e^{-Kc_{0}^{2}/L^{2}\varkappa^{4}}

Proof.

By Eq. 27, we have for any two probability measures ν,νΔK\nu,\nu^{\prime}\in\Delta_{K},

κ𝒳1νν12d(ν,ν)νν1.\kappa_{\mathcal{X}}^{-1}\|\nu-\nu^{\prime}\|_{1}\leq 2d(\nu,\nu^{\prime})\leq\|\nu-\nu^{\prime}\|_{1}\,. (112)

Note that

Pρ1μ0(dP)=𝔼k=1K|1K+ϵKYk1K|=ϵ𝔼|Y|,\int\|P-\rho\|_{1}\,\mu_{0}(dP)=\mathbb{E}\sum_{k=1}^{K}\left|\frac{1}{K}+\frac{\epsilon}{K}Y_{k}-\frac{1}{K}\right|=\epsilon\mathbb{E}|Y|\,, (113)

and by Hoeffding’s inequality,

μ0(P:Pρ1ϵ𝔼|Y|+t)eKt2/2ϵ2.\mu_{0}(P:\|P-\rho\|_{1}\geq\epsilon\mathbb{E}|Y|+t)\leq e^{-Kt^{2}/2\epsilon^{2}}\,. (114)

Analogously,

μ1(P:Pρ1ϵ𝔼|X|t)eKt2/2ϵ2.\mu_{1}(P:\|P-\rho\|_{1}\leq\epsilon\mathbb{E}|X|-t)\leq e^{-Kt^{2}/2\epsilon^{2}}\,. (115)

Under either distribution, Hoeffding’s inequality also yields that |P11|t\big{|}\|P\|_{1}-1\big{|}\geq t with probability at most eKt2/2ϵ2e^{-Kt^{2}/2\epsilon^{2}}. Take t=ϵc0L1ϰ2t=\epsilon c_{0}L^{-1}\varkappa^{-2}. Then letting δ=2eKc02/L2ϰ4\delta=2e^{-Kc_{0}^{2}/L^{2}\varkappa^{4}}, we have with μ0\mu_{0} probability at least 1δ1-\delta that

2d(P/P1,ρ)\displaystyle 2d(P/\|P\|_{1},\rho) PP1ρ1\displaystyle\leq\left\|\frac{P}{\|P\|_{1}}-\rho\right\|_{1}
PPP11+Pρ1\displaystyle\leq\left\|P-\frac{P}{\|P\|_{1}}\right\|_{1}+\|P-\rho\|_{1}
=|P11|+Pρ1\displaystyle=\big{|}\|P\|_{1}-1\big{|}+\|P-\rho\|_{1}
ϵ𝔼|Y|+2ϵc0L1ϰ2.\displaystyle\leq\epsilon\mathbb{E}|Y|+2\epsilon c_{0}L^{-1}\varkappa^{-2}\,. (116)

And, analogously, with μ1\mu_{1} probability at least 1δ1-\delta,

2d(P/P1,ρ)\displaystyle 2d(P/\|P\|_{1},\rho) κ𝒳1PP1ρ1\displaystyle\geq\kappa_{\mathcal{X}}^{-1}\left\|\frac{P}{\|P\|_{1}}-\rho\right\|_{1}
κ𝒳1Pρ1PPP11\displaystyle\geq\kappa_{\mathcal{X}}^{-1}\|P-\rho\|_{1}-\left\|P-\frac{P}{\|P\|_{1}}\right\|_{1}
ϵκ𝒳1𝔼|X|2ϵc0L1ϰ1\displaystyle\geq\epsilon\kappa_{\mathcal{X}}^{-1}\mathbb{E}|X|-2\epsilon c_{0}L^{-1}\varkappa^{-1}
ϵϰκ𝒳1𝔼|Y|2ϵc0L1ϰ2,\displaystyle\geq\epsilon\varkappa\kappa_{\mathcal{X}}^{-1}\mathbb{E}|Y|-2\epsilon c_{0}L^{-1}\varkappa^{-2}\,, (117)

where we have used that κ𝒳1\kappa_{\mathcal{X}}\geq 1 and 𝔼|X|ϰ𝔼|Y|\mathbb{E}|X|\geq\varkappa\mathbb{E}|Y| by construction. Therefore, as long as ϰ7κ𝒳\varkappa\geq 7\kappa_{\mathcal{X}}, we may take r=12ϵ𝔼|Y|+ϵc0L1ϰ2r=\frac{1}{2}\epsilon\mathbb{E}|Y|+\epsilon c_{0}L^{-1}\varkappa^{-2}, in which case

r+2s\displaystyle r+2s =12ϵ𝔼|Y|+2ϵc0L1ϰ2\displaystyle=\frac{1}{2}\epsilon\mathbb{E}|Y|+2\epsilon c_{0}L^{-1}\varkappa^{-2}
12ϵϰκ𝒳1𝔼|Y|3ϵ𝔼|Y|+2ϵc0L1ϰ2\displaystyle\leq\frac{1}{2}\epsilon\varkappa\kappa_{\mathcal{X}}^{-1}\mathbb{E}|Y|-3\epsilon\mathbb{E}|Y|+2\epsilon c_{0}L^{-1}\varkappa^{-2}
12ϵϰκ𝒳1𝔼|Y|ϵc0L1ϰ2,\displaystyle\leq\frac{1}{2}\epsilon\varkappa\kappa_{\mathcal{X}}^{-1}\mathbb{E}|Y|-\epsilon c_{0}L^{-1}\varkappa^{-2}\,,

where the last inequality uses that 𝔼|Y|c0L1ϰ2\mathbb{E}|Y|\geq c_{0}L^{-1}\varkappa^{-2}. With this choice of rr and ss, we have by Eq. 116 and Eq. 117 that

μ0(P:d(P/P1,ρ)r)1δ\displaystyle\mu_{0}(P:d(P/\|P\|_{1},\rho)\leq r)\geq 1-\delta
μ1(P:d(P/P1,ρ)r+2s)1δ\displaystyle\mu_{1}(P:d(P/\|P\|_{1},\rho)\geq r+2s)\geq 1-\delta

as claimed. ∎

Following [62], we then define “posterior” measures

j=Pμj(dP),j=0,1,\mathbb{P}_{j}=\int\mathbb{P}_{P}\mu_{j}(dP),\quad\quad j=0,1\,,

where P=Poisson(NP)\mathbb{P}_{P}=\mathrm{Poisson}(NP) is the distribution of the Poisson observations with parameter PP. We next bound the statistical distance between 0\mathbb{P}_{0} and 1\mathbb{P}_{1}.

Lemma 17.

If ϵ2(L+1)K/(4e2N)\epsilon^{2}\leq(L+1)K/(4e^{2}N), then

dTV(0,1)K2L.\mathrm{d_{TV}}(\mathbb{P}_{0},\mathbb{P}_{1})\leq K2^{-L}\,.
Proof.

By construction of the priors μ0\mu_{0} and μ1\mu_{1}, we may write

j=(j)K,j=0,1,\mathbb{P}_{j}=(\mathbb{Q}_{j})^{\otimes K},\quad j=0,1\,,

where Q0Q_{0} and Q1Q_{1} are the laws of random variables U0U_{0} and U1U_{1} defined by

U0λ\displaystyle U_{0}\mid\lambda Poisson(λ0),λ0=𝑑NK+NϵKY\displaystyle\sim\mathrm{Poisson}(\lambda_{0})\,,\quad\quad\lambda_{0}\overset{d}{=}\frac{N}{K}+\frac{N\epsilon}{K}Y
U1λ\displaystyle U_{1}\mid\lambda Poisson(λ1),λ1=𝑑NK+NϵKX.\displaystyle\sim\mathrm{Poisson}(\lambda_{1})\,,\quad\quad\lambda_{1}\overset{d}{=}\frac{N}{K}+\frac{N\epsilon}{K}X\,.

By [36, Lemma 32], if L+1(2eNϵ/K)2/(N/K)=4e2ϵ2N/KL+1\geq(2eN\epsilon/K)^{2}/(N/K)=4e^{2}\epsilon^{2}N/K, then

dTV(0,1)2(eNϵKN(L+1)/K)L+12L.\mathrm{d_{TV}}(\mathbb{Q}_{0},\mathbb{Q}_{1})\leq 2\left(\frac{eN\epsilon}{K\sqrt{N(L+1)/K}}\right)^{L+1}\leq 2^{-L}\,.

Therefore, under this same condition,

dTV(0,1)=dTV(0K,1K)KdTV(0,1)K2L.\mathrm{d_{TV}}(\mathbb{P}_{0},\mathbb{P}_{1})=\mathrm{d_{TV}}(\mathbb{Q}_{0}^{\otimes K},\mathbb{Q}_{1}^{\otimes K})\leq K\mathrm{d_{TV}}(\mathbb{Q}_{0},\mathbb{Q}_{1})\leq K2^{-L}\,.

Combining the above two lemmas with [62, Theorem 2.15(i)], we obtain that as long as ϰ7κ𝒳\varkappa\geq 7\kappa_{\mathcal{X}} and ϵ2(L+1)K/(4e2N)\epsilon^{2}\leq(L+1)K/(4e^{2}N), we have

infd^supPΔK𝔼P𝟙{|d^d(P/P1,ρ)|12ϵc0L1ϰ2}12(1K2L)2eKc02/L2ϰ4.\inf_{\widehat{d}}\sup_{P\in\Delta_{K}^{\prime}}\mathbb{E}_{P}^{\prime}\mathds{1}\left\{|\widehat{d}-d(P/\|P\|_{1},\rho)|\geq\frac{1}{2}\epsilon c_{0}L^{-1}\varkappa^{-2}\right\}\geq\frac{1}{2}(1-K2^{-L})-2e^{-Kc_{0}^{2}/L^{2}\varkappa^{4}}\,.

Let ϰ=7κ𝒳\varkappa=7\kappa_{\mathcal{X}}, L=log2K+1L=\lceil\log_{2}K\rceil+1, and set ϵ=cK/N\epsilon=c\sqrt{K/N} for a sufficiently small positive constant cc. By assumption, K/(logK)2Cϰ4K/(\log K)^{2}\geq C\varkappa^{4}, and by choosing CC to be arbitrarily small we may make 2eKc02/L2ϰ42e^{-Kc_{0}^{2}/L^{2}\varkappa^{4}} arbitrarily small, so that the right side of the above inequality is strictly positive. We therefore obtain

R~NCϰ2KN(logK)2,\tilde{R}_{N}\geq C\varkappa^{-2}\sqrt{\frac{K}{N(\log K)^{2}}}\,,

and applying Lemma 15 yields

RNCϰ2KN(logK)2exp(N/12).R_{N}\geq C\varkappa^{-2}\sqrt{\frac{K}{N(\log K)^{2}}}-\exp(-N/12)\,.

Since we have assumed that the first term is at least N1/2N^{-1/2}, the second term is negligible as long as NN is larger than a universal constant. This proves the claim. ∎

C.1 Proof of Proposition 3

Proof.

First, we note that it suffices to construct a pair of random variables XX^{\prime} and YY^{\prime} on [0,(c1Lϰ)2][0,(c_{1}L\varkappa)^{2}] satisfying 𝔼(X)k=𝔼(Y)k\mathbb{E}(X^{\prime})^{k}=\mathbb{E}(Y^{\prime})^{k} for k=1,,L1k=1,\dots,L-1 and 𝔼Xϰ𝔼Yc2\mathbb{E}\sqrt{X^{\prime}}\geq\varkappa\mathbb{E}\sqrt{Y^{\prime}}\geq c_{2} for some positive constants c1c_{1} and c2c_{2}. Indeed, letting X=(c1Lϰ)1εXX=(c_{1}L\varkappa)^{-1}\varepsilon\sqrt{X^{\prime}} and Y=(c1Lϰ)1εYY=(c_{1}L\varkappa)^{-1}\varepsilon\sqrt{Y^{\prime}} where ε\varepsilon is a Rademacher random variable independent of XX^{\prime} and YY^{\prime} yields a pair (X,Y)(X,Y) with the desired properties. We therefore focus on constructing such a XX^{\prime} and YY^{\prime}.

By [68, Lemma 7] combined with [60, Section 2.11.1], there exists a universal positive constant c1c_{1} such we may construct two probability distributions μ+\mu_{+} and μ\mu_{-} on [8ϰ2,(c1ϰL)2][8\varkappa^{2},(c_{1}\varkappa L)^{2}] satisfying

xkdμ+(x)\displaystyle\int x^{k}\,\mathrm{d}\mu_{+}(x) =xkdμ(x)k=1,,L\displaystyle=\int x^{k}\,\mathrm{d}\mu_{-}(x)\quad k=1,\dots,L (118)
x1dμ+(x)\displaystyle\int x^{-1}\,\mathrm{d}\mu_{+}(x) =x1dμ(x)+116ϰ2.\displaystyle=\int x^{-1}\,\mathrm{d}\mu_{-}(x)+\frac{1}{16\varkappa^{2}}\,. (119)

We define a new pair of measures ν+\nu_{+} and ν\nu_{-} by

ν+(dx)\displaystyle\nu_{+}(\mathrm{d}x) =1Z(1x(x1)μ+(dx)+α0δ0(dx))\displaystyle=\frac{1}{Z}\left(\frac{1}{x(x-1)}\mu_{+}(\mathrm{d}x)+\alpha_{0}\delta_{0}(\mathrm{d}x)\right) (120)
ν(dx)\displaystyle\nu_{-}(\mathrm{d}x) =1Z(1x(x1)μ(dx)+α1δ1(dx)),\displaystyle=\frac{1}{Z}\left(\frac{1}{x(x-1)}\mu_{-}(\mathrm{d}x)+\alpha_{1}\delta_{1}(\mathrm{d}x)\right)\,, (121)

where we let

Z\displaystyle Z =1x1dμ+(x)1xdμ(x)\displaystyle=\int\frac{1}{x-1}\,\mathrm{d}\mu_{+}(x)-\int\frac{1}{x}\,\mathrm{d}\mu_{-}(x) (122)
α0\displaystyle\alpha_{0} =1xdμ+(x)1xdμ(x)=132ϰ2\displaystyle=\int\frac{1}{x}\,\mathrm{d}\mu_{+}(x)-\int\frac{1}{x}\,\mathrm{d}\mu_{-}(x)=\frac{1}{32\varkappa^{2}} (123)
α1\displaystyle\alpha_{1} =1x1dμ+(x)1x1dμ(x).\displaystyle=\int\frac{1}{x-1}\,\mathrm{d}\mu_{+}(x)-\int\frac{1}{x-1}\,\mathrm{d}\mu_{-}(x)\,. (124)

Since the support of μ+\mu_{+} and μ\mu_{-} lies in [8ϰ2,(c1ϰL)2][8\varkappa^{2},(c_{1}\varkappa L)^{2}], these quantities are well defined, and Lemma 18 shows that they are all positive. Finally, the definition of ZZ, α0\alpha_{0}, and α1\alpha_{1} guarantees that ν+\nu_{+} and ν\nu_{-} have total mass one. Therefore, ν+\nu_{+} and ν\nu_{-} are probability measures on [0,(c1ϰL)2][0,(c_{1}\varkappa L)^{2}].

We now claim that

xkν+(dx)=xkν(dx)k=1,,L.\int x^{k}\nu_{+}(\mathrm{d}x)=\int x^{k}\nu_{-}(\mathrm{d}x)\quad k=1,\dots,L\,. (125)

By definition of ν+\nu_{+} and ν\nu_{-}, this claim is equivalent to

xx1μ+(dx)=xx1μ(dx)+α1k=0,,L1,\int\frac{x^{\ell}}{x-1}\mu_{+}(\mathrm{d}x)=\int\frac{x^{\ell}}{x-1}\mu_{-}(\mathrm{d}x)+\alpha_{1}\quad k=0,\dots,L-1\,, (126)

or, by using the definition of α1\alpha_{1},

x1x1μ+(dx)=x1x1μ(dx)k=0,,L1.\int\frac{x^{\ell}-1}{x-1}\mu_{+}(\mathrm{d}x)=\int\frac{x^{\ell}-1}{x-1}\mu_{-}(\mathrm{d}x)\quad k=0,\dots,L-1\,. (127)

But this equality holds due to the fact that x1x1\frac{x^{\ell}-1}{x-1} is a degree-(1)(\ell-1) polynomial in xx, and the first LL moments of μ+\mu_{+} and μ\mu_{-} agree.

Finally, since x1x\geq 1 on the support of ν\nu_{-}, we have

xdν(x)1.\int\sqrt{x}\,\mathrm{d}\nu_{-}(x)\geq 1\,. (128)

We also have

xdν+(x)=1Z1x(x1)dμ+(x),\int\sqrt{x}\,\mathrm{d}\nu_{+}(x)=\frac{1}{Z}\int\frac{1}{\sqrt{x}(x-1)}\,\mathrm{d}\mu_{+}(x)\,, (129)

and by Lemma 19, this quantity is between 1/(8ϰ)1/(8\varkappa) and 1/ϰ1/\varkappa.

Letting Yν+Y^{\prime}\sim\nu_{+} and XνX^{\prime}\sim\nu_{-}, we have verified that X,Y[0,(c1Lϰ)2]X^{\prime},Y^{\prime}\in[0,(c_{1}L\varkappa)^{2}] almost surely and 𝔼(X)k=𝔼(Y)k\mathbb{E}(X^{\prime})^{k}=\mathbb{E}(Y^{\prime})^{k} for k=1,,Lk=1,\dots,L. Moreover, 𝔼X1ϰ𝔼Y1/8\mathbb{E}\sqrt{X^{\prime}}\geq 1\geq\varkappa\mathbb{E}\sqrt{Y^{\prime}}\geq 1/8, and this establishes the claim. ∎

C.2 Technical lemmata

Lemma 18.

If ϰ1\varkappa\geq 1, then the quantities Z,α0,α1Z,\alpha_{0},\alpha_{1} are all positive, and Z[ϰ2/16,ϰ2/8]Z\in[\varkappa^{-2}/16,\varkappa^{-2}/8].

Proof.

First, α0=116κ2\alpha_{0}=\frac{1}{16\kappa^{2}} by definition, and Zα0Z\geq\alpha_{0} since (x1)1x1(x-1)^{-1}\geq x^{-1} on the support of μ+\mu_{+}. Moreover, for all x8ϰ28x\geq 8\varkappa^{2}\geq 8,

|1x11x|=1x(x1)ϰ2/50.\left|\frac{1}{x-1}-\frac{1}{x}\right|=\frac{1}{x(x-1)}\leq\varkappa^{-2}/50\,. (130)

Therefore α1α0ϰ2/50>0\alpha_{1}\geq\alpha_{0}-\varkappa^{-2}/50>0 and Zα0+ϰ2/50ϰ2/8Z\leq\alpha_{0}+\varkappa^{-2}/50\leq\varkappa^{-2}/8, as claimed. ∎

Lemma 19.

If ϰ1\varkappa\geq 1, then

1Z1x(x1)dμ+(x)[18ϰ,1ϰ]\frac{1}{Z}\int\frac{1}{\sqrt{x}(x-1)}\,\mathrm{d}\mu_{+}(x)\in\left[\frac{1}{8\varkappa},\frac{1}{\varkappa}\right] (131)
Proof.

For x8ϰ28x\geq 8\varkappa^{2}\geq 8,

1Z|1x(x1)1x3/2|=1Zx3/2(x1)16ϰ27(8ϰ2)3/2=172ϰ.\frac{1}{Z}\left|\frac{1}{\sqrt{x}(x-1)}-\frac{1}{x^{3/2}}\right|=\frac{1}{Zx^{3/2}(x-1)}\leq\frac{16\varkappa^{2}}{7(8\varkappa^{2})^{3/2}}=\frac{1}{7\sqrt{2}\varkappa}\,. (132)

Since

1Z1x3/2dμ+(x)1Z(8ϰ2)3/212ϰ,\frac{1}{Z}\int\frac{1}{x^{3/2}}\,\mathrm{d}\mu_{+}(x)\leq\frac{1}{Z}(8\varkappa^{2})^{-3/2}\leq\frac{1}{\sqrt{2}\varkappa}\,, (133)

we obtain that

1Z1x(x1)dμ+(x)172ϰ+12ϰ=ϰ1.\frac{1}{Z}\int\frac{1}{\sqrt{x}(x-1)}\,\mathrm{d}\mu_{+}(x)\leq\frac{1}{7\sqrt{2}\varkappa}+\frac{1}{\sqrt{2}\varkappa}=\varkappa^{-1}\,. (134)

The lower bound bound follows from an application of Jensen’s inequality:

1Z1x3/2dμ+(x)1Z(x1dμ+(x))3/21Z(16ϰ2)3/218ϰ,\frac{1}{Z}\int\frac{1}{x^{3/2}}\,\mathrm{d}\mu_{+}(x)\geq\frac{1}{Z}\left(\int x^{-1}\,\mathrm{d}\mu_{+}(x)\right)^{3/2}\geq\frac{1}{Z}(16\varkappa^{2})^{-3/2}\geq\frac{1}{8\varkappa}\,, (135)

where we have used the fact that

x1dμ+(x)=x1dμ+(x)+116ϰ2(16ϰ2)1.\int x^{-1}\,\mathrm{d}\mu_{+}(x)=\int x^{-1}\,\mathrm{d}\mu_{+}(x)+\frac{1}{16\varkappa^{2}}\geq(16\varkappa^{2})^{-1}\,. (136)

Appendix D Lower bounds for estimation of mixing measures in topic models

In this section, we substantiate Remark 2 by exhibiting a lower bound without logarithmic factors for the estimation of a mixing measure 𝜶\boldsymbol{\alpha} in Wasserstein distance. Combined with the upper bounds of [8], this implies that 𝜶^\widehat{\boldsymbol{\alpha}} is nearly minimax optimal for this problem. While a similar lower bound could be deduced directly from Theorem 3, the proof of this result is substantially simpler and slightly sharper. We prove the following.

Theorem 11.

Grant topic model assumptions and assume 1<τcN1<\tau\leq cN and pKc(nN)pK\leq c^{\prime}(nN) for some universal constants c,c>0c,c^{\prime}>0. Then, for any dd satisfying (26) with cd>0c_{d}>0, we have

inf𝜶^supαΘα(τ),AΘA𝔼[W(𝜶^,𝜶;d)]cdκ¯τ(τN+1κ¯τpKnN).\inf_{\widehat{\boldsymbol{\alpha}}}\sup_{\alpha\in\Theta_{\alpha}(\tau),A\in\Theta_{A}}\mathbb{E}[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)]~{}\gtrsim~{}c_{d}~{}\overline{\kappa}_{\tau}\left(\sqrt{\tau\over N}+{1\over\overline{\kappa}_{\tau}}\sqrt{pK\over nN}\right).

Here the infimum is taken over all estimators.

Our proof of Theorem 11 builds on a reduction scheme that reduces the problem of proving minimax lower bounds for W(𝜶^,𝜶;d)W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d) to that for columns of A^A\widehat{A}-A in 1\ell_{1}-norm and for α^α1\|\widehat{\alpha}-\alpha\|_{1} in which we can invoke (11) as well as the results in [8].

Proof of Theorem 11.

Fix 1<τK1<\tau\leq K. Define the space of 𝜶\boldsymbol{\alpha} as

Θ𝜶:={k=1KαkδAk:αΘα(τ),AΘA}.\Theta_{\boldsymbol{\alpha}}:=\left\{\sum_{k=1}^{K}\alpha_{k}\delta_{A_{k}}:\alpha\in\Theta_{\alpha}(\tau),A\in\Theta_{A}\right\}.

Similar to the proof of Theorem 3, for any subset Θ¯𝜶Θ𝜶\bar{\Theta}_{\boldsymbol{\alpha}}\subseteq\Theta_{\boldsymbol{\alpha}}, we have the following reduction as the first step of the proof,

inf𝜶^supαΔK,AΘA𝔼𝜶[W(𝜶^,𝜶;d)]\displaystyle\inf_{\widehat{\boldsymbol{\alpha}}}\sup_{\alpha\in\Delta_{K},A\in\Theta_{A}}\mathbb{E}_{\boldsymbol{\alpha}}\left[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\right] 12inf𝜶^Θ¯𝜶sup𝜶Θ¯𝜶𝔼𝜶[W(𝜶^,𝜶;d)].\displaystyle\geq{1\over 2}\inf_{\widehat{\boldsymbol{\alpha}}\in\bar{\Theta}_{\boldsymbol{\alpha}}}\sup_{\boldsymbol{\alpha}\in\bar{\Theta}_{\boldsymbol{\alpha}}}\mathbb{E}_{\boldsymbol{\alpha}}\left[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\right].

We proceed to prove the two terms in the lower bound separately. To prove the first term, let us fix one AΘAA\in\Theta_{A} as well as S={1,2,,τ}S=\{1,2,\ldots,\tau\} and choose

Θ¯𝜶={k=1KαkδAk:supp(α)=S}.\bar{\Theta}_{\boldsymbol{\alpha}}=\left\{\sum_{k=1}^{K}\alpha_{k}\delta_{A_{k}}:{\rm supp}(\alpha)=S\right\}.

It then follows that

inf𝜶^supαΘα(τ),AΘA𝔼[W(𝜶^,𝜶;d)]\displaystyle\inf_{\widehat{\boldsymbol{\alpha}}}\sup_{\alpha\in\Theta_{\alpha}(\tau),A\in\Theta_{A}}\mathbb{E}\left[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\right] 12inf𝜶^Θ¯𝜶sup𝜶Θ¯𝜶𝔼[W(𝜶^,𝜶;d)]\displaystyle\geq{1\over 2}\inf_{\widehat{\boldsymbol{\alpha}}\in\bar{\Theta}_{\boldsymbol{\alpha}}}\sup_{\boldsymbol{\alpha}\in\bar{\Theta}_{\boldsymbol{\alpha}}}\mathbb{E}\left[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\right]
14minkkSd(Ak,Ak)infα^Θα(τ)supp(α^)=SsupαΘα(τ)supp(α)=S𝔼[α^α1]\displaystyle\geq{1\over 4}\min_{k\neq k^{\prime}\in S}d(A_{k},A_{k^{\prime}})~{}\inf_{\begin{subarray}{c}\widehat{\alpha}\in\Theta_{\alpha}(\tau)\\ {\rm supp}(\widehat{\alpha})=S\end{subarray}}\sup_{\begin{subarray}{c}\alpha\in\Theta_{\alpha}(\tau)\\ {\rm supp}(\alpha)=S\end{subarray}}\mathbb{E}\left[\|\widehat{\alpha}-\alpha\|_{1}\right]
cd2κ¯τinfα^supαΘα(τ)supp(α)=S𝔼[α^α1].\displaystyle\geq{c_{d}\over 2}\overline{\kappa}_{\tau}~{}\inf_{\widehat{\alpha}}\sup_{\begin{subarray}{c}\alpha\in\Theta_{\alpha}(\tau)\\ {\rm supp}(\alpha)=S\end{subarray}}\mathbb{E}\left[\|\widehat{\alpha}-\alpha\|_{1}\right].

where in the last line we used the fact that

minkkSd(Ak,Ak)cdminkkSAkAk12cdκ¯τ\min_{k\neq k^{\prime}\in S}d(A_{k},A_{k^{\prime}})\geq c_{d}\min_{k\neq k^{\prime}\in S}\|A_{k}-A_{k^{\prime}}\|_{1}\geq 2c_{d}~{}\overline{\kappa}_{\tau} (137)

from (26) and the definition in (18). Since the lower bounds in [8, Theorem 7] also hold for the parameter space {αΘα(τ):supp(α)=S}\{\alpha\in\Theta_{\alpha}(\tau):{\rm supp}(\alpha)=S\}, the first term is proved.

To prove the second term, fix α=k=1s𝐞k/s\alpha=\sum_{k=1}^{s}\mathbf{e}_{k}/s and choose

Θ¯𝜶={1sk=1sδA1:AΘA}.\bar{\Theta}_{\boldsymbol{\alpha}}=\left\{{1\over s}\sum_{k=1}^{s}\delta_{A_{1}}:A\in\Theta_{A}\right\}.

We find that

inf𝜶^supαΔK,AΘA𝔼[W(𝜶^,𝜶;d)]\displaystyle\inf_{\widehat{\boldsymbol{\alpha}}}\sup_{\alpha\in\Delta_{K},A\in\Theta_{A}}\mathbb{E}\left[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\right] 12inf𝜶^Θ¯𝜶sup𝜶Θ¯𝜶𝔼[W(𝜶^,𝜶;d)]\displaystyle\geq{1\over 2}\inf_{\widehat{\boldsymbol{\alpha}}\in\bar{\Theta}_{\boldsymbol{\alpha}}}\sup_{\boldsymbol{\alpha}\in\bar{\Theta}_{\boldsymbol{\alpha}}}\mathbb{E}\left[W(\widehat{\boldsymbol{\alpha}},\boldsymbol{\alpha};d)\right]
12infA^ΘAsupAΘA𝔼[1τk=1sd(A^k,Ak)]\displaystyle\geq{1\over 2}\inf_{\widehat{A}\in\Theta_{A}}\sup_{A\in\Theta_{A}}\mathbb{E}\left[{1\over\tau}\sum_{k=1}^{s}d(\widehat{A}_{k},A_{k})\right]
cd2infA^supAΘA𝔼[1τk=1sA^kAk1].\displaystyle\geq{c_{d}\over 2}\inf_{\widehat{A}}\sup_{A\in\Theta_{A}}\mathbb{E}\left[{1\over\tau}\sum_{k=1}^{s}\|\widehat{A}_{k}-A_{k}\|_{1}\right].

By the proof of [9, Theorem 6] together with m+K(pm)(1c)pKm+K(p-m)\geq(1-c)pK under mcpm\leq cp for c<1c<1, we obtain the second lower bound, hence completing the proof. ∎

Appendix E Limiting distribution of our W-distance estimator for two samples with different sample sizes

We generalize the results in Theorem 5 to cases where X(i)X^{(i)} and X(j)X^{(j)} are the empirical frequencies based on NiN_{i} and NjN_{j}, respectively, i.i.d. samples of r(i)r^{(i)} and r(j)r^{(j)}. Write Nmin=min{Ni,Nj}N_{\min}=\min\{N_{i},N_{j}\} and Nmax=max{Ni,Nj}N_{\max}=\max\{N_{i},N_{j}\}. Let Zij𝒩K(0,Q(ij))Z_{ij}^{\prime}\sim\mathcal{N}_{K}(0,Q^{(ij)^{\prime}}) where

Q(ij)=limNminNjNi+NjΣ(i)+NiNi+NjΣ(j).Q^{(ij)^{\prime}}=\lim_{N_{\min}\to\infty}{N_{j}\over N_{i}+N_{j}}\Sigma^{(i)}+{N_{i}\over N_{i}+N_{j}}\Sigma^{(j)}.

Note that we still assume NNN\asymp N_{\ell} for all [n]\ell\in[n].

Theorem 12.

Grant conditions in Theorem 5. For any dd satisfying (25) with Cd=𝒪(1)C_{d}=\mathcal{O}(1), we have the following convergence in distribution, as n,Nminn,N_{\min}\to\infty,

NiNjNi+Nj(W~W(𝜶(i),𝜶(j);d))𝑑supfijfZrs\sqrt{N_{i}N_{j}\over N_{i}+N_{j}}\left(\widetilde{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right)\overset{d}{\to}\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}Z_{rs}^{\prime}

with ij\mathcal{F}^{\prime}_{ij} defined in (33).

Proof.

From Theorem 9, by recognizing that

NiNjNi+Nj((α~(i)α~(j))(α(i)α(j)))𝑑Zrs,as n,Nmin,\sqrt{N_{i}N_{j}\over N_{i}+N_{j}}\left((\widetilde{\alpha}^{(i)}-\widetilde{\alpha}^{(j)})-(\alpha^{(i)}-\alpha^{(j)})\right)\overset{d}{\to}Z_{rs}^{\prime},\quad\textrm{as }n,N_{\min}\to\infty,

the proof follows from the same arguments in the proof of Theorem 5. ∎

Appendix F Simulation results

In this section we provide numerical supporting evidence to our theoretical findings. In Section F.4 we evaluate the coverage and lengths of fully data-driven confidence intervals for the Wasserstein distance between mixture distributions, abbreviated in what follows as the W-distance. Speed of convergence in distribution of our distance estimator W~\widetilde{W} with its dependence on KK, pp and NN is given in Section F.5.

Before presenting the simulation results regarding the distance estimators, we begin by illustrating that the proposed estimator of the mixture weights, the debiased MLE estimator (41), has the desired behavior. We verify its asymptotic normality in Appendix F.1. For completeness, we also compare its behavior to that of a weighted least squares estimator, in F.2. The impact of using different estimators for inference on the W-distance is discussed in Section F.3.

We use the following data generating mechanism throughout. The mixture weights are generated uniformly from ΔK\Delta_{K} (equivalent to the symmetric Dirichlet distribution with parameter equal to 1) in the dense setting. In the sparse setting, for a given cardinality τ\tau, we randomly select the support as well as non-zero entries from Unif(0,1)(0,1) and normalize in the end to the unit sum. For the mixture components in AA, we first generate its entries as i.i.d. samples from Unif(0,1)(0,1) and then normalize each column to the unit sum. Samples of r=Aαr=A\alpha are generated according to the multinomial sampling scheme in (1). We take the distance dd in W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) to be the total variation distance, that is, d(Ak,A)=AkA1/2d(A_{k},A_{\ell})=\|A_{k}-A_{\ell}\|_{1}/2 for each k,[K]k,\ell\in[K]. To simplify presentation, we consider AA known throughout the simulations.

F.1 Asymptotic normality of the proposed estimator of the mixture weights

We verify the asymptotic normality of our proposed debiased MLE (MLE_debias) given by (41) in both the sparse setting, and the dense setting, of the mixture weights, and compare its speed of convergence to a normal limit with its counterpart restricted on ΔK\Delta_{K}: the maximum likelihood estimator (MLE) in (3).

Since AA is known, it suffices to consider one mixture distribution r=Aαr=A\alpha, where αΔK\alpha\in\Delta_{K} is generated according to either the sparse setting with τ=|supp(α)|=3\tau=|{\rm supp}(\alpha)|=3 or the dense setting. We fix K=5K=5 and p=1000p=1000 and vary N{50,100,300,500}N\in\{50,100,300,500\}. For each setting, we obtain estimates of α\alpha for each estimator based on 500 repeatedly generated multinomial samples. In the sparse setting, the first panel of Figure 2 depicts the QQ-plots (quantiles of the estimates after centering and standardizing versus quantiles of the standard normal distribution) of all two estimators for α3=0\alpha_{3}=0. It is clear that MLE_debias converges to normal limits whereas MLE does not, corroborating our discussion in Section 5.1. Moreover, in the dense setting, the second panel of Figure 2 shows that, for α30.18\alpha_{3}\approx 0.18, although the MLE converges to normal limits eventually, its speed of convergence is outperformed by the MLE_debias. We conclude from both settings that the MLE_debias always converges to a normal limit faster than the MLE regardless of the sparsity pattern of the mixture weights.

Refer to caption
Refer to caption
Figure 2: QQ-plots for the MLE and the MLE_debias in the sparse (upper) and dense (lower) settings

In Section F.2, we also compare the weighted least squares estimator (WLS) in Remark 5 with its counterpart restricted to ΔK\Delta_{K}, and find that WLS converges to normal limits faster than its restricted counterpart.

F.2 Asymptotic normality of the weighted least squares estimator of the mixture weights

We verify the speed of convergence to normal limits of the weighted least squares estimator (WLS) in Remark 5, and compare with its counterpart restricted on ΔK\Delta_{K} (WRLS) given by Eq. 156. We follow the same settings used in Section F.1. As shown in Fig. 3, WLS converges to normal limits at much faster speed than WRLS in both dense and sparse settings.

Refer to caption
Refer to caption
Figure 3: QQ-plots for the WRLS and the WLS in the sparse (upper) and dense (lower) settings

F.3 Impact of weight estimation on distance estimation

In this section, we construct confidence intervals for W(𝜶(i),𝜶(j))W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)}), based on two possible distance estimators: (i) W~\widetilde{W}, the plug-in distance estimator, with weights estimated by the debiased MLE (41) and (43) above; (ii) W~LS\widetilde{W}_{LS}, the plug-in distance estimator with weights estimated by the weighted least squares estimator given in Remark 5.

We evaluate the coverage and length of 95% confidence intervals (CIs) based on their respective limiting distributions, given in (48) and (154). We illustrate the performance when α(i)=α(j)\alpha^{(i)}=\alpha^{(j)}.

We set K=5K=5, p=500p=500 and vary N{50,100,300,500}N\in\{50,100,300,500\}. For each NN, we generate 10 mixtures r()=Aα()r^{(\ell)}=A\alpha^{(\ell)} for {1,,10}\ell\in\{1,\ldots,10\}, with each α()\alpha^{(\ell)} generated according to the dense setting. Then for each r()r^{(\ell)}, we repeatedly draw 200 pairs of X(i)X^{(i)} and X(j)X^{(j)}, sampled from Multinomialp(N;r())\text{Multinomial}_{p}(N;r^{(\ell)}) independently, and compute the coverage and averaged length of the 95% CIs for each procedure. For a fair comparison, we assume the two limiting distributions in (48) and (154) known, and approximate their quantiles by their empirical counterparts based on 10,000 realizations. By further averaging the coverages and averaged lengths of 95% CIs over {1,,10}\ell\in\{1,\ldots,10\}, we see in Table 1 that although both procedures have nominal coverage, CIs based on the debiased MLE are uniformly narrower than those based on the WLS, in line with our discussion in Remark 5. The same experiment is repeated for the sparse setting with τ=3\tau=3 from which we draw the same conclusion except that the advantage of using the debiased MLE becomes more visible. Furthermore, the CIs based on the debiased MLE get slightly narrower in the sparse setting whereas those based on the WLS remain the same.

Table 1: The averaged coverage and length of 95% CIs.
Coverage The averaged length of 95% CIs
N=25N=25 N=50N=50 N=100N=100 N=300N=300 N=500N=500 N=25N=25 N=50N=50 N=100N=100 N=300N=300 N=500N=500
Dense W~\widetilde{W} 0.950 0.948 0.961 0.942 0.950 0.390 0.276 0.195 0.113 0.087
W~LS\widetilde{W}_{LS} 0.933 0.948 0.955 0.945 0.948 0.403 0.285 0.202 0.116 0.090
Sparse W~\widetilde{W} 0.961 0.952 0.964 0.944 0.952 0.372 0.263 0.186 0.107 0.083
W~LS\widetilde{W}_{LS} 0.945 0.955 0.966 0.953 0.959 0.402 0.285 0.201 0.116 0.090

F.4 Confidence intervals of the W-distance

In this section, we evaluate the procedures introduced in Section 5.3 for constructing confidence intervals for W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d) for both α(i)=α(j)\alpha^{(i)}=\alpha^{(j)} and α(i)α(j)\alpha^{(i)}\neq\alpha^{(j)}. Specifically, we evaluate both the coverage and the length of 95% confidence intervals (CIs) of the true W-distance based on: the mm-out-of-NN bootstrap (mm-NN-BS), the derivative-based bootstrap (Deriv-BS) and the plug-in procedure (Plug-in), detailed in Section 5.3.2. For the mm-out-of-NN bootstrap, we follow [57] by setting m=Nγm=N^{\gamma} for some γ(0,1)\gamma\in(0,1). Suggested by our simulation in Section F.7, we choose γ=0.3\gamma=0.3. For the two procedures based on bootstrap, we set the number of repeated bootstrap samples as 1000. For fair comparison, we also set M=1000M=1000 for the plug-in procedure (cf. Eq. 50).

Throughout this section, we set K=5K=5, p=500p=500 and vary N{100,500,1000,3000}N\in\{100,500,1000,3000\}. The simulations of Section F.1 above showed that our de-biased estimator of the mixture weights should be preferred even when the MLE is available (the dense setting). The construction of the data driven CIs for the distance only depends on the mixture weight estimators and their asymptotic normality, and not the nature of the true mixture weights. In the experiments presented below, the mixture weights are generated, without loss of generality, in the dense setting. We observed the same behavior in the sparse setting.

F.4.1 Confidence intervals of the W-distance for α(i)=α(j)\alpha^{(i)}=\alpha^{(j)}

We generate the mixture r=Aαr=A\alpha, and for each choice of NN, we generate 200200 pairs of X(i)X^{(i)} and X(j)X^{(j)}, sampled from Multinomial(N,r)p{}_{p}(N,r) independently. We record in Table 2 the coverage and averaged length of 95% CIs of the W-distance over these 200 repetitions for each procedure. We can see that Plug-in and Deriv-BS have comparable performance with the former being more computationally expensive. On the other hand, mm-NN-BS only achieves the nominal coverage when NN is very large. As NN increases, we observe narrower CIs for all methods.

Table 2: The averaged coverage and length of 95% CIs. The numbers in parentheses are the standard errors (×102\times 10^{-2})
Coverage The averaged length of 95% CIs
NN 100100 500500 10001000 30003000 100100 500500 10001000 30003000
mm-NN-BS 0.865 0.905 0.935 0.945 0.153 (0.53) 0.079 (0.23) 0.059 (0.18) 0.036 (0.10)
Deriv-BS 0.94 0.93 0.945 0.95 0.194 (0.77) 0.090 (0.31) 0.064 (0.21) 0.037 (0.13)
Plug-in 0.96 0.93 0.94 0.945 0.195 (0.68) 0.089 (0.30) 0.063 (0.21) 0.037 (0.11)

F.4.2 Confidence intervals of the W-distance for α(i)α(j)\alpha^{(i)}\neq\alpha^{(j)}

In this section we further compare the three procedures in the previous section for α(i)α(j)\alpha^{(i)}\neq\alpha^{(j)}. Holding the mixture components AA fixed, we generate 1010 pairs of (α(i),α(j))(\alpha^{(i)},\alpha^{(j)}) and their corresponding (r(i),r(j))(r^{(i)},r^{(j)}). For each pair, as in previous section, we generate 200 pairs of (X(i),X(j))(X^{(i)},X^{(j)}) based on (r(i),r(j))(r^{(i)},r^{(j)}) for each NN, and record the coverage and averaged length of 95% CIs for each procedure. By further averaging the coverages and averaged lengths over the 10 pairs of (r(i),r(j))(r^{(i)},r^{(j)}), Table 3 contains the final coverage and averaged length of 95% CIs for each procedure. To save computation, we reduce both MM and the number of repeated bootstrap samples to 500500. As we can see in Table 3, Plug-in has the nominal coverage in all settings. Deriv-BS has similar performance as Plug-in for N500N\geq 500 but tends to have under-coverage for N=100N=100. On the other hand, mm-NN-BS is outperformed by both Plug-in and Deriv-BS.

Table 3: The averaged coverage and length of 95% CIs. The numbers in parentheses are the standard errors (×102\times 10^{-2})
Coverage The averaged length of 95% CIs
NN 100100 500500 10001000 30003000 100100 500500 10001000 30003000
mm-NN-BS 0.700 0.666 0.674 0.705 0.156 (0.67) 0.082 (0.33) 0.062 (0.25) 0.039 (0.16)
Deriv-BS 0.912 0.943 0.952 0.953 0.299 (2.71) 0.140 (0.95) 0.100 (0.60) 0.058 (0.31)
Plug-in 0.946 0.950 0.952 0.951 0.304 (2.43) 0.140 (0.90) 0.100 (0.57) 0.058 (0.30)

F.5 Speed of convergence in distribution of the proposed distance estimator

We focus on the null case, α(i)=α(j)\alpha^{(i)}=\alpha^{(j)}, to show how the speed of convergence of our estimator W~\widetilde{W} in Theorem 5 depends on NN, pp and KK. Specifically, we consider (i) K=10K=10, p=300p=300 and N{10,100,1000}N\in\{10,100,1000\}, (ii) N=100N=100, p=300p=300 and K{5,10,20}K\in\{5,10,20\} and (iii) N=100N=100, K=10K=10 and p{50,100,300,500}p\in\{50,100,300,500\}. For each combination of K,pK,p and NN, we generate 10 mixture distributions r()=Aα()r^{(\ell)}=A\alpha^{(\ell)} for {1,,10}\ell\in\{1,\ldots,10\}, with each α()\alpha^{(\ell)} generated according to the dense setting. Then for each r()r^{(\ell)}, we repeatedly draw 10,000 pairs of X(i)X^{(i)} and X(j)X^{(j)}, sampled from Multinomialp(N;r())\text{Multinomial}_{p}(N;r^{(\ell)}) independently, and compute our estimator W~\widetilde{W}. To evaluate the speed of convergence of W~\widetilde{W}, we compute the Kolmogorov-Smirnov (KS) distance, as well as the p-value of the two-sample KS test, between our these 10,00010,000 estimates of NW~\sqrt{N}~{}\widetilde{W} and its limiting distribution. Because the latter is not given in closed form, we mimic the theoretical c.d.f. by a c.d.f. based on 10,00010,000 draws from it. Finally, both the averaged KS distances and p-values, over {1,2,,10}\ell\in\{1,2,\ldots,10\}, are shown in Table 4. We can see that the speed of convergence of our distance estimator gets faster as NN increases or KK decreases, but seems unaffected by the ambient dimension pp (we note here that the sampling variations across different settings of varying pp are larger than those of varying NN and KK). In addition, our distance estimator already seems to converge in distribution to the established limit as soon as NKN\geq K.

Table 4: The averaged KS distances and p-values of the two sample KS test between our estimates and samples of the limiting distribution
K=10K=10, p=300p=300 N=100N=100, p=300p=300 N=100N=100, K=10K=10
N=10N=10 N=100N=100 N=1000N=1000 K=5K=5 K=10K=10 K=20K=20 p=50p=50 p=100p=100 p=300p=300 p=500p=500
KS distances (×102)(\times 10^{-2}) 1.02 0.94 0.87 0.77 0.92 0.97 0.95 0.87 1.01 0.92
P-values of KS test 0.34 0.39 0.50 0.58 0.46 0.36 0.37 0.47 0.29 0.48

F.6 Details of mm-out-of-NN bootstrap and the derivative-based bootstrap

In this section we give details of the mm-out-of-NN bootstrap and the derivative-based bootstrap mentioned in Section 5.3.1. For simplicity, we focus on Ni=Nj=NN_{i}=N_{j}=N. Let BB be the total number of bootstrap repetitions and A^\widehat{A} be a given estimator of AA.

For a given integer m<Nm<N and any b[B]b\in[B], let Xb(i)X_{*b}^{(i)} and Xb(j)X_{*b}^{(j)*} be the bootstrap samples obtained as

mXb()Multinomialp(m;X()),{i,j}.mX_{*b}^{(\ell)}\sim\text{Multinomial}_{p}(m;X^{(\ell)}),\quad\ell\in\{i,j\}.

Let W~b\widetilde{W}_{b} be defined in (28) by using A^\widehat{A} as well as α~b(i)\widetilde{\alpha}_{b}^{(i)} and α~b(j)\widetilde{\alpha}_{b}^{(j)}. The latter are obtained from (41) based on Xb(i)X_{*b}^{(i)} and Xb(j)X_{*b}^{(j)*}. The mm-out-of-NN bootstrap estimates the distribution of N(W~W(𝜶(i),𝜶(j);d))\sqrt{N}(\widetilde{W}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)) by the empirical distribution of m(W~bW~)\sqrt{m}(\widetilde{W}_{b}-\widetilde{W}) with b[B]b\in[B].

The derivative-based bootstrap method on the other hand aims to estimate the limiting distribution in Theorem 5. Specifically, let ^δ\widehat{\mathcal{F}}^{\prime}_{\delta} be some pre-defined estimator of ij\mathcal{F}_{ij}^{\prime}. The derivative-based bootstrap method estimates the distribution of

supfijfZij\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}Z_{ij}

by the empirical distribution of

supf^δfBij(b),with b[B].\sup_{f\in\widehat{\mathcal{F}}^{\prime}_{\delta}}f^{\top}B_{ij}^{(b)},\quad\text{with }b\in[B].

Here

Bij(b)=N(α~b(i)α~b(j)α~(i)+α~(j))B_{ij}^{(b)}=\sqrt{N}\Bigl{(}\widetilde{\alpha}_{b}^{(i)}-\widetilde{\alpha}_{b}^{(j)}-\widetilde{\alpha}^{(i)}+\widetilde{\alpha}^{(j)}\Bigr{)}

uses the estimators α~b(i)\widetilde{\alpha}_{b}^{(i)} and α~b(j)\widetilde{\alpha}_{b}^{(j)} obtained from (41) based on the bootstrap samples

NXb()Multinomialp(N;X()),{i,j}.NX_{*b}^{(\ell)}\sim\text{Multinomial}_{p}(N;X^{(\ell)}),\quad\ell\in\{i,j\}.

F.7 Selection of mm for mm-out-of-NN bootstrap

In our simulation, we follow [57] by setting m=Nγm=N^{\gamma} for some γ(0,1)\gamma\in(0,1). To evaluate its performance, consider the null case, r=sr=s, and choose K=10K=10, p=500p=500, N{100,500,1000,3000}N\in\{100,500,1000,3000\} and γ{0.1,0.3,0.5,0.7}\gamma\in\{0.1,0.3,0.5,0.7\}. The number of repetitions of Bootstrap is set to 500 while for each setting we repeat 200 times. The KS distance is used to evaluate the closeness between Bootstrap samples and the limiting distribution. Again, we approximate the theoretical c.d.f. of the limiting distribution based on 20,000 draws from it. Figure 4 shows the KS distances for various choices of NN and γ\gamma. As we we can see, the result of mm-out-of-NN bootstrap gets better as NN increases. Furthermore, it suggests γ=0.3\gamma=0.3 which gives the best performance over all choices of NN.

Refer to caption
Figure 4: The averaged KS distances of mm-out-of-NN bootstrap with m=Nγm=N^{\gamma}

Appendix G Review of the 1\ell_{1}-norm error bound of the MLE-type estimators of the mixture weights in [8]

In this section, we state in details the theoretical guarantees from [8] on the MLE-type estimator α^(i)\widehat{\alpha}^{(i)} in (3) of α(i)\alpha^{(i)}. Since it suffices to focus on one i[n]i\in[n], we drop the superscripts in this section for simplicity. We assume minj[p]Aj>0\min_{j\in[p]}\|A_{j\cdot}\|_{\infty}>0. Otherwise, a pre-screening procedure could be used to reduce the dimension pp such that this condition is met.

Recalling from (3) that α^\widehat{\alpha} depends on a given estimator A^\widehat{A} of AA. Let A^\widehat{A} be a generic estimator of AA satisfying

A^kΔp,for k[K];\displaystyle\widehat{A}_{k}\in\Delta_{p},\quad\text{for }\ k\in[K]; (138)
maxk[K]A^kAk1ϵ1,;\displaystyle\max_{k\in[K]}\|\widehat{A}_{k}-A_{k}\|_{1}\leq\epsilon_{1,\infty}; (139)
maxj[p]A^jAjAjϵ,;\displaystyle\max_{j\in[p]}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{\infty}\over\|A_{j\cdot}\|_{\infty}}\leq\epsilon_{\infty,\infty}; (140)
maxj[p]A^jAj1Aj1ϵ,1.\displaystyle\max_{j\in[p]}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{1}\over\|A_{j\cdot}\|_{1}}\leq\epsilon_{\infty,1}. (141)

for some deterministic sequences ϵ1,\epsilon_{1,\infty}, ϵ,\epsilon_{\infty,\infty} and ϵ,1\epsilon_{\infty,1}.111Displays (139) – (141) only need to hold up to a label permutation among columns of AA.

To state the results in [8], we need a few more notation. For any r=Aαr=A\alpha with αΘα(τ)\alpha\in\Theta_{\alpha}(\tau), define

J¯={j[p]:rj>5log(p)N}.\underline{J}=\left\{j\in[p]:r_{j}>{5\log(p)\over N}\right\}.

The set J¯\underline{J} in conjunction with J¯\overline{J} in (16) are chosen such that

{J¯supp(X)J¯}12p1,\mathbb{P}\left\{\underline{J}\subseteq{\rm supp}(X)\subseteq\overline{J}\right\}\geq 1-2p^{-1},

see, for instance, [8, Lemma I.1]. Here XX is the empirical estimator of rr. Further define

αmin:=minksupp(α)αk,ξmaxjJ¯maxksupp(α)Ajkksupp(α)Ajk.\alpha_{\min}:=\min_{k\in{\rm supp}(\alpha)}\alpha_{k},\qquad\xi\coloneqq\max_{j\in\overline{J}}{\max_{k\notin{\rm supp}(\alpha)}A_{jk}\over\sum_{k\in{\rm supp}(\alpha)}A_{jk}}. (142)

Recall that (18) and (19). Write

κ¯τ=minαΘα(τ)κ(AJ¯,α0).\displaystyle\underline{\kappa}_{\tau}=\min_{\alpha\in\Theta_{\alpha}(\tau)}\kappa(A_{\underline{J}},\|\alpha\|_{0}). (143)

We have κ¯τκ¯τ\underline{\kappa}_{\tau}\leq\overline{\kappa}_{\tau}. Finally, let

M1\displaystyle M_{1} :=log(p)κ¯τ4(1ξ)3αmin3,M2:=log(p)κ¯K2(1ξ)αmin(1+ξKα0)αmin.\displaystyle:={\log(p)\over\underline{\kappa}_{\tau}^{4}}{(1\vee\xi)^{3}\over\alpha_{\min}^{3}},\qquad M_{2}:={\log(p)\over\overline{\kappa}_{K}^{2}}{(1\vee\xi)\over\alpha_{\min}}{(1+\xi\sqrt{K-\|\alpha\|_{0}})\over\alpha_{\min}}.

The following results are proved in [8, Theorems 9 & 10].

Theorem 13.

Let αΘα(τ)\alpha\in\Theta_{\alpha}(\tau). Assume there exists sufficiently large constant C,C>0C,C^{\prime}>0 such that

NCmax{M1,M2}N\geq C\max\{M_{1},M_{2}\} (144)

and |J¯J¯|C|\overline{J}\setminus\underline{J}|\leq C^{\prime}. Let A^\widehat{A} be any estimator such that (138) – (140) hold with

2(1ξ)αminϵ,1,2(1ξ)αminϵ1,κ¯τ2.{2(1\vee\xi)\over\alpha_{\min}}\epsilon_{\infty,\infty}\leq 1,\qquad{2(1\vee\xi)\over\alpha_{\min}}\epsilon_{1,\infty}\leq\underline{\kappa}^{2}_{\tau}. (145)

Then we have, with probability 18p11-8p^{-1},

α^α11κ¯τKlog(p)N+1κ¯τ2ϵ1,.\|\widehat{\alpha}-\alpha\|_{1}~{}\lesssim~{}{1\over\overline{\kappa}_{\tau}}\sqrt{K\log(p)\over N}+{1\over\overline{\kappa}_{\tau}^{2}}\epsilon_{1,\infty}.

Furthermore, if additionally

ξlog(p)αminN(1+1κ¯τξταmin)\displaystyle\sqrt{\xi\log(p)\over\alpha_{\min}N}\left(1+{1\over\overline{\kappa}_{\tau}}\sqrt{\xi\tau\over\alpha_{\min}}\right) +log(p)N+(1+ξαminκ¯τ2)ϵ1,cminksupp(α)jJ¯cAjk\displaystyle+{\log(p)\over N}+\left(1+{\xi\over\alpha_{\min}\overline{\kappa}_{\tau}^{2}}\right)\epsilon_{1,\infty}\leq c\min_{k\notin{\rm supp}(\alpha)}\sum_{j\in\overline{J}^{c}}A_{jk}

holds for some sufficiently small constant c>0c>0, then with probability at least 1𝒪(p1)1-\mathcal{O}(p^{-1})

α^α11κ¯ττlog(p)N+1κ¯τ2ϵ1,.\|\widehat{\alpha}-\alpha\|_{1}\lesssim{1\over\overline{\kappa}_{\tau}}\sqrt{\tau\log(p)\over N}+{1\over\overline{\kappa}_{\tau}^{2}}\epsilon_{1,\infty}.

Specializing to the estimator A^\widehat{A} in [9], Theorem 13 can be invoked for ϵ1,\epsilon_{1,\infty} and ϵ,\epsilon_{\infty,\infty} given by (13) and (14). For future reference, when KK is fixed, it is easy to see from (14) that the same estimator A^\widehat{A} also satisfies (141) for

ϵ,1=plog(L)nN\epsilon_{\infty,1}=\sqrt{p\log(L)\over nN} (146)

with probability tending to one.

Appendix H Results on the W-distance estimator based on the weighted least squares estimator of the mixture weights

In this section we study the W-distance estimator, W~LS\widetilde{W}_{LS} in (28), of W(𝜶(i),𝜶(j);d)W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d), by using the weighted least squares estimators

α~LS()=A^+X(),for{i,j}\widetilde{\alpha}_{LS}^{(\ell)}=\widehat{A}^{+}X^{(\ell)},\qquad\text{for}\quad\ell\in\{i,j\} (147)

where

A^+:=(A^D^1A^)1A^D^1andD^:=diag(A^11,,A^p1),\widehat{A}^{+}:=(\widehat{A}^{\top}\widehat{D}^{-1}\widehat{A})^{-1}\widehat{A}^{\top}\widehat{D}^{-1}\quad\ \text{and}\quad\widehat{D}:=\mathrm{diag}(\|\widehat{A}_{1\cdot}\|_{1},\ldots,\|\widehat{A}_{p\cdot}\|_{1}), (148)

with A^\widehat{A} being a generic estimator of AA. Due to the weighting matrix D^\widehat{D}, A^D^1A^\widehat{A}^{\top}\widehat{D}^{-1}\widehat{A} is a doubly stochastic matrix, and its largest eigenvalue is 11 (see also Lemma 20, below).

Fix arbitrary i[n]i\in[n]. Let

VLS(i)=AD1(diag(r(i))r(i)r(i))D1AV_{LS}^{(i)}=A^{\top}D^{-1}\left(\mathrm{diag}(r^{(i)})-r^{(i)}r^{(i)\top}\right)D^{-1}A (149)

and

ΣLS(i)=M1VLS(i)M1=A+diag(r(i))A+α(i)α(i)\Sigma_{LS}^{(i)}=M^{-1}V_{LS}^{(i)}M^{-1}=A^{+}\mathrm{diag}(r^{(i)})A^{+\top}-\alpha^{(i)}\alpha^{(i)\top} (150)

with

M=AD1A.M=A^{\top}D^{-1}A.

The following theorem establishes the asymptotic normality of α~LS(i)\widetilde{\alpha}^{(i)}_{LS}.

Theorem 14.

Under 1 and condition (1), let A^\widehat{A} be any estimator such that (138), (139) and (141) hold. Assume λK1(M)=𝒪(N)\lambda_{K}^{-1}(M)=\mathcal{O}(\sqrt{N}), λK1(VLS(i))=o(N)\lambda_{K}^{-1}(V_{LS}^{(i)})=o(N) and

(ϵ1,+ϵ,1)NλK(VLS(i))=o(1),as n,N.(\epsilon_{1,\infty}+\epsilon_{\infty,1})\sqrt{N\over\lambda_{K}(V_{LS}^{(i)})}=o(1),\quad\textrm{as $n,N\to\infty$.} (151)

Then, we have the following convergence in distribution as n,Nn,N\to\infty,

N(ΣLS(i))1/2(α~LS(i)α(i))𝑑𝒩K(0,\bmIK).\sqrt{N}\bigl{(}\Sigma_{LS}^{(i)}\bigr{)}^{-1/2}\left(\widetilde{\alpha}_{LS}^{(i)}-\alpha^{(i)}\right)\overset{d}{\to}\mathcal{N}_{K}(0,\bm{I}_{K}).
Proof.

The proof can be found in Section H.1. ∎

Remark 10.

We begin by observing that if pp is finite, independent of NN, and AA is known, the conclusion of the theorem follows trivially from the fact that

N(X(i)r(i))𝑑𝒩p(0,diag(r(i))r(i)r(i))\sqrt{N}(X^{(i)}-r^{(i)})\overset{d}{\to}\mathcal{N}_{p}(0,\mathrm{diag}(r^{(i)})-r^{(i)}r^{(i)\top})

and α~(i)=A+X(i)\widetilde{\alpha}^{(i)}=A^{+}X^{(i)}, under no conditions beyond the multinomial sampling scheme assumption. Our theorem generalizes this to the practical situation when AA is estimated, from a data set that is potentially dependent on that used to construct α~(i)\widetilde{\alpha}^{(i)}, and when p=p(N)p=p(N).

Condition λK1(VLS(i))=o(N)\lambda_{K}^{-1}(V_{LS}^{(i)})=o(N) is a mild regularity condition that is needed even if AA is known, but both p=p(N)p=p(N) and parameters in AA and α(i)\alpha^{(i)} are allowed to grow with NN.

If VLS(i)V_{LS}^{(i)} is rank deficient, so is the asymptotic covariance matrix ΣLS(i)\Sigma_{LS}^{(i)}. This happens, for instance, when the mixture components AkA_{k} have disjoint supports and the weight vector α(i)\alpha^{(i)} is sparse. Nevertheless, a straightforward modification of our analysis leads to the same conclusion, except that the inverse of ΣLS(i)\Sigma_{LS}^{(i)} gets replaced by a generalized inverse and the limiting covariance matrix becomes \bmIs\bm{I}_{s} with s=rank(ΣLS(i))s=\mathrm{rank}(\Sigma_{LS}^{(i)}) and zeroes elsewhere.

Condition (151), on the other hand, ensures that the error of estimating the mixture components becomes negligible. For the choice of A^\widehat{A} in [9], in view of (13) and (146), (151) requires

plog(L)nλK(VLS(i))=o(1),as n,N\sqrt{p\log(L)\over n\lambda_{K}(V_{LS}^{(i)})}=o(1),\quad\text{as }n,N\to\infty

with L=npNL=n\vee p\vee N.

The asymptotic normality in Theorem 14 in conjunction with Proposition 1 readily yields the limiting distribution of W~LS\widetilde{W}_{LS} in (28) by using (147), stated in the following theorem. For arbitrary pair (i,j)(i,j), let

Gij𝒩K(0,QLS(ij))G_{ij}\sim\mathcal{N}_{K}(0,Q_{LS}^{(ij)}) (152)

where we assume that the following limit exists

QLS(ij)limNΣLS(i)+ΣLS(j)K×K.Q_{LS}^{(ij)}\coloneqq\lim_{N\to\infty}~{}\Sigma_{LS}^{(i)}+\Sigma_{LS}^{(j)}~{}\in\mathbb{R}^{K\times K}.
Theorem 15.

Under Assumption 1 and the multinomial sampling assumption (1), let A^\widehat{A} be any estimators such that (24), (138), (139) and (141) hold. Assume λK1(M)=𝒪(N)\lambda_{K}^{-1}(M)=\mathcal{O}(\sqrt{N}), ϵAN=o(1)\epsilon_{A}\sqrt{N}=o(1), λK1(VLS(i))λK1(VLS(j))=o(N)\lambda_{K}^{-1}(V_{LS}^{(i)})\vee\lambda_{K}^{-1}(V_{LS}^{(j)})=o(N) and

(ϵ1,+ϵ,1)NλK(VLS(i))λK(VLS(j))=o(1),as n,N(\epsilon_{1,\infty}+\epsilon_{\infty,1})\sqrt{N\over\lambda_{K}(V_{LS}^{(i)})\wedge\lambda_{K}(V_{LS}^{(j)})}=o(1),\quad\textrm{as $n,N\to\infty$. } (153)

Then, as n,Nn,N\to\infty, we have the following convergence in distribution,

N(W~LSW(𝜶(i),𝜶(j);d))𝑑supfijfGij\sqrt{N}\left(\widetilde{W}_{LS}-W(\boldsymbol{\alpha}^{(i)},\boldsymbol{\alpha}^{(j)};d)\right)\overset{d}{\to}\sup_{f\in\mathcal{F}^{\prime}_{ij}}f^{\top}G_{ij} (154)

where ij\mathcal{F}^{\prime}_{ij} is defined in (33).

Proof.

Since Theorem 14 ensures that

N[(α~LS(i)α~LS(j))(α(i)α(i))]𝑑Gij,as n,N,\sqrt{N}\left[(\widetilde{\alpha}_{LS}^{(i)}-\widetilde{\alpha}_{LS}^{(j)})-(\alpha^{(i)}-\alpha^{(i)})\right]\overset{d}{\to}G_{ij},\quad\textrm{as }n,N\to\infty,

the result follows by invoking Proposition 1. ∎

Consequently, the following corollary provides one concrete instance for which conclusions in Theorem 15 hold for W~LS\widetilde{W}_{LS} based on A^\widehat{A} in [9].

Corollary 3.

Grant topic model assumptions and 3 as well as conditions listed in [8, Appendix K.1]. For any r(i),r(j)𝒮r^{(i)},r^{(j)}\in\mathcal{S} with i,j[n]i,j\in[n], suppose plog(L)=o(n)p\log(L)=o(n) and

max{λK1(M),λK1(VLS(i)),λK1(VLS(j))}=𝒪(1).\max\left\{\lambda_{K}^{-1}(M),\lambda_{K}^{-1}(V_{LS}^{(i)}),\lambda_{K}^{-1}(V_{LS}^{(j)})\right\}=\mathcal{O}(1). (155)

For any dd satisfying (25) with Cd=𝒪(1)C_{d}=\mathcal{O}(1), then Eq. 154 holds as n,Nn,N\to\infty.

Remark 11 (The weighted restricted least squares estimator of the mixture weights).

We argue here that the WLS estimator α~LS()\widetilde{\alpha}^{(\ell)}_{LS} in (147) can be viewed as the de-biased estimators of the following restricted weighted least squares estimator

α^LS():=argminαΔK12D^1/2(X()A^α)22.\widehat{\alpha}_{LS}^{(\ell)}:=\operatorname*{argmin}_{\alpha\in\Delta_{K}}{1\over 2}\|\widehat{D}^{-1/2}(X^{(\ell)}-\widehat{A}\alpha)\|_{2}^{2}. (156)

Suppose that A^=A\widehat{A}=A and let us drop the superscripts for simplicity. The K.K.T. conditions of (156) imply

N(α^LSα)=NA+(Xr)+NM1(λμ𝟙K).\sqrt{N}\left(\widehat{\alpha}_{LS}-\alpha\right)=\sqrt{N}A^{+}(X-r)+\sqrt{N}M^{-1}\left(\lambda-\mu\mathds{1}_{K}\right). (157)

Here λ+K\lambda\in\mathbb{R}_{+}^{K} and μ\mu\in\mathbb{R} are Lagrange variables corresponding to the restricted optimization in (156). We see that the first term on the right-hand side in Eq. 157 is asymptotically normal, while the second one does not always converge to a normal limit. Thus, it is natural to consider a modified estimator

α^LSM1(λμ𝟙K)\widehat{\alpha}_{LS}-M^{-1}\left(\lambda-\mu\mathds{1}_{K}\right)

by removing the second term in (157). Interestingly, this new estimator is nothing but α~LS\widetilde{\alpha}_{LS} in (147). This suggests that N(α~LSα)\sqrt{N}(\widetilde{\alpha}_{LS}-\alpha) converges to a normal limit faster than that of N(α^LSα)\sqrt{N}(\widehat{\alpha}_{LS}-\alpha), as confirmed in Figure 2 of Section F.2.

H.1 Proof of Theorem 14

Proof.

For simplicity, let us drop the superscripts. By definition, we have

α~LSα\displaystyle\widetilde{\alpha}_{LS}-\alpha =A+(Xr)+(A^+A+)(Xr)+(A^+A+)r\displaystyle=A^{+}(X-r)+(\widehat{A}^{+}-A^{+})(X-r)+(\widehat{A}^{+}-A^{+})~{}r

where we let

A+=M1AD1,A^+=M^1A^D^1,A^{+}=M^{-1}A^{\top}D^{-1},\qquad\widehat{A}^{+}=\widehat{M}^{-1}\widehat{A}^{\top}\widehat{D}^{-1}, (158)

with M=AD1AM=A^{\top}D^{-1}A and M^=A^D^1A^\widehat{M}=\widehat{A}^{\top}\widehat{D}^{-1}\widehat{A}. Recall that

ΣLS=M1VLSM1.\Sigma_{LS}=M^{-1}V_{LS}M^{-1}. (159)

It suffices to show that, for any uK{0}u\in\mathbb{R}^{K}\setminus\{0\},

NuA+(Xr)uΣLSu𝑑𝒩(0,1),\displaystyle\sqrt{N}{u^{\top}A^{+}(X-r)\over\sqrt{u^{\top}\Sigma_{LS}u}}\overset{d}{\to}\mathcal{N}(0,1), (160)
Nu(A^+A+)(Xr)uΣLSu=o(1),\displaystyle\sqrt{N}{u^{\top}(\widehat{A}^{+}-A^{+})(X-r)\over\sqrt{u^{\top}\Sigma_{LS}u}}=o_{\mathbb{P}}(1), (161)
Nu(A^+A+)ruΣLSu=o(1).\displaystyle\sqrt{N}{u^{\top}(\widehat{A}^{+}-A^{+})~{}r\over\sqrt{u^{\top}\Sigma_{LS}u}}=o_{\mathbb{P}}(1). (162)

Define the 2\ell_{2}\to\ell_{\infty} condition number of VLSV_{LS} as

κ(VLS)=infu0uVLSuu2.\kappa(V_{LS})=\inf_{u\neq 0}{u^{\top}V_{LS}u\over\|u\|_{\infty}^{2}}. (163)

Notice that

κ(VLS)KλK(VLS)κ(VLS).{\kappa(V_{LS})\over K}\leq\lambda_{K}(V_{LS})\leq\kappa(V_{LS}). (164)

Proof of Eq. 160: By recognizing that the left-hand-side of (160) can be written as

1Nt=1NuA+(Ztr)uΣLSu1Nt=1NYt.{1\over\sqrt{N}}\sum_{t=1}^{N}{u^{\top}A^{+}(Z_{t}-r)\over\sqrt{u^{\top}\Sigma_{LS}u}}\coloneqq{1\over\sqrt{N}}\sum_{t=1}^{N}Y_{t}.

with ZtZ_{t}, for t[N]t\in[N], being i.i.d. Multinomialp(1;r)\textrm{Multinomial}_{p}(1;r). It is easy to see that 𝔼[Yt]=0\mathbb{E}[Y_{t}]=0 and 𝔼[Yt2]=1\mathbb{E}[Y_{t}^{2}]=1. To apply the Lyapunov central limit theorem, it suffices to verify

limN𝔼[|Yt|3]N=0.\lim_{N\to\infty}{\mathbb{E}[|Y_{t}|^{3}]\over\sqrt{N}}=0.

This follows by noting that

𝔼[|Yt|3]\displaystyle\mathbb{E}[|Y_{t}|^{3}] 2uA+uΣLSu\displaystyle\leq 2{\|u^{\top}A^{+}\|_{\infty}\over\sqrt{u^{\top}\Sigma_{LS}u}} by 𝔼[Yt2]=1,Ztr12\displaystyle\textrm{by }\mathbb{E}[Y_{t}^{2}]=1,\|Z_{t}-r\|_{1}\leq 2
2uAD1uVLSu\displaystyle\leq{2\|u^{\top}A^{\top}D^{-1}\|_{\infty}\over\sqrt{u^{\top}V_{LS}u}} by Eq. 159
2κ(VLS)\displaystyle\leq{2\over\sqrt{\kappa(V_{LS}})} by Eq. 163

and by using λK1(VLS)=o(N)\lambda_{K}^{-1}(V_{LS})=o(N) and Eq. 164.

Proof of Eq. 161: Fix any uK{0}u\in\mathbb{R}^{K}\setminus\{0\}. Note that

u(A^+A+)(Xr)uΣLSu\displaystyle{u^{\top}(\widehat{A}^{+}-A^{+})(X-r)\over\sqrt{u^{\top}\Sigma_{LS}u}} =uM(A^+A+)(Xr)uVLSu\displaystyle={u^{\top}M(\widehat{A}^{+}-A^{+})(X-r)\over\sqrt{u^{\top}V_{LS}u}}
uM(A^+A+)(Xr)1uκ(VLS)\displaystyle\leq{\|u\|_{\infty}\|M(\widehat{A}^{+}-A^{+})(X-r)\|_{1}\over\|u\|_{\infty}\sqrt{\kappa(V_{LS})}} by Eq. 163.\displaystyle\textrm{by \lx@cref{creftype~refnum}{def_kappa}}.

Further notice that M(A^+A+)(Xr)1\|M(\widehat{A}^{+}-A^{+})(X-r)\|_{1} is smaller than

MM^1,(A^+A+)(Xr)1+M^(A^+A+)(Xr)1.\|M-\widehat{M}\|_{1,\infty}\|(\widehat{A}^{+}-A^{+})(X-r)\|_{1}+\|\widehat{M}(\widehat{A}^{+}-A^{+})(X-r)\|_{1}.

For the first term, since

A^+A+\displaystyle\widehat{A}^{+}-A^{+} =M^1A^(D^1D1)+M^1(A^A)D1\displaystyle=\widehat{M}^{-1}\widehat{A}^{\top}(\widehat{D}^{-1}-D^{-1})+\widehat{M}^{-1}(\widehat{A}-A)^{\top}D^{-1}
+(M^1M1)AD1\displaystyle\quad+(\widehat{M}^{-1}-M^{-1})A^{\top}D^{-1}
=M^1A^D^1(DD^)D1+M^1(A^A)D1\displaystyle=\widehat{M}^{-1}\widehat{A}^{\top}\widehat{D}^{-1}(D-\widehat{D})D^{-1}+\widehat{M}^{-1}(\widehat{A}-A)^{\top}D^{-1}
+M^1(MM^)M1AD1,\displaystyle\quad+\widehat{M}^{-1}(M-\widehat{M})M^{-1}A^{\top}D^{-1}, (165)

we obtain

(A^+A+)(Xr)1\displaystyle\|(\widehat{A}^{+}-A^{+})(X-r)\|_{1}
M^11,A^D^11,(D^D)D11,Xr1\displaystyle\leq\|\widehat{M}^{-1}\|_{1,\infty}\|\widehat{A}^{\top}\widehat{D}^{-1}\|_{1,\infty}\|(\widehat{D}-D)D^{-1}\|_{1,\infty}\|X-r\|_{1}
+M^11,(A^A)D11,Xr1\displaystyle\quad+\|\widehat{M}^{-1}\|_{1,\infty}\|(\widehat{A}-A)^{\top}D^{-1}\|_{1,\infty}\|X-r\|_{1}
+M^1(MM^)1,A+(Xr)1\displaystyle\quad+\|\widehat{M}^{-1}(M-\widehat{M})\|_{1,\infty}\|A^{+}(X-r)\|_{1}
2M^11,ϵ,1Xr1+M^1(MM^)1,A+(Xr)1\displaystyle\leq 2\|\widehat{M}^{-1}\|_{1,\infty}\epsilon_{\infty,1}\|X-r\|_{1}+\|\widehat{M}^{-1}(M-\widehat{M})\|_{1,\infty}\|A^{+}(X-r)\|_{1}
4M11,ϵ,1Xr1+2M1,1(2ϵ,1+ϵ1,)A+(Xr)1\displaystyle\leq 4\|M^{-1}\|_{1,\infty}\epsilon_{\infty,1}\|X-r\|_{1}+2\|M^{-1}\|_{\infty,1}(2\epsilon_{\infty,1}+\epsilon_{1,\infty})\|A^{+}(X-r)\|_{1} (166)

where in the last line we invoke Lemma 20 by observing that condition (171) holds under (151). Section H.1 in conjunction with Section H.2 yields

MM^1,(A^+A+)(Xr)1\displaystyle\|M-\widehat{M}\|_{1,\infty}\|(\widehat{A}^{+}-A^{+})(X-r)\|_{1}
2M11,(2ϵ,1+ϵ1,)[2ϵ,1Xr1+(2ϵ,1+ϵ1,)A+(Xr)1].\displaystyle\leq 2\|M^{-1}\|_{1,\infty}(2\epsilon_{\infty,1}+\epsilon_{1,\infty})\Bigl{[}2\epsilon_{\infty,1}\|X-r\|_{1}+(2\epsilon_{\infty,1}+\epsilon_{1,\infty})\|A^{+}(X-r)\|_{1}\Bigr{]}. (167)

On the other hand, from Section H.1 and its subsequent arguments, it is straightforward to show

M^(A^+A+)(Xr)1\displaystyle\|\widehat{M}(\widehat{A}^{+}-A^{+})(X-r)\|_{1} A^D^11,(D^D)D11,Xr1\displaystyle\leq\|\widehat{A}^{\top}\widehat{D}^{-1}\|_{1,\infty}\|(\widehat{D}-D)D^{-1}\|_{1,\infty}\|X-r\|_{1}
+(A^A)D11,Xr1\displaystyle\quad+\|(\widehat{A}-A)^{\top}D^{-1}\|_{1,\infty}\|X-r\|_{1}
+MM^1,A+(Xr)1\displaystyle\quad+\|M-\widehat{M}\|_{1,\infty}\|A^{+}(X-r)\|_{1}
2ϵ,1Xr1+(2ϵ,1+ϵ1,)A+(Xr)1.\displaystyle\leq 2\epsilon_{\infty,1}\|X-r\|_{1}+(2\epsilon_{\infty,1}+\epsilon_{1,\infty})\|A^{+}(X-r)\|_{1}. (168)

By combining Section H.1 and Section H.1 and by using Eq. 151 together with the fact that

M1,1KM1op=KλK(M)\|M^{-1}\|_{\infty,1}\leq\sqrt{K}\|M^{-1}\|_{\mathrm{op}}={\sqrt{K}\over\lambda_{K}(M)} (169)

to collect terms, we obtain

M(A^+A+)(Xr)1ϵ,1Xr1+(2ϵ,1+ϵ1,)A+(Xr)1.\displaystyle\|M(\widehat{A}^{+}-A^{+})(X-r)\|_{1}\lesssim\epsilon_{\infty,1}\|X-r\|_{1}+(2\epsilon_{\infty,1}+\epsilon_{1,\infty})\|A^{+}(X-r)\|_{1}.

Finally, since Xr12\|X-r\|_{1}\leq 2 and Lemma 21 ensures that, for any t0t\geq 0,

A+(Xr)1\displaystyle\|A^{+}(X-r)\|_{1} maxj[p](A+)j2(2Klog(2K/t)N+4Klog(2K/t)N),\displaystyle\leq\max_{j\in[p]}\|(A^{+})_{j}\|_{2}\left(\sqrt{2K\log(2K/t)~{}\over N}+{4K\log(2K/t)\over N}\right), (170)

with probability at least 1t1-t, by also using

(A+)j2M1Aj2Aj1λK1(M),\|(A^{+})_{j}\|_{2}\leq{\|M^{-1}A_{j\cdot}\|_{2}\over\|A_{j\cdot}\|_{1}}\leq\lambda_{K}^{-1}(M),

we conclude

M(A^+A+)(Xr)1=𝒪(ϵ,1+(ϵ,1+ϵ1,)λK1(M)Klog(K)N).\|M(\widehat{A}^{+}-A^{+})(X-r)\|_{1}=\mathcal{O}_{\mathbb{P}}\left(\epsilon_{\infty,1}+(\epsilon_{\infty,1}+\epsilon_{1,\infty})\lambda_{K}^{-1}(M)\sqrt{K\log(K)\over N}\right).

This further implies Eq. 161 under λK1(M)=𝒪(N)\lambda_{K}^{-1}(M)=\mathcal{O}(\sqrt{N}), Eq. 151 and Eq. 164.

Proof of Eq. 162: By similar arguments, we have

Nu(A^+A+)ruVLSuNM(A^+A+)r1κ(VLS).\sqrt{N}{u^{\top}(\widehat{A}^{+}-A^{+})r\over\sqrt{u^{\top}V_{LS}u}}\leq\sqrt{N}{\|M(\widehat{A}^{+}-A^{+})r\|_{1}\over\sqrt{\kappa(V_{LS})}}.

Since, for r=Aαr=A\alpha,

M(A^+A+)r1\displaystyle\|M(\widehat{A}^{+}-A^{+})r\|_{1} =MA^+(AA^)α1\displaystyle=\|M\widehat{A}^{+}(A-\widehat{A})\alpha\|_{1}
MA^+1,(A^A)α1\displaystyle\leq\|M\widehat{A}^{+}\|_{1,\infty}\|(\widehat{A}-A)\alpha\|_{1}
(A^D^11,+(M^M)M^1A^D^11,)A^A1,\displaystyle\leq\left(\|\widehat{A}^{\top}\widehat{D}^{-1}\|_{1,\infty}+\|(\widehat{M}-M)\widehat{M}^{-1}\widehat{A}^{\top}\widehat{D}^{-1}\|_{1,\infty}\right)\|\widehat{A}-A\|_{1,\infty}
(1+M^M1,M^11,)ϵ1,\displaystyle\leq(1+\|\widehat{M}-M\|_{1,\infty}\|\widehat{M}^{-1}\|_{1,\infty})\epsilon_{1,\infty}
(1+2M11,(2ϵ,1+ϵ1,))ϵ1,\displaystyle\leq\Bigl{(}1+2\|M^{-1}\|_{1,\infty}(2\epsilon_{\infty,1}+\epsilon_{1,\infty})\Bigr{)}\epsilon_{1,\infty}

by using Lemma 20 and Section H.2 in the last step, invoking Eq. 151, (169) and λK1(M)=𝒪(N)\lambda_{K}^{-1}(M)=\mathcal{O}(\sqrt{N}) concludes

M(A^+A+)r1=𝒪(ϵ1,)\|M(\widehat{A}^{+}-A^{+})r\|_{1}=\mathcal{O}(\epsilon_{1,\infty})

implying Eq. 162. The proof is complete. ∎

H.2 Lemmas used in the proof of Theorem 14

The following lemma controls various quantities related with M^\widehat{M}.

Lemma 20.

Under 1, assume

M1,1(ϵ1,+2ϵ,1)1/2.\|M^{-1}\|_{\infty,1}\left(\epsilon_{1,\infty}+2\epsilon_{\infty,1}\right)\leq 1/2. (171)

Then on the event that Eq. 138 and Eq. 141 hold, one has

λK(M^)λK(M)/2,M^1,12M1,1\lambda_{K}(\widehat{M})\geq\lambda_{K}(M)/2,\qquad\|\widehat{M}^{-1}\|_{\infty,1}\leq 2\|M^{-1}\|_{\infty,1}

and

max{(M^M)M^11,,(M^M)M^1,1}2M1,1(2ϵ,1+ϵ1,).\max\left\{\|(\widehat{M}-M)\widehat{M}^{-1}\|_{1,\infty},~{}\|(\widehat{M}-M)\widehat{M}^{-1}\|_{\infty,1}\right\}\leq 2\|M^{-1}\|_{\infty,1}(2\epsilon_{\infty,1}+\epsilon_{1,\infty}).

Moreover, λK(M^)λK(M)/2\lambda_{K}(\widehat{M})\geq\lambda_{K}(M)/2 is guaranteed by 1 and 4ϵ,1+2ϵ1,λK(M)4\epsilon_{\infty,1}+2\epsilon_{1,\infty}\leq\lambda_{K}(M).

Proof.

We assume the identity label permutation for Eq. 138 and Eq. 141. First, by using MopM,1\|M\|_{\mathrm{op}}\leq\|M\|_{\infty,1} for any symmetric matrix MM, we find

M1,1M1op=λK1(M)1.\|M^{-1}\|_{\infty,1}\geq\|M^{-1}\|_{\mathrm{op}}=\lambda_{K}^{-1}(M)\geq 1. (172)

Then condition (171) guarantees, for any j[p]j\in[p],

A^j1Aj1(1A^jAj1Aj1)12Aj1.\|\widehat{A}_{j\cdot}\|_{1}\geq\|A_{j\cdot}\|_{1}\left(1-{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{1}\over\|A_{j\cdot}\|_{1}}\right)\geq{1\over 2}\|A_{j\cdot}\|_{1}. (173)

To prove the first result, by Weyl’s inequality, we have

λK(M^)λK(M)M^Mop.\lambda_{K}(\widehat{M})\geq\lambda_{K}(M)-\|\widehat{M}-M\|_{\mathrm{op}}.

Using MopM,1\|M\|_{\mathrm{op}}\leq\|M\|_{\infty,1} again yields

M^Mop\displaystyle\|\widehat{M}-M\|_{\mathrm{op}} M^M,1\displaystyle\leq\|\widehat{M}-M\|_{\infty,1}
A^D^1(A^A),1+A^(D^1D1)A,1\displaystyle\leq\|\widehat{A}^{\top}\widehat{D}^{-1}(\widehat{A}-A)\|_{\infty,1}+\|\widehat{A}^{\top}(\widehat{D}^{-1}-D^{-1})A\|_{\infty,1}
+(A^A)D1A,1\displaystyle\quad+\|(\widehat{A}-A)^{\top}D^{-1}A\|_{\infty,1}
(i)D^1(A^A),1+D^1(D^D)D1A,1\displaystyle\overset{(i)}{\leq}\|\widehat{D}^{-1}(\widehat{A}-A)\|_{\infty,1}+\|\widehat{D}^{-1}(\widehat{D}-D)D^{-1}A\|_{\infty,1}
+(A^A),1D1A,1\displaystyle\quad+\|(\widehat{A}-A)^{\top}\|_{\infty,1}\|D^{-1}A\|_{\infty,1}
(ii)2maxj[p]A^jAj1A^j1+maxk[K]A^kAk1,\displaystyle\overset{(ii)}{\leq}2\max_{j\in[p]}{\|\widehat{A}_{j\cdot}-A_{j\cdot}\|_{1}\over\|\widehat{A}_{j\cdot}\|_{1}}+\max_{k\in[K]}\|\widehat{A}_{k}-A_{k}\|_{1}, (174)

where the step (i)(i) uses A^,1=1\|\widehat{A}^{\top}\|_{\infty,1}=1 and the step (ii)(ii) is due to D1A,1=maxjAj1/Aj1=1\|D^{-1}A\|_{\infty,1}=\max_{j}\|A_{j\cdot}\|_{1}/\|A_{j\cdot}\|_{1}=1. Invoke Eq. 141 to obtain

M^Mop2ϵ,1+ϵ1,(171)12M1,1.\displaystyle\|\widehat{M}-M\|_{\mathrm{op}}\leq 2\epsilon_{\infty,1}+\epsilon_{1,\infty}\overset{(\ref{cond_A_cr})}{\leq}{1\over 2\|M^{-1}\|_{\infty,1}}.

Together with (172), we conclude λK(M^)λK(M)/2\lambda_{K}(\widehat{M})\geq\lambda_{K}(M)/2. Moreover, we readily see that λK(M^)λK(M)/2\lambda_{K}(\widehat{M})\geq\lambda_{K}(M)/2 only requires λK(M)/22ϵ,1+ϵ1,\lambda_{K}(M)/2\geq 2\epsilon_{\infty,1}+\epsilon_{1,\infty}.

To prove the second result, start with the decomposition

M^1M1\displaystyle\widehat{M}^{-1}-M^{-1} =M1(MM^)M^1\displaystyle=M^{-1}(M-\widehat{M})\widehat{M}^{-1}
=M1AD1(AA^)+M1A(D1D^1)A^M^1\displaystyle=M^{-1}A^{\top}D^{-1}(A-\widehat{A})+M^{-1}A^{\top}(D^{-1}-\widehat{D}^{-1})\widehat{A}\widehat{M}^{-1}
+M1(AA^)D^1A^M^1.\displaystyle\quad+M^{-1}(A-\widehat{A})^{\top}\widehat{D}^{-1}\widehat{A}\widehat{M}^{-1}. (175)

It then follows from dual-norm inequalities, A,1=1\|A^{\top}\|_{\infty,1}=1 and D^1A^,1=1\|\widehat{D}^{-1}\widehat{A}\|_{\infty,1}=1 that

M^1,1\displaystyle\|\widehat{M}^{-1}\|_{\infty,1} M1,1+M^1M1,1\displaystyle\leq\|M^{-1}\|_{\infty,1}+\|\widehat{M}^{-1}-M^{-1}\|_{\infty,1}
M1,1+M1,1D1(AA^),1\displaystyle\leq\|M^{-1}\|_{\infty,1}+\|M^{-1}\|_{\infty,1}\|D^{-1}(A-\widehat{A})\|_{\infty,1}
+M1,1(D1D^1)A^,1M^1,1\displaystyle\quad+\|M^{-1}\|_{\infty,1}\|(D^{-1}-\widehat{D}^{-1})\widehat{A}\|_{\infty,1}\|\widehat{M}^{-1}\|_{\infty,1}
+M1,1(AA^),1M^1,1\displaystyle\quad+\|M^{-1}\|_{\infty,1}\|(A-\widehat{A})^{\top}\|_{\infty,1}\|\widehat{M}^{-1}\|_{\infty,1}
M1,1+M1,1D1(AA^),1\displaystyle\leq\|M^{-1}\|_{\infty,1}+\|M^{-1}\|_{\infty,1}\|D^{-1}(A-\widehat{A})\|_{\infty,1}
+M1,1D1(D^D),1M^1,1\displaystyle\quad+\|M^{-1}\|_{\infty,1}\|D^{-1}(\widehat{D}-D)\|_{\infty,1}\|\widehat{M}^{-1}\|_{\infty,1}
+M1,1(AA^),1M^1,1\displaystyle\quad+\|M^{-1}\|_{\infty,1}\|(A-\widehat{A})^{\top}\|_{\infty,1}\|\widehat{M}^{-1}\|_{\infty,1}
M1,1+M1,1ϵ,1\displaystyle\leq\|M^{-1}\|_{\infty,1}+\|M^{-1}\|_{\infty,1}\epsilon_{\infty,1}
+M1,1(ϵ,1+ϵ1,)M^1,1.\displaystyle\quad+\|M^{-1}\|_{\infty,1}(\epsilon_{\infty,1}+\epsilon_{1,\infty})\|\widehat{M}^{-1}\|_{\infty,1}.

Since, similar to Eq. 172, we also have M^1,11\|\widehat{M}^{-1}\|_{\infty,1}\geq 1. Invoke condition Eq. 171 to conclude

M^1,12M1,1,\|\widehat{M}^{-1}\|_{\infty,1}\leq 2\|M^{-1}\|_{\infty,1},

completing the proof of the second result.

Finally, Section H.2 and the second result immediately give

(M^M)M^1,1\displaystyle\|(\widehat{M}-M)\widehat{M}^{-1}\|_{\infty,1} =M^M,1M^1,12M1,1(2ϵ,1+ϵ1,).\displaystyle=\|\widehat{M}-M\|_{\infty,1}\|\widehat{M}^{-1}\|_{\infty,1}\leq 2\|M^{-1}\|_{\infty,1}(2\epsilon_{\infty,1}+\epsilon_{1,\infty}).

The same bound also holds for (M^M)M^11,\|(\widehat{M}-M)\widehat{M}^{-1}\|_{1,\infty} by the symmetry of M^M\widehat{M}-M and M^1\widehat{M}^{-1}. ∎

Lemma 21.

For any fixed matrix BK×pB\in\mathbb{R}^{K\times p}, with probability at least 1t1-t for any t(0,1)t\in(0,1), we have

B(Xr)1maxj[p]Bj2(2Klog(2K/t)N+4Klog(2K/t)N).\|B(X-r)\|_{1}~{}\leq~{}\max_{j\in[p]}\|B_{j}\|_{2}\left(\sqrt{2K\log(2K/t)~{}\over N}+{4K\log(2K/t)\over N}\right).
Proof.

Start with B(Xr)1=k[K]|Bk(Xr)|\|B(X-r)\|_{1}=\sum_{k\in[K]}|B_{k\cdot}^{\top}(X-r)| and pick any k[K]k\in[K]. Note that

NXMultinomialp(N;r).NX\sim\textrm{Multinomial}_{p}(N;r).

By writing

Xr=1Nt=1N(Ztr)X-r={1\over N}\sum_{t=1}^{N}\left(Z_{t}-r\right)

with ZtMultinomialp(1;r)Z_{t}\sim\textrm{Multinomial}_{p}(1;r), we have

Bk(Xr)=1Nt=1NBk(Ztr).B_{k\cdot}^{\top}(X-r)={1\over N}\sum_{t=1}^{N}B_{k\cdot}^{\top}\left(Z_{t}-r\right).

To apply the Bernstein inequality, recalling that rΔpr\in\Delta_{p}, we find that Bk(Ztr)B_{k\cdot}^{\top}\left(Z_{t}-r\right) has zero mean, Bk(Ztr)2BkB_{k\cdot}^{\top}\left(Z_{t}-r\right)\leq 2\|B_{k\cdot}\|_{\infty} for all t[N]t\in[N] and

1Nt=1NVar(Bk(Ztr))Bkdiag(r)Bk.{1\over N}\sum_{t=1}^{N}\textrm{Var}\left(B_{k\cdot}^{\top}\left(Z_{t}-r\right)\right)\leq B_{k\cdot}^{\top}\textrm{diag}(r)B_{k\cdot}.

An application of the Bernstein inequality gives, for any t0t\geq 0,

{Bk(Xr)tBkdiag(r)BkN+2tBkN}12et/2.\mathbb{P}\left\{B_{k\cdot}^{\top}(X-r)\leq\sqrt{t~{}B_{k\cdot}^{\top}\textrm{diag}(r)B_{k\cdot}\over N}+{2t\|B_{k\cdot}\|_{\infty}\over N}\right\}\geq 1-2e^{-t/2}.

Choosing t=2log(2K/ϵ)t=2\log(2K/\epsilon) for any ϵ(0,1)\epsilon\in(0,1) and taking the union bounds over k[K]k\in[K] yields

B(Xr)12log(2K/ϵ)Nk[K]Bkdiag(r)Bk+4log(2K/ϵ)Nk[K]Bk\|B(X-r)\|_{1}\leq\sqrt{2\log(2K/\epsilon)~{}\over N}\sum_{k\in[K]}\sqrt{B_{k\cdot}^{\top}\textrm{diag}(r)B_{k\cdot}}+{4\log(2K/\epsilon)\over N}\sum_{k\in[K]}\|B_{k\cdot}\|_{\infty}

with probability exceeding 1ϵ1-\epsilon. The result thus follows by noting that

k[K]Bkdiag(r)BkKk[K]j[p]rjBkj2Kmaxj[p]Bj2\sum_{k\in[K]}\sqrt{B_{k\cdot}^{\top}\textrm{diag}(r)B_{k\cdot}}\leq\sqrt{K}\sqrt{\sum_{k\in[K]}\sum_{j\in[p]}r_{j}B_{kj}^{2}}\leq\sqrt{K}\max_{j\in[p]}\|B_{j}\|_{2}

and k[K]BkKmaxj[p]Bj2\sum_{k\in[K]}\|B_{k\cdot}\|_{\infty}\leq K\max_{j\in[p]}\|B_{j}\|_{2}. ∎