This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimax-optimal estimation for sparse multi-reference alignment with collision-free signals

Subhro Ghosh Authors are listed in the alphabetical order of their surnames.Department of Mathematics, National University of Singapore ([email protected])    Soumendu Sundar Mukherjee 11footnotemark: 1 Statistics and Mathematics Unit, Indian Statistical Institute, Kolkata ([email protected])    Jing Bin Pan 11footnotemark: 1 Department of Mathematics, National University of Singapore, ([email protected])Corresponding author: Jing Bin Pan ([email protected])
Abstract

The Multi-Reference Alignment (MRA) problem aims at the recovery of an unknown signal from repeated observations under the latent action of a group of cyclic isometries, in the presence of additive noise of high intensity σ\sigma. It is a more tractable version of the celebrated cryo EM model. In the crucial high noise regime, it is known that its sample complexity scales as σ6\sigma^{6}. Recent investigations have shown that for the practically significant setting of sparse signals, the sample complexity of the maximum likelihood estimator asymptotically scales with the noise level as σ4\sigma^{4}. In this work, we investigate minimax optimality for signal estimation under the MRA model for so-called collision-free signals. In particular, this signal class covers the setting of generic signals of dilute sparsity (wherein the support size s=O(L1/3)s=O(L^{1/3}), where LL is the ambient dimension.

We demonstrate that the minimax optimal rate of estimation in for the sparse MRA problem in this setting is σ2/n\sigma^{2}/\sqrt{n}, where nn is the sample size. In particular, this widely generalizes the sample complexity asymptotics for the restricted MLE in this setting, establishing it as the statistically optimal estimator. Finally, we demonstrate a concentration inequality for the restricted MLE on its deviations from the ground truth.

1 Introduction

1.1 Introduction

Multi-reference alignment (MRA) is the problem of estimating an unknown signal from noisy samples that are subject to latent rotational transformations. In recent years, the MRA model has found practical applications in many scientific fields, such as structural biology StructuralBiology1 StructuralBiology2 StructuralBiology3 StructuralBiology4 StructuralBiology5 StructuralBiology6 , image recognition ImageRecognition1 ImageRecognition2 ImageRecognition3 ImageRecognition4 , robotics Robotics and signal processing SignalProcessing1 SignalProcessing2 , and has garnered significant research interest as a result. Most notably, the MRA model serves as a simplified model for the three-dimensional reconstruction problem in cryo-electron microscopy (cryo-EM) CryoEM1 Generic CryoEM2 .

In this paper, we study the estimation problem in the MRA model MRA1 Generic MRA2 for collision-free signals. Identify the vector space L\mathbb{R}^{L} with the space of functions /L\mathbb{Z}/L\mathbb{Z}\to\mathbb{R}. Let

:={R(): 1L}\displaystyle\mathcal{R}:=\{R^{(\ell)}\ :\ 1\leq\ell\leq L\}

denote the cyclic group of coordinate shifts, where R():LLR^{(\ell)}:\mathbb{R}^{L}\to\mathbb{R}^{L} is the linear operator given by (R()θ)(i)=θ(i+)(R^{(\ell)}\theta)(i)=\theta(i+\ell). In the general MRA estimation problem, the objective is to recover an unknown parameter θL\theta\in\mathbb{R}^{L}, known in the literature as a signal, from nn independent noisy observations X1,X2,,XnX_{1},X_{2},\cdots,X_{n} given by

Xi=Giθ+σξi,i{1,,n},\displaystyle X_{i}=G_{i}\theta+\sigma\xi_{i},\ \ \ \ \ \ \ i\in\{1,\cdots,n\}, (1)

where the GiG_{i}’s are drawn from \mathcal{R} independently and uniformly at random and each ξi𝒩(𝟎,𝑰d)\xi_{i}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I}_{d}) is i.i.d standard Gaussian noise that is independent of the RiR_{i}’s. We remark that while the canonical distribution on \mathcal{R} is uniform, other distributions have also been considered OtherDistribution .

In this model, any vector θ\theta is statistically indistinguishable from its cyclic shifts R(1)θ,,R(L1)θR^{(1)}\theta,\cdots,R^{(L-1)}\theta. Consequently, the parameter θ\theta is only identifiable up to the action of \mathcal{R}. Hence we study the estimation problem under the rotation-invariant distance

ρ(θ,ϕ):=minGθGϕ.\displaystyle\rho(\theta,\phi):=\min_{G\in\mathcal{R}}\left\lVert\theta-G\phi\right\rVert.

Here, \left\lVert\cdot\right\rVert refers to the usual Euclidean norm.

Both the dimension LL and the noise level σ\sigma are assumed to be known. In many practical applications such as cryo-EM, the noise level is very high but can be lowered over time by technological improvements Improvement . As such, understanding the relationship between the difficulty of the estimation problem and the quantity θ/σ\left\lVert\theta\right\rVert/\sigma, known as the signal-to-noise ratio is of fundamental importance. With that in mind, we focus our work on understanding the asymptotic sampling complexity in the MRA model as σ+\sigma\to+\infty.

The MRA problem, particularly in the low-noise regime, has been mostly attacked using the synchronization approach MRA1 . However, it was recently recognised as a Gaussian mixture model BRW , which enabled the use of various methods such as the method of moments MethodOfMoments Generic and expectation-maximization ExpectationMaximization . For a more detailed discussion on the likelihood landscape of such models, we refer the reader to LikelihoodLandscape1 LikelihoodLandscape2 LikelihoodLandscape3 LikelihoodLandscape4 .

1.2 Estimation rates for MRA and sparsity

In the two seminal papers by Bandeira, Rigollet and Weed BRW Generic in 2017, it was shown that the general MRA model in the high noise regime has a sampling complexity of O(σL)O(\sigma^{L}) in the worst case, and O(σ6)O(\sigma^{6}) in the case of signals having full Fourier support. This has motivated investigations into variations of the MRA model where a better sampling complexity may be achieved. One such variation is the MRA estimation problem for certain classes of sparse signals, where the restricted maximum likelihood estimator (abbrv. MLE) has been shown to exhibit an improved sampling complexity of O(σ4)O(\sigma^{4})Sparse . We refer the reader to Sparse1 and Sparse2 for an investigation into algorithmic aspects of sparse MRA and the role of sparsity in the related problem of cryo EM.

While it is of interest to understand the the sample complexity asymptotics for the restricted MLE, it only shades light on the performance of a certain type of estimators. In order to have a complete understanding of the problem, one needs to understand minimax optimal rates of estimation for the model. In other words, the central question is, what is the best possible behaviour over candidate estimators for the worst possible error over a sufficiently rich class of signals. Minimax optimality in statistical problems has a long and rich history; for a detailed account we refer the interested reader to Tsybakov . In the context of the MRA problem, minimax optimal rate of estimation for the general MRA problem is known to be σ3/n\sigma^{3}/\sqrt{n} (c.f. BRW ; Zhou-Zhou2022 ). However, minimax optimality for the sparse MRA problem is not understood, and in view of different sample complexity asymptotics, is not covered by the minimax theory on general signal spaces. This is the broad direction that we propose to explore in the work.

1.3 The space of collision free signals

In this work, we will focus our attention on a particular class of sparse signals called collision-free signals. For a vector θL\theta\in\mathbb{R}^{L}, let 𝒟(θ)\mathcal{D}(\theta) denote the multiset of differences

𝒟(θ):={ij(mod L):i,jsupp(θ) with ij}.\displaystyle\mathcal{D}(\theta):=\big{\{}i-j\ (\text{mod }L)\ :\ i,j\in\text{supp}(\theta)\text{ with }i\neq j\big{\}}.

We say that θ\theta is collision-free if each element in 𝒟(θ)\mathcal{D}(\theta) appears with multiplicity exactly 11.

The notion of collision-freeness is motivated by the beltway problem in combinatorial optimization. The beltway problem consists of recovering a set 𝒮\mathcal{S} of integers from their pairwise differences 𝒟𝒮\mathcal{D}_{\mathcal{S}}, up to the trivial symmetry of translating all the numbers in the set by the same amount. In 1939, Piccard Piccard conjectured that if two sets of integers 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} have the same set of pairwise differences 𝒟\mathcal{D}, and the pairwise differences are known to be unique (i.e. 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} are collision-free), then the sets 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} must be translates of each other.

A counterexample to Piccard’s conjecture was found by Bloom Bloom in 1975 for |𝒮|=6|\mathcal{S}|=6. In 2007, a complete resolution was brought about by Bekir and Golomb Golomb , who showed that Piccard’s conjecture holds for |𝒮|7|\mathcal{S}|\geq 7. In our current context, this means that the support of a collision-free signal θ\theta can be recovered from 𝒟(θ)\mathcal{D}(\theta) up to translation as long as |Supp(θ)|7|\text{Supp}(\theta)|\geq 7. This fact will play a crucial role in allowing us to obtain improved bounds on the sampling complexity.

With the above discussion in mind, we will assume that the unknown parameter θ\theta satisfies:

  1. (i)

    θ\theta is collision free;

  2. (ii)

    σθ\sigma\geq\left\lVert\theta\right\rVert;

  3. (iii)

    There exist positive constants m,M,ϵ>0m,M,\epsilon\in\mathbb{R}_{>0} such that m|θ(i)|Mm\leq|\theta(i)|\leq M for all iSupp(θ)i\in\text{Supp}(\theta) and s:=|Supp(θ)|max{7,(2+ϵ)M2/m2}s:=|\text{Supp}(\theta)|\geq\max\{7,(2+\epsilon)M^{2}/m^{2}\}.

Let 𝒯\mathcal{T} denote the set of signals satisfying the above three conditions. In our analysis, we will only keep track of the dependence on σ\sigma and ss. The other quantities L,m,ML,m,M and ϵ\epsilon are fixed throughout the paper

Collision free signals arise as a natural concept in the setting of sparsity, especially in the context of sparse MRA. It is straightforward to envisage that the sparser a signal, the more likely it is for its support to be a collision free subset of /L\mathbb{Z}/L\mathbb{Z}. Indeed, a randomly chosen subset of /L\mathbb{Z}/L\mathbb{Z} (with expected size ss) is going to be collision free with high probability as long as s=O(L1/3)s=O(L^{1/3}) (c.f Sparse Appendix D). The latter corresponds to the regime of so-called dilute sparsity, per the terminology used in the investigations in Sparse , and has a natural place in the study of sparse MRA.

1.4 Main Results

For a signal θ\theta, let PθP_{\theta} denote the distribution of a random variable XX satisfying

X=Gθ+σξ,\displaystyle X=G\theta+\sigma\xi,

where GG is drawn from \mathcal{R} uniformly and ξi𝒩(𝟎,𝑰d)\xi_{i}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I}_{d}). Let fθ:df_{\theta}:\mathbb{R}^{d}\to\mathbb{R} denote the density function of PθP_{\theta}. Explicitly,

fθ(x)\displaystyle f_{\theta}(x) =1σL(2π)L/2𝔼G[exp(12σ2xGθ2)].\displaystyle=\frac{1}{\sigma^{L}(2\pi)^{L/2}}\mathbb{E}_{G}\bigg{[}\exp\bigg{(}-\frac{1}{2\sigma^{2}}\left\lVert x-G\theta\right\rVert^{2}\bigg{)}\bigg{]}.

Given nn i.i.d samples X1,,XnX_{1},\cdots,X_{n} as in (1), we analyse the performance of the restricted maximum likelihood estimator (MLE) θ~n\tilde{\theta}_{n} given by

θ~n:=arg maxϕ𝒯i=1nlog𝔼G[exp(12σ2XiGϕ2)].\displaystyle\tilde{\theta}_{n}:=\underset{\phi\in\mathcal{T}}{\text{arg max}}\sum_{i=1}^{n}\log\mathbb{E}_{G}\bigg{[}\exp\Big{(}-\frac{1}{2\sigma^{2}}\left\lVert X_{i}-G\phi\right\rVert^{2}\Big{)}\bigg{]}.

Our main results are the following two theorems, which characterises the rate of estimation in the collision-free MRA model.

Theorem 1.4.1.

We have that

infθ^nsupθ𝒯𝔼[ρ(θ^n,θ)]σ2n,\inf_{\hat{\theta}_{n}}\sup_{\theta\in\mathcal{T}}\mathbb{E}\big{[}\rho(\hat{\theta}_{n},\theta)\big{]}\asymp\frac{\sigma^{2}}{\sqrt{n}}, (2)

where the infimum is taken over all estimators θ^n\hat{\theta}_{n} based on nn samples X1,,XnX_{1},\cdots,X_{n}.

Theorem 1.4.2.

Fix a constant δ(0,2LM)\delta\in(0,2\sqrt{L}M). There exist constants Cs,C~s>0C_{s},\tilde{C}_{s}\in\mathbb{R}_{>0} depending on ss such that for all nCsδ4σ12n\geq C_{s}\delta^{-4}\sigma^{12} and for all θ𝒯\theta\in\mathcal{T}, the restricted MLE θ~n\tilde{\theta}_{n} satisfies

P(ρ(θ~n,θ)δ)Csσ5sδ2sexp(C~snδ4σ12).\displaystyle P(\rho(\tilde{\theta}_{n},\theta)\geq\delta)\leq C_{s}\sigma^{5s}\delta^{-2s}\exp\bigg{(}-\frac{\tilde{C}_{s}n\delta^{4}}{\sigma^{12}}\bigg{)}.

Theorem 1.4.1 is a direct consequence of the following two results, which we state below for independent interest.

The first result is a uniform upper bound for the restricted MLE of O(σ4)O(\sigma^{4}), which is a marked improvement over the pointwise upper bound in (Sparse, , Theorem 4).

Theorem 1.4.3.

There exists positive constants CC and CsC_{s}, where CsC_{s} depends on ss, such that the restricted MLE θ~n\tilde{\theta}_{n} satisfies

𝔼[ρ(θ~n,θ)]Cσ2n+Cσ3lognsn+Csσ11n\mathbb{E}\big{[}\rho(\tilde{\theta}_{n},\theta)\big{]}\leq C\cdot\frac{\sigma^{2}}{\sqrt{n}}+C\cdot\frac{\sigma^{3}\log n}{sn}+C_{s}\cdot\frac{\sigma^{11}}{n} (3)

uniformly over all choices θ𝒯\theta\in\mathcal{T} for all σ\sigma sufficiently large.

The second result is a lower bound on the minimax rate of estimation in the sparse MRA model. In particular, it shows that the restricted MLE achieves the optimal sampling complexity of O(σ4)O(\sigma^{4}).

Theorem 1.4.4.

Fix a sparsity s1s\in\mathbb{Z}_{\geq 1}. Let 𝒯s\mathcal{T}_{s} denote the set of vectors θ𝒯\theta\in\mathcal{T} satisfying |Supp(θ)|=s|\text{Supp}(\theta)|=s. For any σmaxθ𝒯sθ\sigma\geq\max_{\theta\in\mathcal{T}_{s}}\left\lVert\theta\right\rVert, there exists a universal constant MM such that

infθ~nsupθ𝒯s𝔼θ[ρ(θ~n,θ)]Mmin{σ2n, 1},\displaystyle\inf_{\tilde{\theta}_{n}}\sup_{\theta\in\mathcal{T}_{s}}\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\big{]}\geq M\min\bigg{\{}\frac{\sigma^{2}}{\sqrt{n}},\ 1\bigg{\}},

where the infimum is taken over all estimators θ~n\tilde{\theta}_{n} on nn samples from PθP_{\theta}.

In section 2, we will introduce the key ingredients that will be used to derive our main results. Sections 3, 4 and 5 are dedicated to the proofs of theorem 1.4.3, 1.4.4 and 1.4.2 respectively.

2 Preliminaries

In this section, we introduce and prove some properties about the Kullback-Leibler divergence and the moment tensors. Both are key tools that play a central role in our analysis.

2.1 Kullback-Leibler Divergence and Moment Tensors

In the MRA model, the Kullback-Leibler (KL) divergence admits an explicit formula:

DKL(PθPϕ)=12σ2(ϕ2θ2)+𝔼ξ[log𝔼G[exp(1σ2(θ+σξ)TGθ)]𝔼G[exp(1σ2(θ+σξ)TGϕ)]].\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})=\frac{1}{2\sigma^{2}}(\left\lVert\phi\right\rVert^{2}-\left\lVert\theta\right\rVert^{2})+\mathbb{E}_{\xi}\bigg{[}\log\frac{\mathbb{E}_{G}[\exp(\frac{1}{\sigma^{2}}(\theta+\sigma\xi)^{T}G\theta)]}{\mathbb{E}_{G}[\exp(\frac{1}{\sigma^{2}}(\theta+\sigma\xi)^{T}G\phi)]}\bigg{]}.

An important piece of machinery to analyse the KL divergence is the family of moment tensors. For any positive integer mm, the mth moment tensor of a vector θL\theta\in\mathbb{R}^{L} is the quantity 𝔼[(Gθ)m](L)m.\mathbb{E}[(G\theta)^{\otimes m}]\in(\mathbb{R}^{L})^{\otimes m}. If ϕL\phi\in\mathbb{R}^{L} is another vector, the mth moment difference tensor between θ\theta and ϕ\phi is defined to be

Δm(θ,ϕ):=𝔼[(Gθ)m(Gϕ)m](L)m.\displaystyle\Delta_{m}(\theta,\phi):=\mathbb{E}[(G\theta)^{\otimes m}-(G\phi)^{\otimes m}]\in(\mathbb{R}^{L})^{\otimes m}.

The relationship between the KL divergence and the moment tensors are given by the following results.

Theorem 2.1.1.

(BRW, , Lemma 8) Let θ,ϕL\theta,\phi\in\mathbb{R}^{L}. Let ϑ=θ𝔼[Gθ]\upvartheta=\theta-\mathbb{E}[G\theta] and φ=ϕ𝔼[Gϕ]\varphi=\phi-\mathbb{E}[G\phi]. Then

DKL(PθPϕ)=DKL(PϑPφ)+12σ2Δ1(θ,ϕ)2.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})=D_{\text{KL}}(P_{\upvartheta}\;\|\;P_{\varphi})+\frac{1}{2\sigma^{2}}\left\lVert\Delta_{1}(\theta,\phi)\right\rVert^{2}.
Theorem 2.1.2.

(BRW, , Theorem 9) Let θ,ϕL\theta,\phi\in\mathbb{R}^{L} satisfy 3ρ(θ,ϕ)θσ3\rho(\theta,\phi)\leq\left\lVert\theta\right\rVert\leq\sigma and 𝔼[Gθ]=𝔼[Gϕ]=0\mathbb{E}[G\theta]=\mathbb{E}[G\phi]=0. For any positive integer kk, there exist universal constants C¯\underline{C} and C¯\overline{C} such that

C¯m=1Δm(θ,ϕ)2(3σ)2mm!DKL(PθPϕ)2m=1k1Δm(θ,ϕ)2σ2mm!+C¯θ2k2ρ(θ,ϕ)2σ2k.\displaystyle\underline{C}\sum_{m=1}^{\infty}\frac{\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}}{(\sqrt{3}\sigma)^{2m}m!}\leq D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\leq 2\sum_{m=1}^{k-1}\frac{\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}}{\sigma^{2m}m!}+\overline{C}\frac{\left\lVert\theta\right\rVert^{2k-2}\rho(\theta,\phi)^{2}}{\sigma^{2k}}.

A few remarks are in order. Firstly, the assumption that 𝔼[Gθ]=𝔼[Gϕ]=0\mathbb{E}[G\theta]=\mathbb{E}[G\phi]=0 means that the first moment tensor Δ1(θ,ϕ)\Delta_{1}(\theta,\phi) vanishes. Thus the summations in the above theorem start from m=2m=2. Secondly, the result still holds if the hypothesis 3ρ(θ,ϕ)θ3\rho(\theta,\phi)\leq\left\lVert\theta\right\rVert is replaced by ρ(θ,ϕ)K0θ\rho(\theta,\phi)\leq K_{0}\left\lVert\theta\right\rVert for any fixed positive constant K0>0K_{0}\in\mathbb{R}_{>0}. In our current setting, we let K0=3K_{0}=3. A full proof of this statement can be found in Appendix C.

2.2 Collision-Free Signals

As a consequence of the resolution of the Piccard’s conjecture, collision-free signals are very well-behaved locally. This notion can be formalised in terms of the following lower bounds on the KL-divergence and moment tensors.

Lemma 2.2.1.

(Sparse, , Lemma 5) For any pair of collision free signals θ,ϕ𝒯\theta,\phi\in\mathcal{T}, we have

Δ2(θ,ϕ)2ϵ2+ϵ1Lsρ(θ,ϕ).\displaystyle\left\lVert\Delta_{2}(\theta,\phi)\right\rVert\geq\sqrt{\frac{2\epsilon}{2+\epsilon}}\cdot\frac{1}{\sqrt{L}}\cdot\sqrt{s}\cdot\rho(\theta,\phi).
Proposition 2.2.2.

(Sparse, , Proposition 17) There exists a postive constant ϵ0>0\epsilon_{0}\in\mathbb{R}_{>0} such that for all collision free signals θ,ϕ𝒯\theta,\phi\in\mathcal{T} satisfying ρ(θ,ϕ)<ϵ0\rho(\theta,\phi)<\epsilon_{0} and for all σ\sigma sufficiently large,

DKL(PθPϕ)Cσ4Δ2(θ,ϕ)2.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C\sigma^{-4}\cdot\left\lVert\Delta_{2}(\theta,\phi)\right\rVert^{2}.

Combining the above two results, we obtain the following local lower bound for the KL divergence of collision free signals.

Proposition 2.2.3.

There exist positive constants ϵ0,C>0\epsilon_{0},C\in\mathbb{R}_{>0} such that for any pair of collision free signals θ,ϕL\theta,\phi\in\mathbb{R}^{L} with θ𝒯\theta\in\mathcal{T} and ρ(θ,ϕ)<ϵ0\rho(\theta,\phi)<\epsilon_{0} and for any σ\sigma sufficiently large,

DKL(PθPϕ)Cσ4sρ(θ,ϕ)2\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C\sigma^{-4}s\cdot\rho(\theta,\phi)^{2}

The threshold ϵ0\epsilon_{0} also satisfies the following property: for any θ,ϕL\theta,\phi\in\mathbb{R}^{L} satisfying θ𝒯\theta\in\mathcal{T} and θϕ<ϵ0\lVert\theta-\phi\rVert<\epsilon_{0}, we have Supp(θ)=Supp(ϕ)\text{Supp}(\theta)=\text{Supp}(\phi).

2.3 Global Lower Bound for the KL Divergence

In this section, we will complement the sharp local bound for the KL divergence in Proposition 2.2.3 with the following global bound.

Proposition 2.3.1.

Let θ,ϕL\theta,\phi\in\mathbb{R}^{L} be two collision free signals, with θ𝒯\theta\in\mathcal{T}. There exists a constant CsC_{s}, depending on ss, such that

DKL(PθPϕ)Csσ6ρ(θ,ϕ)2.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C_{s}\sigma^{-6}\rho(\theta,\phi)^{2}.

for all σ\sigma sufficiently large.

As it turns out, the third moment tensor for collision free signals have a very simple form. Thus our strategy for proving Proposition 2.3.1 will be to show that the third moment difference tensor is nonvanishing for two distinct collision free signals and then invoke Theorem 2.1.2.

Lemma 2.3.2.

Let θ,ϕL\theta,\phi\in\mathbb{R}^{L} be two collision free signals lying in distinct orbits. Then Δ3(θ,ϕ)0\Delta_{3}(\theta,\phi)\neq 0.

Proof.

We will show that the orbit of any collision free signal θ\theta can be recovered purely from its third moment tensor. The collision free property of θ\theta implies that for any 1i,jL1\leq i,j\leq L with iji\neq j, we have that

𝔼[(Gθ)3]i,j,j\displaystyle\mathbb{E}[(G\theta)^{\otimes 3}]_{i,j,j} =1LgLθ(i+g)θ(j+g)θ(j+g)\displaystyle=\frac{1}{L}\sum_{g\in\mathbb{Z}_{L}}\theta(i+g)\theta(j+g)\theta(j+g)
={0 if ij𝒟(θ)1Lθ(i)θ(j)θ(j) otherwise, where i,jsupp(θ) with ij=ij.\displaystyle=\begin{cases}0&\ \text{ if }i-j\not\in\mathcal{D}(\theta)\\ \dfrac{1}{L}\theta(i^{\prime})\theta(j^{\prime})\theta(j^{\prime})&\ \text{ otherwise, where }i^{\prime},j^{\prime}\in\text{supp}(\theta)\text{ with }i-j=i^{\prime}-j^{\prime}.\end{cases}

In particular, the set 𝒟(θ)\mathcal{D}(\theta), and hence the support of θ\theta, can be recovered from its third moment tensor up to \mathcal{R}-orbit. Next, for each iSupp(θ)i\in\text{Supp}(\theta), to recover θ(i)\theta(i), let 1jL1\leq j\leq L be such that ij𝒟(θ)i-j\in\mathcal{D}(\theta). We have that

𝔼[(Gθ)3]i,i,j=1Lθ(i)θ(i)θ(j) and 1L𝔼[(Gθ)3]i,j,j=θ(i)θ(j)θ(j).\displaystyle\mathbb{E}[(G\theta)^{\otimes 3}]_{i,i,j}=\frac{1}{L}\theta(i)\theta(i)\theta(j)\ \ \ \ \text{ and }\ \ \ \ \frac{1}{L}\mathbb{E}[(G\theta)^{\otimes 3}]_{i,j,j}=\theta(i)\theta(j)\theta(j).

Thus

θ(i)3=L𝔼[(Gθ)3]i,i,j2𝔼[(Gθ)3]i,j,j\displaystyle\theta(i)^{3}=\dfrac{L\cdot\mathbb{E}[(G\theta)^{\otimes 3}]_{i,i,j}^{2}}{\mathbb{E}[(G\theta)^{\otimes 3}]_{i,j,j}}

can be obtained from the third moment tensor. The conclusion follows. ∎

Lemma 2.3.3.

Let θ,ϕL\theta,\phi\in\mathbb{R}^{L}. We denote θ¯=1Li=1Lθ(i)\overline{\theta}=\frac{1}{L}\sum_{i=1}^{L}\theta(i) and ϕ¯=1Li=1Lϕ(i)\overline{\phi}=\frac{1}{L}\sum_{i=1}^{L}\phi(i). Let ϑ=θ𝔼[Gθ]\upvartheta=\theta-\mathbb{E}[G\theta] and φ=ϕ𝔼[Gϕ]\varphi=\phi-\mathbb{E}[G\phi]. Then

Δ3(θ,ϕ)=Δ3(ϑ,φ)+3Sym(𝟙(θ¯𝔼[(Gϑ)2]ϕ¯𝔼[(Gφ)2]))+(θ¯3ϕ¯3)𝟙3,\displaystyle\Delta_{3}(\theta,\phi)=\Delta_{3}(\upvartheta,\varphi)+3\text{Sym}\Big{(}\mathbbm{1}\otimes\big{(}\overline{\theta}\cdot\mathbb{E}[(G\upvartheta)^{\otimes 2}]-\overline{\phi}\cdot\mathbb{E}[(G\varphi)^{\otimes 2}]\big{)}\Big{)}+(\overline{\theta}^{3}-\overline{\phi}^{3})\cdot\mathbbm{1}^{\otimes 3},

where Sym is the symmetrization operator which acts on order-33 tensors by averaging over all permutations of the indices:

Sym(T)i,j,k:=16σS3Tσ(i),σ(j),σ(k).\displaystyle\text{Sym}(T)_{i,j,k}:=\frac{1}{6}\sum_{\sigma\in S_{3}}T_{\sigma(i),\sigma(j),\sigma(k)}.
Proof.

Let μθ=𝔼[Gθ]\mu_{\theta}=\mathbb{E}[G\theta] and μϕ=𝔼[Gϕ]\mu_{\phi}=\mathbb{E}[G\phi]. Observe that

𝔼[(Gθ)3]\displaystyle\mathbb{E}[(G\theta)^{\otimes 3}] =𝔼[(Gϑ+Gμθ)3]\displaystyle=\mathbb{E}[(G\upvartheta+G\mu_{\theta})^{\otimes 3}]
=𝔼[(Gϑ+μθ)3]\displaystyle=\mathbb{E}[(G\upvartheta+\mu_{\theta})^{\otimes 3}]
=𝔼[(Gϑ)3]+𝔼[GϑGϑμθ]+𝔼[GϑμθGϑ]+𝔼[μθGϑGϑ]\displaystyle=\mathbb{E}[(G\upvartheta)^{\otimes 3}]+\mathbb{E}[G\upvartheta\otimes G\upvartheta\otimes\mu_{\theta}]+\mathbb{E}[G\upvartheta\otimes\mu_{\theta}\otimes G\upvartheta]+\mathbb{E}[\mu_{\theta}\otimes G\upvartheta\otimes G\upvartheta]
+𝔼[Gϑμθμθ]+𝔼[μθGϑμθ]+𝔼[μθμθGϑ]+μθμθμθ\displaystyle+\mathbb{E}[G\upvartheta\otimes\mu_{\theta}\otimes\mu_{\theta}]+\mathbb{E}[\mu_{\theta}\otimes G\upvartheta\otimes\mu_{\theta}]+\mathbb{E}[\mu_{\theta}\otimes\mu_{\theta}\otimes G\upvartheta]+\mu_{\theta}\otimes\mu_{\theta}\otimes\mu_{\theta}
=𝔼[(Gϑ)3]+𝔼[GϑGϑμθ]+𝔼[GϑμθGϑ]+𝔼[μθGϑGϑ]\displaystyle=\mathbb{E}[(G\upvartheta)^{\otimes 3}]+\mathbb{E}[G\upvartheta\otimes G\upvartheta\otimes\mu_{\theta}]+\mathbb{E}[G\upvartheta\otimes\mu_{\theta}\otimes G\upvartheta]+\mathbb{E}[\mu_{\theta}\otimes G\upvartheta\otimes G\upvartheta]
+𝔼[Gϑ]μθμθ+μθ𝔼[Gϑ]μθ+μθμθ𝔼[Gϑ]+μθ3\displaystyle+\mathbb{E}[G\upvartheta]\otimes\mu_{\theta}\otimes\mu_{\theta}+\mu_{\theta}\otimes\mathbb{E}[G\upvartheta]\otimes\mu_{\theta}+\mu_{\theta}\otimes\mu_{\theta}\otimes\mathbb{E}[G\upvartheta]+\mu_{\theta}^{\otimes 3}
=𝔼[(Gϑ)3]+3Sym(μθ𝔼[(Gϑ)2])+μθ3.\displaystyle=\mathbb{E}[(G\upvartheta)^{\otimes 3}]+3\text{Sym}\big{(}\mu_{\theta}\otimes\mathbb{E}[(G\upvartheta)^{\otimes 2}]\big{)}+\mu_{\theta}^{\otimes 3}.

Since the Symmetrization operator is linear, we obtain

Δ3(θ,ϕ)=Δ3(ϑ,φ)+3Sym(𝟙(θ¯𝔼[(Gϑ)2]ϕ¯𝔼[(Gφ)2]))+(θ¯3ϕ¯3)𝟙3,\displaystyle\Delta_{3}(\theta,\phi)=\Delta_{3}(\upvartheta,\varphi)+3\text{Sym}\Big{(}\mathbbm{1}\otimes\big{(}\overline{\theta}\cdot\mathbb{E}[(G\upvartheta)^{\otimes 2}]-\overline{\phi}\cdot\mathbb{E}[(G\varphi)^{\otimes 2}]\big{)}\Big{)}+(\overline{\theta}^{3}-\overline{\phi}^{3})\cdot\mathbbm{1}^{\otimes 3},

as desired. ∎

We are now ready to prove Proposition 2.3.1.

Proof of Proposition 2.3.1.

Since both the KL divergence and the orbit distance are invariant under the action of the group \mathcal{R}, we assume without loss of generality that ρ(θ,ϕ)=θϕ\rho(\theta,\phi)=\left\lVert\theta-\phi\right\rVert. By Proposition 2.2.3 , it suffices to prove the above statement for the case of ρ(θ,ϕ)ϵ0.\rho(\theta,\phi)\geq\epsilon_{0}. We will divide our argument into two cases.

Case 1. ϵ0ρ(θ,ϕ)3θ\epsilon_{0}\leq\rho(\theta,\phi)\leq 3\left\lVert\theta\right\rVert.

Observe that ρ(θ,ϕ)\rho(\theta,\phi) is bounded above as θ\left\lVert\theta\right\rVert is (recall that we require |θ(i)|M|\theta(i)|\leq M for all 1iL1\leq i\leq L). Thus it suffices to show that

DKL(PθPϕ)σ6ϵs\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq\sigma^{-6}\epsilon_{s}

for some small constant ϵs\epsilon_{s} depending only on ss. In what follows, we denote ϑ=θ𝔼[Gθ]\upvartheta=\theta-\mathbb{E}[G\theta] and φ=ϕ𝔼[Gϕ]\varphi=\phi-\mathbb{E}[G\phi]. Firstly, by lemma 2.3.2, we have that Δ3(θ,ϕ)0\Delta_{3}(\theta,\phi)\neq 0. For each θ𝒯\theta\in\mathcal{T}, define

Bθ,ϵ0:={ϕL:ϵ0ρ(θ,ϕ)3θ}.\displaystyle B_{\theta,\epsilon_{0}}:=\big{\{}\phi\in\mathbb{R}^{L}\ :\ \epsilon_{0}\leq\rho(\theta,\phi)\leq 3\left\lVert\theta\right\rVert\big{\}}.

Then set

δ0:=infθ𝒯infϕBθ,ϵ0Δ3(θ,ϕ).\displaystyle\delta_{0}:=\inf_{\theta\in\mathcal{T}}\inf_{\phi\in B_{\theta,\epsilon_{0}}}\left\lVert\Delta_{3}(\theta,\phi)\right\rVert.

Note that δ0\delta_{0} is a strictly positive constant since both Bθ,ϵ0B_{\theta,\epsilon_{0}} and 𝒯\mathcal{T} are compact sets. Furthermore, δ0\delta_{0} is independent of both θ\theta and ϕ\phi (but may depend on ss). Let δ>0\delta\in\mathbb{R}_{>0} be a small constant, whose value may decrease from line to line, satisfying

3Sym(𝟙(θ¯𝔼[(Gϑ)2]ϕ¯𝔼[(Gφ)2]))+(θ¯3ϕ¯3)𝟙3δ02\left\lVert 3\cdot\text{Sym}\Big{(}\mathbbm{1}\otimes\big{(}\overline{\theta}\cdot\mathbb{E}[(G\upvartheta)^{\otimes 2}]-\overline{\phi}\cdot\mathbb{E}[(G\varphi)^{\otimes 2}]\big{)}\Big{)}+(\overline{\theta}^{3}-\overline{\phi}^{3})\cdot\mathbbm{1}^{\otimes 3}\right\rVert\leq\frac{\delta_{0}}{2} (4)

for all θ,ϕL\theta,\phi\in\mathbb{R}^{L} such that |θ¯ϕ¯|δ|\overline{\theta}-\overline{\phi}|\leq\delta and Δ2(ϑ,φ)δ\left\lVert\Delta_{2}(\upvartheta,\varphi)\right\rVert\leq\delta. By Theorem 2.1.1,

DKL(PθPϕ)=12σ2Δ1(θ,ϕ)2+DKL(PϑPφ).\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})=\frac{1}{2\sigma^{2}}\left\lVert\Delta_{1}(\theta,\phi)\right\rVert^{2}+D_{\text{KL}}(P_{\upvartheta}\;\|\;P_{\varphi}).

Thus if |θ¯ϕ¯|>δ|\overline{\theta}-\overline{\phi}|>\delta, we have that

DKL(PθPϕ)12σ2Δ1(θ,ϕ)2δ2L2σ2δ2σ6.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq\frac{1}{2\sigma^{2}}\left\lVert\Delta_{1}(\theta,\phi)\right\rVert^{2}\geq\frac{\delta^{2}L}{2\sigma^{2}}\geq\delta^{2}\sigma^{-6}.

Now suppose that |θ¯ϕ¯|δ|\overline{\theta}-\overline{\phi}|\leq\delta. If Δ2(ϑ,φ)>δ\left\lVert\Delta_{2}(\upvartheta,\varphi)\right\rVert>\delta, by (BRW, , Theorem 9), there exists an universal positive constant C>0C\in\mathbb{R}_{>0} such that

DKL(PθPϕ)DKL(PϑPφ)CΔ2(ϑ,φ)2σ4Cδ2σ4δ2σ6.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq D_{\text{KL}}(P_{\vartheta}\;\|\;P_{\varphi})\geq C\frac{\left\lVert\Delta_{2}(\upvartheta,\varphi)\right\rVert^{2}}{\sigma^{4}}\geq\frac{C\delta^{2}}{\sigma^{4}}\geq\delta^{2}\sigma^{-6}.

Finally, suppose that we have both |θ¯ϕ¯|δ|\overline{\theta}-\overline{\phi}|\leq\delta and Δ2(ϑ,φ)δ\left\lVert\Delta_{2}(\upvartheta,\varphi)\right\rVert\leq\delta. Using (4) and Lemma 2.3.3, we obtain

Δ3(ϑ,φ)Δ3(θ,ϕ)δ02δ02.\displaystyle\left\lVert\Delta_{3}(\upvartheta,\varphi)\right\rVert\geq\left\lVert\Delta_{3}(\theta,\phi)\right\rVert-\frac{\delta_{0}}{2}\geq\frac{\delta_{0}}{2}.

Again by Theorem 2.1.2, we obtain

DKL(PθPϕ)DKL(PϑPφ)CΔ3(ϑ,φ)2σ6Cδ024σ6.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq D_{\text{KL}}(P_{\vartheta}\;\|\;P_{\varphi})\geq C\frac{\left\lVert\Delta_{3}(\upvartheta,\varphi)\right\rVert^{2}}{\sigma^{6}}\geq\frac{C\delta^{2}_{0}}{4\sigma^{6}}.

The desired conclusion follows by choosing ϵs=min{Cδ024,δ2}.\epsilon_{s}=\min\{\frac{C\delta_{0}^{2}}{4},\delta^{2}\}.

Case 2. ρ(θ,ϕ)>3θ.\rho(\theta,\phi)>3\left\lVert\theta\right\rVert.

By Lemma A.1 , we in fact have a stronger bound DKL(PθPϕ)Cσ4ρ(θ,ϕ)2D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C\sigma^{-4}\rho(\theta,\phi)^{2}. ∎

3 Uniform Rates of Convergence for the Maximum Likelihood Estimator

The goal of this section is to prove Theorem 1.4.3. Throughout this section, we fix a collision free signal θ𝒯\theta\in\mathcal{T}, which plays the role of the unknown parameter which we are trying to estimate. Let s=|Supp(θ)|s=|\text{Supp}(\theta)| and define the subspace

Lθ:={ϕL:ϕ(i)=0 for all iSupp(θ)}.\displaystyle L_{\theta}:=\Big{\{}\phi\in\mathbb{R}^{L}\ :\ \phi(i)=0\text{ for all }i\not\in\text{Supp}(\theta)\Big{\}}.

In this section, all derivatives will be taken with respect to the subspace LθL_{\theta}. In other words, for a (twice continuously differentiable) function g:Lg:\mathbb{R}^{L}\to\mathbb{R} and a point pLp\in\mathbb{R}^{L}, we define the gradient and the Hessian to be

g(p)i\displaystyle\nabla g(p)_{i} :={gxi(p)if iSupp(θ),0otherwise;\displaystyle:=\begin{cases}\dfrac{\partial g}{\partial x_{i}}(p)&\text{if }i\in\text{Supp}(\theta),\\ 0&\text{otherwise};\end{cases}
(Hg(p))ij\displaystyle(H_{g}(p))_{ij} :={gxixj(p)if i,jSupp(θ),0otherwise.\displaystyle:=\begin{cases}\dfrac{\partial g}{\partial x_{i}\partial x_{j}}(p)&\text{if }i,j\in\text{Supp}(\theta),\\ 0&\text{otherwise}.\end{cases}

Let D:L+D:\mathbb{R}^{L}\to\mathbb{R}_{+} denote the map

D(ϕ):=DKL(PθPϕ)=𝔼[logfθ(X)fϕ(X)].\displaystyle D(\phi):=D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})=\mathbb{E}\bigg{[}\log\frac{f_{\theta}(X)}{f_{\phi}(X)}\bigg{]}.

Let H=HDH=H_{D} denote the Hessian of the above map at the point θ\theta (on the subspace LθL_{\theta}). Since HH is positive semidefinite, it induces a seminorm H:L+\lVert\cdot\rVert_{H}:\mathbb{R}^{L}\to\mathbb{R}_{+} via

ϕH:=ϕTHϕ.\displaystyle\left\lVert\phi\right\rVert_{H}:=\sqrt{\phi^{T}H\phi}.

For i.i.d samples X1,,XnX_{1},\cdots,X_{n} drawn according to PθP_{\theta}, the restricted MLE θ~n\tilde{\theta}_{n} is a minimizer of

Dn(ϕ):=1ni=1nlogfθfϕ(Xi)\displaystyle D_{n}(\phi):=\frac{1}{n}\sum_{i=1}^{n}\log\frac{f_{\theta}}{f_{\phi}}(X_{i})

on the set 𝒯.\mathcal{T}.

Our proof builds upon the approach in (BRW, , Theorem 4). In our proof, we will invoke the following results.

Lemma 3.0.1.

(BRW, , Lemma B.6) Let ϵ>0\epsilon\in\mathbb{R}_{>0} and let ϵ={ϕL:ρ(ϕ,θ)ϵ}\mathcal{B}_{\epsilon}=\{\phi\in\mathbb{R}^{L}\ :\ \rho(\phi,\theta)\leq\epsilon\}. Let HDnH_{D_{n}} denote the Hessian of DnD_{n}. There exists a constant CC such that

𝔼[supϕϵHD(ϕ)HDn(ϕ)op2]Clognnσ4.\displaystyle\mathbb{E}\bigg{[}\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert^{2}_{\text{op}}\bigg{]}\leq C\frac{\log n}{n\sigma^{4}}.
Lemma 3.0.2.

(BRW, , Lemma B.7) Let \mathcal{L} be any vector subspace of L\mathbb{R}^{L}. Suppose that there exists a positive integer kk and positive constants cc and CC such that for all θ,ϕ\theta,\phi\in\mathcal{L} satisfying c1θcc^{-1}\leq\left\lVert\theta\right\rVert\leq c and θσ\left\lVert\theta\right\rVert\leq\sigma, we have that

DKL(PθPϕ)Cσ2kρ(θ,ϕ)2.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C\sigma^{-2k}\rho(\theta,\phi)^{2}.

Then the restricted MLE θ~n\tilde{\theta}_{n} satisfies

𝔼[ρ(θ~n,θ)2]C~σ4k2n\displaystyle\mathbb{E}[\rho(\tilde{\theta}_{n},\theta)^{2}]\leq\tilde{C}\frac{\sigma^{4k-2}}{n}

for some positive constant C~\tilde{C}.

Lemma 3.0.3.

(BRW, , Lemma B.15) There exists a positive constant CC depending only on the dimension LL such that for any θ,ϕL\theta,\phi\in\mathbb{R}^{L} with θ1\left\lVert\theta\right\rVert\leq 1,

|DKL(PθPϕ)12θϕH2|Cϕθ2σ3.\displaystyle\bigg{|}D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})-\frac{1}{2}\left\lVert\theta-\phi\right\rVert^{2}_{H}\bigg{|}\leq C\frac{\left\lVert\phi-\theta\right\rVert^{2}}{\sigma^{3}}.
Proof of Theorem 1.4.3.

The key idea behind the proof is that for n+n\to+\infty, the dominant term in (3) is the first term, which is controlled by the local nature of the signals. Since collision free signals are very well-behaved locally (Proposition 2.2.3), this allows us to obtain sharper uniform bounds on the rate of convergence of the restricted MLE.

We assume without loss of generality that ρ(θ~n,θ)=θ~nθ.\rho(\tilde{\theta}_{n},\theta)=\lVert\tilde{\theta}_{n}-\theta\rVert. Furthermore, since θ\left\lVert\theta\right\rVert is bounded, we also assume θ=1\left\lVert\theta\right\rVert=1 and σ1\sigma\geq 1 by replacing θ\theta and σ\sigma by θ/θ\theta/\left\lVert\theta\right\rVert and σ/θ\sigma/\left\lVert\theta\right\rVert respectively.

Let δ\delta be a fixed positive constant and let AA be a positive constant whose value may change from line to line. Define the event :={ρ(θ~n,θ)ϵ}\mathcal{E}:=\{\rho(\tilde{\theta}_{n},\theta)\leq\epsilon\}, where ϵ\epsilon is another positive constant whose exact value will be determined later. We decompose

𝔼θ[ρ(θ~n,θ)]=𝔼θ[ρ(θ~n,θ)𝟙]+𝔼θ[ρ(θ~n,θ)𝟙c]\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\big{]}=\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\mathbbm{1}_{\mathcal{E}}\big{]}+\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\mathbbm{1}_{\mathcal{E}^{c}}\big{]} (5)

and bound each term separately. Note that the first term entails the local behaviour of the restricted MLE while the second term entails the global behaviour of the restricted MLE.

Local bound: For the first term, let B1B_{1} and B2B_{2} denote the constants as in Lemma 3.0.3 and Proposition 2.2.3 respectively. We first choose δ\delta sufficiently small such that

δB1B212.\displaystyle\frac{\delta B_{1}}{B_{2}}\leq\frac{1}{2}.

Let σ>0\sigma\in\mathbb{R}_{>0} be sufficiently large such that sδσ1<ϵ0s\delta\sigma^{-1}<\epsilon_{0} and choose ϵ=sδσ1\epsilon=s\delta\sigma^{-1}. By Lemma 3.0.3 and Proposition 2.2.3, we have that

|D(θ~n)12θ~nθH2|\displaystyle\Big{|}D(\tilde{\theta}_{n})-\frac{1}{2}\lVert\tilde{\theta}_{n}-\theta\rVert^{2}_{H}\Big{|} B1θ~nθ3σ3B1sδθ~nθ2σ412D(θ~n).\displaystyle\leq B_{1}\frac{\lVert\tilde{\theta}_{n}-\theta\rVert^{3}}{\sigma^{3}}\leq\frac{B_{1}s\delta\lVert\tilde{\theta}_{n}-\theta\rVert^{2}}{\sigma^{4}}\leq\frac{1}{2}D(\tilde{\theta}_{n}).

Hence

13θ~nθH2D(θ~n)θ~nθH2.\displaystyle\frac{1}{3}\lVert\tilde{\theta}_{n}-\theta\rVert_{H}^{2}\leq D(\tilde{\theta}_{n})\leq\lVert\tilde{\theta}_{n}-\theta\rVert^{2}_{H}.

Together with Proposition 2.2.3, we get

θ~nθH2Aσ4sθ~nθ2.\displaystyle\lVert\tilde{\theta}_{n}-\theta\rVert^{2}_{H}\geq A\sigma^{-4}s\lVert\tilde{\theta}_{n}-\theta\rVert^{2}. (6)

The function DnD_{n} satisfy Dn(θ)=0D_{n}(\theta)=0 and attains its minimum value on 𝒯\mathcal{T} at θ~n\tilde{\theta}_{n}. As such, we must have Dn(θ~n)0.D_{n}(\tilde{\theta}_{n})\leq 0. Furthermore, since θ~nθ<ϵ0\lVert\tilde{\theta}_{n}-\theta\rVert<\epsilon_{0}, we also have that Supp(θ~n)=Supp(θ)\text{Supp}(\tilde{\theta}_{n})=\text{Supp}(\theta). Expanding DDnD-D_{n} as a Taylor series about θ\theta, we obtain

13θ~nθH2\displaystyle\frac{1}{3}\lVert\tilde{\theta}_{n}-\theta\rVert^{2}_{H} D(θ~n)Dn(θ~n)\displaystyle\leq D(\tilde{\theta}_{n})-D_{n}(\tilde{\theta}_{n})
Dn(θ)T(θ~nθ)+12(θ~nθ)T(HD(η)HDn(η))(θ~nθ)\displaystyle-\nabla D_{n}(\theta)^{T}(\tilde{\theta}_{n}-\theta)+\frac{1}{2}(\tilde{\theta}_{n}-\theta)^{T}\big{(}H_{D}(\eta)-H_{D_{n}}(\eta)\big{)}(\tilde{\theta}_{n}-\theta)

where ηd\eta\in\mathbb{R}^{d} is a vector on the line segment between θ\theta and θ~n\tilde{\theta}_{n}. We employ the dual norm H\left\lVert\cdot\right\rVert_{H}^{*} to bound the first term and the operator norm op\left\lVert\cdot\right\rVert_{\text{op}} to bound the second term. This gives

13θ~nθH2Dn(θ)Hθ~nθH+12θ~nθ2supϕϵHD(ϕ)HDn(ϕ)op\displaystyle\frac{1}{3}\lVert\tilde{\theta}_{n}-\theta\rVert^{2}_{H}\leq\left\lVert\nabla D_{n}(\theta)\right\rVert^{*}_{H}\lVert\tilde{\theta}_{n}-\theta\rVert_{H}+\dfrac{1}{2}\lVert\tilde{\theta}_{n}-\theta\rVert^{2}\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert_{\text{op}}

where ϵ:={ϕd:ρ(ϕ,θ)ϵ}.\mathcal{B}_{\epsilon}:=\big{\{}\phi\in\mathbb{R}^{d}\ :\ \rho(\phi,\theta)\leq\epsilon\big{\}}. Using (6) and dividing by θ~nθH\lVert\tilde{\theta}_{n}-\theta\rVert_{H} throughout on both sides,

σ2sθ~nθ\displaystyle\sigma^{-2}\sqrt{s}\lVert\tilde{\theta}_{n}-\theta\rVert A(Dn(θ)H+σ2sθ~nθsupϕϵHD(ϕ)HDn(ϕ)op).\displaystyle\leq A\Big{(}\lVert\nabla D_{n}(\theta)\rVert^{*}_{H}+\frac{\sigma^{2}}{\sqrt{s}}\lVert\tilde{\theta}_{n}-\theta\rVert\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert_{\text{op}}\Big{)}.

Multiplying both sides by σ2/s\sigma^{2}/\sqrt{s} and applying Young’s inequality to the second term on the right-hand side yields

θ~nθ\displaystyle\lVert\tilde{\theta}_{n}-\theta\rVert A(σ2sDn(θ)H+σ7ssupϕϵHD(ϕ)HDn(ϕ)op2+σsθ~nθ2).\displaystyle\leq A\bigg{(}\frac{\sigma^{2}}{\sqrt{s}}\left\lVert\nabla D_{n}(\theta)\right\rVert^{*}_{H}+\frac{\sigma^{7}}{s}\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert_{\text{op}}^{2}+\frac{\sigma}{s}\lVert\tilde{\theta}_{n}-\theta\rVert^{2}\bigg{)}.

Taking expectation on both sides,

𝔼θ[ρ(θ~n,θ)𝟙]Aσ2s𝔼θ[Dn(θ)H]+Aσ7s𝔼θ[supϕϵHD(ϕ)HDn(ϕ)op2]+Aσs𝔼θ[ρ(θ~n,θ)2].\displaystyle\begin{split}\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\mathbbm{1}_{\mathcal{E}}\big{]}&\leq\frac{A\sigma^{2}}{\sqrt{s}}\cdot\mathbb{E}_{\theta}\big{[}\left\lVert\nabla D_{n}(\theta)\right\rVert^{*}_{H}\big{]}+\frac{A\sigma^{7}}{s}\cdot\mathbb{E}_{\theta}\bigg{[}\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert_{\text{op}}^{2}\bigg{]}\\ &+\frac{A\sigma}{s}\cdot\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)^{2}\big{]}.\end{split} (7)

This concludes the local analysis of the restricted MLE.

Global bound: We first address the second term in (5). Since ρ(θ~n,θ)sδσ1\rho(\tilde{\theta}_{n},\theta)\geq s\delta\sigma^{-1} on c\mathcal{E}^{c}, we obtain

𝔼θ[ρ(θ~n,θ)𝟙c]Aσs𝔼θ[ρ(θ~n,θ)2]\displaystyle\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\mathbbm{1}_{\mathcal{E}^{c}}\big{]}\leq\frac{A\sigma}{s}\cdot\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)^{2}\big{]}

which can be combined with the third term in (7). This gives

𝔼θ[ρ(θ~n,θ)]Aσ2s𝔼θ[Dn(θ)H]+Aσ7s𝔼θ[supϕϵHD(ϕ)HDn(ϕ)op2]+Aσs𝔼θ[ρ(θ~n,θ)2].\displaystyle\begin{split}\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)\big{]}&\leq\frac{A\sigma^{2}}{\sqrt{s}}\cdot\mathbb{E}_{\theta}\big{[}\left\lVert\nabla D_{n}(\theta)\right\rVert^{*}_{H}\big{]}+\frac{A\sigma^{7}}{s}\cdot\mathbb{E}_{\theta}\bigg{[}\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert_{\text{op}}^{2}\bigg{]}\\ &+\frac{A\sigma}{s}\cdot\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)^{2}\big{]}.\end{split} (8)

We will establish an upper bound for each of the three terms in the above equation separately.

For the first term, Bartlett’s identities state that for each 1in1\leq i\leq n,

𝔼θ[logfϕ(Xi)|ϕ=θ]\displaystyle\mathbb{E}_{\theta}\big{[}\nabla\log f_{\phi}(X_{i})|_{\phi=\theta}\big{]} =0\displaystyle=0
𝔼θ[(logfϕ(Xi)|ϕ=θ)(logfϕ(Xi)|ϕ=θ)T]\displaystyle\mathbb{E}_{\theta}\big{[}(\nabla\log f_{\phi}(X_{i})|_{\phi=\theta})(\nabla\log f_{\phi}(X_{i})|_{\phi=\theta})^{T}\big{]} =HD(θ)=H.\displaystyle=H_{D}(\theta)=H.

Since Dn(θ)=1ni=1nlogfϕ(Xi)|ϕ=θ\nabla D_{n}(\theta)=\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\nabla\log f_{\phi}(X_{i})\big{|}_{\phi=\theta} and the XiX_{i}’s are independent, we get

𝔼θ[Dn(θ)Dn(θ)T]=1nH.\displaystyle\mathbb{E}_{\theta}\big{[}\nabla D_{n}(\theta)\nabla D_{n}(\theta)^{T}\big{]}=\frac{1}{n}H.

In particular, the nullspace of HH is contained in the nullspace of Dn(θ)Dn(θ)T\nabla D_{n}(\theta)\nabla D_{n}(\theta)^{T} almost surely since any vector vv lying in the nullspace of HH satisfies

0𝔼θ[vTDn(θ)Dn(θ)Tv]=1nvTHv=0.\displaystyle 0\leq\mathbb{E}_{\theta}\big{[}v^{T}\nabla D_{n}(\theta)\nabla D_{n}(\theta)^{T}v\big{]}=\frac{1}{n}v^{T}Hv=0.

Since the row space of a symmetric matrix is the orthogonal complement of its nullspace, this means that the row space of Dn(θ)Dn(θ)T\nabla D_{n}(\theta)\nabla D_{n}(\theta)^{T} (and hence the row space of Dn(θ))\nabla D_{n}(\theta)) is contained in the row space of HH almost surely. As a result,

𝔼θ[Dn(θ)H]\displaystyle\mathbb{E}_{\theta}\big{[}\left\lVert\nabla D_{n}(\theta)\right\rVert^{*}_{H}\big{]} =𝔼θ[Dn(θ)THDn(θ)]\displaystyle=\mathbb{E}_{\theta}\Big{[}\sqrt{\nabla D_{n}(\theta)^{T}H^{\dagger}\nabla D_{n}(\theta)}\Big{]}
𝔼θ[Tr(Dn(θ)THDn(θ))]\displaystyle\leq\sqrt{\mathbb{E}_{\theta}\big{[}\text{Tr}\big{(}\nabla D_{n}(\theta)^{T}H^{\dagger}\nabla D_{n}(\theta)\big{)}\big{]}}
=1nTr(HH)sn,\displaystyle=\sqrt{\frac{1}{n}\text{Tr}(H^{\dagger}H)}\leq\sqrt{\frac{s}{n}}, (9)

where HH^{\dagger} denotes the Moore-Penrose inverse of HH. For the second term, we invoke Lemma 3.0.1 to get

𝔼θ[supϕϵHD(ϕ)HDn(ϕ)op2]Alognnσ4.\mathbb{E}_{\theta}\bigg{[}\sup_{\phi\in\mathcal{B}_{\epsilon}}\left\lVert H_{D}(\phi)-H_{D_{n}}(\phi)\right\rVert_{\text{op}}^{2}\bigg{]}\leq A\frac{\log n}{n\sigma^{4}}. (10)

Finally, for the third term, we invoke Lemma 3.0.2 with the weaker global bound in Proposition 2.3.1 (i.e. k=3k=3). We obtain

𝔼θ[ρ(θ~n,θ)2]Asσ10n\mathbb{E}_{\theta}\big{[}\rho(\tilde{\theta}_{n},\theta)^{2}\big{]}\leq A_{s}\frac{\sigma^{10}}{n} (11)

for some positive constant AsA_{s} depending on ss. Putting (9),  (10) and (11) back into (8), we have

𝔼[ρ(θ~n,θ)]Aσ2n+Aσ3lognsn+Asσ11sn.\displaystyle\mathbb{E}\big{[}\rho(\tilde{\theta}_{n},\theta)\big{]}\leq A\frac{\sigma^{2}}{\sqrt{n}}+A\frac{\sigma^{3}\log n}{sn}+A_{s}\frac{\sigma^{11}}{sn}.

The conclusion follows. ∎

4 Lower Bound for Collision Free Signals

In this section, we prove Theorem 1.4.4.

Proof.

For convenience of notation, we assume that ss is even. The proof for the case in which ss is odd is conceptually similar but notationally more complicated. By Le Cam’s two point argument, for any ϕ,ψ𝒯s\phi,\psi\in\mathcal{T}_{s}, we have that

infθ^nsupθ𝒯s𝔼θ[ρ(θ^n,θ)]ρ(ϕ,ψ)(22nDKL(PϕPψ)8)\displaystyle\inf_{\hat{\theta}_{n}}\sup_{\theta\in\mathcal{T}_{s}}\mathbb{E}_{\theta}\big{[}\rho(\hat{\theta}_{n},\theta)\big{]}\geq\rho(\phi,\psi)\cdot\left(\frac{2-\sqrt{2nD_{\text{KL}}(P_{\phi}\;\|\;P_{\psi})}}{8}\right)

where the infimum is taken over all estimators θ^n\hat{\theta}_{n} on nn samples. Thus it suffices to construct two signals θn,ϕ𝒯s\theta_{n},\phi\in\mathcal{T}_{s} such that

ρ(ϕ,θn)Mmin{σ2n, 1}\displaystyle\rho(\phi,\theta_{n})\geq M\min\bigg{\{}\frac{\sigma^{2}}{\sqrt{n}},\ 1\bigg{\}}

for some universal constant M>0M\in\mathbb{R}_{>0} but DKL(PϕnPθnn)=nDKL(PϕPθn)1/2D_{\text{KL}}(P_{\phi}^{n}\;\|\;P^{n}_{\theta_{n}})=nD_{\text{KL}}(P_{\phi}\;\|\;P_{\theta_{n}})\leq 1/2. Let {i1,i2,,is}{1,,L}\{i_{1},i_{2},\cdots,i_{s}\}\subseteq\{1,\cdots,L\} be such that each element in the multiset

{ijik: 1j,ks with jk}\displaystyle\big{\{}i_{j}-i_{k}\ :\ 1\leq j,k\leq s\text{ with }j\neq k\big{\}}

appears with multiplicity exactly 11. Define

ϕj\displaystyle\phi_{j} :={(1)k1s if j=ik for 1ks0 otherwise\displaystyle:=\begin{cases}(-1)^{k}\dfrac{1}{\sqrt{s}}&\text{ if }j=i_{k}\text{ for }1\leq k\leq s\\ 0&\text{ otherwise}\end{cases}
(θn)j\displaystyle(\theta_{n})_{j} :={(1)k1s if j=ik for 1ks21sδ if j=ik for k=s11s+δ if j=ik for k=s0 otherwise\displaystyle:=\begin{cases}(-1)^{k}\dfrac{1}{\sqrt{s}}&\text{ if }j=i_{k}\text{ for }1\leq k\leq s-2\\[8.53581pt] -\dfrac{1}{\sqrt{s}}-\delta&\text{ if }j=i_{k}\text{ for }k=s-1\\[8.53581pt] \dfrac{1}{\sqrt{s}}+\delta&\text{ if }j=i_{k}\text{ for }k=s\\[8.53581pt] 0&\text{ otherwise}\end{cases}

where δ=cmin{σ2n,1}\displaystyle\delta=c\min\bigg{\{}\frac{\sigma^{2}}{\sqrt{n}},1\bigg{\}} for some universal constant cc to be specified.

Note that 𝔼[Gϕ]=𝔼[Gθ]=0\mathbb{E}[G\phi]=\mathbb{E}[G\theta]=0 and ρ(ϕ,θn)=2δ1/3\rho(\phi,\theta_{n})=2\delta\leq 1/3 for c<1/6c<1/6. Since both ϕ\phi and θn\theta_{n} are mean zero signals, with ϕ=1\left\lVert\phi\right\rVert=1, we can invoke Theorem 2.1.2 (with k=2k=2) to obtain

DKL(PϕPθn)C¯ρ(θ,ϕ)2σ412n\displaystyle D_{\text{KL}}(P_{\phi}\;\|\;P_{\theta_{n}})\leq\overline{C}\frac{\rho(\theta,\phi)^{2}}{\sigma^{4}}\leq\frac{1}{2n}

by choosing c<12C¯c<\dfrac{1}{2\overline{C}}. ∎

5 Concentration of the Maximum Likelihood Estimator

In this section, we fix a collision free signal θ𝒯\theta\in\mathcal{T} that plays the role of the true signal which we wish to approximate. The goal of this section is to prove Theorem 1.4.2.

Recall that for i.i.d samples X1,,XnX_{1},\cdots,X_{n} drawn according to PθP_{\theta}, the restricted MLE θ~n\tilde{\theta}_{n} is the minimizer of the negative log-likehood

Rn(ϕ):\displaystyle R_{n}(\phi): =1ni=1nlogfϕ(Xi)\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\log f_{\phi}(X_{i})
=L2log2πσ2+1ni=1n(Gθ+σξi2+ϕ22σ2)log𝔼G[exp((Gθ+σξi)TGϕσ2)]).\displaystyle=\frac{L}{2}\log 2\pi\sigma^{2}+\frac{1}{n}\sum_{i=1}^{n}\bigg{(}\frac{\left\lVert G\theta+\sigma\xi_{i}\right\rVert^{2}+\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}\bigg{)}-\log\mathbb{E}_{G}\bigg{[}\exp\bigg{(}\frac{(G\theta+\sigma\xi_{i})^{T}G\phi}{\sigma^{2}}\bigg{)}\bigg{]}\bigg{)}.

on 𝒯\mathcal{T}. Since Gθ+σξG\theta+\sigma\xi is equal in distribution to G(θ+σξ)G(\theta+\sigma\xi), we obtain

Rn(ϕ)=dL2log2πσ2+1ni=1n(θ+σξi2+ϕ22σ2)log𝔼G[exp((θ+σξi)TGϕσ2)]).\displaystyle R_{n}(\phi)\stackrel{{\scriptstyle d}}{{=}}\frac{L}{2}\log 2\pi\sigma^{2}+\frac{1}{n}\sum_{i=1}^{n}\bigg{(}\frac{\left\lVert\theta+\sigma\xi_{i}\right\rVert^{2}+\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}\bigg{)}-\log\mathbb{E}_{G}\bigg{[}\exp\bigg{(}\frac{(\theta+\sigma\xi_{i})^{T}G\phi}{\sigma^{2}}\bigg{)}\bigg{]}\bigg{)}. (12)

In particular, Rn(ϕ)R_{n}(\phi) does not depend on the randomness of the group action. Thus we let R(ϕ)=𝔼[Rn(ϕ)]R(\phi)=\mathbb{E}[R_{n}(\phi)] denote its mean with respect to the Gaussian noise ξ1,,ξn\xi_{1},\cdots,\xi_{n}.

Lemma 5.0.1.

Let K,δ>0K,\delta\in\mathbb{R}_{>0} be fixed constants. There exists a δ\delta-net NδN_{\delta} (with respect to the Euclidean norm on L\mathbb{R}^{L}) for the subset

S:={ϕL:|supp(ϕ)|s and ϕK}\displaystyle S:=\Big{\{}\phi\in\mathbb{R}^{L}\ :\ |\text{supp}(\phi)|\leq s\text{ and }\left\lVert\phi\right\rVert\leq K\Big{\}}

satisfying |Nδ|Ls(3Kδ)s|N_{\delta}|\leq L^{s}\bigg{(}\dfrac{3K}{\delta}\bigg{)}^{s}.

Proof.

For each tuple of indices (i1,,is)(i_{1},\cdots,i_{s}) with 1i1<i2<<isL1\leq i_{1}<i_{2}<\cdots<i_{s}\leq L, define the set

Si1,,is:={ϕL:supp(ϕ){i1,,is} and ϕK}.\displaystyle S_{i_{1},\cdots,i_{s}}:=\Big{\{}\phi\in\mathbb{R}^{L}\ :\ \text{supp}(\phi)\subseteq\{i_{1},\cdots,i_{s}\}\text{ and }\left\lVert\phi\right\rVert\leq K\Big{\}}.

By viewing Si1,,isS_{i_{1},\cdots,i_{s}} as a subset of the ss-dimensional linear subspace

Li1,,is={ϕL:supp(ϕ){i1,,is}},\displaystyle L_{i_{1},\cdots,i_{s}}=\Big{\{}\phi\in\mathbb{R}^{L}\ :\ \text{supp}(\phi)\subseteq\{i_{1},\cdots,i_{s}\}\Big{\}},

we see that there exists a δ\delta-net Nδ(i1,,is)N^{(i_{1},\cdots,i_{s})}_{\delta} for Si1,,isS_{i_{1},\cdots,i_{s}} satisfying |Nδ(i1,,is)|(3K/δ)s.|N^{(i_{1},\cdots,i_{s})}_{\delta}|\leq(3K/\delta)^{s}. Define NδN_{\delta} to be the union of the Nδ(i1,,is)N_{\delta}^{(i_{1},\cdots,i_{s})}’s. Since there are (Ls){L}\choose{s} possible choices for i1,,isi_{1},\cdots,i_{s}, we have that

|Nδ|(Ls)(3Kδ)sLs(3Kδ)s.\displaystyle|N_{\delta}|\leq\begin{pmatrix}L\\ s\end{pmatrix}\bigg{(}\frac{3K}{\delta}\bigg{)}^{s}\leq L^{s}\bigg{(}\frac{3K}{\delta}\bigg{)}^{s}.

Lemma 5.0.2.

There exists a universal constant c>0c\in\mathbb{R}_{>0} and a constant Cs>0C_{s}\in\mathbb{R}_{>0} depending on ss such that for any K,t>0K,t\in\mathbb{R}_{>0},

P(supϕ𝒯:ϕK|\displaystyle P\bigg{(}\sup_{\phi\in\mathcal{T}:\left\lVert\phi\right\rVert\leq K}| Rn(ϕ)R(ϕ)|>4t)\displaystyle R_{n}(\phi)-R(\phi)|>4t\bigg{)}
CsσsKs(ts+ts/2)exp(cnσ2t2K2)+Csexp(cnmin{t,t2}).\displaystyle\leq C_{s}\sigma^{-s}K^{s}(t^{-s}+t^{-s/2})\exp\bigg{(}-\frac{cn\sigma^{2}t^{2}}{K^{2}}\bigg{)}+C_{s}\exp\big{(}-cn\min\{t,t^{2}\}\big{)}.
Proof.

In what follows, let cc and CsC_{s} be constants whose value may change from line to line. Using (12), for any ϕd\phi\in\mathbb{R}^{d}, we may write

Rn(ϕ)=III(ϕ)+const(ϕ)\displaystyle R_{n}(\phi)=\textup{I}-\textup{II}(\phi)+\text{const}(\phi)

where

I =1ni=1n12σ2θ+σξi2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{2\sigma^{2}}\lVert\theta+\sigma\xi_{i}\rVert^{2}
II(ϕ)\displaystyle\textup{II}(\phi) =1ni=1nf(ξi,ϕ):=1ni=1nlog𝔼G[exp((θ+σξi)TGϕσ2)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}f(\xi_{i},\phi):=\frac{1}{n}\sum_{i=1}^{n}\log\mathbb{E}_{G}\bigg{[}\exp\bigg{(}\frac{(\theta+\sigma\xi_{i})^{T}G\phi}{\sigma^{2}}\bigg{)}\bigg{]}

and const(ϕ)\text{const}(\phi) is a term not depending on the randomness of the ξi\xi_{i}’s. We analyse the concentration of I and II(ϕ)\textup{II}(\phi) separately. Define the event ={|I𝔼[I]|<2t}\mathcal{E}=\{|\textup{I}-\mathbb{E}[\textup{I}]|<2t\}. First observe that for each 1in1\leq i\leq n,

12σ2θ+σξi2=12σ2(θ2+2σθTξi+σ2ξi2).\displaystyle\frac{1}{2\sigma^{2}}\lVert\theta+\sigma\xi_{i}\rVert^{2}=\frac{1}{2\sigma^{2}}\big{(}\left\lVert\theta\right\rVert^{2}+2\sigma\theta^{T}\xi_{i}+\sigma^{2}\lVert\xi_{i}\rVert^{2}\big{)}.

The first term is deterministic while the distributions of the second and third terms are given by

1σni=1nθTξi𝒩(0,1σ2nθ2) and i=1nξi2χnL2.\displaystyle\frac{1}{\sigma n}\sum_{i=1}^{n}\theta^{T}\xi_{i}\sim\mathcal{N}\bigg{(}0,\frac{1}{\sigma^{2}n}\left\lVert\theta\right\rVert^{2}\bigg{)}\ \ \ \ \ \ \ \text{ and }\ \ \ \ \ \ \ \sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}\sim\chi_{nL}^{2}.

By Gaussian tail bounds, we have that

P(|1σni=1nθTξi|t)2exp(nσ2t22θ2).\displaystyle P\Bigg{(}\bigg{|}\frac{1}{\sigma n}\sum_{i=1}^{n}\theta^{T}\xi_{i}\bigg{|}\geq t\Bigg{)}\leq 2\exp\bigg{(}-\frac{n\sigma^{2}t^{2}}{2\lVert\theta\rVert^{2}}\bigg{)}.

For the chi-square random variable, we will invoke the following well-known chi-square tail bound.

Theorem 5.0.3.

(Chi-square, , Lemma 1). Let Xχk2X\sim\chi^{2}_{k} be a chi-squared distributed random variable with kk degrees of freedom. Then for any xx\in\mathbb{R},

P(Xk2kx+2x)\displaystyle P(X-k\geq 2\sqrt{kx}+2x) exp(x)\displaystyle\leq\exp(-x)
P(kX2kx)\displaystyle P(k-X\geq 2\sqrt{kx}) exp(x).\displaystyle\leq\exp(-x).

Observe that

P(12ni=1nξi2L2t)\displaystyle P\bigg{(}\frac{1}{2n}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}-\frac{L}{2}\geq t\bigg{)} =P(i=1nξi2Ln2nt).\displaystyle=P\bigg{(}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}-Ln\geq 2nt\bigg{)}.

Let x>0x\in\mathbb{R}_{>0} be such that nLx+x=nt.\sqrt{nLx}+x=nt. If xnt2x\geq\dfrac{nt}{2}, then n2Lt+nt2ntn\sqrt{2Lt}+nt\leq 2nt and so

P(i=1nξi2Ln2nt)P(i=1nξi2Lnn2Lt+nt)exp(nt2).\displaystyle P\bigg{(}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}-Ln\geq 2nt\bigg{)}\leq P\bigg{(}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}-Ln\geq n\sqrt{2Lt}+nt\bigg{)}\leq\exp\bigg{(}-\frac{nt}{2}\bigg{)}.

On the other hand, if nLxnt2\sqrt{nLx}\geq\dfrac{nt}{2}, then xnt24Lx\geq\dfrac{nt^{2}}{4L} and so nt+nt22L2ntnt+\dfrac{nt^{2}}{2L}\leq 2nt. Thus

P(i=1nξi2Ln2nt)P(i=1nξi2Lnnt+nt22L)exp(nt24L).\displaystyle P\bigg{(}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}-Ln\geq 2nt\bigg{)}\leq P\bigg{(}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}-Ln\geq nt+\frac{nt^{2}}{2L}\bigg{)}\leq\exp\bigg{(}-\frac{nt^{2}}{4L}\bigg{)}.

Next, we have that

P(L212ni=1nξi2t)\displaystyle P\bigg{(}\frac{L}{2}-\frac{1}{2n}\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}\geq t\bigg{)} =P(Lni=1nξi22nt)exp(nt2L).\displaystyle=P\bigg{(}Ln-\sum_{i=1}^{n}\lVert\xi_{i}\rVert^{2}\geq 2nt\bigg{)}\leq\exp\bigg{(}-\frac{nt^{2}}{L}\bigg{)}.

To summarise, we have the following bound

P(|12ni=1nξi2L2|t)2exp(cnmin{t,t2})\displaystyle P\Bigg{(}\bigg{|}\frac{1}{2n}\sum_{i=1}^{n}\left\lVert\xi_{i}\right\rVert^{2}-\frac{L}{2}\bigg{|}\geq t\Bigg{)}\leq 2\exp\big{(}-cn\min\{t,t^{2}\}\big{)}

This implies that

P(c)\displaystyle P(\mathcal{E}^{c}) 2exp(nσ2t22θ2)+2exp(cnmin{t,t2})\displaystyle\leq 2\exp\bigg{(}-\frac{n\sigma^{2}t^{2}}{2\left\lVert\theta\right\rVert^{2}}\bigg{)}+2\exp\big{(}-cn\min\{t,t^{2}\}\big{)}
Csexp(cnmin{t,t2})\displaystyle\leq C_{s}\exp\big{(}-cn\min\{t,t^{2}\}\big{)} (13)

where we have used the fact that θσ\left\lVert\theta\right\rVert\leq\sigma for the last inequality.

On the event \mathcal{E}, we have |Rn(ϕ)R(ϕ)|2t+|II(ϕ)𝔼[II(ϕ)]||R_{n}(\phi)-R(\phi)|\leq 2t+|\textup{II}(\phi)-\mathbb{E}[\textup{II}(\phi)]| as well as

1ni=1nθ+σξi2𝔼[θ+σξi2]+4tσ2=θ2+(L+4t)σ2(2L+4t)σ2.\frac{1}{n}\sum_{i=1}^{n}\lVert\theta+\sigma\xi_{i}\rVert^{2}\leq\mathbb{E}\big{[}\lVert\theta+\sigma\xi_{i}\rVert^{2}]+4t\sigma^{2}=\left\lVert\theta\right\rVert^{2}+(L+4t)\sigma^{2}\leq(2L+4t)\sigma^{2}. (14)

We now control the concentration of II(ϕ)\textup{II}(\phi). Differentiating f(ξ,ϕ)f(\xi,\phi) with respect to ξ\xi, we obtain

ξf(ξ,ϕ)=𝔼G[exp((θ+σξ)TGϕσ2)1σGϕ]𝔼G[exp((θ+σξ)TGϕσ2)]𝔼G[exp((θ+σξ)TGϕσ2)1σϕ]𝔼G[exp((θ+σξ)TGϕσ2)]=ϕσ.\displaystyle\left\lVert\nabla_{\xi}f(\xi,\phi)\right\rVert=\left\lVert\frac{\mathbb{E}_{G}\big{[}\exp(\frac{(\theta+\sigma\xi)^{T}G\phi}{\sigma^{2}})\cdot\frac{1}{\sigma}G\phi\big{]}}{\mathbb{E}_{G}\big{[}\exp(\frac{(\theta+\sigma\xi)^{T}G\phi}{\sigma^{2}})\big{]}}\right\rVert\leq\frac{\mathbb{E}_{G}\big{[}\exp(\frac{(\theta+\sigma\xi)^{T}G\phi}{\sigma^{2}})\cdot\frac{1}{\sigma}\left\lVert\phi\right\rVert\big{]}}{\mathbb{E}_{G}\big{[}\exp(\frac{(\theta+\sigma\xi)^{T}G\phi}{\sigma^{2}})\big{]}}=\frac{\left\lVert\phi\right\rVert}{\sigma}.

Thus f(ξ,ϕ)f(\xi,\phi) is ϕσ\frac{\left\lVert\phi\right\rVert}{\sigma}-Lipschitz in ξ\xi. By the following Gaussian concentration result:

Theorem 5.0.4.

(Concentration, , Theorem 5.5) Let X𝒩(0,In)X\sim\mathcal{N}(0,I_{n}) be a standard Gaussian vector. Let L>0L\in\mathbb{R}_{>0} and let f:nf:\mathbb{R}^{n}\to\mathbb{R} be an LL-lipschitz function. Then for all λ\lambda\in\mathbb{R},

log𝔼[eλ(f(X)𝔼[f(X)])λ22L2.\displaystyle\log\mathbb{E}[e^{\lambda(f(X)-\mathbb{E}[f(X)])}\leq\frac{\lambda^{2}}{2}L^{2}.

for a fixed ϕd\phi\in\mathbb{R}^{d}, the random variable f(ξ,ϕ)𝔼[f(ξ,ϕ)]f(\xi,\phi)-\mathbb{E}[f(\xi,\phi)] is ϕ2σ2\frac{\left\lVert\phi\right\rVert^{2}}{\sigma^{2}}-subgaussian. Hoeffding’s inequality then yields

P(|II(ϕ)𝔼[II(ϕ)]|>t)\displaystyle P\big{(}|\textup{II}(\phi)-\mathbb{E}[\textup{II}(\phi)]|>t\big{)} =P(1n|i=1nf(ξi,ϕ)𝔼[f(ξi,ϕ)]|t)\displaystyle=P\bigg{(}\frac{1}{n}\bigg{|}\sum_{i=1}^{n}f(\xi_{i},\phi)-\mathbb{E}[f(\xi_{i},\phi)]\bigg{|}\geq t\bigg{)}
=P(|i=1nf(ξi,ϕ)𝔼[f(ξi,ϕ)]|nt)\displaystyle=P\bigg{(}\bigg{|}\sum_{i=1}^{n}f(\xi_{i},\phi)-\mathbb{E}[f(\xi_{i},\phi)]\bigg{|}\geq nt\bigg{)}
2exp(nσ2t22ϕ2).\displaystyle\leq 2\exp\bigg{(}-\frac{n\sigma^{2}t^{2}}{2\left\lVert\phi\right\rVert^{2}}\bigg{)}.

Now set δ=tσ4L+t\delta=\dfrac{t\sigma}{4\sqrt{L+t}} and let NδN_{\delta} be a δ\delta-net of {θ𝒯:θM}\{\theta\in\mathcal{T}\ :\ \left\lVert\theta\right\rVert\leq M\} having cardinality

|Nδ|Ls(12KL+ttσ)s.\displaystyle|N_{\delta}|\leq L^{s}\bigg{(}\frac{12K\sqrt{L+t}}{t\sigma}\bigg{)}^{s}.

Next, the gradient of 𝔼ξ[f(ξ,ϕ)]\mathbb{E}_{\xi}[f(\xi,\phi)] when differentiating with respect to ϕ\phi is given by

ϕ𝔼ξ[f(ξ,ϕ)]\displaystyle\left\lVert\nabla_{\phi}\mathbb{E}_{\xi}[f(\xi,\phi)]\right\rVert =𝔼ξ[𝔼G[exp((GT(θ+σξ))Tϕσ2)GT(θ+σξ)σ2]𝔼G[exp((GT(θ+σξ))Tϕσ2)]]\displaystyle=\left\lVert\mathbb{E}_{\xi}\Bigg{[}\frac{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi))^{T}\phi}{\sigma^{2}})\cdot\frac{G^{T}(\theta+\sigma\xi)}{\sigma^{2}}\big{]}}{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi))^{T}\phi}{\sigma^{2}})\big{]}}\Bigg{]}\right\rVert
𝔼ξ[𝔼G[exp((GT(θ+σξ))Tϕσ2)θ+σξσ2]𝔼G[exp((GT(θ+σξ))Tϕσ2)]]=1σ2𝔼ξ[θ+σξ].\displaystyle\leq\mathbb{E}_{\xi}\Bigg{[}\frac{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi))^{T}\phi}{\sigma^{2}})\cdot\lVert\frac{\theta+\sigma\xi}{\sigma^{2}}\rVert\big{]}}{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi))^{T}\phi}{\sigma^{2}})\big{]}}\Bigg{]}=\frac{1}{\sigma^{2}}\mathbb{E}_{\xi}[\left\lVert\theta+\sigma\xi\right\rVert].

This can be upper bounded by

1σ2𝔼ξ[θ+σξ]θ+σ𝔼[ξ]σ2σ+σ𝔼[ξ2]σ2=2Lσ.\displaystyle\frac{1}{\sigma^{2}}\mathbb{E}_{\xi}[\left\lVert\theta+\sigma\xi\right\rVert]\leq\frac{\left\lVert\theta\right\rVert+\sigma\mathbb{E}[\left\lVert\xi\right\rVert]}{\sigma^{2}}\leq\frac{\sigma+\sigma\sqrt{\mathbb{E}[\lVert\xi\rVert^{2}]}}{\sigma^{2}}=\frac{2\sqrt{L}}{\sigma}.

On the other hand, on the event \mathcal{E}, for a fixed ξd\xi\in\mathbb{R}^{d}, the gradient of II(ϕ)\textup{II}(\phi) is given by

ϕII(ϕ)\displaystyle\left\lVert\nabla_{\phi}\textup{II}(\phi)\right\rVert =1ni=1n[𝔼G[exp((GT(θ+σξi))Tϕσ2)GT(θ+σξi)σ2]𝔼G[exp((GT(θ+σξi))Tϕσ2)]]\displaystyle=\left\lVert\frac{1}{n}\sum_{i=1}^{n}\Bigg{[}\frac{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi_{i}))^{T}\phi}{\sigma^{2}})\cdot\frac{G^{T}(\theta+\sigma\xi_{i})}{\sigma^{2}}\big{]}}{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi_{i}))^{T}\phi}{\sigma^{2}})\big{]}}\Bigg{]}\right\rVert
1ni=1n[𝔼G[exp((GT(θ+σξi))Tϕσ2)θ+σξiσ2]𝔼G[exp((GT(θ+σξi))Tϕσ2)]]=1σ21ni=1nθ+σξi.\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\Bigg{[}\frac{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi_{i}))^{T}\phi}{\sigma^{2}})\cdot\lVert\frac{\theta+\sigma\xi_{i}}{\sigma^{2}}\rVert\big{]}}{\mathbb{E}_{G}\big{[}\exp(\frac{(G^{T}(\theta+\sigma\xi_{i}))^{T}\phi}{\sigma^{2}})\big{]}}\Bigg{]}=\frac{1}{\sigma^{2}}\cdot\frac{1}{n}\sum_{i=1}^{n}\left\lVert\theta+\sigma\xi_{i}\right\rVert.

This can be upper bounded by

1σ21ni=1nθ+σξi1σ21ni=1nθ+σξi22L+4tσ\displaystyle\frac{1}{\sigma^{2}}\cdot\frac{1}{n}\sum_{i=1}^{n}\lVert\theta+\sigma\xi_{i}\rVert\leq\frac{1}{\sigma^{2}}\sqrt{\frac{1}{n}\sum_{i=1}^{n}\lVert\theta+\sigma\xi_{i}\rVert^{2}}\leq\frac{\sqrt{2L+4t}}{\sigma}

where the last inequality follows from (14)(\ref{Concentration6}). This means that on the event \mathcal{E}, the function II(ϕ)𝔼[II(ϕ)]\textup{II}(\phi)-\mathbb{E}[\textup{II}(\phi)] (for each fixed ξ\xi) is Lipschitz in ϕ\phi, with Lipschitz constant at most

2Lσ+2L+4tσ4L+tσ=tδ.\displaystyle\frac{2\sqrt{L}}{\sigma}+\frac{\sqrt{2L+4t}}{\sigma}\leq\frac{4\sqrt{L+t}}{\sigma}=\frac{t}{\delta}.

It then follows that

P(supϕ𝒯:ϕK|RN(ϕ)R(ϕ)|>4t and )\displaystyle P\bigg{(}\sup_{\phi\in\mathcal{T}:\left\lVert\phi\right\rVert\leq K}|R_{N}(\phi)-R(\phi)|>4t\text{ and }\mathcal{E}\bigg{)} P(supϕ𝒯:ϕK|II(ϕ)𝔼[II(ϕ)]|>2t and )\displaystyle\leq P\bigg{(}\sup_{\phi\in\mathcal{T}:\left\lVert\phi\right\rVert\leq K}|\textup{II}(\phi)-\mathbb{E}[\textup{II}(\phi)]|>2t\text{ and }\mathcal{E}\bigg{)}
P(supϕNδ|II(ϕ)𝔼[II(ϕ)]|>t)\displaystyle\leq P\bigg{(}\sup_{\phi\in N_{\delta}}|\textup{II}(\phi)-\mathbb{E}[\textup{II}(\phi)]|>t\bigg{)}
2|Nδ|exp(nσ2t22K2)\displaystyle\leq 2|N_{\delta}|\exp\bigg{(}-\frac{n\sigma^{2}t^{2}}{2K^{2}}\bigg{)}
Cs(KL+ttσ)sexp(cnσ2t2K2)\displaystyle\leq C_{s}\bigg{(}\frac{K\sqrt{L+t}}{t\sigma}\bigg{)}^{s}\exp\bigg{(}-\frac{cn\sigma^{2}t^{2}}{K^{2}}\bigg{)}
CsσsKs(1+tt)sexp(cnσ2t2K2)\displaystyle\leq C_{s}\sigma^{-s}K^{s}\bigg{(}\frac{\sqrt{1+t}}{t}\bigg{)}^{s}\exp\bigg{(}-\frac{cn\sigma^{2}t^{2}}{K^{2}}\bigg{)}
CsσsKs(ts+ts/2)exp(cnσ2t2K2).\displaystyle\leq C_{s}\sigma^{-s}K^{s}(t^{-s}+t^{-s/2})\exp\bigg{(}-\frac{cn\sigma^{2}t^{2}}{K^{2}}\bigg{)}.

Combining this with (13) gives the desired result. ∎

Proof of Theorem 1.4.2.

In what follows, let CsC_{s} and C~s\tilde{C}_{s} be constants depending on ss whose value may change from line to line. We employ a slicing argument. For the given value of δ\delta and each integer k1k\in\mathbb{Z}_{\geq 1}, define

Γk={ϕL:kδρ(θ,ϕ)<(k+1)δ}.\displaystyle\Gamma_{k}=\Big{\{}\phi\in\mathbb{R}^{L}:k\delta\leq\rho(\theta,\phi)<(k+1)\delta\Big{\}}.

Then for all ϕΓk\phi\in\Gamma_{k}, we have that

DKL(PθPϕ)Csσ6ρ(θ,ϕ)2Csσ6k2δ2\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C_{s}\sigma^{-6}\rho(\theta,\phi)^{2}\geq C_{s}\sigma^{-6}k^{2}\delta^{2}

by Proposition 2.3.1. Let tk=Csσ6k2δ2t_{k}=C_{s}\sigma^{-6}k^{2}\delta^{2}. By shrinking the constant CsC_{s}, we have that

DKL(PθPϕ)10tk.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq 10t_{k}.

Note that max{2LM,kδ}2LMk\max\{2\sqrt{L}M,k\delta\}\leq 2\sqrt{L}Mk since δ2LM\delta\leq 2\sqrt{L}M. Invoking Lemma 5.0.2 with t=tkt=t_{k} and K=max{2LM,kδ}K=\max\{2\sqrt{L}M,k\delta\} , we have that

P(supϕ𝒯:ϕmax{2LM,kδ}|Rn(ϕ)R(ϕ)|>4tk)\displaystyle\ \ \ \ P\bigg{(}\sup_{\phi\in\mathcal{T}:\left\lVert\phi\right\rVert\leq\max\{2\sqrt{L}M,k\delta\}}|R_{n}(\phi)-R(\phi)|>4t_{k}\bigg{)}
Cs(σ5sksδ2s+σ2sδs)exp(C~snk2δ4σ10)+Csexp(C~snmin{k2δ2σ6,k4δ4σ12})\displaystyle\leq C_{s}(\sigma^{5s}k^{-s}\delta^{-2s}+\sigma^{2s}\delta^{-s})\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{10}}\bigg{)}+C_{s}\exp\bigg{(}-\tilde{C}_{s}n\min\bigg{\{}\frac{k^{2}\delta^{2}}{\sigma^{6}},\frac{k^{4}\delta^{4}}{\sigma^{12}}\bigg{\}}\bigg{)}
Csσ5sδ2sexp(C~snk2δ4σ10)+Csexp(C~snk2δ4σ12).\displaystyle\leq C_{s}\sigma^{5s}\delta^{-2s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{10}}\bigg{)}+C_{s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{12}}\bigg{)}.

Note that DKL(PθPϕ)=R(ϕ)R(θ)10tkD_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})=R(\phi)-R(\theta)\geq 10t_{k}. Thus in the event in which we have both |Rn(θ)R(θ)|4tk|R_{n}(\theta)-R(\theta)|\leq 4t_{k} and |Rn(ϕ)R(ϕ)|4tk|R_{n}(\phi)-R(\phi)|\leq 4t_{k}, since Rn(ϕ)Rn(θ)2tk0R_{n}(\phi)-R_{n}(\theta)\geq 2t_{k}\geq 0, we see that ϕ\phi is not the restricted MLE (recall that the restricted MLE θ~n\tilde{\theta}_{n} minimises Rn(θ~n)R_{n}(\tilde{\theta}_{n}) on 𝒯\mathcal{T}). Thus

P(θ~nΓk)\displaystyle P(\tilde{\theta}_{n}\in\Gamma_{k}) Csσ5sδ2sexp(C~snk2δ4σ10)+Csexp(C~snk2δ4σ12)\displaystyle\leq C_{s}\sigma^{5s}\delta^{-2s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{10}}\bigg{)}+C_{s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{12}}\bigg{)}
Csσ5sδ2sexp(C~snk2δ4σ12)+Csexp(C~snk2δ4σ12)\displaystyle\leq C_{s}\sigma^{5s}\delta^{-2s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{12}}\bigg{)}+C_{s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{12}}\bigg{)}
Csσ5sδ2sexp(C~snk2δ4σ12)\displaystyle\leq C_{s}\sigma^{5s}\delta^{-2s}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{12}}\bigg{)}

where the last inequality uses the fact that σ5sδ2s\sigma^{5s}\delta^{-2s} is bounded below by a constant. Summing over k1k\in\mathbb{Z}_{\geq 1} gives

P(ρ(θ~n,θ)δ)\displaystyle P(\rho(\tilde{\theta}_{n},\theta)\geq\delta) k=1P(θ~nΓk)\displaystyle\leq\sum_{k=1}^{\infty}P(\tilde{\theta}_{n}\in\Gamma_{k})
Csσ5sδ2sk=1exp(C~snk2δ4σ12).\displaystyle\leq C_{s}\sigma^{5s}\delta^{-2s}\sum_{k=1}^{\infty}\exp\bigg{(}-\frac{\tilde{C}_{s}nk^{2}\delta^{4}}{\sigma^{12}}\bigg{)}.

Now for nC~s1δ4σ12,n\geq\widetilde{C}_{s}^{-1}\delta^{-4}\sigma^{12}, the infinite summation is bounded by the geometric series

exp(C~snδ4σ12)k=0exp(k),\displaystyle\exp\bigg{(}-\frac{\tilde{C}_{s}n\delta^{4}}{\sigma^{12}}\bigg{)}\sum_{k=0}^{\infty}\exp(-k),

which in turn is dominated by the first term. Thus

P(ρ(θ~n,θ)δ)Csσ5sδ2sexp(C~snδ4σ12)\displaystyle P(\rho(\tilde{\theta}_{n},\theta)\geq\delta)\leq C_{s}\sigma^{5s}\delta^{-2s}\exp\bigg{(}-\frac{\tilde{C}_{s}n\delta^{4}}{\sigma^{12}}\bigg{)}

as desired. ∎

6 Acknowledgements

SG was supported in part by the MOE grants R-146-000-250-133, R-146-000-312-114 and MOE-T2EP20121-0013.

SSM was partially supported by an INSPIRE research grant (DST/INSPIRE/04/2018/002193) from the Department of Science and Technology, Government of India and a Start-Up Grant from Indian Statistical Institute, Kolkata.

References

  • [1] Emmanuel Abbe, Tamir Bendory, William Leeb, João M Pereira, Nir Sharon, and Amit Singer. Multireference alignment is easier with an aperiodic translation distribution. IEEE Transactions on Information Theory, 65(6):3565–3584, 2018.
  • [2] Afonso S Bandeira, Moses Charikar, Amit Singer, and Andy Zhu. Multireference alignment using semidefinite programming. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 459–470, 2014.
  • [3] Afonso S Bandeira, Jonathan Niles-Weed, and Philippe Rigollet. Optimal rates of estimation for multi-reference alignment. Mathematical Statistics and Learning, 2(1):25–75, 2020.
  • [4] Alberto Bartesaghi, Alan Merk, Soojay Banerjee, Doreen Matthies, Xiongwu Wu, Jacqueline LS Milne, and Sriram Subramaniam. 2.2 å resolution cryo-em structure of β\beta-galactosidase in complex with a cell-permeant inhibitor. Science, 348(6239):1147–1151, 2015.
  • [5] Ahmad Bekir and Solomon Golomb. There are no further counterexamples to s. piccard’s theorem. Information Theory, IEEE Transactions on, 53:2864 – 2867, 09 2007.
  • [6] Tamir Bendory, Alberto Bartesaghi, and Amit Singer. Single-particle cryo-electron microscopy: Mathematical theory, computational challenges, and opportunities. IEEE signal processing magazine, 37(2):58–76, 2020.
  • [7] Tamir Bendory, Nicolas Boumal, Chao Ma, Zhizhen Zhao, and Amit Singer. Bispectrum inversion with application to multireference alignment. IEEE Transactions on signal processing, 66(4):1037–1050, 2017.
  • [8] Tamir Bendory, Yuehaw Khoo, Joe Kileel, Oscar Mickelin, and Amit Singer. Autocorrelation analysis for cryo-em with sparsity constraints: Improved sample complexity and projection-based algorithms. Proceedings of the National Academy of Sciences, 120(18):e2216507120, 2023.
  • [9] Tamir Bendory, Oscar Mickelin, and Amit Singer. Sparse multi-reference alignment: Sample complexity and computational hardness. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8977–8981. IEEE, 2022.
  • [10] Gary S Bloom. A counterexample to a theorem of s. piccard. Journal of Combinatorial Theory, Series A, 22(3):378–379, 1977.
  • [11] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013.
  • [12] Nicolas Boumal, Tamir Bendory, Roy R Lederman, and Amit Singer. Heterogeneous multireference alignment: A single pass approach. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2018.
  • [13] Lisa Gottesfeld Brown. A survey of image registration techniques. ACM computing surveys (CSUR), 24(4):325–376, 1992.
  • [14] Victor-Emmanuel Brunel. Learning rates for gaussian mixtures under group action. In Conference on Learning Theory, pages 471–491. PMLR, 2019.
  • [15] Robert Diamond. On the multiple simultaneous superposition of molecular structures by rigid body transformations. Protein Science, 1(10):1279–1287, 1992.
  • [16] Zehao Dou, Zhou Fan, and Harrison Zhou. Rates of estimation for high-dimensional multi-reference alignment. arXiv preprint arXiv:2205.01847, 2022.
  • [17] Ian L Dryden and Kanti V Mardia. Statistical shape analysis: with applications in R, volume 995. John Wiley & Sons, 2016.
  • [18] Zhou Fan, Roy R Lederman, Yi Sun, Tianhao Wang, and Sheng Xu. Maximum likelihood for high-noise group orbit estimation and single-particle cryo-em. arXiv preprint arXiv:2107.01305, 2021.
  • [19] Zhou Fan, Yi Sun, Tianhao Wang, and Yihong Wu. Likelihood landscape and maximum likelihood estimation for the discrete orbit recovery model. Communications on Pure and Applied Mathematics, 76(6):1208–1302, 2023.
  • [20] Hassan Foroosh, Josiane B Zerubia, and Marc Berthod. Extension of phase correlation to subpixel registration. IEEE transactions on image processing, 11(3):188–200, 2002.
  • [21] Subhroshekhar Ghosh and Philippe Rigollet. Sparse multi-reference alignment: Phase retrieval, uniform uncertainty principles and the beltway problem. Foundations of Computational Mathematics, pages 1–48, 2022.
  • [22] Roberto Gil-Pita, Manuel Rosa-Zurera, P Jarabo-Amores, and Francisco López-Ferreras. Using multilayer perceptrons to align high range resolution radar signals. In Artificial Neural Networks: Formal Models and Their Applications–ICANN 2005: 15th International Conference, Warsaw, Poland, September 11-15, 2005. Proceedings, Part II 15, pages 911–916. Springer, 2005.
  • [23] Anya Katsevich and Afonso S Bandeira. Likelihood maximization and moment matching in low snr gaussian mixture models. Communications on Pure and Applied Mathematics, 76(4):788–842, 2023.
  • [24] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302–1338, 2000.
  • [25] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  • [26] Wooram Park and Gregory S Chirikjian. An assembly automation approach to alignment of noncircular projections in electron microscopy. IEEE Transactions on Automation Science and Engineering, 11(3):668–679, 2014.
  • [27] Wooram Park, Charles R Midgett, Dean R Madden, and Gregory S Chirikjian. A stochastic kinematic model of class averaging in single-particle electron microscopy. The International journal of robotics research, 30(6):730–754, 2011.
  • [28] Amelia Perry, Jonathan Weed, Afonso S Bandeira, Philippe Rigollet, and Amit Singer. The sample complexity of multireference alignment. SIAM Journal on Mathematics of Data Science, 1(3):497–517, 2019.
  • [29] Sophie Piccard. Sur les ensembles de distances des emsembles de points d’un espace euclidien. 1939.
  • [30] Ya’Acov Ritov. Estimating a signal with noisy nuisance parameters. Biometrika, 76(1):31–37, 1989.
  • [31] Dirk Robinson, Sina Farsiu, and Peyman Milanfar. Optimal registration of aliased images using variable projection with applications to super-resolution. The Computer Journal, 52(1):31–42, 2009.
  • [32] David M Rosen, Luca Carlone, Afonso S Bandeira, and John J Leonard. A certifiably correct algorithm for synchronization over the special euclidean group. In Algorithmic Foundations of Robotics XII: Proceedings of the Twelfth Workshop on the Algorithmic Foundations of Robotics, pages 64–79. Springer, 2020.
  • [33] Brian M Sadler and Georgios B Giannakis. Shift-and rotation-invariant object reconstruction using the bispectrum. JOSA A, 9(1):57–69, 1992.
  • [34] Sjors HW Scheres, Mikel Valle, Rafael Nunez, Carlos OS Sorzano, Roberto Marabini, Gabor T Herman, and Jose-Maria Carazo. Maximum-likelihood multi-reference refinement for electron microscopy images. Journal of molecular biology, 348(1):139–149, 2005.
  • [35] Devika Sirohi, Zhenguo Chen, Lei Sun, Thomas Klose, Theodore C Pierson, Michael G Rossmann, and Richard J Kuhn. The 3.8 å resolution cryo-em structure of zika virus. Science, 352(6284):467–470, 2016.
  • [36] Douglas L Theobald and Phillip A Steindel. Optimal simultaneous superpositioning of multiple structures with missing data. Bioinformatics, 28(15):1972–1979, 2012.
  • [37] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, 2009.
  • [38] J Portegies Zwart, René van der Heiden, Sjoerd Gelsema, and Frans Groen. Fast translation invariant classification of hrr range profiles in a zero phase representation. IEE Proceedings-Radar, Sonar and Navigation, 150(6):411–418, 2003.

A Appendix A

Lemma A.1.

Let θ,ϕL\theta,\phi\in\mathbb{R}^{L} be two vectors and suppose that ρ(θ,ϕ)3θ\rho(\theta,\phi)\geq 3\left\lVert\theta\right\rVert. Then there exists a constant CLC_{L}, depending on LL, such that DKL(PθPϕ)CLσ4ρ(θ,ϕ)2.D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C_{L}\sigma^{-4}\rho(\theta,\phi)^{2}.

Proof.

By replacing θ,ϕ\theta,\phi and σ\sigma with θ/θ,ϕ/θ\theta/\left\lVert\theta\right\rVert,\phi/\left\lVert\theta\right\rVert and σ/θ\sigma/\left\lVert\theta\right\rVert respectively, we may assume that θ=1\left\lVert\theta\right\rVert=1 and σ1\sigma\geq 1. We further assume without loss of generality that ρ(θ,ϕ)=θϕ\rho(\theta,\phi)=\left\lVert\theta-\phi\right\rVert. Now observe that

|θϕϕ|θ13θϕ,\displaystyle\big{|}\left\lVert\theta-\phi\right\rVert-\left\lVert\phi\right\rVert\big{|}\leq\left\lVert\theta\right\rVert\leq\frac{1}{3}\left\lVert\theta-\phi\right\rVert,

where the left-hand side follows from two different applications of the triangle inequality. Thus ϕθϕ\left\lVert\phi\right\rVert\approx\left\lVert\theta-\phi\right\rVert so it suffices to show that there exists a positive constant CLC_{L} such that DKL(PθPϕ)CLσ4ϕ2D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq C_{L}\sigma^{-4}\left\lVert\phi\right\rVert^{2} whenever ϕ2\left\lVert\phi\right\rVert\geq 2. We will directly use the Donsker-Varadhan formula

DKL(PθPϕ)=supf{𝔼XPθ[f(X)]log(𝔼XPϕ[ef(X)])}\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})=\sup_{f}\bigg{\{}\underset{X\sim P_{\theta}}{\mathbb{E}}\big{[}f(X)\big{]}-\log\Big{(}\underset{X\sim P_{\phi}}{\mathbb{E}}\big{[}e^{f(X)}\big{]}\Big{)}\bigg{\}}

with f(x):=λx2f(x):=-\lambda\left\lVert x\right\rVert^{2}, where λ0\lambda\geq 0 is a constant to be specified. For the first term, we have

𝔼XPθ[λX2]\displaystyle\underset{X\sim P_{\theta}}{\mathbb{E}}\Big{[}-\lambda\left\lVert X\right\rVert^{2}\Big{]} =𝔼XPθ[λGθ+σξ2]\displaystyle=\underset{X\sim P_{\theta}}{\mathbb{E}}\Big{[}-\lambda\left\lVert G\theta+\sigma\xi\right\rVert^{2}\Big{]}
=λ𝔼XPθ[Gθ2+2σ(Gθ)Tξ+σ2ξ2]\displaystyle=-\lambda\underset{X\sim P_{\theta}}{\mathbb{E}}\Big{[}\left\lVert G\theta\right\rVert^{2}+2\sigma(G\theta)^{T}\xi+\sigma^{2}\left\lVert\xi\right\rVert^{2}\Big{]}
=λ(1+σ2L).\displaystyle=-\lambda(1+\sigma^{2}L).

For the second term, note that for X=Gϕ+σξX=G\phi+\sigma\xi, conditioned on GG, the random variable 1σX2\left\lVert\frac{1}{\sigma}X\right\rVert^{2} has a non-central χ2\chi^{2}-distribution with LL degrees of freedom. We obtain

log𝔼XPϕ[eλσ21σX2]\displaystyle-\log\underset{X\sim P_{\phi}}{\mathbb{E}}\Big{[}e^{-\lambda\sigma^{2}\left\lVert\frac{1}{\sigma}X\right\rVert^{2}}\Big{]} =log𝔼G[exp(λGϕ21+2λσ2)(1+2λσ2)L/2]\displaystyle=-\log\mathbb{E}_{G}\bigg{[}\exp\Big{(}\frac{-\lambda\left\lVert G\phi\right\rVert^{2}}{1+2\lambda\sigma^{2}}\Big{)}(1+2\lambda\sigma^{2})^{-L/2}\bigg{]}
=λ1+2λσ2ϕ2+L2log(1+2λσ2).\displaystyle=\frac{\lambda}{1+2\lambda\sigma^{2}}\left\lVert\phi\right\rVert^{2}+\frac{L}{2}\log(1+2\lambda\sigma^{2}).

We get

DKL(PθPϕ)supλ0{λ1+2λσ2ϕ2+L2log(1+2λσ2)λ(1+σ2L)}.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq\sup_{\lambda\in\mathbb{R}_{\geq 0}}\Bigg{\{}\frac{\lambda}{1+2\lambda\sigma^{2}}\left\lVert\phi\right\rVert^{2}+\frac{L}{2}\log(1+2\lambda\sigma^{2})-\lambda(1+\sigma^{2}L)\Bigg{\}}.

Choose λ=14σ4L.\lambda=\dfrac{1}{4\sigma^{4}L}. Then 2λσ21\displaystyle 2\lambda\sigma^{2}\leq 1 so the quantity on the right-hand side is at least

λ2ϕ2+L2(2λσ24λ2σ4)λ(1+σ2L)\displaystyle\frac{\lambda}{2}\left\lVert\phi\right\rVert^{2}+\frac{L}{2}(2\lambda\sigma^{2}-4\lambda^{2}\sigma^{4})-\lambda(1+\sigma^{2}L)

where we have used the fact that log(1+x)xx2\log(1+x)\geq x-x^{2} for x0x\in\mathbb{R}_{\geq 0}. Since ϕ2\left\lVert\phi\right\rVert\geq 2, we have that

DKL(PθPϕ)\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi}) (λ4ϕ2+λ)+(Lλσ22Lλ2σ4)(λ+Lλσ2)\displaystyle\geq\Big{(}\frac{\lambda}{4}\left\lVert\phi\right\rVert^{2}+\lambda\Big{)}+(L\lambda\sigma^{2}-2L\lambda^{2}\sigma^{4})-(\lambda+L\lambda\sigma^{2})
=116σ4Lϕ218σ4L132σ4Lϕ2\displaystyle=\frac{1}{16\sigma^{4}L}\left\lVert\phi\right\rVert^{2}-\frac{1}{8\sigma^{4}L}\geq\frac{1}{32\sigma^{4}L}\left\lVert\phi\right\rVert^{2}

as desired. ∎

B Appendix B

Proposition B.1.

Let θ,ϕL\theta,\phi\in\mathbb{R}^{L} be two collision free signals such that Δ2(θ,ϕ)=0\Delta_{2}(\theta,\phi)=0. If s7s\geq 7, then ϕ\phi either lies in the orbit of θ\theta or lies in the orbit of θ-\theta.

Proof.

We have that for any 1i,jL1\leq i,j\leq L with iji\neq j,

𝔼[(Gθ)2]i,j\displaystyle\mathbb{E}[(G\theta)^{\otimes 2}]_{i,j} =1LgLθ(i+g)θ(j+g)\displaystyle=\frac{1}{L}\sum_{g\in\mathbb{Z}_{L}}\theta(i+g)\theta(j+g)
={0 if ijSupp(θ)1Lθ(i)θ(j) otherwise, where i,jsupp(θ) with ij=ij.\displaystyle=\begin{cases}0&\ \text{ if }i-j\not\in\text{Supp}(\theta)\\ \dfrac{1}{L}\theta(i^{\prime})\theta(j^{\prime})&\ \text{ otherwise, where }i^{\prime},j^{\prime}\in\text{supp}(\theta)\text{ with }i-j=i^{\prime}-j^{\prime}.\end{cases}

Thus the support of θ\theta is completely determined by the second moment tensor due to the resolution of the beltway problem. As Δ2(θ,ϕ)=0\Delta_{2}(\theta,\phi)=0, we must have Supp(θ)=Supp(Rϕ)\text{Supp}(\theta)=\text{Supp}(R_{\ell}\phi) for some 1L1\leq\ell\leq L. Without loss of generality assume =L\ell=L. In other words, we have supp(θ)=supp(ϕ)\text{supp}(\theta)=\text{supp}(\phi). Let {k1,k2,,ks}\{k_{1},k_{2},\cdots,k_{s}\} denote their (common) support. Then the condition

θ(ki)θ(ki)=ϕ(ki)ϕ(ki) for all 1i,is\displaystyle\theta(k_{i})\theta(k_{i^{\prime}})=\phi(k_{i})\phi(k_{i^{\prime}})\ \ \ \ \ \ \text{ for all }1\leq i,i^{\prime}\leq s

is equivalent to condition that the pairwise product of the ratios

θ(k1)ϕ(k1),θ(k2)ϕ(k2),,θ(ks)ϕ(ks)\displaystyle\frac{\theta(k_{1})}{\phi(k_{1})},\ \frac{\theta(k_{2})}{\phi(k_{2})},\cdots,\ \frac{\theta(k_{s})}{\phi(k_{s})}

must all evaluate to 11. This implies that these ratios must either all be positive or all be negative. After reordering if necessary, we assume without loss of generality that

|θ(k1)ϕ(k1)||θ(k2)ϕ(k2)||θ(ks)ϕ(ks)|.\displaystyle\bigg{|}\frac{\theta(k_{1})}{\phi(k_{1})}\bigg{|}\geq\bigg{|}\frac{\theta(k_{2})}{\phi(k_{2})}\bigg{|}\geq\cdots\geq\bigg{|}\frac{\theta(k_{s})}{\phi(k_{s})}\bigg{|}.

If the ratios are all positive, then since

θ(k1)ϕ(k1)θ(k2)ϕ(k2)=1,\displaystyle\frac{\theta(k_{1})}{\phi(k_{1})}\cdot\frac{\theta(k_{2})}{\phi(k_{2})}=1,

we must have θ(k2)/ϕ(k2)1\theta(k_{2})/\phi(k_{2})\leq 1. If the above inequality is strict, then

θ(k2)ϕ(k2)θ(k3)ϕ(k3)<1,\displaystyle\frac{\theta(k_{2})}{\phi(k_{2})}\cdot\frac{\theta(k_{3})}{\phi(k_{3})}<1,

a contradiction. Hence

θ(k2)ϕ(k2)θ(ki)ϕ(ki)=1 for all i2\displaystyle\frac{\theta(k_{2})}{\phi(k_{2})}\cdot\frac{\theta(k_{i})}{\phi(k_{i})}=1\text{ for all $i\neq 2$} θ(ki)ϕ(ki)=1 for all 1iL\displaystyle\implies\frac{\theta(k_{i})}{\phi(k_{i})}=1\ \text{ for all $1\leq i\leq L$}
θ=ϕ.\displaystyle\implies\theta=\phi.

In the latter case, a similar argument yields that θ(ki)/ϕ(ki)=1\theta(k_{i})/\phi(k_{i})=-1 for all 1iL1\leq i\leq L and so θ=ϕ\theta=-\phi. ∎

C Appendix C

We give a full proof of the modifield Theorem 2.1.2. We first prove a modified version of lemma B.12. Fix a positive number K01.K_{0}\geq 1.

Lemma C.1.

For any m1m\in\mathbb{Z}_{\geq 1} and θ,ϕL\theta,\phi\in\mathbb{R}^{L} satisfying θ=1\left\lVert\theta\right\rVert=1 and ρ(θ,ϕ)K0\rho(\theta,\phi)\leq K_{0}, we have

Δm(θ,ϕ)21218mK02mρ(θ,ϕ)2.\displaystyle\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}\leq 12\cdot 18^{m}K_{0}^{2m}\rho(\theta,\phi)^{2}.
Proof.

Assume without loss of generality that ρ(θ,ϕ)=θϕ=:ϵ.\rho(\theta,\phi)=\left\lVert\theta-\phi\right\rVert=:\epsilon. By Jensen’s inequality,

𝔼[(Gθ)m(Gϕ)m]2\displaystyle\left\lVert\mathbb{E}[(G\theta)^{\otimes m}-(G\phi)^{\otimes m}]\right\rVert^{2} 𝔼[(Gθ)m(Gϕ)m2]=θmϕm2.\displaystyle\leq\mathbb{E}\big{[}\lVert(G\theta)^{\otimes m}-(G\phi)^{\otimes m}\rVert^{2}\big{]}=\lVert\theta^{\otimes m}-\phi^{\otimes m}\rVert^{2}.

Expanding the norm yields

θmϕm2\displaystyle\lVert\theta^{\otimes m}-\phi^{\otimes m}\rVert^{2} =θ2m2θ,ϕm+ϕ2m\displaystyle=\left\lVert\theta\right\rVert^{2m}-2\langle\theta,\phi\rangle^{m}+\left\lVert\phi\right\rVert^{2m}
=12(1+γ)m+(1+2γ+ϵ2)m\displaystyle=1-2(1+\gamma)^{m}+(1+2\gamma+\epsilon^{2})^{m}

where γ:=θ,ϕθ\gamma:=\langle\theta,\phi-\theta\rangle is such that |γ|ϵ|\gamma|\leq\epsilon by Cauchy Schwarz.

By the binomial theorem, for all xx such that |x|3K02|x|\leq 3K_{0}^{2}, there exists an rmr_{m} such that

(1+x)m=k=0m(mk)xk=1+mx+rm\displaystyle(1+x)^{m}=\sum_{k=0}^{m}\begin{pmatrix}m\\ k\end{pmatrix}x^{k}=1+mx+r_{m}

with |rm|18mK02mx2|r_{m}|\leq 18^{m}K_{0}^{2m}x^{2}. By assumption, we have |2γ+ϵ2|3K02|2\gamma+\epsilon^{2}|\leq 3K_{0}^{2} and so

θmϕm2\displaystyle\lVert\theta^{\otimes m}-\phi^{\otimes m}\rVert^{2} 122mγ+218mK02mϵ2+1+2mγ+mϵ2+18mK02m9ϵ2\displaystyle\leq 1-2-2m\gamma+2\cdot 18^{m}K_{0}^{2m}\epsilon^{2}+1+2m\gamma+m\epsilon^{2}+18^{m}K_{0}^{2m}\cdot 9\epsilon^{2}
=218mK02mϵ2+mϵ2+18mK02m9ϵ2\displaystyle=2\cdot 18^{m}K_{0}^{2m}\epsilon^{2}+m\epsilon^{2}+18^{m}K_{0}^{2m}\cdot 9\epsilon^{2}
1218mK02mϵ2.\displaystyle\leq 12\cdot 18^{m}K_{0}^{2m}\epsilon^{2}.

We now proof the upper bound of the modified Theorem 2.1.2.

Proof.

If θ=0,\theta=0, then since ρ(θ,ϕ)K0θ\rho(\theta,\phi)\leq K_{0}\left\lVert\theta\right\rVert, we must have ϕ=0\phi=0 as well so the statement trivially holds. Otherwise, first note that each term in (2.1.2) remains unchanged if the quantities θ\theta, ϕ\phi and σ\sigma are replaced by θ/θ,ϕ/θ\theta/\left\lVert\theta\right\rVert,\ \phi/\left\lVert\theta\right\rVert and σ/θ\sigma/\left\lVert\theta\right\rVert respectively. The same is true when ϕ\phi is replaced by another vector G0ϕG_{0}\phi in the same \mathcal{R}-orbit. As such, we henceforth assume θ=1\left\lVert\theta\right\rVert=1, σ1\sigma\geq 1 and ρ(θ,ϕ)=θϕ.\rho(\theta,\phi)=\left\lVert\theta-\phi\right\rVert.

Instead of establishing an upper bound on the KL divergence DKL(PθPϕ)D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi}) directly, we instead work with the χ2\chi^{2}-divergence

χ2(Pθ,Pϕ):=d(fθ(x)fϕ(x))2fϕ(x)𝑑x.\displaystyle\chi^{2}(P_{\theta},P_{\phi}):=\int_{\mathbb{R}^{d}}\frac{(f_{\theta}(x)-f_{\phi}(x))^{2}}{f_{\phi}(x)}\ dx.

and then pass to the KL divergence via the upper bound DKL(PθPϕ)χ2(Pθ,Pϕ)D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\leq\chi^{2}(P_{\theta},P_{\phi}). Since 𝔼[Gϕ]=0,\mathbb{E}[G\phi]=0, Jensen’s inequality implies that

fϕ(x)\displaystyle f_{\phi}(x) 1σd(2π)d/2ex2+ϕ22σ2e1σ2𝔼[xTGϕ]=1σd(2π)d/2ex2+ϕ22σ2.\displaystyle\geq\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{-\frac{\left\lVert x\right\rVert^{2}+\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}e^{\frac{1}{\sigma^{2}}\mathbb{E}[x^{T}G\phi]}=\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{-\frac{\left\lVert x\right\rVert^{2}+\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}.

Hence

(fθ(x)fϕ(x))2fϕ(x)1σd(2π)d/2ex2+ϕ22σ2(eθ22σ2𝔼[e1σ2xTGθ]eϕ22σ2𝔼[e1σ2xTGϕ])2.\displaystyle\frac{(f_{\theta}(x)-f_{\phi}(x))^{2}}{f_{\phi}(x)}\leq\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{\frac{-\left\lVert x\right\rVert^{2}+\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}\Big{(}e^{-\frac{\left\lVert\theta\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{1}{\sigma^{2}}x^{T}G\theta}\big{]}-e^{-\frac{\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{1}{\sigma^{2}}x^{T}G\phi}\big{]}\Big{)}^{2}.

To obtain our desired bound on the χ2\chi^{2}-divergence, we will integrate both sides with respect to xx. Expanding out the square on the right-hand side yield three terms, which we will evaluate separately. The first term is

eϕ22θ22σ2d1σd(2π)d/2ex22σ2𝔼[e1σ2xT(G+G)θ]𝑑x\displaystyle\ \ \ \ e^{\frac{\left\lVert\phi\right\rVert^{2}-2\left\lVert\theta\right\rVert^{2}}{2\sigma^{2}}}\int_{\mathbb{R}^{d}}\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{-\frac{\left\lVert x\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{1}{\sigma^{2}}x^{T}(G+G^{\prime})\theta}\big{]}\ dx

where GG^{\prime} denotes an independent and identically distributed copy of GG. To simplify the expression, we seek to rewrite it as the integral of the density of a Gaussian by completing the square. We obtain

eϕ22θ22σ2d1σd(2π)d/2ex22σ2𝔼[e1σ2xT(G+G)θ]𝑑x\displaystyle\ \ \ \ e^{\frac{\left\lVert\phi\right\rVert^{2}-2\left\lVert\theta\right\rVert^{2}}{2\sigma^{2}}}\int_{\mathbb{R}^{d}}\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{-\frac{\left\lVert x\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{1}{\sigma^{2}}x^{T}(G+G^{\prime})\theta}\big{]}\ dx
=eϕ22θ22σ2𝔼[d1σd(2π)d/2e12σ2(x(G+G)θ)T(x(G+G)θ)𝑑xe12σ2((G+G)θ)T((G+G)θ)]\displaystyle=e^{\frac{\left\lVert\phi\right\rVert^{2}-2\left\lVert\theta\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\Bigg{[}\int_{\mathbb{R}^{d}}\frac{1}{\sigma^{d}(2\pi)^{d/2}}e^{-\frac{1}{2\sigma^{2}}(x-(G+G^{\prime})\theta)^{T}(x-(G+G^{\prime})\theta)}dx\cdot e^{\frac{1}{2\sigma^{2}}((G+G^{\prime})\theta)^{T}((G+G^{\prime})\theta)}\Bigg{]}
=eϕ22θ22σ2𝔼[e12σ2((G+G)θ)T((G+G)θ)]\displaystyle=e^{\frac{\left\lVert\phi\right\rVert^{2}-2\left\lVert\theta\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\Big{[}e^{\frac{1}{2\sigma^{2}}((G+G^{\prime})\theta)^{T}((G+G^{\prime})\theta)}\Big{]}
=eϕ22σ2𝔼[eθTGθσ2]\displaystyle=e^{\frac{\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{\theta^{T}G\theta}{\sigma^{2}}}\big{]}

Via similar computations, the second and third terms evaluate to

2eϕ22σ2𝔼[eθTGϕσ2] and eϕ22σ2𝔼[eϕTGϕσ2]\displaystyle-2e^{\frac{\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{\theta^{T}G\phi}{\sigma^{2}}}\big{]}\ \ \ \ \ \text{ and }\ \ \ \ \ e^{\frac{\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}\mathbb{E}\big{[}e^{\frac{\phi^{T}G\phi}{\sigma^{2}}}\big{]}

respectively. As θϕK0\left\lVert\theta-\phi\right\rVert\leq K_{0} and θ=1\left\lVert\theta\right\rVert=1, we have that ϕ2(K0+1)24K02\left\lVert\phi\right\rVert^{2}\leq(K_{0}+1)^{2}\leq 4K_{0}^{2} and so

eϕ22σ2e2K02σ2.\displaystyle e^{\frac{\left\lVert\phi\right\rVert^{2}}{2\sigma^{2}}}\leq e^{\frac{2K_{0}^{2}}{\sigma^{2}}}.

A power series expansion yields

χ2(Pθ,Pϕ)\displaystyle\chi^{2}(P_{\theta},P_{\phi}) e2K02σ2𝔼[eθTGθσ22eθTGϕσ2+eϕTGϕσ2]\displaystyle\leq e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\mathbb{E}\Big{[}e^{\frac{\theta^{T}G\theta}{\sigma^{2}}}-2e^{\frac{\theta^{T}G\phi}{\sigma^{2}}}+e^{\frac{\phi^{T}G\phi}{\sigma^{2}}}\Big{]}
=e2K02σ2m=11σ2mm!𝔼[(θTGθ)m2(θTGϕ)m+(ϕTGϕ)m]\displaystyle=e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\sum_{m=1}^{\infty}\frac{1}{\sigma^{2m}m!}\mathbb{E}\big{[}(\theta^{T}G\theta)^{m}-2(\theta^{T}G\phi)^{m}+(\phi^{T}G\phi)^{m}\big{]}
=e2K02σ2m=11σ2mm!Δm(θ,ϕ)2.\displaystyle=e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\sum_{m=1}^{\infty}\frac{1}{\sigma^{2m}m!}\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}.

With the lemma, and using the fact that σ1\sigma\geq 1, we conclude the proof with

e2K02σ2m=kΔm(θ,ϕ)2σ2mm!\displaystyle e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\sum_{m=k}^{\infty}\frac{\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}}{\sigma^{2m}m!} 12e2K02σ2m=k18mK02mρ(θ,ϕ)2σ2mm!\displaystyle\leq 12e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\sum_{m=k}^{\infty}\frac{18^{m}K_{0}^{2m}\rho(\theta,\phi)^{2}}{\sigma^{2m}m!}
12e2K02σ2ρ(θ,ϕ)2σ2km=k18mK02mm!\displaystyle\leq 12e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\cdot\frac{\rho(\theta,\phi)^{2}}{\sigma^{2k}}\sum_{m=k}^{\infty}\frac{18^{m}K_{0}^{2m}}{m!}
12e2K02σ2+18K02ρ(θ,ϕ)2σ2k.\displaystyle\leq 12e^{\frac{2K_{0}^{2}}{\sigma^{2}}+18K_{0}^{2}}\cdot\frac{\rho(\theta,\phi)^{2}}{\sigma^{2k}}.

The remainder of the section is dedicated to proving the lower bound of the modified Theorem 2.1.2. To that end, we first recall the key ingredients used in the original proof of Theorem 2.1.2. At each step, the modifications that are needed to suit our current context are highlighted.

Definition C.2.

The (probabilist’s) Hermite polynomials are a family of polynomials {hk}k=0\{h_{k}\}_{k=0}^{\infty} defined by

hk(x):=(1)kex22kxkex22,k0.\displaystyle h_{k}(x):=(-1)^{k}e^{\frac{x^{2}}{2}}\dfrac{\partial^{k}}{\partial x^{k}}e^{-\frac{x^{2}}{2}},\ \ \ \ \ \ k\in\mathbb{Z}_{\geq 0}.
Fact C.3.

The Hermite polynomials satisfy the following basic properties:

  1. (i)

    The polynomial hk(x)h_{k}(x) has degree kk;

  2. (ii)

    The family {hk}k=0\{h_{k}\}_{k=0}^{\infty} is an orthogonal basis for L2(,γ)L^{2}(\mathbb{R},\gamma), where γ\gamma denotes the standard Gaussian measure on ;\mathbb{R};

  3. (iii)

    We have hkL2(,γ)2=k!\left\lVert h_{k}\right\rVert^{2}_{L^{2}(\mathbb{R},\gamma)}=k!;

  4. (iv)

    For any μ\mu\in\mathbb{R}, we have 𝔼Y𝒩(μ,1)[hk(Y)]=μk.\underset{Y\sim\mathcal{N}(\mu,1)}{\mathbb{E}}\big{[}h_{k}(Y)\big{]}=\mu^{k}.

Definition C.4.

For each multi-index α=(α1,,αd)0d\alpha=(\alpha_{1},\cdots,\alpha_{d})\in\mathbb{N}_{0}^{d}, define the multivariate polynomial hαh_{\alpha} by

hα(x1,,xd):=i=1dhαi(xi).\displaystyle h_{\alpha}(x_{1},\cdots,x_{d}):=\prod_{i=1}^{d}h_{\alpha_{i}}(x_{i}).

The family {hα:α0d}\big{\{}h_{\alpha}:\alpha\in\mathbb{N}_{0}^{d}\big{\}} is called the multivariate Hermite polynomials.

The multivariate Hermite polynomials form an orthogonal basis for the product space L2(d,γd)L^{2}(\mathbb{R}^{d},\gamma^{\otimes d}). By properties (ii), (iii) and (iv) of Fact C.3, the family of rescaled Hermite polynomials {Hα:α0d}\{H_{\alpha}\ :\ \alpha\in\mathbb{N}^{d}_{0}\} defined by

Hα(x1,,xd):=σ|α|hα(σ1x1,,σ1xd)H_{\alpha}(x_{1},\cdots,x_{d}):=\sigma^{|\alpha|}h_{\alpha}(\sigma^{-1}x_{1},\cdots,\sigma^{-1}x_{d}) (15)

satisfy the following identities

𝔼Z𝒩(𝝁,σ2𝑰d)[Hα(Z)]\displaystyle\displaystyle\underset{Z\sim\mathcal{N}(\boldsymbol{\mu},\sigma^{2}\boldsymbol{I}_{d})}{\mathbb{E}}\big{[}H_{\alpha}(Z)\big{]} =i=1dμiαi\displaystyle=\prod_{i=1}^{d}\mu_{i}^{\alpha_{i}} (16)
𝔼Z𝒩(𝟎,σ2𝑰d)[Hα(Z)Hβ(Z)]\displaystyle\underset{Z\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}_{d})}{\mathbb{E}}\big{[}H_{\alpha}(Z)H_{\beta}(Z)\big{]} ={σ2|α|α! if α=β0 otherwise.\displaystyle=\begin{cases}\sigma^{2|\alpha|}\alpha!&\ \text{ if }\alpha=\beta\\ 0&\ \text{ otherwise}.\end{cases} (17)

Next, for each positive integer mm, we define the function Hm:d(d)mH_{m}:\mathbb{R}^{d}\to(\mathbb{R}^{d})^{\otimes m} in the following way. Given xdx\in\mathbb{R}^{d}, set Hm(x)H_{m}(x) to be the order-mm symmetric tensor whose (i1,,im)(i_{1},\cdots,i_{m})th-entry is given by Hα(i1im)(x)H_{\alpha^{(i_{1}\cdots i_{m})}}(x), where α(i1im){0,,m}d\alpha^{(i_{1}\cdots i_{m})}\in\{0,\cdots,m\}^{d} is the multi-index associated to (i1,,im)(i_{1},\cdots,i_{m}):

α(i1im):=|{j[m]:ij=}|, 1d.\displaystyle\alpha^{(i_{1}\cdots i_{m})}_{\ell}:=\big{|}\big{\{}{j\in[m]\ :\ i_{j}=\ell}\big{\}}\big{|},\ \ \ \ \ \ \ 1\leq\ell\leq d.

Note that |α(i1im)|=m|\alpha^{(i_{1}\cdots i_{m})}|=m for each mm-tuple (i1,,im)[d]m(i_{1},\cdots,i_{m})\in[d]^{m}.

The motivation behind the above definition will gradually become apparent once we write the quantities Δm(θ,ϕ)2\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2} in terms of the family (Hm)m=1(H_{m})_{m=1}^{\infty}, where θ\theta and ϕ\phi are arbitrary vectors in d\mathbb{R}^{d}. For a positive integer kk, consider the degree kk polynomial

Tk(x):=m=1kΔm(θ,ϕ),Hm(x)(3σ)2mm!.\displaystyle T_{k}(x):=\sum_{m=1}^{k}\frac{\langle\Delta_{m}(\theta,\phi),H_{m}(x)\rangle}{(\sqrt{3}\sigma)^{2m}m!}.

If XPθX\sim P_{\theta}, then (15) implies that

𝔼[Tk(X)]=𝔼G[m=1kΔm(θ,ϕ),𝔼ξ[Hm(X)](3σ)2mm!]=m=1kΔm(θ,ϕ),𝔼[(Gθ)m](3σ)2mm!.\displaystyle\mathbb{E}[T_{k}(X)]=\mathbb{E}_{G}\Bigg{[}\sum_{m=1}^{k}\frac{\big{\langle}\Delta_{m}(\theta,\phi),\mathbb{E}_{\xi}[H_{m}(X)]\big{\rangle}}{(\sqrt{3}\sigma)^{2m}m!}\Bigg{]}=\sum_{m=1}^{k}\frac{\big{\langle}\Delta_{m}(\theta,\phi),\mathbb{E}[(G\theta)^{\otimes m}]\big{\rangle}}{(\sqrt{3}\sigma)^{2m}m!}.

Hence if YPϕY\sim P_{\phi}, we get

𝔼[Tk(X)]𝔼[Tk(Y)]=m=1kΔm(θ,ϕ)2(3σ)2mm!=:δ.\displaystyle\mathbb{E}[T_{k}(X)]-\mathbb{E}[T_{k}(Y)]=\sum_{m=1}^{k}\frac{\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}}{(\sqrt{3}\sigma)^{2m}m!}=:\delta.

To proceed, we will use the following lemma to relate the KL divergence between PθP_{\theta} and PϕP_{\phi} to the quantity δ\delta. (Remark: This lemma remains unchanged from lemma B.11 of the BRW paper).

Lemma C.5.

Let P1P_{1} and P2P_{2} be any two probability distributions on a measure space (Ω,)(\Omega,\mathcal{F}). Suppose that there exists a measurable function F:ΩF:\Omega\to\mathbb{R} such that (𝔼P1[F(X)]𝔼P2[F(X)])2=μ2\big{(}\mathbb{E}_{P_{1}}[F(X)]-\mathbb{E}_{P_{2}}[F(X)]\big{)}^{2}=\mu^{2} and max{varP1(F(X)),varP2(F(X))}σ2\max\big{\{}\text{var}_{P_{1}}(F(X)),\text{var}_{P_{2}}(F(X))\big{\}}\leq\sigma^{2}. Then

DKL(P1P2)μ24σ2+μ2.D_{\text{KL}}(P_{1}\;\|\;P_{2})\geq\frac{\mu^{2}}{4\sigma^{2}+\mu^{2}}. (18)
Proof.

By replacing FF by F+λF+\lambda for a suitably chosen constant λ\lambda, we may assume that 𝔼P1[F(X)]=μ/2\mathbb{E}_{P_{1}}[F(X)]=\mu/2 and 𝔼P2[F(X)]=μ/2\mathbb{E}_{P_{2}}[F(X)]=-\mu/2. Let Q1Q_{1} and Q2Q_{2} denote the corresponding probability distributions of F(X)F(X) when XX is distributed according to P1P_{1} and P2P_{2} respectively. By the data processing inequality, it suffices to prove the claimed bound for DKL(Q1Q2).D_{\text{KL}}(Q_{1}\;\|\;Q_{2}). We further assume that Q1Q_{1} is absolutely continuous with respect to Q2Q_{2} (otherwise the bound is trivial).

As the quantities involved in (18) arise from taking the expectation of random variables, our approach to establish the inequality will be to pass to the convex function f:[0,+)f:[0,+\infty)\to\mathbb{R} defined by

f(x):=xlogx(x1)22(x+1)\displaystyle f(x):=x\log x-\frac{(x-1)^{2}}{2(x+1)}

and then apply Jensen’s inequality. This yields

𝔼Q2[f(dQ1dQ2)]f(𝔼Q2[dQ1dQ2])=f(1)=0.\displaystyle\mathbb{E}_{Q_{2}}\Bigg{[}f\left(\frac{dQ_{1}}{dQ_{2}}\right)\Bigg{]}\geq f\Bigg{(}\mathbb{E}_{Q_{2}}\bigg{[}\frac{dQ_{1}}{dQ_{2}}\bigg{]}\Bigg{)}=f(1)=0.

Let μ\mu be a dominating measure (i.e. Q1μQ_{1}\ll\mu and Q2μQ_{2}\ll\mu) and let q1q_{1} and q2q_{2} denote the densities of Q1Q_{1} and Q2Q_{2} with respect to μ\mu. The previous calculation implies that

DKL(Q1Q2)=𝔼Q2[dQ1dQ2logdQ1dQ2]12(q1(x)q2(x))2(q1(x)+q2(x))𝑑μ(x).\displaystyle D_{\text{KL}}(Q_{1}\;\|\;Q_{2})=\mathbb{E}_{Q_{2}}\bigg{[}\frac{dQ_{1}}{dQ_{2}}\log\frac{dQ_{1}}{dQ_{2}}\bigg{]}\geq\frac{1}{2}\int_{\mathbb{R}}\frac{(q_{1}(x)-q_{2}(x))^{2}}{(q_{1}(x)+q_{2}(x))}\ d\mu(x).

By the Cauchy-Schwarz inequality,

μ2\displaystyle\mu^{2} =(x(q1(x)q2(x))𝑑μ(x))2\displaystyle=\Bigg{(}\int_{\mathbb{R}}x\big{(}q_{1}(x)-q_{2}(x)\big{)}\ d\mu(x)\Bigg{)}^{2}
x2(q1(x)+q2(x))𝑑μ(x)(q1(x)q2(x))2q1(x)+q2(x)𝑑μ(x)\displaystyle\leq\int_{\mathbb{R}}x^{2}\big{(}q_{1}(x)+q_{2}(x)\big{)}\ d\mu(x)\int_{\mathbb{R}}\frac{(q_{1}(x)-q_{2}(x))^{2}}{q_{1}(x)+q_{2}(x)}\ d\mu(x)
=(2σ2+μ2/2)(q1(x)q2(x))2q1(x)+q2(x)𝑑μ(x).\displaystyle=(2\sigma^{2}+\mu^{2}/2)\int_{\mathbb{R}}\frac{(q_{1}(x)-q_{2}(x))^{2}}{q_{1}(x)+q_{2}(x)}\ d\mu(x).

Hence

DKL(Q1Q2)μ24σ4+μ2\displaystyle D_{\text{KL}}(Q_{1}\;\|\;Q_{2})\geq\frac{\mu^{2}}{4\sigma^{4}+\mu^{2}}

as desired. ∎

With the above lemma, our strategy for establishing lower bounds for the KL divergence will be to lower bound the quantity δ\delta and upper bound the variances of both Tk(X)T_{k}(X) and Tk(Y)T_{k}(Y). To control δ\delta, we will apply Lemma C.1. On the other hand, to control the variances, we will use its Hermite decomposition as a gateway to bring in heavy machinery from harmonic analysis. The following lemma remains unchanged from Lemma B.13 of the BRW paper.

Lemma C.6.

Fix a positive integer kk. Let ζd\zeta\in\mathbb{R}^{d} and suppose that YPζY\sim P_{\zeta}. Then for any symmetric tensors S1,,SkS_{1},\cdots,S_{k}, where Sm(d)mS_{m}\in(\mathbb{R}^{d})^{m}, we have

var(m=1kSm,Hm(Y)(3σ)2mm!)eζ22σ2k=1kSm2(3σ)2mm!.\displaystyle\text{var}\left(\sum_{m=1}^{k}\frac{\langle S_{m},H_{m}(Y)\rangle}{(\sqrt{3}\sigma)^{2m}m!}\right)\leq e^{\frac{\left\lVert\zeta\right\rVert^{2}}{2\sigma^{2}}}\sum_{k=1}^{k}\frac{\left\lVert S_{m}\right\rVert^{2}}{(\sqrt{3}\sigma)^{2m}m!}.
Proof.

Let F(x)=m=1kSm,Hm(x)(3σ)2mm!F(x)=\sum_{m=1}^{k}\frac{\langle S_{m},H_{m}(x)\rangle}{(\sqrt{3}\sigma)^{2m}m!}. To upper bound the variance, it suffices to upper bound the second moment 𝔼[F(Y)2]\mathbb{E}[F(Y)^{2}]. Before we are able to bring in results from the theory of Gaussian spaces, we first need to replace PζP_{\zeta} with the centered multivariate normal distribution ZP0=𝒩(𝟎,σ2𝑰d).Z\sim P_{0}=\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}_{d}). To that end, we apply the Cauchy-Schwartz inequality to obtain

𝔼[F(Y)2]\displaystyle\mathbb{E}[F(Y)^{2}] =dfζ(x)F(x)2𝑑x\displaystyle=\int_{\mathbb{R}^{d}}f_{\zeta}(x)F(x)^{2}\ dx
(df0(x)F(x)4𝑑x)1/2(dfζ(x)2f0(x)𝑑x)1/2\displaystyle\leq\left(\int_{\mathbb{R}^{d}}f_{0}(x)F(x)^{4}\ dx\right)^{1/2}\left(\int_{\mathbb{R}^{d}}\frac{f_{\zeta}(x)^{2}}{f_{0}(x)}\ dx\right)^{1/2}
=𝔼[F(Z)4]1/2(dfζ(x)2f0(x)𝑑x)1/2.\displaystyle=\mathbb{E}\big{[}F(Z)^{4}\big{]}^{1/2}\left(\int_{\mathbb{R}^{d}}\frac{f_{\zeta}(x)^{2}}{f_{0}(x)}\ dx\right)^{1/2}. (19)

We first address the second term. This is done by proceeding in a similar fashion as in the proof of the upper bound of the modified Theorem 9. Observe that

dfζ(x)2f0(x)𝑑x\displaystyle\int_{\mathbb{R}^{d}}\frac{f_{\zeta}(x)^{2}}{f_{0}(x)}\ dx =1σd(2π)d/2d𝔼[exp(12σ2(x22xTGζ+ζ2))]2exp(12σ2x2)𝑑x.\displaystyle=\frac{1}{\sigma^{d}(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\frac{\mathbb{E}\big{[}\exp(-\frac{1}{2\sigma^{2}}(\left\lVert x\right\rVert^{2}-2x^{T}G\zeta+\left\lVert\zeta\right\rVert^{2}))\big{]}^{2}}{\exp(-\frac{1}{2\sigma^{2}}\left\lVert x\right\rVert^{2})}\ dx.

By applying Jensen’s inequality and then completing the square afterwards, we obtain

dfζ(x)2f0(x)𝑑x\displaystyle\ \ \ \ \int_{\mathbb{R}^{d}}\frac{f_{\zeta}(x)^{2}}{f_{0}(x)}\ dx
𝔼[1σd(2π)d/2dexp(1σ2(x22xTGζ+ζ2))exp(12σ2x2)dy.]\displaystyle\leq\mathbb{E}\Bigg{[}\frac{1}{\sigma^{d}(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\frac{\exp\big{(}-\frac{1}{\sigma^{2}}(\left\lVert x\right\rVert^{2}-2x^{T}G\zeta+\left\lVert\zeta\right\rVert^{2})\big{)}}{\exp(-\frac{1}{2\sigma^{2}}\left\lVert x\right\rVert^{2})}\ dy.\Bigg{]}
=𝔼[1σd(2π)d/2dexp(12σ2x2+2σ2xTG2σ2ζ2)𝑑yexp(1σ2ζ2)]\displaystyle=\mathbb{E}\Bigg{[}\frac{1}{\sigma^{d}(2\pi)^{d/2}}\int_{\mathbb{R}^{d}}\exp\left(-\frac{1}{2\sigma^{2}}\left\lVert x\right\rVert^{2}+\frac{2}{\sigma^{2}}x^{T}G-\frac{2}{\sigma^{2}}\left\lVert\zeta\right\rVert^{2}\right)\ dy\cdot\exp\bigg{(}\frac{1}{\sigma^{2}}\left\lVert\zeta\right\rVert^{2}\bigg{)}\Bigg{]}
=exp(1σ2ζ2).\displaystyle=\exp\bigg{(}\frac{1}{\sigma^{2}}\left\lVert\zeta\right\rVert^{2}\bigg{)}. (20)

We now come to the crux of the matter, which is to establish an upper bound on the first term (𝔼[F(Z)4])1/2(\mathbb{E}[F(Z)^{4}])^{1/2}. To accomplish that goal, we bring in some standard results about the Ornstein-Uhlenbeck semigroup, which is a family of operators Uρ:L2(d,γd)L2(d,γd)U_{\rho}:L^{2}(\mathbb{R}^{d},\gamma^{\otimes d})\to L^{2}(\mathbb{R}^{d},\gamma^{\otimes d}) defined by

Uρ(f)(z):=𝔼Z𝒩(𝟎,σ2𝑰d)[f(ρz+1ρ2Z)],ρ[1,1].\displaystyle U_{\rho}(f)(z):=\underset{Z^{\prime}\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}_{d})}{\mathbb{E}}\Big{[}f\big{(}\rho z+\sqrt{1-\rho^{2}}\cdot Z^{\prime}\big{)}\Big{]},\ \ \ \ \ \ \rho\in[-1,1].

Here, we highlight that our definition of UρU_{\rho} differs from the standard definition in the literature in the sense that expectation is taken with respect to a multivariate normal distribution with covariance matrix σ2𝑰d\sigma^{2}\boldsymbol{I}_{d} as opposed to 𝑰d\boldsymbol{I}_{d} to compensate for the scaling in (15).(\ref{AppendixC5}).

The set {Hα:α0d}\{H_{\alpha}\ :\ \alpha\in\mathbb{N}^{d}_{0}\} is an eigenbasis for the family (Uρ)(U_{\rho}), with Uρ(Hα)=ρ|α|HαU_{\rho}(H_{\alpha})=\rho^{|\alpha|}H_{\alpha} [25, Proposition 11.37]. By viewing Sm,Hm(x)\langle S_{m},H_{m}(x)\rangle as a polynomial in xx, we get

Uρ(Sm,Hm(x))\displaystyle U_{\rho}\big{(}\langle S_{m},H_{m}(x)\rangle\big{)} =Uρ(1i1,,imd(Sm)i1im(Hm)i1im)\displaystyle=U_{\rho}\left(\sum_{1\leq i_{1},\cdots,i_{m}\leq d}(S_{m})_{i_{1}\cdots i_{m}}(H_{m})_{i_{1}\cdots i_{m}}\right)
=1i1,,imd(Sm)i1imUρ(Hα(i1im))\displaystyle=\sum_{1\leq i_{1},\cdots,i_{m}\leq d}(S_{m})_{i_{1}\cdots i_{m}}U_{\rho}\big{(}H_{\alpha^{(i_{1}\cdots i_{m})}}\big{)}
=ρm1i1,,imd(Sm)i1imHα(i1im)=ρmSm,Hm(x),\displaystyle=\rho^{m}\sum_{1\leq i_{1},\cdots,i_{m}\leq d}(S_{m})_{i_{1}\cdots i_{m}}H_{\alpha^{(i_{1}\cdots i_{m})}}=\rho^{m}\langle S_{m},H_{m}(x)\rangle,

where we have used the fact that |α(i1im)|=m|\alpha^{(i_{1}\cdots i_{m})}|=m. Thus if we define the degree kk polynomial

F~(x):=m=1kSm,Hm(x)(3)mσ2mm!,\displaystyle\widetilde{F}(x):=\sum_{m=1}^{k}\frac{\langle S_{m},H_{m}(x)\rangle}{(\sqrt{3})^{m}\sigma^{2m}m!},

then U1/3(F~)=F.U_{1/\sqrt{3}}(\widetilde{F})=F. Next, we will invoke the Gaussian hypercontractivity theorem:

Theorem C.7.

[25, Theorem 11.23] Let 1pq1\leq p\leq q\leq\infty and let fLp(n,γ)f\in L^{p}(\mathbb{R}^{n},\gamma), where γ\gamma is the standard Gaussian measure on n\mathbb{R}^{n}. Then Uρfqfp\left\lVert U_{\rho}f\right\rVert_{q}\leq\left\lVert f\right\rVert_{p} for 0ρp1q10\leq\rho\leq\sqrt{\frac{p-1}{q-1}}.

In our current context, the Gaussian hypercontractivity theorem implies that

𝔼[F(Z)4]1/2𝔼[F~(Z)2].\displaystyle\mathbb{E}\big{[}F(Z)^{4}\big{]}^{1/2}\leq\mathbb{E}\big{[}\widetilde{F}(Z)^{2}\big{]}.

It remains to compute 𝔼[F~(Z)2]\mathbb{E}[\widetilde{F}(Z)^{2}]. Due to the orthogonality relations in (17), when we expand the square, most of the terms will vanish. Since SmS_{m} and Hm(x)H_{m}(x) are both symmetric tensors, for any tuple (i1,,im)[d]m(i_{1},\cdots,i_{m})\in[d]^{m}, the quantities (Sm)i1im(S_{m})_{i_{1}\cdots i_{m}} and (Hm(x))i1im(H_{m}(x))_{i_{1}\cdots i_{m}} depend only on the multi-set {i1,,im}.\{i_{1},\cdots,i_{m}\}. Thus for each α0d\alpha\in\mathbb{N}^{d}_{0} such that |α|=m|\alpha|=m, if we define

Sα:=(Sm)i1im\displaystyle S_{\alpha}:=(S_{m})_{i_{1}\cdots i_{m}}

where (i1,,im)(i_{1},\cdots,i_{m}) is any mm-tuple satisfying α(i1im)=α\alpha^{(i_{1}\cdots i_{m})}=\alpha, we have that

Sm,Hm(x)=|α|=mm!α!SαHα(x).\displaystyle\langle S_{m},H_{m}(x)\rangle=\sum_{\begin{subarray}{c}|\alpha|=m\end{subarray}}\frac{m!}{\alpha!}S_{\alpha}H_{\alpha}(x).

Applying the orthogonality relations in (17), we obtain

𝔼[F(Z)4]1/2𝔼[F~(Z)2]\displaystyle\mathbb{E}\big{[}F(Z)^{4}\big{]}^{1/2}\leq\mathbb{E}\big{[}\widetilde{F}(Z)^{2}\big{]} =m=1k𝔼[(1(3)mσ2mm!|α|=mm!α!SαHα(Z))2]\displaystyle=\sum_{m=1}^{k}\mathbb{E}\Bigg{[}\bigg{(}\frac{1}{(\sqrt{3})^{m}\sigma^{2m}m!}\sum_{\begin{subarray}{c}|\alpha|=m\end{subarray}}\frac{m!}{\alpha!}S_{\alpha}H_{\alpha}(Z)\bigg{)}^{2}\Bigg{]}
=m=1k13mσ4m|α|=mSα2α!2𝔼[Hα(Z)2]\displaystyle=\sum_{m=1}^{k}\frac{1}{3^{m}\sigma^{4m}}\sum_{\begin{subarray}{c}|\alpha|=m\end{subarray}}\frac{S_{\alpha}^{2}}{\alpha!^{2}}\cdot\mathbb{E}\big{[}H_{\alpha}(Z)^{2}\big{]}
=m=1k1(3σ)2mm!|α|=mm!α!Sα2\displaystyle=\sum_{m=1}^{k}\frac{1}{(\sqrt{3}\sigma)^{2m}m!}\sum_{\begin{subarray}{c}|\alpha|=m\end{subarray}}\frac{m!}{\alpha!}S_{\alpha}^{2}
=m=1kSm2(3σ)2mm!.\displaystyle=\sum_{m=1}^{k}\frac{\left\lVert S_{m}\right\rVert^{2}}{(\sqrt{3}\sigma)^{2m}m!}. (21)

Plugging in (20) and (21) back into (19) gives the desired conclusion. ∎

We now conclude our discussion in this section by putting together everything that was introduced. Note that while the constants in our proof differ from the original proof given in [3, Theorem 9], the spirit of the proof remains unchanged.

Proof.

As in the proof of the upper bound, we assume ρ(θ,ϕ)=θϕ,θ=1\rho(\theta,\phi)=\left\lVert\theta-\phi\right\rVert,\ \left\lVert\theta\right\rVert=1 and σ1\sigma\geq 1. By Lemma C.1, we have

δ=m=1kΔm(θ,ϕ)2(3σ)2mm!12ρ(θ,ϕ)2m=018mK02m3mm!12K02e6K02.\displaystyle\delta=\sum_{m=1}^{k}\frac{\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}}{(\sqrt{3}\sigma)^{2m}m!}\leq 12\rho(\theta,\phi)^{2}\sum_{m=0}^{\infty}\frac{18^{m}\cdot K_{0}^{2m}}{3^{m}m!}\leq 12K_{0}^{2}e^{6K_{0}^{2}}.

By Lemma C.6, since θ2=1\left\lVert\theta\right\rVert^{2}=1 and ϕ2(K0+1)24K02\left\lVert\phi\right\rVert^{2}\leq(K_{0}+1)^{2}\leq 4K_{0}^{2}, the variances of Tk(X)T_{k}(X) and Tk(Y)T_{k}(Y) are bounded above by e2K02σ2δe^{\frac{2K_{0}^{2}}{\sigma^{2}}}\delta. Now since 4e2K02σ212K02e6K024e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\leq 12K_{0}^{2}e^{6K_{0}^{2}} applying Lemma C.5 then gives

DKL(PθPϕ)δ24e2K02σ2δ+δ2δ224K02e6K02δ=124K02e6K02m=1kΔm(θ,ϕ)2(3σ)2mm!.\displaystyle D_{\text{KL}}(P_{\theta}\;\|\;P_{\phi})\geq\frac{\delta^{2}}{4e^{\frac{2K_{0}^{2}}{\sigma^{2}}}\delta+\delta^{2}}\geq\frac{\delta^{2}}{24K_{0}^{2}e^{6K_{0}^{2}}\delta}=\frac{1}{24}K_{0}^{-2}e^{-6K_{0}^{2}}\sum_{m=1}^{k}\frac{\left\lVert\Delta_{m}(\theta,\phi)\right\rVert^{2}}{(\sqrt{3}\sigma)^{2m}m!}.

Finally, as the summands are nonnegative, letting kk\to\infty gives the desired result. ∎