This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Bethe and Sinkhorn Permanents of Low Rank Matrices
and Implications for Profile Maximum Likelihood

Nima Anari
Stanford University
[email protected]
   Moses Charikar
Stanford University
[email protected]
Moses Charikar was supported by a Simons Investigator Award, a Google Faculty Research Award and an Amazon Research Award.
   Kirankumar Shiragur
Stanford University
[email protected]
Kirankumar Shiragur was supported by Stanford Data Science Scholarship.
   Aaron Sidford
Stanford University
[email protected]
Aaron Sidford was supported by NSF CAREER Award CCF-1844855.
Abstract

In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given nn i.i.d. samples from a discrete distribution, achieve an approximation factor of exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right), improving upon the previous best-known bound achievable in polynomial time of exp(O(n2/3logn))\exp(-O(n^{2/3}\log n)) (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter.

We achieve these results by providing new bounds on the quality of approximation of the Bethe and Sinkhorn permanents (Vontobel, 2012 and 2014). We show that each of these are exp(O(klog(N/k)))\exp(O(k\log(N/k))) approximations to the permanent of N×NN\times N matrices with non-negative rank at most kk, improving upon the previous known bounds of exp(O(N))\exp(O(N)). To obtain our results on PML, we exploit the fact that the PML objective is proportional to the permanent of a certain Vandermonde matrix with n\sqrt{n} distinct columns, i.e. with non-negative rank at most n\sqrt{n}. As a by-product of our work we establish a surprising connection between the convex relaxation in prior work (CSS19) and the well-studied Bethe and Sinkhorn approximations.

1 Introduction

Symmetric property estimation of distributions111Throughout this paper, we use the word distribution to refer to discrete distributions. is an important and well studied problem in statistics and theoretical computer science. Given access to nn i.i.d samples from a hidden discrete distribution p the goal is to estimate f(p)\textbf{f}(\textbf{p}), for a symmetric property f()\textbf{f}(\cdot). Formally, a property is symmetric if it is invariant to permutating the labels, i.e. it is a function of the multiset of probabilities and does not depend on the symbol labels. There are many well-known well-studied such properties, including support size and coverage, entropy, distance to uniformity, Renyi entropy, and sorted 1\ell_{1} distance. Understanding the computational and sample complexity for estimating these symmetric properties has led to an extensive line of interesting research over the past decade.

Symmetric property estimation spans applications in many different fields. For instance, entropy estimation has found applications in neuroscience [RWdRvSB99], physics [VBB+12] and others [PW96, PGM+01]. Support size and coverage estimation were initially used in estimating ecological diversity [Cha84, CL92, BF93, CCG+12] and subsequently applied to many different applications [ET76, TE87, Für05, KLR99, PBG+01, DS13, RCS+09, GTPB07, HHRB01]. For applications of other symmetric properties we refer the reader to [HJWW17, HJM17, AOST14, RVZ17, ZVV+16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a].

Early work on symmetric property estimation developed estimators tailored to the particular property of interest. Consequently, a fundamental and important open questions was to come up with an estimator that is universal, i.e. the same esstimator could be used for all symmetric properties. A natural approach for constructing universal estimators is plug-in approach, where given samples we first compute a distribution independent of the property and later we output the (value of this) property for the computed distribution as our estimate.

Our approach is based on the observation (see [ADOS16]) that a sufficient statistic for estimating a symmetric property from a sequence of samples is the profile, i.e. the multiset of frequencies of symbols in the sequence; e.g. the profile of sequence abbcabbc is {2,1,1}\{2,1,1\}. We provide an efficient universal estimator that is based on the plug-in approach applied to the profile maximum likelihood (PML) distribution introduced by Orlitsky et al. [OSS+04]: given a sequence of nn samples, PML is the distribution that maximizes the likelihood of the observed profile. The problem of computing the PML distribution has been studied in several papers since, applying heuristic approaches such as Bethe/Sinkhorn approximation [Von12, Von14], the EM algorithm [OSS+04], a dynamic programming [PJW17] and algebraic methods [ADM+10].

A recent paper of Acharya et al. [ADOS16] showed that a plug-in estimator using the optimal PML distribution is universal in estimating various symmetric properties of distributions. In fact it suffices to compute a β\beta-approximate PML distribution (i.e. a distribution that approximates the PML objective to within a factor of β\beta) for β>exp(n1δ)\beta>\exp(-n^{1-\delta}) for constant δ>0\delta>0. Previous work of the authors in [CSS19], gave the first efficient algorithm to compute a β\beta-approximate PML for some non-trivial β\beta. In particular, [CSS19] gave a nearly linear running time algorithm to compute an exp(O(n2/3logn))\exp(-O(n^{2/3}\log n))-approximate PML distribution. In this work, we give an efficient algorithm to compute an exp(O(nlogn))\exp(-O(\sqrt{n}\log n))-approximate PML distribution.

The parameter β\beta in β\beta-approximate PML effects the error parameter regime under which the estimator is sample complexity optimal. Smaller values of β\beta yield a universal estimator that is sample optimal over broader parameter regime. For instance, [CSS19] show that exp(O(n2/3logn))\exp(-O(n^{2/3}\log n))-approximate PML222Throughout this paper, O~()\widetilde{O}(\cdot) hides poly logn\log n terms. is sample complexity optimal for estimating certain symmetric properties within accuracy for ϵ>n0.16666\epsilon>n^{-0.16666}. On the other hand [ADOS16] showed that computing an exp(O(nlogn))\exp(-O(\sqrt{n}\log n))-approximate PML is sample complexity optimal for ϵ>n0.249\epsilon>n^{-0.249}. However note that, using the current analysis techniques [ADOS16] we are unsure on how to exploit the computation of exact PML any better than computing an exp(O(nlogn))\exp(-O(\sqrt{n}\log n))-approximate PML and they both are sample complexity optimal over the same error parameter regime.

In our work, we use the Bethe approximation of the permanent or the Bethe permanent (for short), a previously proposed heuristic to compute an approximate PML distribution. This is based on the Bethe free energy approximation originating in statistical physics and is very closely connected to the belief propagation algorithm [YFW05, Von13]. The idea of using the Bethe permanent for computing an approximate PML distribution comes from the fact that the likelihood of a profile with respect to a distribution can be written as the permanent of a non-negative Vandermonde matrix (which we call the profile probability matrix). For a N×NN\times N non-negative matrix, [GS14] show that the ratio between the permanent and the Bethe permanent of a matrix is upper bounded by 1.9022N2N1.9022^{N}\leq 2^{N}, that was later improved to 2N\sqrt{2}^{N} [AR18]333Note that previous results on the Bethe permanent do not immediately imply non-trivial results for PML. For consistency with the literature, we use approximation factors <1<1 for PML.. A natural question is whether the approximation ratio of the Bethe permanent depends on some other structural parameter better than the input dimension of the matrix? In this work, we show that the approximation ratio between the permanent and the Bethe permanent is upper bounded by an exponential in the non-negative rank of the matrix (up to a logarithmic factor). We also give an explicit construction of a matrix to show that our result for this structural parameter is asymptotically tight. As the non-negative rank of any N×NN\times N non-negative matrix is at most NN, our analysis implies an upper bound of cNc^{N} for some constant c>0c>0 on the approximation ratio. Therefore our work (asymptotically) generalizes previous results for general non-negative matrices.

To obtain our efficient algorithm, we prove a slightly stronger statement than the bound of the Bethe permanent of a matrix with non-negative rank at most kk. We show that a scaling of a simpler approximation of the permanent known as the Sinkhorn444Sinkhorn is also called as capacity in the literature. permanent also approximates the permanent up to exponential in the non-negative rank of the matrix (up to log factors). This implies our bound for the Bethe permanent and shows that scaled Sinkhorn is a compelling alternative to Sinkhorn, with a tighter worst-case multiplicative approximation to the permanent.

An immediate application of our work on the Bethe and the scaled Sinkhorn permanent is to approximate PML. Given nn samples, the number of distinct columns in the profile probability matrix is always upper bounded by n\sqrt{n}, i.e. its non-negative rank is at most n\sqrt{n}. Therefore our analysis of the scaled Sinkhorn permanent immediately implies an exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right) approximation to the PML objective with respect to a fixed distribution. This result, combined with probability discretization, results in a convex program whose optimal solution is a fractional representation of an approximate PML distribution. We round this fractional solution to output a valid distribution that is an exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right)-approximate PML distribution. Surprisingly the resulting convex program is exactly the same as the one in [CSS19], where a completely different (combinatorial) technique was used to arrive at the convex program. Our work here provides a better analysis of the convex program in [CSS19] using a more delicate and sophisticated rounding algorithm.

Organization of the paper:

In Section 2 we present preliminaries. In Section 3, we provide the main results of the paper. In Section 4, we analyze the scaled Sinkhorn permanent of structured matrices. In Section 4.1, we prove an upper bound for the approximation ratio of the scaled Sinkhorn permanent to the permanent as a function of the number of distinct columns. In Section 4.2, we prove the generalized result of the scaled Sinkhorn permanent for the low non-negative rank matrices. In Section 5, we prove the lower bound for the Bethe and scaled Sinkhorn approximations of the permanent. In Section 6, we combine the result for the scaled Sinkhorn permanent with the idea of probability discretization to provide the convex program that returns a fractional representation of an approximate PML distribution. In the same section, we provide the rounding algorithm to return a valid approximate PML distribution.

1.1 Overview of Techniques

In [CSS19], the authors presented a convex relaxation for the PML objective. This was obtained by a combinatorial view of the PML problem. In a sequence of steps, they discretized the set of probabilities and the frequencies, grouped the terms in the objective into groups and developed a relaxation for the sum of terms in the largest group, giving an exp(O(n2/3logn))\exp({-{O}(n^{2/3}\log n)}) approximation. In this paper, we exploit the fact that the likelihood of a profile with respect to a distribution is the permanent of a certain non-negative Vandermonde matrix (referred to here as the profile probability matrix with respect to a distribution) and that the PML objective is an optimization problem over such permanents. We work with the same convex relaxation we derived earlier, but relate it to the well known Bethe and scaled Sinkhorn approximations for the permanent. In fact, Vontobel [Von12, Von14] proposed the Bethe and Sinkhorn permanents as a heuristic approximation of the PML objective, but bounding the quality of the solution was an open problem [Von11]. We show that both the Bethe and scaled Sinkhorn permanents are within a factor exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right) of the PML objective. Enroute, we show that the approximation ratio of the Bethe and scaled Sinkhorn permanents for any non-negative matrix AA are upper bounded by the exponential in the non-negative rank of matrix AA. This is a strengthening of the well known exp(O(N))\exp\left(O(N)\right) upper bound on the approximation ratio of both the Bethe and scaled Sinkhorn permanents of an N×NN\times N matrix.

In [CSS19], the fact that the convex problem we obtained was a relaxation of the PML objective followed directly from the combinatorial derivation of our relaxation. By contrast, our analysis here exploits the non-trivial fact that the Bethe and scaled Sinkhorn approximations are lower bounds for the permanent of a non-negative matrix. The Bethe and scaled Sinkhorn permanents of the profile probability matrix AA with respect to a fixed distribution are optimum solutions to maximization problems over doubly stochastic matrices QQ where the objective functions have entropy-like terms involving the entries of AA and QQ. In order to obtain an upper bound on the Bethe and scaled Sinkhorn approximation as a function of the non-negative rank, we show the existence of a doubly stochastic matrix QQ as a witness such that the objective of the Bethe and scaled Sinkhorn w.r.t. QQ upper bounds the permanent of AA within the desired factor.

We first work with a simpler setting of matrices AA with at most kk distinct columns.555In the final preparation of this paper for posting an anonymous reviewer showed that a simpler proof for the distinct column case can be derived using Corollary 3.4.5 of Barvinok’s book [Bar17]. The proof of the Corollary 3.4.5 further uses the famous Bregman–Minc inequality. We thank the anonymous reviewer for this and include the derivation in Appendix A. In constrast, our proof is self-contained and we believe it provides further insight into the structure of the Sinkhorn/Bethe approximations. See Section 3.1 for further details. Here we consider a modified matrix A^\hat{A} that contains the kk distinct columns of AA. We define a distribution μ\mu on permutations of the domain where the probability of a permutation σ\sigma is proportional to its contribution to the permanent of AA. There is a many-to-one mapping from such permutations to 0-1 N×kN\times k matrices with row sums 1 and column sums ϕj\phi_{j}, the number of times the jjth column of A^\hat{A} appears in AA. We next define an N×kN\times k real-valued, non-negative matrix PP with row sums 1 and column sums ϕj\phi_{j}, in terms of the marginals of the distribution μ\mu. We also define a different distribution ν\nu on 0-1 N×kN\times k row-stochastic matrices by independent sampling from PP. Finally, we use the fact that the KL-divergence between μ\mu and ν\nu is non-negative to get the required upper bound on the scaled Sinkhorn approximation with a doubly stochastic witness QQ (obtained from PP). This proof technique is inspired by the recent work of Anari and Rezaei [AR18] that gives a tight 2N\sqrt{2}^{N} bound on the approximation ratio of the Bethe approximation for the permanent of an N×NN\times N non-negative matrix.

Though this bound on the quality of the Bethe and scaled Sinkhorn approximations for non-negative matrices with kk distinct columns suffices for our PML applications, interestingly we show that it can be extended to non-negative matrices with bounded rank. In order to obtain an upper bound on the Bethe and scaled Sinkhorn approximation as a function of the non-negative rank of AA, recall that we need to show the existence of a suitable doubly stochastic witness QQ which certifies the required guarantee. We express the permanent of AA as the sum of O(exp(klog(N/k)))O(\exp(k\log(N/k))) terms of the form perm(U)perm(V)\mathrm{perm}(U)\mathrm{perm}(V) where matrices UU and VV have at most kk distinct columns. We focus on the largest of these terms, and construct a doubly stochastic witness QQ for matrix AA from the witnesses for matrices UU and VV in this largest term. This doubly stochastic witness QQ certifies the required guarantee and we get an upper bound on the scaled Sinkhorn approximation as a function of the non-negative rank. This result for the scaled Sinkhorn approximation further implies an upper bound for the Bethe approximation.

Even with this improved bound on the quality of the Bethe and scaled Sinkhorn approximations as applied to the PML objective, challenges remain in obtaining an improved approximate PML distribution. In particular, we do not know of an efficient algorithm to maximize the Bethe or the scaled Sinkhorn permanent of the profile probability matrix over a family of distributions as it would be needed to compute the Bethe or the scaled Sinkhorn approximation to the optimum of the PML objective. Prior work by Vontobel suggests an alternating maximization approach, but this is only guaranteed to produce a local optimum. To address this, we revisit the efficiently computable convex relaxation from [CSS19] and show that this is suitably close to the scaled Sinkhorn approximation. This is quite surprising as the prior derivation of this relaxation in [CSS19] was purely combinatorial and had nothing to do with the scaled Sinkhorn approximation.

The final challenge towards obtaining our PML results is to round the fractional solution produced so that the approximation guarantee is preserved. The rounding procedure from [CSS19] does not immediately suffice, but we present a more sophisticated and delicate rounding procedure that does indeed give us the required approximation guarantee. The rounding algorithm proceeds in three steps, where in the first step we first apply a procedure analogous to [CSS19] to handle large probability values and in the later steps we provide a new procedure to the smaller probability values; in each step, we ensure that the objective function does not drop significantly. The input to the rounding procedure is a matrix where the rows correspond to discretized probability values and the columns correspond to distinct frequencies. We create rows corresponding to new probability values in the course of the rounding algorithm, maintain column sums and eventually ensure that all row sums are integral, and ensure that the objective function has not dropped significantly.

2 Preliminaries

Let [a,b][a,b] and [a,b][a,b]_{\mathbb{R}} denote the interval of integers and reals a\geq a and b\leq b respectively. Let 𝒟\mathcal{D} be the domain of elements and N=def|𝒟|N\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\mathcal{D}| be its size. Let A𝒟×𝒟\textbf{A}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}} be a non-negative matrix, where its (x,y)(x,y)’th entry is denoted by Ax,y\textbf{A}_{x,y}. We further use Ax:\textbf{A}_{x:} and A:y\textbf{A}_{:y} to denote the row and column corresponding to xx and yy respectively. The non-negative rank of a non-negative matrix A𝒟×𝒟\textbf{A}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}} is equal to the smallest number kk such there exist non-negative vectors vj,uj𝒟\textbf{v}_{j},\textbf{u}_{j}\in\mathbb{R}^{\mathcal{D}} for j[1,k]j\in[1,k] such that A=j[1,k]vjuj\textbf{A}=\sum_{j\in[1,k]}\textbf{v}_{j}\textbf{u}_{j}^{\top}. Let S𝒟S_{\mathcal{D}} be the set of all permutations of domain 𝒟\mathcal{D} and we denote a permutation σ\sigma in the following way σ={(x,σ(x)) for all x𝒟}\sigma=\{(x,\sigma(x))\text{ for all }x\in\mathcal{D}\}. The permanent of a matrix A denoted by perm(A)\mathrm{perm}(\textbf{A}) is defined as follows,

perm(A)=defσS𝒟eσAe.\mathrm{perm}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\sigma\in S_{\mathcal{D}}}\prod_{e\in\sigma}\textbf{A}_{e}~{}.

Let 𝐊rc0𝒟×𝒟\mathbf{K}_{rc}\subseteq\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} be the set of all non-negative matrices that are doubly stochastic. For any matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} and Q𝐊rc\textbf{Q}\in\mathbf{K}_{rc}, we define the following set of functions:

U(A,Q)=def(x,y)𝒟×𝒟Qx,ylog(Ax,yQx,y)andV(Q)=(x,y)𝒟×𝒟(1Qx,y)log(1Qx,y).\mathrm{U}(\textbf{A},\textbf{Q})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\left(\frac{\textbf{A}_{x,y}}{\textbf{Q}_{x,y}}\right)\quad\text{and}\quad\mathrm{V}(\textbf{Q})=\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}(1-\textbf{Q}_{x,y})\log\left(1-\textbf{Q}_{x,y}\right)~{}. (1)

Further,

F(A,Q)=defU(A,Q)+V(Q).\mathrm{F}(\textbf{A},\textbf{Q})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathrm{U}(\textbf{A},\textbf{Q})+\mathrm{V}(\textbf{Q})~{}.

Using these definitions, we define the Bethe permanent of a matrix.

Definition 2.1.

For a matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the Bethe permanent of A is defined as follows,

bethe(A)=defmaxQ𝐊rcexp(F(A,Q)).\mathrm{bethe}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{\textbf{Q}\in\mathbf{K}_{rc}}\exp\left(\mathrm{F}(\textbf{A},\textbf{Q})\right)~{}.

A well known and important result about the Bethe permanent is that it lower bounds the value of permanent of a non-negative matrix and we state this result next.

Lemma 2.2 ([Gur11] based on [Sch98]).

For any non-negative A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the following holds,

bethe(A)perm(A)\mathrm{bethe}(\textbf{A})\leq\mathrm{perm}(\textbf{A})

We next define the Sinkhorn permanent of a matrix and later we state the relationship between the Bethe and the Sinkhorn permanent.

Definition 2.3.

For a matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the Sinkhorn permanent of A is defined as follows,

sinkhorn(A)=defmaxQ𝐊rcexp(U(A,Q)).\mathrm{sinkhorn}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{\textbf{Q}\in\mathbf{K}_{rc}}\exp\left(\mathrm{U}(\textbf{A},\textbf{Q})\right)~{}.

To establish the relationship between the Bethe and the Sinkhorn permanent we need the following lemma from [GS14].

Lemma 2.4 (Proposition 3.1 in [GS14]).

For any distribution p0𝒟\textbf{p}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}, the following holds,

x𝒟(1px)log(1px)1.\sum_{x\in\mathcal{D}}(1-\textbf{p}_{x})\log(1-\textbf{p}_{x})\geq-1~{}.

For any matrix Q𝐊rc\textbf{Q}\in\mathbf{K}_{rc}, each row of Q is a distribution; therefore the following holds,

V(Q)N.\mathrm{V}(\textbf{Q})\geq-N~{}.

As a corollary of the above inequality we have,

Corollary 2.5.

For any non-negative matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the following inequality holds,

exp(N)sinkhorn(A)bethe(A).\exp(-N)\mathrm{sinkhorn}(\textbf{A})\leq\mathrm{bethe}(\textbf{A})~{}.

Later we will see that it is convenient to work with exp(N)sinkhorn(A)\exp(-N)\mathrm{sinkhorn}(\textbf{A}) than sinkhorn(A)\mathrm{sinkhorn}(\textbf{A}) itself; we define this expression to be scaled Sinkhorn and we formally state it next.

Definition 2.6.

For a matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the scaled Sinkhorn permanent of A is defined as follows,

scaledsinkhorn(A)=defmaxQ𝐊rcexp(U(A,Q)N).\mathrm{scaledsinkhorn}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{\textbf{Q}\in\mathbf{K}_{rc}}\exp\left(\mathrm{U}(\textbf{A},\textbf{Q})-N\right)~{}.

The above expression can be equivalently stated as,

scaledsinkhorn(A)=exp(N)sinkhorn(A).\mathrm{scaledsinkhorn}(\textbf{A})=\exp(-N)\mathrm{sinkhorn}(\textbf{A})~{}.

Combining Lemma 2.2 and 2.5 we get the following result.

Corollary 2.7.

For any matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the following inequality holds,

scaledsinkhorn(A)bethe(A),\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{bethe}(\textbf{A}),

which further implies,

scaledsinkhorn(A)perm(A).\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})~{}.

Other than approximations to the permanent of a matrix, we next state two important results that will be helpful throughout the paper. The first result is the Stirling’s approximation for factorial function and the second is the non-negativity result of KL divergence between two distributions.

Lemma 2.8 (Stirling’s approximation).

For all n+n\in\mathbb{Z}_{+}, the following holds:

exp(nlognn)n!O(n)exp(nlognn).\exp(n\log n-n)\leq n!\leq O(\sqrt{n})\exp(n\log n-n)~{}.

Let μ\mu and ν\nu be distributions defined on some domain Ω\Omega. The KL divergence denoted KL(μν)\mathrm{KL}\left(\mu\|\nu\right) between distributions μ\mu and ν\nu is defined as follows,

KL(μν)=defXΩμ(X)logμ(X)ν(X)=𝔼Xμ[logμ(X)]𝔼Xμ[logν(X)]\mathrm{KL}\left(\mu\|\nu\right)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\textbf{X}\in\Omega}\mu(\textbf{X})\log\frac{\mu(\textbf{X})}{\nu(\textbf{X})}=\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]-\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\nu(\textbf{X})\right]
Lemma 2.9 (Non-negativity of KL divergence).

For any distributions μ\mu and ν\nu defined on domain Ω\Omega, the KL divergence between distributions μ\mu and ν\nu satisfies,

KL(μν)0.\mathrm{KL}\left(\mu\|\nu\right)\geq 0~{}.

In the remainder of this section we provide formal definitions related to PML.

2.1 Profile maximum likelihood

Let Δ𝒟[0,1]𝒟\Delta^{\mathcal{D}}\subset[0,1]_{\mathbb{R}}^{\mathcal{D}} be the set of all discrete distributions supported on domain 𝒟\mathcal{D}. Here on we use the word distribution to refer to discrete distributions. Throughout this paper we assume that we receive a sequence of nn independent samples from an underlying distribution pΔ𝒟\textbf{p}\in\Delta^{\mathcal{D}}. Let 𝒟n\mathcal{D}^{n} be the set of all length nn sequences and yn𝒟ny^{n}\in\mathcal{D}^{n} be one such sequence with yiny^{n}_{i} denoting its iith element. The probability of observing sequence yny^{n} is:

(p,yn)=defx𝒟pxf(yn,x)\mathbb{P}(\textbf{p},y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{x\in\mathcal{D}}\textbf{p}_{x}^{\textbf{f}(y^{n},x)}

where f(yn,x)=|{i[n]|yin=x}|\textbf{f}(y^{n},x)=|\{i\in[n]~{}|~{}y^{n}_{i}=x\}| is the frequency/multiplicity of symbol xx in sequence yny^{n} and px\textbf{p}_{x} is the probability of domain element x𝒟x\in\mathcal{D}.

For any given sequence one could define its profile (histogram of a histogram or fingerprint) that is sufficient statistic for symmetric property estimation.

Definition 2.10 (Profile).

For any sequence yn𝒟ny^{n}\in\mathcal{D}^{n}, let M={f(yn,x)}x𝒟\{0}\textbf{M}=\{\textbf{f}(y^{n},x)\}_{x\in\mathcal{D}}\backslash\{0\} be the set of all its non-zero distinct frequencies and m1,m2,,m|M|\textbf{m}_{1},\textbf{m}_{2},\dots,\textbf{m}_{|\textbf{M}|} be elements of the set M. The profile of a sequence yn𝒟ny^{n}\in\mathcal{D}^{n} denoted ϕ=Φ(yn)+|M|\phi=\Phi(y^{n})\in\mathbb{Z}_{+}^{|\textbf{M}|} is

ϕ=def(ϕj)j[1,|M|] , where ϕj=ϕj(yn)=def|{x𝒟|f(yn,x)=mj}|\phi\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\phi_{j})_{j\in[1,|\textbf{M}|]}\text{ , where }\phi_{j}=\phi_{j}(y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{x\in\mathcal{D}~{}|~{}\textbf{f}(y^{n},x)=\textbf{m}_{j}\}|

is the number of domain elements with frequency mj\textbf{m}_{j} in yny^{n}666The profile does not contain information about the number of unseen domain elements.. We call nn the length of profile ϕ\phi and as a function of profile ϕ\phi, n=j[1,|M|]mjϕjn=\sum_{j\in[1,|\textbf{M}|]}\textbf{m}_{j}\cdot\phi_{j}. Let Φn\Phi^{n} denote the set of all profiles of length nn. We use kk to denote the number of distinct frequencies in the profile ϕ\phi and k=|M|k=|\textbf{M}|.777Note the number of distinct frequencies denoted kk in a length nn sequence is always upper bounded by n\sqrt{n}. For convenience we use m+k\overrightarrow{\mathrm{m}}\in\mathbb{Z}_{+}^{k} to denote the vector of observed frequencies, therefore mj=mj\overrightarrow{\mathrm{m}}_{j}=\textbf{m}_{j} for all j[1,k]j\in[1,k].

For any distribution pΔ𝒟\textbf{p}\in\Delta^{\mathcal{D}}, the probability of a profile ϕΦn\phi\in\Phi^{n} is defined as:

(p,ϕ)=def{yn𝒟n|Φ(yn)=ϕ}(p,yn)\mathbb{P}(\textbf{p},\phi)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\{y^{n}\in\mathcal{D}^{n}~{}|~{}\Phi(y^{n})=\phi\}}\mathbb{P}(\textbf{p},y^{n})\\ (2)

Let xnx^{n} be a sequence such that Φ(xn)=ϕ\Phi(x^{n})=\phi. We define a profile probability matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi} with respect to sequence xnx^{n} (therefore profile ϕ\phi) and distribution p as follows,

Az,yp,ϕ=defpzfy for all z,y𝒟,\textbf{A}^{\textbf{p},\phi}_{z,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{p}_{z}^{\textbf{f}_{y}}\text{ for all }z,y\in\mathcal{D}, (3)

where fy=deff(xn,y)\textbf{f}_{y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{f}(x^{n},y) is the frequency of domain element y𝒟y\in\mathcal{D} in sequence xnx^{n} and recall Φ(xn)=ϕ\Phi(x^{n})=\phi. We are interested in the permanent of the matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi}, and note that the perm(Ap,ϕ)\mathrm{perm}(\textbf{A}^{\textbf{p},\phi}) is invariant under the choice of sequences xnx^{n} that satisfy Φ(xn)=ϕ\Phi(x^{n})=\phi. Therefore we index the matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi} with profile ϕ\phi rather than sequence xnx^{n} itself. The number of distinct columns in Ap,ϕ\textbf{A}^{\textbf{p},\phi} is equal to number of distinct observed frequencies plus one (for the unseen), i.e. k+1k+1.

The probability of a profile ϕΦn\phi\in\Phi^{n} with respect to distribution p (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of permanent of matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi} is given below:

(p,ϕ)=Cϕ(j[0,k]1ϕj!)perm(Ap,ϕ)\mathbb{P}(\textbf{p},\phi)=C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{perm}(\textbf{A}^{\textbf{p},\phi}) (4)

where Cϕ=defn!j[1,k](mj!)ϕjC_{\phi}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{n!}{\prod_{j\in[1,k]}(\textbf{m}_{j}!)^{\phi_{j}}} and ϕ0\phi_{0} is the number of unseen domain elements888Given a distribution p, we know its domain 𝒟\mathcal{D} and therefore the value of ϕ0\phi_{0}..

The distribution which maximizes the probability of a profile ϕ\phi is the profile maximum likelihood distribution which we formally define next.

Definition 2.11 (Profile maximum likelihood).

For any profile ϕΦn\phi\in\Phi^{n}, a profile maximum likelihood (PML) distribution ppml,ϕΔ𝒟\textbf{p}_{\mathrm{pml},\phi}\in\Delta^{\mathcal{D}} is:

ppml,ϕargmaxpΔ𝒟(p,ϕ)\textbf{p}_{\mathrm{pml},\phi}\in\operatorname*{arg\,max}_{\textbf{p}\in\Delta^{\mathcal{D}}}\mathbb{P}(\textbf{p},\phi)

and (ppml,ϕ,ϕ)\mathbb{P}(\textbf{p}_{\mathrm{pml},\phi},\phi) is the maximum PML objective value.

The central goal of this paper is to define efficient algorithms for computing approximate PML distributions defined as follows.

Definition 2.12 (Approximate PML).

For any profile ϕΦn\phi\in\Phi^{n}, a distribution ppml,ϕβΔ𝒟\textbf{p}^{\beta}_{\mathrm{pml},\phi}\in\Delta^{\mathcal{D}} is a β\beta-approximate PML distribution if

(ppml,ϕβ,ϕ)β(ppml,ϕ,ϕ)\mathbb{P}(\textbf{p}^{\beta}_{\mathrm{pml},\phi},\phi)\geq\beta\cdot\mathbb{P}(\textbf{p}_{\mathrm{pml},\phi},\phi)

3 Results

Here we state the main results of this paper. In our first class of main results, we improve the analysis of the scaled Sinkhorn permanent for structured non-negative matrices. We first show that the scaled Sinkhorn permanent approximates the permanent of a non-negative matrix A, where the approximation factor (up to log factors) depends exponentially on the non-negative rank of the smatrix A. We formally state this result next.

Theorem 3.1 (Scaled Sinkhorn permanent approximation for low non-negative rank matrices).

For any matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} with non-negative rank at most kk, the following inequality holds,

scaledsinkhorn(A)perm(A)exp(O(klogNk))scaledsinkhorn(A).\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}. (5)

Further using scaledsinkhorn(A)bethe(A)\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{bethe}(\textbf{A}) (See 2.7) and bethe(A)perm(A)\mathrm{bethe}(\textbf{A})\leq\mathrm{perm}(\textbf{A}) (See Lemma 2.2) we immediately get the same result for the Bethe permanent.

Corollary 3.2 (Bethe permanent approximation for low non-negative rank matrices).

For any matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} with non-negative rank at most kk, the following inequality holds,

bethe(A)perm(A)exp(O(klogNk))bethe(A).\mathrm{bethe}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{bethe}(\textbf{A})~{}. (6)

Interestingly, in the worst case, Sinkhorn is an eNe^{N} approximation to the permanent of A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, even when A has at most 1 distinct column (e.g. consider the all 1’s matrix). Consequently, for matrices with non-negative rank at most kk, whenever k=o(N/logN)k=o(N/\log N), scaled Sinkhorn is a compelling alternative to Sinkhorn, with a tighter worst-case multiplicative approximation to the permanent.

Our results improve the analysis of the Bethe permanent for such structured matrices. Previously, the best known analysis of the Bethe permanent showed an 2N\sqrt{2}^{N}-approximation factor to the permanent [AR18]. The analysis in [AR18] is tight for general non-negative matrices and the authors showed that this bound cannot be improved without leveraging further structure. Our next result is of similar flavor, and we provide an asymptotically tight example for Theorem 3.1 and 3.2.

Theorem 3.3 (Lower bound for the Bethe and the scaled Sinkhorn permanents approximation).

There exists a matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} with non-negative rank kk, that satisfies

perm(A)exp(Ω(klogNk))bethe(A),\mathrm{perm}(\textbf{A})\geq\exp\left(\Omega\left(k\log\frac{N}{k}\right)\right)\mathrm{bethe}(\textbf{A})~{}, (7)

which further implies,

perm(A)exp(Ω(klogNk))scaledsinkhorn(A).\mathrm{perm}(\textbf{A})\geq\exp\left(\Omega\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}. (8)

An immediate application of these above stated results is for PML. Recall, that for any fixed distribution p and profile ϕ\phi, (p,ϕ)\mathbb{P}(\textbf{p},\phi) is proportional to the permanent of the non-negative matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi} (See Section 2 for the definition of Ap,ϕ\textbf{A}^{\textbf{p},\phi}). Note the number of distinct columns in the profile probability matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi} is upper bounded by the number of distinct frequencies plus one, which further is always less than n+1\sqrt{n}+1 (where nn is the length of the profile). Therefore the non-negative rank of the profile probability matrix Ap,ϕ\textbf{A}^{\textbf{p},\phi} is always upper bounded by n+1\sqrt{n}+1. Since scaledsinkhorn(A)\mathrm{scaledsinkhorn}(\textbf{A}) can be computed in polynomial time [CSS19]999scaledsinkhorn(A)\mathrm{scaledsinkhorn}(\textbf{A}) corresponds to a convex optimization problem and a minor modification of the approach in [CSS19] to solve a related, but slightly different optimization problem, yields a polynomial time algorithm to compute scaledsinkhorn(A)\mathrm{scaledsinkhorn}(\textbf{A}) up to high accuracy., Theorem 3.1 implies an efficient algorithm to approximate the value (p,ϕ)\mathbb{P}(\textbf{p},\phi) for a fixed distribution p up to multiplicative exp(O(nlogn))\exp(O(\sqrt{n}\log n)) factor, and is also the best known approximation factor achieved by a deterministic algorithm.

Analyzing the relationship between the Bethe permanent and the permanent of the profile probability matrix was posed as an interesting research direction in [Von11] (See Section VII). Moreover, one of the primary interests in the area of algorithmic statistics/machine learning is to efficiently compute the PML distribution. Exploiting the structure of doubly stochastic matrix Q that maximizes scaledsinkhorn(Ap,ϕ,Q)\mathrm{scaledsinkhorn}(\textbf{A}^{\textbf{p},\phi},\textbf{Q}) combined with the probability discretization idea from [CSS19] we provide an efficient algorithm to compute an approximate PML distribution. We use Lemma 4.1 to argue the approximation of our approximate PML distribution and we summarize this result below.

Theorem 3.4 (exp(nlogn)\exp\left(\sqrt{n}\log n\right)-approximate PML).

For any given profile ϕΦn\phi\in\Phi^{n}, Algorithm 4 computes an exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right)-approximate PML distribution in O~(n1.5)\widetilde{O}(n^{1.5}) time.

Previous to our work, the best known result [CSS19] gave an efficient algorithm to compute exp(O(n2/3logn))\exp(-O(n^{2/3}\log n))-approximate PML distribution.

One important application of approximate PML is in symmetric property estimation. In [ADOS16], the authors showed that approximate PML distribution based plug-in estimator is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity. Further combining their result with our Theorem 3.4 we get the efficient version of Theorem 2 in [ADOS16] and we summarize this result next.

Theorem 3.5 (Efficient universal estimator using approximate PML).

Let nn be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity. If ϵcn0.2499\epsilon\geq\frac{c}{n^{0.2499}} for some constant c>0c>0, then there exists a PML based universal plug-in estimator that runs in time O~(n1.5)\widetilde{O}(n^{1.5}) and is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity to accuracy ϵ\epsilon.

Note the dependency on ϵ\epsilon in the above theorem and the approximation factor in Theorem 3.4 are strictly better than [CSS19], which is another efficient PML based approach for universal symmetric property estimation; [CSS19] works when the error parameter ϵ1n0.166\epsilon\geq\frac{1}{n^{0.166}}.

Recent work [HO19], further gives the broad optimality of approximate PML. [HO19] shows optimality of approximate PML distribution based estimator for other symmetric properties, such as, sorted distribution estimation (under 1\ell_{1} distance), α\alpha-Renyi entropy for non-integer α>3/4\alpha>3/4, and other broad class of additive properties that are Lipschitz. [HO19] also provides a PML-based tester to test whether an unknown distribution is ϵ\geq\epsilon far from a given distribution in 1\ell_{1} distance and achieves the optimal sample complexity up to logarithmic factors. Our result further implies an efficient version of all these results.

3.1 Related work

We divide the related work into two broad categories: permanent approximations and profile maximum likelihood.

Permanent approximations:

The first set of related work is with respect to computing the permanent of matrices. [Val79] showed that computing the permanent of matrices even when it has entries in 0, 1 is #P-Hard. This led to the study of computing approximations to the permanent. Additive approximation to the permanent for arbitrary A was given by [Gur05]. On the other hand, multiplicative approximation to the permanent (or even determining the sign) is hard for general A [AA11, GS18]. This hardness results led to the study of the multiplicative approximation to the permanent for special class of matrices and one such class is the set of non-negative matrices. In this direction, [JSV04] gave the first efficient randomized algorithm to approximate the permanent within (1+ϵ)(1+\epsilon) multiplicative accuracy. There has also been a rich literature on coming up with deterministic approximation to the permanent of non-negative matrices. [LSW98] gave the first deterministic algorithm to the permanent of N×NN\times N non-negative matrices with approximation ratio eN\leq e^{N}. [Gur11] using an inequality from [Sch98] showed that the Bethe permanent lower bounds the value of the permanent of non-negative matrices. We refer the reader to [Von13, GS14] for the polynomial computability of the Bethe permanent and [AR18] for a more rigorous literature survey on the Bethe permanent and other related work.

As discussed in the footnote of the introduction, an anonymous reviewer showed us an alternative and simpler proof for the upper bound on the scaled Sinkhorn approximation to the permanent of matrices with at most kk distinct columns (Lemma 4.1). This alternative proof is deferred to Appendix A and is derived using Corollary 3.4.5. in Barvinok’s book [Bar17]. The result in turn, is proved using the Bregman-Minc inequality conjectured by Minc, cf. [Spe82] and later proved by Bregman [Bre73]. The Bregman-Minc inequality is well-known and there are many different proofs [Sch78, Rad97, AS04] known. In comparison to this alternative proof for matrices with kk distinct columns (Lemma 4.1), our proof is self contained and intuitive. We believe it could help provide further insights into the Sinkhorn/Bethe approximations.

Profile maximum likelihood:

The second set of related work is with respect to profile maximum likelihood and its applications. As discussed in the introduction, PML was introduced by [OSS+04]. Many heuristic approaches such as the EM algorithm [OSS+04], algebraic approaches [ADM+10] and a dynamic programming approach [PJW17] have been proposed to compute approximations to PML. Further [Von12, Von14] used the Bethe permanent as a heuristic to compute the PML distribution. All these approaches don’t provide any theoretical guarantees for the quality of the approximate PML distribution and it was an open question to efficiently compute a non-trivial approximate PML distribution. [CSS19] gave the first efficient algorithm to compute the exp(n2/3logn)\exp(-n^{2/3}\log n) approximate PML distribution, where nn is the number of samples.

The connection between PML and universal estimators was first studied in [ADOS16]. [ADOS16] showed that an approximate PML distribution can be used as an universal estimator for estimating symmetric properties, namely entropy, distance to uniformity, support size and coverage. See [HO19] for broad applicability of approximate PML in property testing and estimating other symmetric properties such as sorted 1\ell_{1} distance, Renyi entropy, and other broad class of additive properties. [CSS19] combined with [ADOS16], gave the first efficient PML based universal estimator for symmetric property estimation. There have been several other approaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b] adopted and rigorously analyzed a linear programming based approach for universal estimators proposed by [ET76] and showed that it is sample complexity optimal in the constant error regime for estimating certain symmetric properties (namely, entropy, support size, support coverage, and distance to uniformity). Recent work of Han, Jiao and Weissman [HJW18] applied a local moment matching based approach in designing efficient universal symmetric property estimators for a single distribution. [HJW18] achieves the optimal sample complexity in a broader error regimes for estimating the power sum function, support and entropy.

Estimating symmetric properties of a distribution is a rich field and extensive work has been dedicated to studying their optimal sample complexity for estimating each of these properties. Optimal sample complexities for estimating many symmetric properties were resolved in the past few years; support [VV11b, WY15], support coverage [OSW16, ZVV+16], entropy [VV11b, WY16a], distance to uniformity [VV11a, JHW16], sorted 1\ell_{1} distance [VV11a, HJW18], Renyi entropy [AOST14, AOST17], KL divergence [BZLV16, HJW16] and many others.

Comparison to [CSS19]: [CSS19] provides the first efficient algorithm for computing β\beta-approximate PML distribution for β>exp(n1δ)\beta>\exp(-n^{1-\delta}) for some constant δ>0\delta>0, where nn is the number of samples. Formally, [CSS19] computes an exp(n2/3logn)\exp(-n^{2/3}\log n)-approximate PML distribution. Suppose \ell and kk are the number of distinct probability values and frequencies respectively, then [CSS19] provides a convex program that using combinatorial techniques they analyze and show that it approximates the PML objective up to exp(O~(×k))\exp(-\widetilde{O}(\ell\times k)) multiplicative factor. Further this convex program outputs a fractional solution and [CSS19] provides a rounding algorithm that outputs a valid integral solution (that corresponds to a valid distribution). [CSS19] further incurs a exp(O~(×k))\exp(-\widetilde{O}(\ell\times k)) multiplicative loss in the rounding procedure. Using the discretization results, up to exp(n2/3logn)\exp(-n^{2/3}\log n)-multiplicative loss one can assume ,kn1/3\ell,k\leq n^{1/3} and therefore [CSS19] provides a exp(n2/3logn)\exp(-n^{2/3}\log n)-approximate PML distribution.

However in our current work, using results for the scaled Sinkhorn permanent, we show that the same convex program in [CSS19] approximates the PML objective up to exp(O~(+k))\exp(-\widetilde{O}(\ell+k)) multiplicative factor. Further we also provide a better rounding algorithm that outputs a valid distribution and incur a exp(O~(+k))\exp(-\widetilde{O}(\ell+k)) multiplicative loss. Further using the discretization results, up to exp(nlogn)\exp(-\sqrt{n}\log n)-multiplicative loss one can assume ,kn\ell,k\leq\sqrt{n} and therefore our work provides a exp(nlogn)\exp(-\sqrt{n}\log n)-approximate PML distribution.

4 The Sinkhorn permanent for structured matrices.

In this section, we provide the proof for our first main theorem (Theorem 3.1). We show that the scaled Sinkhorn permanent of a non-negative matrix approximates its permanent, where the approximation factor is exponential in the non-negative of the matrix (up to log factors). Our proof is divided into two parts. First in Section 4.1, we work with a simpler setting of matrices AA with at most kk distinct columns and prove the following lemma.

Lemma 4.1 (Scaled Sinkhorn permanent approximation).

For any matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} with at most kk distinct columns, the following holds,

scaledsinkhorn(A)perm(A)exp(O(klogNk))scaledsinkhorn(A).\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}. (9)

Further using the above result, in Section 4.2 we prove our main theorem (Theorem 3.1) for low non-negative rank matrices.

4.1 The Sinkhorn permanent for distinct column case.

We start this section by defining some notation that captures the structure of repetition of columns in a matrix. For the remainder of this section we fix a matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}. We let kk denote the number of distinct columns of A and use c1,c2,ck\textbf{c}_{1},\textbf{c}_{2},\dots\textbf{c}_{k} to denote these distinct columns. Further we let A^=[c1|c2||ck]{\hat{\textbf{A}}}=[\textbf{c}_{1}~{}|~{}\textbf{c}_{2}~{}|\dots|~{}\textbf{c}_{k}] denote the 𝒟×k\mathcal{D}\times k matrix formed by these distinct columns. We use A:y\textbf{A}_{:y} to denote the yy’th column of matrix A and let ϕj=def|{y𝒟|A:y=cj}|\phi_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}| denote the number of columns equal to cj\textbf{c}_{j}. It is immediate that,

j[1,k]ϕj=N,\sum_{j\in[1,k]}\phi_{j}=N~{}, (10)

where N=|𝒟|N=|\mathcal{D}| is the size of the domain. For any matrix P0𝒟×k\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k} define,

f(A,P)=defx𝒟j[1,k]Px,jlogA^x,jPx,j+j[1,k]ϕjlogϕjj[1,k]ϕj.\textbf{f}(\textbf{A},\textbf{P})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[1,k]}\phi_{j}~{}. (11)

In the first half of this section, we show existence of a matrix P0𝒟×k\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k} (See Lemma 4.4) such that j[1,k]Px,j=1 for all x𝒟\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}, x𝒟Px,j=ϕj for all j[1,k]\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k], and further (See Lemma 4.5),

logperm(A)O(klogNk)+f(A,P).\log\mathrm{perm}(\textbf{A})\leq O\left(k\log\frac{N}{k}\right)+\textbf{f}(\textbf{A},\textbf{P})~{}. (12)

Later in the second half (See Lemma 4.6), we show that for any matrix P0𝒟×k\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k} that satisfies j[1,k]Px,j=1 for all x𝒟\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D} and x𝒟Px,j=ϕj for all j[1,k]\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k], there exists a matrix Q𝐊rc\textbf{Q}\in\mathbf{K}_{rc} (recall 𝐊rc\mathbf{K}_{rc} is the set of all 𝒟×𝒟\mathcal{D}\times\mathcal{D} doubly stochastic matrices) that satisfies,

f(A,P)=U(A,Q)N.\textbf{f}(\textbf{A},\textbf{P})=\mathrm{U}(\textbf{A},\textbf{Q})-N~{}. (13)

However, using 2.7 we already know that, scaledsinkhorn(A)logperm(A)\mathrm{scaledsinkhorn}(\textbf{A})\leq\log\mathrm{perm}(\textbf{A}). Further using the definition of scaledsinkhorn(A)\mathrm{scaledsinkhorn}(\textbf{A}) and combining with Equations 12 and 13 we get,

scaledsinkhorn(A)perm(A)exp(O(klogNk))scaledsinkhorn(A).\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}.

In the remainder, we provide proofs for all the above mentioned inequalities and we need the following set of definitions. Let Kr{0,1}𝒟×k\textbf{K}_{r}\subseteq\{0,1\}^{\mathcal{D}\times k}, be the subset of all 𝒟×k\mathcal{D}\times k matrices that are row stochastic, meaning there is exactly single 11 in each row. Let KAKr\textbf{K}_{\textbf{A}}\subseteq\textbf{K}_{r} be the set of matrices such that any XKA\textbf{X}\in\textbf{K}_{\textbf{A}} satisfies x𝒟Xx,j=ϕj for all j[1,k]\sum_{x\in\mathcal{D}}\textbf{X}_{x,j}=\phi_{j}\text{ for all }j\in[1,k].

Definition 4.2.

Let hA:S𝒟KA\textbf{h}_{\textbf{A}}:S_{\mathcal{D}}\rightarrow\textbf{K}_{\textbf{A}} be the function that takes a permutation σS𝒟\sigma\in S_{\mathcal{D}} as input and returns a matrix XKA\textbf{X}\in\textbf{K}_{\textbf{A}} in the following way,

Xx,j={1 if A:σ(x)=cj0 otherwise for all x𝒟.\displaystyle\textbf{X}_{x,j}=\begin{cases}1&\text{ if }\textbf{A}_{:\sigma(x)}=\textbf{c}_{j}\\ 0&\text{ otherwise}\end{cases}\quad\quad\text{ for all }x\in\mathcal{D}. (14)

Remark: Note that as desired hA(σ)KA\textbf{h}_{\textbf{A}}(\sigma)\in\textbf{K}_{\textbf{A}} for all σS𝒟\sigma\in S_{\mathcal{D}} because of the following. For any σS𝒟\sigma\in S_{\mathcal{D}}, let X=defhA(σ)\textbf{X}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{h}_{\textbf{A}}(\sigma). Since cj\textbf{c}_{j} for all j[1,k]j\in[1,k] are distinct, we have j[1,k]Xx,j=1\sum_{j\in[1,k]}\textbf{X}_{x,j}=1. Further for any j[1,k]j\in[1,k], x𝒟Xx,j={x𝒟|A:σ(x)=cj}1={x𝒟|Ax=cj}1=ϕj\sum_{x\in\mathcal{D}}\textbf{X}_{x,j}=\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{A}_{:\sigma(x)}=\textbf{c}_{j}\}}1=\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{A}_{\cdot x}=\textbf{c}_{j}\}}1=\phi_{j}.

We next define the probability of a permutation σS𝒟\sigma\in S_{\mathcal{D}} with respect to matrix A as follows,

Pr(σ)=defeσAeperm(A)\mathrm{Pr}\left(\sigma\right)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\prod_{e\in\sigma}\textbf{A}_{e}}{\mathrm{perm}(\textbf{A})} (15)

Further we define a marginal distribution μ\mu on Kr\textbf{K}_{r} and later we will establish that this is indeed a probability distribution, that is, probabilities add up to 1.

μ(X)=def{0 if XKr\KA{σS𝒟|hA(σ)=X}Pr(σ) if XKA.\displaystyle\mu(\textbf{X})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{cases}0&\text{ if }\textbf{X}\in\textbf{K}_{r}\backslash\textbf{K}_{\textbf{A}}\\ \sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\mathrm{Pr}\left(\sigma\right)&\text{ if }\textbf{X}\in\textbf{K}_{\textbf{A}}~{}.\end{cases} (16)

For XKA\textbf{X}\in\textbf{K}_{\textbf{A}}, we next provide another equivalent expression for μ(X)\mu(\textbf{X}).

μ(X)={σS𝒟|hA(σ)=X}Pr(σ)={σS𝒟|hA(σ)=X}(x,σ(x))Ax,σ(x)perm(A),=1perm(A){σS𝒟|hA(σ)=X}x𝒟j[1,k]A^x,jXx,j=(j[1,k]ϕj!)(x𝒟j[1,k]A^x,jXx,j)(1perm(A))\begin{split}\mu(\textbf{X})&=\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\mathrm{Pr}\left(\sigma\right)=\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\frac{\prod_{(x,\sigma(x))}\textbf{A}_{x,\sigma(x)}}{\mathrm{perm}(\textbf{A})},\\ &=\frac{1}{\mathrm{perm}(\textbf{A})}\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}\\ &=\left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\end{split} (17)

In the first and second equality, we used definitions of μ(X)\mu(\textbf{X}) and Pr(σ)\mathrm{Pr}\left(\sigma\right) (See Equation 15). For any σS𝒟\sigma\in S_{\mathcal{D}}, let X=hA(σ)\textbf{X}=\textbf{h}_{\textbf{A}}(\sigma). Further for any x𝒟x\in\mathcal{D}, let jj^{\prime} be such that A:σ(x)=cj\textbf{A}_{:\sigma(x)}=\textbf{c}_{j^{\prime}}, then Ax,σ(x)=A^x,j\textbf{A}_{x,\sigma(x)}={\hat{\textbf{A}}}_{x,j^{\prime}} that is further equal to j[1,k]A^x,jXx,j\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}} because Xx,j\textbf{X}_{x,j} is equal to 11 if j=jj=j^{\prime} and 0 otherwise. Therefore the third equality holds. For the final equality, observe that for any σS𝒟\sigma\in S_{\mathcal{D}} if we let X=hA(σ)\textbf{X}=\textbf{h}_{\textbf{A}}(\sigma), then for each j[1,k]j\in[1,k], any permutation within the subset of elements {x𝒟|A:σ(x)=cj}\{x\in\mathcal{D}~{}|~{}\textbf{A}_{:\sigma(x)}=\textbf{c}_{j}\} results in a permutation σ\sigma^{\prime} that satisfies hA(σ)=X\textbf{h}_{\textbf{A}}(\sigma^{\prime})=\textbf{X}. These permutations can be carried out independently for each j[1,k]j\in[1,k] that corresponds to j[1,k]ϕj!\prod_{j\in[1,k]}\phi_{j}! number of permutations and all of them have the same x𝒟j[1,k]A^x,jXx,j\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}} value.

Using the derivation from above, the definition for μ\mu can also be written as follows:

μ(X)={0 if XKr\KA(j[1,k]ϕj!)(x𝒟j[1,k]A^x,jXx,j)(1perm(A)) if XKA.\displaystyle\mu(\textbf{X})=\begin{cases}0&\text{ if }\textbf{X}\in\textbf{K}_{r}\backslash\textbf{K}_{\textbf{A}}\\ \left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)&\text{ if }\textbf{X}\in\textbf{K}_{\textbf{A}}~{}.\end{cases} (18)

Note for XKA\textbf{X}\in\textbf{K}_{\textbf{A}}, the expression for μ(X)\mu(\textbf{X}) can be equivalently written as follows:

μ(X)=(j[1,k]ϕj!)({(x,j)𝒟×[1,k]|Xx,j=1}A^x,j)(1perm(A)).\mu(\textbf{X})=\left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)~{}. (19)

We next show that the μ\mu defined above is a valid distribution.

XKrμ(X)=XKAμ(X)=XKA{σS𝒟|hA(σ)=X}Pr(σ)=σS𝒟Pr(σ)=1\sum_{\textbf{X}\in\textbf{K}_{r}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\Pr(\sigma)=\sum_{\sigma\in S_{\mathcal{D}}}\Pr(\sigma)=1

Remark: The domain of distribution μ\mu is Kr\textbf{K}_{r}, but its support is subset of KA\textbf{K}_{\textbf{A}}.

Definition 4.3.

For the distribution μ\mu, we define a non-negative matrix P0𝒟×k\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k} with respect to μ\mu as follows:

Px,j=defPrXμ(Xx,j=1)={XKA|Xx,j=1}μ(X).\textbf{P}_{x,j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\Pr_{\textbf{X}\sim\mu}(\textbf{X}_{x,j}=1)=\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})~{}. (20)
Lemma 4.4.

The matrix P defined in Equation 20 satisfies the following conditions:

j[1,k]Px,j=1 for all x𝒟 and x𝒟Px,j=ϕj for all j[1,k].\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}\quad\text{ and }\quad\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]~{}. (21)
Proof.

We first evaluate the row sum. For each x𝒟x\in\mathcal{D},

j[1,k]Px,j=j[1,k]{XKA|Xx,j=1}μ(X)=XKAμ(X)=1.\displaystyle\sum_{j\in[1,k]}\textbf{P}_{x,j}=\sum_{j\in[1,k]}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})=1~{}.

In the second inequality we used that XKA\textbf{X}\in\textbf{K}_{\textbf{A}}, meaning for each x𝒟x\in\mathcal{D}, j[1,k]Xx,j=1\sum_{j\in[1,k]}\textbf{X}_{x,j}=1. Next we evaluate the column sum, for each j[1,k]j\in[1,k],

x𝒟Px,j\displaystyle\sum_{x\in\mathcal{D}}\textbf{P}_{x,j} =x𝒟{XKA|Xx,j=1}μ(X)=XKA{x𝒟|Xx,j=1}μ(X)\displaystyle=\sum_{x\in\mathcal{D}}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})
=XKAμ(X){x𝒟|Xx,j=1}1=XKAμ(X)ϕj=ϕj\displaystyle=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{X}_{x,j}=1\}}1=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\phi_{j}=\phi_{j}

In the first equality we used the definition of Px,j\textbf{P}_{x,j}. In the second inequality we interchanged the summations. In the final equality we used XKAμ(X)=1\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})=1. ∎

The matrix P defined in Equation 20 is important because we can upper bound the permanent of matrix A in terms of entries of this matrix. We formalize this result in our next lemma.

Lemma 4.5.

For matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, if P is the matrix defined in Equation 20, then

logperm(A)O(klogNk)+f(A,P)\log\mathrm{perm}(\textbf{A})\leq O\left(k\log\frac{N}{k}\right)+\textbf{f}(\textbf{A},\textbf{P})
Proof.

We first calculate the expectation of log(μ(X))\log(\mu(\textbf{X})) and express it in terms of matrix P.

𝔼Xμ[logμ(X)]=XKrμ(X)logμ(X)=XKAμ(X)logμ(X),=XKAμ(X)log((j[1,k]ϕj!)({(x,j)𝒟×[1,k]|Xx,j=1}A^x,j)(1perm(A))),=log(j[1,k]ϕj!)logperm(A)+XKAμ(X)log({(x,j)𝒟×[1,k]|Xx,j=1}A^x,j).\begin{split}\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]&=\sum_{\textbf{X}\in\textbf{K}_{r}}\mu(\textbf{X})\log\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\mu(\textbf{X})~{},\\ &=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\right)~{},\\ &=\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)-\log\mathrm{perm}(\textbf{A})+\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)~{}.\end{split} (22)

The second equality holds because the support of distribution μ\mu is subset of KA\textbf{K}_{\textbf{A}}. In the third equality we used Equation 19. We now simplify the last term in the final expression from the above derivation.

XKAμ(X)log({(x,j)𝒟×[1,k]|Xx,j=1}A^x,j)=XKAμ(X){(x,j)𝒟×[1,k]|Xx,j=1}logA^x,j,=(x,j)𝒟×[1,k]logA^x,j{XKA|Xx,j=1}μ(X),=(x,j)𝒟×[1,k]Px,jlogA^x,j.\begin{split}\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)&=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\sum_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}\log{\hat{\textbf{A}}}_{x,j}~{},\\ &=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\log{\hat{\textbf{A}}}_{x,j}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})~{},\\ &=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log{\hat{\textbf{A}}}_{x,j}~{}.\\ \end{split} (23)

Combining Equation 22 and Equation 23 together we get,

𝔼Xμ[logμ(X)]=log(j[1,k]ϕj!)logperm(A)+(x,j)𝒟×[1,k]Px,jlogA^x,j.\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]=\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)-\log\mathrm{perm}(\textbf{A})+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log{\hat{\textbf{A}}}_{x,j}~{}. (24)

We next define a different distribution ν\nu on Kr\textbf{K}_{r} using the following sampling procedure: For each x𝒟x\in\mathcal{D}, pick a column j[1,k]j\in[1,k] independently with probability Px,j\textbf{P}_{x,j}. Note that this is a valid sampling procedure because for each x𝒟x\in\mathcal{D}, j[1,k]Px,j=1\sum_{j\in[1,k]}\textbf{P}_{x,j}=1. The description of distribution ν\nu is as follows: for each XKr\textbf{X}\in\textbf{K}_{r},

ν(X)=def{(x,j)𝒟×[1,k]|Xx,j=1}Px,j\nu(\textbf{X})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}\textbf{P}_{x,j} (25)

Remark: Note that XKrν(X)=x𝒟(j[1,k]Px,j)=1\sum_{X\in\textbf{K}_{r}}\nu(\textbf{X})=\prod_{x\in\mathcal{D}}(\sum_{j\in[1,k]}\textbf{P}_{x,j})=1 and ν\nu is a valid distribution.

We next calculate the expectation of log(ν(X))\log(\nu(\textbf{X})) with respect to distribution μ\mu and express it in terms of matrix P. Note that XKrμ(X)logν(X)=XKAμ(X)logν(X)\sum_{\textbf{X}\in\textbf{K}_{r}}\mu(\textbf{X})\log\nu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\nu(\textbf{X}) because μ(X)=0\mu(\textbf{X})=0 for all XKr\KA\textbf{X}\in\textbf{K}_{r}\backslash\textbf{K}_{\textbf{A}} and we get,

𝔼Xμ[logν(X)]\displaystyle\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\nu(\textbf{X})\right] =XKAμ(X)logν(X)=XKAμ(X)log({(x,j)𝒟×[1,k]|Xx,j=1}Px,j)\displaystyle=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\nu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}\textbf{P}_{x,j}\right)
=XKAμ(X){(x,j)𝒟×[1,k]|Xx,j=1}logPx,j=(x,j)𝒟×[1,k]logPx,j{XKA|Xx,j=1}μ(X)\displaystyle=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\sum_{\{(x,j)\in\mathcal{D}\times[1,k]|\textbf{X}_{x,j}=1\}}\log\textbf{P}_{x,j}=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\log\textbf{P}_{x,j}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}|\textbf{X}_{x,j}=1\}}\mu(\textbf{X})
=(x,j)𝒟×[1,k]Px,jlogPx,j\displaystyle=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\textbf{P}_{x,j}

We now calculate the KL divergence KL(μν)\mathrm{KL}\left(\mu\|\nu\right) between distributions μ\mu and ν\nu.

KL(μν)\displaystyle\mathrm{KL}\left(\mu\|\nu\right) =𝔼Xμ[logμ(X)]𝔼Xμ[logν(X)]\displaystyle=\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]-\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\nu(\textbf{X})\right]
=log(j[1,k]ϕj!)logperm(A)+(x,j)𝒟×[1,k]Px,jlogA^x,j(x,j)𝒟×[1,k]Px,jlogPx,j\displaystyle=\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)-\log\mathrm{perm}(\textbf{A})+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log{\hat{\textbf{A}}}_{x,j}-\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\textbf{P}_{x,j}

Using Lemma 2.9, we have KL(μν)0\mathrm{KL}\left(\mu\|\nu\right)\geq 0, that further implies,

logperm(A)log(j[1,k]ϕj!)+(x,j)𝒟×[1,k]Px,jlogA^x,jPx,jj[1,k]O(logϕj)+j[1,k]ϕjlogϕjj[1,k]ϕj+(x,j)𝒟×[1,k]Px,jlogA^x,jPx,jO(klogNk)+j[1,k]ϕjlogϕjj[1,k]ϕj+(x,j)𝒟×[1,k]Px,jlogA^x,jPx,j\begin{split}\log\mathrm{perm}(\textbf{A})&\leq\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}\\ &\leq\sum_{j\in[1,k]}O(\log\phi_{j})+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[1,k]}\phi_{j}+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}\\ &\leq O(k\log\frac{N}{k})+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[1,k]}\phi_{j}+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}\end{split} (26)

In the second inequality we used Lemma 2.8 on each ϕj\phi_{j} and further in the third inequality we used j[1,k]ϕj=N\sum_{j\in[1,k]}\phi_{j}=N and the fact that the function j[1,k]logϕj\sum_{j\in[1,k]}\log\phi_{j} is always upper bounded by O(klogNk)O(k\log\frac{N}{k}). Further using the definition of f(A,P)\textbf{f}(\textbf{A},\textbf{P}) (See Equation 11), we conclude the proof. ∎

We provided an upper bound to the permanent of matrix A and all that remains is to relate this upper bound to the scaled Sinkhorn permanent of matrix A. Our next lemma will serve this purpose.

Lemma 4.6.

For any matrix P0𝒟×[1,k]\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times[1,k]} that satisfies,

j[1,k]Px,j=1 for all x𝒟 and x𝒟Px,j=ϕj for all j[1,k].\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}\quad\text{ and }\quad\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]~{}. (27)

there exists a doubly stochastic matrix Q0𝒟×𝒟\textbf{Q}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} such that,

f(A,P)=U(A,Q)N.\textbf{f}(\textbf{A},\textbf{P})=\mathrm{U}(\textbf{A},\textbf{Q})-N~{}. (28)
Proof.

Define matrix Q𝒟×𝒟\textbf{Q}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}} as follows,

Qx,y=defPx,jϕj\textbf{Q}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{P}_{x,j}}{\phi_{j}}

where in the definition above jj is such that A:y=cj\textbf{A}_{:y}=\textbf{c}_{j}. Now we verify the row and column sums of matrix Q. For each x𝒟x\in\mathcal{D},

y𝒟Qx,y=j[1,k]{y𝒟|A:y=cj}Px,jϕj=j[1,k]Px,jϕj{y𝒟|A:y=cj}1=j[1,k]Px,jϕjϕj=j[1,k]Px,j=1\begin{split}\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}&=\sum_{j\in[1,k]}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}\frac{\textbf{P}_{x,j}}{\phi_{j}}=\sum_{j\in[1,k]}\frac{\textbf{P}_{x,j}}{\phi_{j}}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}1\\ &=\sum_{j\in[1,k]}\frac{\textbf{P}_{x,j}}{\phi_{j}}\cdot\phi_{j}=\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\\ \end{split} (29)

We next evaluate the column sums. For each y𝒟y\in\mathcal{D}, let jj 101010Note that jj is a function of yy. For convenience, in our notation we don’t capture its dependence on yy. be such that A:y=cj\textbf{A}_{:y}=\textbf{c}_{j}, then

x𝒟Qx,y=x𝒟Px,jϕj=1ϕjx𝒟Px,j=1ϕjϕj=1.\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}=\sum_{x\in\mathcal{D}}\frac{\textbf{P}_{x,j}}{\phi_{j}}=\frac{1}{\phi_{j}}\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\frac{1}{\phi_{j}}\phi_{j}=1~{}. (30)

Therefore the matrix Q is doubly stochastic and we next relate U(A,Q)\mathrm{U}(\textbf{A},\textbf{Q}) with f(A,P)\textbf{f}(\textbf{A},\textbf{P}). Recall the definition of U(A,Q)\mathrm{U}(\textbf{A},\textbf{Q}) (Equation 1),

U(A,Q)=(x,y)𝒟×𝒟Qx,ylog(Ax,yQx,y).\mathrm{U}(\textbf{A},\textbf{Q})=\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})~{}. (31)

We analyze the above term and express it in terms of entries of matrices P and A^{\hat{\textbf{A}}}.

(x,y)𝒟×𝒟Qx,ylog(Ax,yQx,y)=x𝒟j[1,k][{y𝒟|A:y=cj}Qx,ylog(Ax,yQx,y)]=x𝒟j[1,k][{y𝒟|A:y=cj}Px,jϕjlog(A^x,jϕjPx,j)]=x𝒟j[1,k][ϕjPx,jϕjlog(A^x,jϕjPx,j)]=x𝒟j[1,k][Px,jlog(A^x,jϕjPx,j)]\begin{split}\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})&=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})\right]\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}\frac{\textbf{P}_{x,j}}{\phi_{j}}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\phi_{j}\cdot\frac{\textbf{P}_{x,j}}{\phi_{j}}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]\end{split} (32)

The first equality follows because cj\textbf{c}_{j} for all j[1,k]j\in[1,k] are distinct. The second equality follows because for each x𝒟x\in\mathcal{D}, consider any y𝒟y\in\mathcal{D} such that A:y=cj\textbf{A}_{:y}=\textbf{c}_{j} and note that for all such yy’s, Ax,y=A^x,j\textbf{A}_{x,y}={\hat{\textbf{A}}}_{x,j} and Qx,y=Px,jϕj\textbf{Q}_{x,y}=\frac{\textbf{P}_{x,j}}{\phi_{j}}. The third equality follows because {y𝒟|A:y=cj}1=|{y𝒟|A:y=cj}|=ϕj\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}1=|\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}|=\phi_{j}.

We further simplify the final term in the above derivation.

x𝒟j[1,k][Px,jlog(A^x,jϕjPx,j)]=x𝒟j[1,k][Px,jlog(A^x,jPx,j)]+x𝒟j[1,k]Px,jlogϕj=x𝒟j[1,k][Px,jlog(A^x,jPx,j)]+j[1,k]logϕjx𝒟Px,j=x𝒟j[1,k][Px,jlog(A^x,jPx,j)]+j[1,k]ϕjlogϕj.\begin{split}\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]&=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\textbf{P}_{x,j}\log\phi_{j}\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{j\in[1,k]}\log\phi_{j}\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}~{}.\end{split} (33)

Combining Equation 32, Equation 33 and further substituting back in Equation 31 we get,

U(A,Q)=x𝒟j[1,k][Px,jlog(A^x,jPx,j)]+j[1,k]ϕjlogϕj=f(A,Q)+N.\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})&=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}\\ &=\textbf{f}(\textbf{A},\textbf{Q})+N~{}.\end{split} (34)

In the final expression, we used the definition of f(A,Q)\textbf{f}(\textbf{A},\textbf{Q}) and combined it with N=j[1,k]ϕjN=\sum_{j\in[1,k]}\phi_{j}. ∎

We are now ready to prove our main lemma of this section and is restated for convenience. See 4.1

Proof.

Consider the matrix P defined in Equation 20. By Lemma 4.4, matrix P satisfies the conditions of Lemma 4.6; therefore, there exists a doubly stochastic matrix Q𝐊rc\textbf{Q}\in\mathbf{K}_{rc} such that f(A,P)=U(A,Q)N\textbf{f}(\textbf{A},\textbf{P})=\mathrm{U}(\textbf{A},\textbf{Q})-N. Combining it with Lemma 4.5 we get logperm(A)O(klogNk)+U(A,Q)N\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\mathrm{U}(\textbf{A},\textbf{Q})-N, which further implies perm(A)exp(O(klogNk))scaledsinkhorn(A)\mathrm{perm}(\textbf{A})\leq\exp(O(k\log\frac{N}{k}))\mathrm{scaledsinkhorn}(\textbf{A}). The lower bound for the perm(A)\mathrm{perm}(\textbf{A}) follows from 2.7 and we conclude the proof. ∎

We next state another interesting property of the matrix P defined in Equation 20. This property will be useful for the purposes of PML (Section 6).

Theorem 4.7.

For matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, the matrix P defined in Equation 20 satisfies the following: If x,y𝒟 are such that Ax.=Ay. then, for all j[1,k] we have Px,j=Py,j.\text{If }x,y\in\mathcal{D}\text{ are such that }\textbf{A}_{x.}=\textbf{A}_{y.}\text{ then, for all }j\in[1,k]\text{ we have }\textbf{P}_{x,j}=\textbf{P}_{y,j}~{}.

Proof.

For any j[1,k]j\in[1,k], recall by the definitions of terms Px,j\textbf{P}_{x,j} and Py,j\textbf{P}_{y,j},

Px,j={XKA|Xx,j=1}(j[1,k]ϕj!)((z,j)𝒟×[1,k]A^z,jXz,j)(1perm(A)),=(j[1,k]ϕj!)(1perm(A)){XKA|Xx,j=1}(z,j)𝒟×[1,k]A^z,jXz,j.\begin{split}\textbf{P}_{x,j}&=\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\left(\prod_{j^{\prime}\in[1,k]}\phi_{j^{\prime}}!\right)\left(\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{X}_{z,j^{\prime}}}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right),\\ &=\left(\prod_{j^{\prime}\in[1,k]}\phi_{j^{\prime}}!\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{X}_{z,j^{\prime}}}~{}.\end{split} (35)
Py,j=(j[1,k]ϕj!)(1perm(A)){XKA|Xy,j=1}(z,j)𝒟×[1,k]A^z,jXz,j.\begin{split}\textbf{P}_{y,j}=\left(\prod_{j^{\prime}\in[1,k]}\phi_{j^{\prime}}!\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\sum_{\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}}\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{X}_{z,j^{\prime}}^{\prime}}~{}.\end{split} (36)

For any Y{XKA|Xx,j=1}\textbf{Y}\in\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\} we next construct a unique Y{XKA|Xy,j=1}\textbf{Y}^{\prime}\in\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\} (and vice versa) such that,

(z,j)𝒟×[1,k]A^z,jYz,j=(z,j)𝒟×[1,k]A^z,jYz,j\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}}=\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}^{\prime}}

Each YKA\textbf{Y}\in\textbf{K}_{\textbf{A}} corresponds to a bipartite graph where vertices correspond to set 𝒟\mathcal{D} on left side and [1,k][1,k] on the other, such that, degree of every left vertex x𝒟x\in\mathcal{D} is 1 and degree of every right vertex j[1,k]j\in[1,k] is ϕj\phi_{j}.

Consider Y{XKA|Xx,j=1}\textbf{Y}\in\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}, we divide the analysis into the following two cases,

  1. 1.

    If Yy,j=1\textbf{Y}_{y,j}=1, meaning both vertices x,y𝒟x,y\in\mathcal{D} are connected to j[1,k]j\in[1,k] in our bipartite graph representation. Then, Y=defY\textbf{Y}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{Y}.

  2. 2.

    If Yy,j=0\textbf{Y}_{y,j}=0, meaning vertex xx is connected to jj and yy to some other vertex jjj^{\prime}\neq j. In this case we swap the edges, meaning we remove edges (x,j),(y,j)(x,j),(y,j^{\prime}) and add (x,j),(y,j)(x,j^{\prime}),(y,j) to construct Y\textbf{Y}^{\prime}. We formally define Y\textbf{Y}^{\prime} next,

    Yz,j′′=def{1 if z=y and j′′=j,0 if z=y and j′′=j,1 if z=x and j′′=j,0 if z=x and j′′=j,Yz,j otherwise.\textbf{Y}^{\prime}_{z,j^{\prime\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{cases}1\text{ if }z=y\text{ and }j^{\prime\prime}=j,\\ 0\text{ if }z=y\text{ and }j^{\prime\prime}=j^{\prime},\\ 1\text{ if }z=x\text{ and }j^{\prime\prime}=j^{\prime},\\ 0\text{ if }z=x\text{ and }j^{\prime\prime}=j,\\ \textbf{Y}_{z,j^{\prime}}\text{ otherwise}~{}.\end{cases} (37)

In both cases, clearly Y{XKA|Xy,j=1}\textbf{Y}^{\prime}\in\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}. Further, Ax.=Ay.\textbf{A}_{x.}=\textbf{A}_{y.} implies A^x,j=A^y,j{\hat{\textbf{A}}}_{x,j^{\prime}}={\hat{\textbf{A}}}_{y,j} for all j[1,k]j^{\prime}\in[1,k] and the following equality holds,

(z,j)𝒟×[1,k]A^z,jYz,j=(z,j)𝒟×[1,k]A^z,jYz,j\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}}=\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}^{\prime}}

The same analysis also holds when we start with a Y{XKA|Xy,j=1}\textbf{Y}^{\prime}\in\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\} and construct Y{XKA|Xx,j=1}\textbf{Y}\in\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}. We have a one to one correspondence between elements Y and Y\textbf{Y}^{\prime} in the sets {XKA|Xx,j=1}\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\} and {XKA|Xy,j=1}\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\} respectively, satisfying,

(z,j)𝒟×[1,k]A^z,jYz,j=(z,j)𝒟×[1,k]A^z,jYz,j.\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}}=\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}^{\prime}}~{}.

Therefore, Px,j=Py,j\textbf{P}_{x,j}=\textbf{P}_{y,j} and we conclude the proof. ∎

4.2 Generalization to low non-negative rank matrices

Here we prove our main result for the scaled Sinkhorn permanent of low non-negative rank matrices (Theorem 3.1). To prove this result, we use the performance result of the scaled Sinkhorn permanent for non-negative matrices with kk distinct columns. The following lemma relates the permanent of a matrix A of non-negative rank kk to matrices with at most kk distinct columns and will be crucial for our analysis.

Lemma 4.8 ([Bar96]).

For any matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} of non-negative rank kk. If A=defj[k]vjuj\textbf{A}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[k]}\textbf{v}_{j}\textbf{u}_{j}^{\top} for vj,uj0𝒟\textbf{v}_{j},\textbf{u}_{j}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}, then

perm(A)={α+k|j[k]αj=N}1j[k]αj!perm(Vα)perm(Uα),\mathrm{perm}(\textbf{A})=\sum_{\{\alpha\subseteq\mathbb{Z}_{+}^{k}|\sum_{j\in[k]}\alpha_{j}=N\}}\frac{1}{\prod_{j\in[k]}\alpha_{j}!}\mathrm{perm}(\textbf{V}^{\alpha})\mathrm{perm}(\textbf{U}^{\alpha}),

where Vα=def[v1v1α1|v2v2α2||vkvkαk]\textbf{V}^{\alpha}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[\underbrace{\textbf{v}_{1}\dots\textbf{v}_{1}}_{\alpha_{1}}~{}|~{}\underbrace{\textbf{v}_{2}\dots\textbf{v}_{2}}_{\alpha_{2}}~{}|~{}\dots~{}|\underbrace{\textbf{v}_{k}\dots\textbf{v}_{k}}_{\alpha_{k}}], Uα=def[u1u1α1|u2u2α2||ukukαk]\textbf{U}^{\alpha}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[\underbrace{\textbf{u}_{1}\dots\textbf{u}_{1}}_{\alpha_{1}}~{}|~{}\underbrace{\textbf{u}_{2}\dots\textbf{u}_{2}}_{\alpha_{2}}~{}|~{}\dots~{}|\underbrace{\textbf{u}_{k}\dots\textbf{u}_{k}}_{\alpha_{k}}].

As the number of terms in the above summation is low, the maximizing term is a good approximation to the permanent of A.

Corollary 4.9.

Given a non-negative matrix A0𝒟×𝒟\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}, let kk denote the non-negative rank of the matrix. If A=j[k]vjuj\textbf{A}=\sum_{j\in[k]}\textbf{v}_{j}\textbf{u}_{j}^{\top} for vj,uj0𝒟\textbf{v}_{j},\textbf{u}_{j}\in\mathbb{R}_{\geq 0}^{\mathcal{D}} is any non-negative matrix factorization of A, then

perm(A)exp(O(klogNk))max{α+k|j[k]αj=N}1j[k]αj!perm(Vα)perm(Uα).\mathrm{perm}(\textbf{A})\leq\exp\left(O(k\log\frac{N}{k})\right)\max_{\{\alpha\subseteq\mathbb{Z}_{+}^{k}|\sum_{j\in[k]}\alpha_{j}=N\}}\frac{1}{\prod_{j\in[k]}\alpha_{j}!}\mathrm{perm}(\textbf{V}^{\alpha})\mathrm{perm}(\textbf{U}^{\alpha})~{}. (38)
Proof.

The number of feasible α\alpha’s in the set {α+k|j[k]αj=N}\{\alpha\subseteq\mathbb{Z}_{+}^{k}|\sum_{j\in[k]}\alpha_{j}=N\} is at most (N+k1k1)exp(O(klogNk))\binom{N+k-1}{k-1}\in\exp\left(O(k\log\frac{N}{k})\right) and we conclude the proof. ∎

Lemma 4.10.

Let Q,Q′′0𝒟×𝒟\textbf{{Q}}^{\prime},\textbf{{Q}}^{\prime\prime}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}} be any doubly stochastic matrices. Then Q=defQQ′′\textbf{Q}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime} is a doubly stochastic matrix.

Proof.

We first consider the row sums,

Q1=QQ′′1=Q1=1.\textbf{Q}\textbf{1}=\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime}\textbf{1}=\textbf{{Q}}^{\prime}\textbf{1}=\textbf{1}~{}.

Therefore the matrix Q is row stochastic. In the above derivation, the second and third equalities follow because Q′′\textbf{{Q}}^{\prime\prime} and Q\textbf{{Q}}^{\prime} are row stochastic matrices respectively. We now consider the column sums,

Q1=Q′′Q1=Q′′1=1.\textbf{Q}^{\top}\textbf{1}=\textbf{{Q}}^{\prime\prime\top}\textbf{{Q}}^{\prime\top}\textbf{1}=\textbf{{Q}}^{\prime\prime\top}\textbf{1}=\textbf{1}~{}.

The above derivation follows because Q\textbf{{Q}}^{\prime} and Q′′\textbf{{Q}}^{\prime\prime} are column stochastic and therefore the matrix Q is column stochastic. As the matrix Q is both row and column stochastic we conclude the proof. ∎

We are now ready to prove our main result of this section and we restate it for convenience. See 3.1

Proof.

Let α\alpha be the maximizer of the optimization problem 38, then

perm(A)exp(O(klogNk))1j[k]αj!perm(Vα)perm(Uα).\mathrm{perm}(\textbf{A})\leq\exp\left(O(k\log\frac{N}{k})\right)\frac{1}{\prod_{j\in[k]}\alpha_{j}!}\mathrm{perm}(\textbf{V}^{\alpha})\mathrm{perm}(\textbf{U}^{\alpha})~{}. (39)

Recall to prove the theorem, we need to construct a doubly stochastic witness Q that satisfies:

logperm(A)O(klogNk)+U(A,Q)N.\log\mathrm{perm}(\textbf{A})\leq{O(k\log\frac{N}{k})}+\mathrm{U}(\textbf{A},\textbf{Q})-N~{}.

We construct such a witness Q from the doubly stochastic witnesses for matrices Vα\textbf{V}^{\alpha} and Uα\textbf{U}^{\alpha}. For all j[k]j\in[k] define Sj=def{y𝒟|V:yα=vj}S_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{y\in\mathcal{D}~{}|~{}\textbf{V}^{\alpha}_{:y}=\textbf{v}_{j}\}, equivalently Sj={y𝒟|U:yα=uj}S_{j}=\{y\in\mathcal{D}~{}|~{}\textbf{U}^{\alpha}_{:y}=\textbf{u}_{j}\} and note that αj=|Sj|\alpha_{j}=|S_{j}|. Let Q\textbf{{Q}}^{\prime} and Q′′\textbf{{Q}}^{\prime\prime} be the doubly stochastic matrices that maximize the scaled Sinkhorn permanent for matrices Vα\textbf{V}^{\alpha} and Uα\textbf{U}^{\alpha} respectively. Therefore by Lemma 4.1 the following inequalities hold,

logperm(Vα)O(klogNk)+U(Vα,Q)N,\log\mathrm{perm}(\textbf{V}^{\alpha})\leq O(k\log\frac{N}{k})+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})-N~{}, (40)
logperm(Uα)O(klogNk)+U(Uα,Q′′)N,\log\mathrm{perm}(\textbf{U}^{\alpha})\leq O(k\log\frac{N}{k})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{}, (41)

where recall U(Vα,Q)=x,y𝒟×𝒟Qx,ylogVx,yαQx,y\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\frac{\textbf{V}^{\alpha}_{x,y}}{\textbf{{Q}}^{\prime}_{x,y}} and U(Uα,Q′′)=x,y𝒟×𝒟Qx,y′′logUx,yαQx,y′′\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\frac{\textbf{U}^{\alpha}_{x,y}}{\textbf{{Q}}^{\prime\prime}_{x,y}}. Without loss of generality by the symmetry (with respect to columns within SjS_{j}) and concavity of the scaled Sinkhorn objective, we can assume that the maximizing matrices Q\textbf{{Q}}^{\prime} and Q′′\textbf{{Q}}^{\prime\prime} satisfy the following: for all x𝒟x\in\mathcal{D} and j[k]j\in[k],

Qx,y=Qx,y and Qx,y′′=Qx,y′′ for all y,ySj and x𝒟.\textbf{{Q}}^{\prime}_{x,y}=\textbf{{Q}}^{\prime}_{x,y^{\prime}}\text{ and }\textbf{{Q}}^{\prime\prime}_{x,y}=\textbf{{Q}}^{\prime\prime}_{x,y^{\prime}}\text{ for all }y,y^{\prime}\in S_{j}\text{ and }x\in\mathcal{D}~{}. (42)

Note that the doubly stochastic matrix that we constructed for the proof of Lemma 4.1 also satisfies the above collection of equalities. Now combining Equations 39, 40 and 41 we get,

logperm(A)O(klogNk)logj[k]αj!+U(Vα,Q)N+U(Uα,Q′′)N,O(klogNk)j[k](αjlogαjαj)+U(Vα,Q)N+U(Uα,Q′′)N,O(klogNk)j[k]αjlogαj+U(Vα,Q)+U(Uα,Q′′)N.\begin{split}\log\mathrm{perm}(\textbf{A})&\leq{O(k\log\frac{N}{k})}-\log\prod_{j\in[k]}\alpha_{j}!+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})-N+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{},\\ &\leq{O(k\log\frac{N}{k})}-\sum_{j\in[k]}\left(\alpha_{j}\log\alpha_{j}-\alpha_{j}\right)+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})-N+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{},\\ &\leq{O(k\log\frac{N}{k})}-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{}.\end{split} (43)

In the second inequality we use the Stirling’s approximation (Lemma 2.8) and the error term due to this approximation is upper bounded by O(klogNk)O(k\log\frac{N}{k}). In the third inequality we used j[k]αj=N\sum_{j\in[k]}\alpha_{j}=N.

Let Q=QQ′′\textbf{Q}=\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime\top}, then by Lemma 4.10 the matrix Q is doubly stochastic. In the remainder of the proof we show that,

j[k]αjlogαj+U(Vα,Q)+U(Uα,Q′′)U(A,Q),-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})\leq\mathrm{U}(\textbf{A},\textbf{Q})~{}, (44)

where recall U(A,Q)=x,y𝒟×𝒟Qx,ylogAx,yQx,y\mathrm{U}(\textbf{A},\textbf{Q})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\frac{\textbf{A}_{x,y}}{\textbf{Q}_{x,y}}. As matrix Q is doubly stochastic, the above inequality combined with Equation 43 concludes the proof. Therefore in the remainder we focus our attention to prove Equation 44 and we start by simplifying the above expression. Define,

βx,yj=def1Qx,yzSjQx,zQy,z′′ for all x𝒟,y𝒟 and for all j[k].\beta^{j}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{1}{\textbf{Q}_{x,y}}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\quad\text{ for all }x\in\mathcal{D},y\in\mathcal{D}\text{ and for all }j\in[k]~{}. (45)

For all x𝒟x\in\mathcal{D} and y𝒟y\in\mathcal{D} the variables defined above satisfy the following,

j[k]βx,yj=1Qx,yj[k]zSjQx,zQy,z′′=1Qx,yz𝒟Qx,zQy,z′′=1Qx,yQx,y=1,\sum_{j\in[k]}\beta^{j}_{x,y}=\frac{1}{\textbf{Q}_{x,y}}\sum_{j\in[k]}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}=\frac{1}{\textbf{Q}_{x,y}}\sum_{z\in\mathcal{D}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}=\frac{1}{\textbf{Q}_{x,y}}\textbf{Q}_{x,y}=1~{}, (46)

where in the third inequality we used the definition of Q=QQ′′\textbf{Q}=\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime\top}. We next simplify and lower bound the term U(A,Q)\mathrm{U}(\textbf{A},\textbf{Q}) in terms of these newly defined variables.

logAx,y=logj[k]vj(x)uj(y)logj[k](vj(x)uj(y)βx,yj)βx,yj=j[k]βx,yjlog(vj(x)uj(y)βx,yj),\displaystyle\log\textbf{A}_{x,y}=\log\sum_{j\in[k]}\textbf{v}_{j}(x)\textbf{u}_{j}(y)\geq\log\prod_{j\in[k]}\left(\frac{\textbf{v}_{j}(x)\textbf{u}_{j}(y)}{\beta^{j}_{x,y}}\right)^{\beta^{j}_{x,y}}=\sum_{j\in[k]}{\beta^{j}_{x,y}}\log\left(\frac{\textbf{v}_{j}(x)\textbf{u}_{j}(y)}{\beta^{j}_{x,y}}\right)~{}, (47)

where in the first equality we used A=j[k]vjuj\textbf{A}=\sum_{j\in[k]}\textbf{v}_{j}\textbf{u}_{j}^{\top}. In the second inequality we used weighted AM-GM inequality. Now consider the term Qx,ylogAx,y\textbf{Q}_{x,y}\log\textbf{A}_{x,y} and substitute the above lower bound,

Qx,ylogAx,y\displaystyle\textbf{Q}_{x,y}\log\textbf{A}_{x,y} Qx,yj[k]βx,yj(logvj(x)+loguj(y))Qx,yj[k]βx,yjlogβx,yj.\displaystyle\geq\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}(\log\textbf{v}_{j}(x)+\log\textbf{u}_{j}(y))-\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\beta^{j}_{x,y}}~{}. (48)

Summing over all the (x,y)(x,y) pairs we get,

x,y𝒟×𝒟Qx,ylogAx,yx𝒟j[k]logvj(x)(y𝒟Qx,yβx,yj)+y𝒟j[k]logvj(y)(x𝒟Qx,yβx,yj),x,y𝒟×𝒟Qx,yj[k]βx,yjlogβx,yj.\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\textbf{A}_{x,y}&\geq\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}\big{)}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(y)\big{(}\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}\big{)}~{},\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\beta^{j}_{x,y}}~{}.\end{split} (49)

In the above expression the following terms simplify,

y𝒟Qx,yβx,yj=y𝒟Qx,y1Qx,yzSjQx,zQy,z′′=zSjQx,zy𝒟Qy,z′′=zSjQx,z.\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}=\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}\frac{1}{\textbf{Q}_{x,y}}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}=\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\sum_{y\in\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{y,z}=\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}~{}. (50)

Similarly,

x𝒟Qx,yβx,yj=zSjQy,z′′.\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}=\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}~{}. (51)

Also note that,

x,y𝒟×𝒟Qx,yj[k]βx,yjlogβx,yj=x,y𝒟×𝒟Qx,yj[k]βx,yjlogβx,yjQx,yQx,y,=x,y𝒟×𝒟Qx,yj[k]βx,yjlog(βx,yjQx,y)x,y𝒟×𝒟Qx,yj[k]βx,yjlogQx,y,=x,y𝒟×𝒟j[k]βx,yjQx,ylog(βx,yjQx,y)x,y𝒟×𝒟Qx,ylogQx,y,=x,y𝒟×𝒟j[k](zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)x,y𝒟×𝒟Qx,ylogQx,y.\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\beta^{j}_{x,y}}&=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log\frac{\beta^{j}_{x,y}\textbf{Q}_{x,y}}{\textbf{Q}_{x,y}},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log(\beta^{j}_{x,y}\textbf{Q}_{x,y})-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\textbf{Q}_{x,y}},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}{\beta^{j}_{x,y}}\textbf{Q}_{x,y}\log(\beta^{j}_{x,y}\textbf{Q}_{x,y})-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log{\textbf{Q}_{x,y}}~{},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log{\textbf{Q}_{x,y}}~{}.\\ \end{split} (52)

In the third and fourth inequality we used Equation 46 and Equation 45 respectively. Substituting Equations 50, 51 and 52 in Equation 49 we get,

x,y𝒟×𝒟Qx,ylogAx,yx𝒟j[k]logvj(x)(zSjQx,z)+y𝒟j[k]logvj(y)(zSjQy,z′′)x,y𝒟×𝒟j[k](zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)+x,y𝒟×𝒟Qx,ylogQx,y.\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\textbf{A}_{x,y}&\geq\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\big{)}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(y)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log{\textbf{Q}_{x,y}}~{}.\end{split} (53)

By rearranging terms the above expression can be equivalently written as,

U(A,Q)=x,y𝒟×𝒟Qx,ylogAx,yQx,yx𝒟j[k]logvj(x)(zSjQx,z)+y𝒟j[k]loguj(y)(zSjQy,z′′)x,y𝒟×𝒟j[k](zSjQx,zQy,z′′)log(zSjQx,zQy,z′′).\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\frac{\textbf{A}_{x,y}}{\textbf{Q}_{x,y}}&\geq\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\big{)}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{u}_{j}(y)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}~{}.\end{split} (54)

In the above expression we have a lower bound for the term U(A,Q)\mathrm{U}(\textbf{A},\textbf{Q}) and we relate it to terms U(Vα,Q)\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime}) and U(Uα,Q′′)\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime}). Consider the following term,

x,y𝒟×𝒟Qx,ylogVx,yα=x𝒟j[k]ySjQx,ylogVx,yα=x𝒟j[k]ySjQx,ylogvj(x),=x𝒟j[k]logvj(x)(ySjQx,y)=x𝒟j[k]logvj(x)(zSjQx,z),\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{V}^{\alpha}_{x,y}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{V}^{\alpha}_{x,y}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{v}_{j}(x)~{},\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\big{)}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\big{)}~{},\\ \end{split} (55)

In the final equality we renamed the variables and the rest of equalities are straightforward. Carrying out similar derivation we also get,

x,y𝒟×𝒟Qx,y′′logUx,yα=x𝒟j[k]loguj(x)(ySjQx,y′′)=y𝒟j[k]loguj(y)(zSjQy,z′′).\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{U}^{\alpha}_{x,y}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{u}_{j}(x)\big{(}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime\prime}_{x,y}\big{)}=\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{u}_{j}(y)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}~{}. (56)

As before in the final equality we renamed variables. Substituting Equations 55 and 56 in Equation 54 we get,

U(A,Q)x,y𝒟×𝒟Qx,ylogVx,yα+x,y𝒟×𝒟Qx,y′′logUx,yαx,y𝒟×𝒟(zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)=U(Vα,Q)+U(Uα,Q′′)+x,y𝒟×𝒟Qx,ylogQx,y+x,y𝒟×𝒟Qx,y′′logQx,y′′x,y𝒟×𝒟j[k](zSjQx,zQy,z′′)log(zSjQx,zQy,z′′).\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})&\geq\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{V}^{\alpha}_{x,y}+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{U}^{\alpha}_{x,y}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\\ &=\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}.\end{split} (57)

Therefore to prove Equation 44, all that remains is to show that,

x,y𝒟×𝒟(Qx,ylogQx,y+Qx,y′′logQx,y′′)x,y𝒟×𝒟(zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)j[k]αjlogαj.\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}+\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}\big{)}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\geq-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}. (58)

To prove the above inequality we use the symmetry in the solutions Q\textbf{{Q}}^{\prime} and Q′′\textbf{{Q}}^{\prime\prime}. Recall from Equation 42, for all x𝒟x\in\mathcal{D} and j[k]j\in[k] we have Qx,y=Qx,y and Qx,y′′=Qx,y′′ for all y,ySj and x𝒟\textbf{{Q}}^{\prime}_{x,y}=\textbf{{Q}}^{\prime}_{x,y^{\prime}}\text{ and }\textbf{{Q}}^{\prime\prime}_{x,y}=\textbf{{Q}}^{\prime\prime}_{x,y^{\prime}}\text{ for all }y,y^{\prime}\in S_{j}\text{ and }x\in\mathcal{D}. Define Rx,j=Qx,y\textbf{R}^{\prime}_{x,j}=\textbf{{Q}}^{\prime}_{x,y} and Rx,j′′=Qx,y′′\textbf{R}^{\prime\prime}_{x,j}=\textbf{{Q}}^{\prime\prime}_{x,y} for any ySjy\in S_{j}. We next substitute these definitions and simplify terms on the left hand side of Equation 58,

x,y𝒟×𝒟Qx,ylogQx,y=x𝒟j[k]ySjQx,ylogQx,y=x𝒟j[k]ySjRx,jlogRx,j,=x𝒟j[k]αjRx,jlogRx,j.\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j},\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}~{}.\end{split} (59)

In the final equality we used |Sj|=αj|S_{j}|=\alpha_{j} and the rest of the equalities are straightforward. Similar argument as above also gets us,

x,y𝒟×𝒟Qx,y′′logQx,y′′=x𝒟j[k]αjRx,j′′logRx,j′′=y𝒟j[k]αjRy,j′′logRy,j′′.\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{x,j}\log\textbf{R}^{\prime\prime}_{x,j}=\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}~{}.\end{split} (60)

Note in the final equality we renamed variables. Finally,

x,y𝒟×𝒟j[k](zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)=x,y𝒟×𝒟j[k]αjRx,jRy,j′′logαjRx,jRy,j′′,=x,y𝒟×𝒟j[k]αjRx,jRy,j′′(logαj+logRx,j+logRy,j′′),\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}&\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}~{},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\big{(}\log\alpha_{j}+\log\textbf{R}^{\prime}_{x,j}+\log\textbf{R}^{\prime\prime}_{y,j}\big{)}~{},\\ \end{split} (61)

Again each of the terms in the parenthesis further simplify as follows,

x,y𝒟×𝒟j[k]αjRx,jRy,j′′logαj\displaystyle\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\alpha_{j} =j[k]αjlogαjx,y𝒟×𝒟Rx,jRy,j′′=j[k]αjlogαjx𝒟Rx,jy𝒟Ry,j′′,\displaystyle=\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}=\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}\sum_{x\in\mathcal{D}}\textbf{R}^{\prime}_{x,j}\sum_{y\in\mathcal{D}}\textbf{R}^{\prime\prime}_{y,j},
=j[k]αjlogαj.\displaystyle=\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.
x,y𝒟×𝒟j[k]αjRx,jRy,j′′logRx,j=x𝒟j[k]αjRx,jlogRx,jy𝒟Ry,j′′=x𝒟j[k]αjRx,jlogRx,j.\displaystyle\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime}_{x,j}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}\sum_{y\in\mathcal{D}}\textbf{R}^{\prime\prime}_{y,j}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}~{}.

Similarly,

x,y𝒟×𝒟j[k]αjRx,jRy,j′′logRy,j′′=y𝒟j[k]αjRy,j′′logRy,j′′.\displaystyle\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}=\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}~{}.

Substituting back all the above three expressions in Equation 61 we get,

x,y𝒟×𝒟j[k](zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)=x𝒟j[k]αjRx,jlogRx,j+y𝒟j[k]αjRy,j′′logRy,j′′+j[k]αjlogαj.\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}\\ &\quad+\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.\end{split} (62)

Further substituting Equations 59, 60 and 62 in the derivation below we get,

x,y𝒟×𝒟(Qx,ylogQx,y+Qx,y′′logQx,y′′)x,y𝒟×𝒟(zSjQx,zQy,z′′)log(zSjQx,zQy,z′′)=j[k]αjlogαj.\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}+\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}\big{)}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}=-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.

Therefore the above derivation proves Equation 58 and we further substitute it in Equation 57 to get,

U(A,Q)U(Vα,Q)+U(Uα,Q′′)j[k]αjlogαj.\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})\geq\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.\end{split} (63)

The above expression combined with Equation 43 gives the following upper bound on the log of permanent,

logperm(A)O(klogNk)+U(A,Q)N.\displaystyle\log\mathrm{perm}(\textbf{A})\leq{O(k\log\frac{N}{k})}+\mathrm{U}(\textbf{A},\textbf{Q})-N~{}. (64)

The above expression combined with definition of the scaled Sinkhorn permanent concludes the proof. ∎

5 Lower bound for Bethe and scaled Sinkhorn permanent approximations

Here we provide the proof of Theorem 3.3 that is stated below for convenience. See 3.3

Proof.

Assume NN is divisible by kk. Let 1 and 0 be Nk×Nk\frac{N}{k}\times\frac{N}{k} all ones and all zeros matrices respectively. Note that bethe(1)NklogNkNk+1\mathrm{bethe}(\textbf{1})\leq\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+1 and the proof for this statement follows because kN1\frac{k}{N}\textbf{1} is the maximizer of the optimization problem maxQF(1,Q)\max_{\textbf{Q}}\mathrm{F}(\textbf{1},\textbf{Q}) over all doubly stochastic matrices Q. On the other hand logperm(1)=logNk!NklogNkNk+Ω(logNk)\log\mathrm{perm}(\textbf{1})=\log\frac{N}{k}!\geq\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+\Omega(\log\frac{N}{k}), where in the last inequality we used the Stirling’s approximation. Now consider the following matrix,

A=def[100010001]\textbf{A}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{bmatrix}\textbf{1}&\textbf{0}&\dots\textbf{0}\\ \textbf{0}&\textbf{1}&\dots\textbf{0}\\ \vdots&\dots&\ddots\\ \textbf{0}&\textbf{0}&\dots\textbf{1}\\ \end{bmatrix}

In the above definition, A is a N×NN\times N matrix, with k×kk\times k blocks. For the matrix A we have, logperm(A)=klogperm(1)k(NklogNkNk+Ω(logNk))\log\mathrm{perm}(\textbf{A})=k\cdot\log\mathrm{perm}(\textbf{1})\geq k\left(\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+\Omega(\log\frac{N}{k})\right) and bethe(A)=kbethe(1)k(NklogNkNk+1)\mathrm{bethe}(\textbf{A})=k\cdot\mathrm{bethe}(\textbf{1})\leq k\left(\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+1\right). Therefore logperm(A)bethe(A)Ω(klogNk)\log\mathrm{perm}(\textbf{A})-\mathrm{bethe}(\textbf{A})\geq\Omega(k\log\frac{N}{k}).

The proof for the case when NN is not divisible by kk is similar. Here matrix A is the N×NN\times N block diagonal matrix where the first kk blocks correspond to Nk×Nk\lfloor\frac{N}{k}\rfloor\times\lfloor\frac{N}{k}\rfloor all ones matrix and the final block corresponds to r×rr\times r all ones matrix, where r=defNkNkr\stackrel{{\scriptstyle\mathrm{def}}}{{=}}N-k\lfloor\frac{N}{k}\rfloor. For this definition of matrix A we have, logperm(A)=klogNk!+logr!k(NklogNkNk+Ω(logNk))+rlogrr+Ω(logr)\log\mathrm{perm}(\textbf{A})=k\cdot\log\lfloor\frac{N}{k}\rfloor!+\log r!\geq k\left(\lfloor\frac{N}{k}\rfloor\log\lfloor\frac{N}{k}\rfloor-\lfloor\frac{N}{k}\rfloor+\Omega(\log\frac{N}{k})\right)+r\log r-r+\Omega(\log r) and bethe(A)=kbethe(1)k(NklogNkNk+1)+rlogrr+1\mathrm{bethe}(\textbf{A})=k\cdot\mathrm{bethe}(\textbf{1})\leq k\left(\lfloor\frac{N}{k}\rfloor\log\lfloor\frac{N}{k}\rfloor-\lfloor\frac{N}{k}\rfloor+1\right)+r\log r-r+1. Therefore logperm(A)bethe(A)Ω(klogNk)\log\mathrm{perm}(\textbf{A})-\mathrm{bethe}(\textbf{A})\geq\Omega(k\log\frac{N}{k}). The first condition of the theorem follows by taking exponentials on both sides of the previous inequality.

The second inequality in the theorem follows by using bethe(A)scaledsinkhorn(A)\mathrm{bethe}(\textbf{A})\geq\mathrm{scaledsinkhorn}(\textbf{A}) (See 2.7). As the matrix A constructed here is of non-negative rank kk, we conclude the proof. ∎

6 Improved approximation to profile maximum likelihood

In this section, we provide an efficient algorithm to compute an exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right)-approximate PML distribution. We first introduce the setup and some new notation. For convenience, we also recall some definitions from Section 2.

We are given access to nn independent samples from a hidden distribution pΔ𝒟\textbf{p}\in\Delta^{\mathcal{D}} supported on domain 𝒟\mathcal{D}. Let xnx^{n} be this length nn sequence and ϕ=Φ(xn)\phi=\Phi(x^{n}) be its corresponding profile. Let f(xn,y)\textbf{f}(x^{n},y) be the frequency for domain element y𝒟y\in\mathcal{D} in sequence xnx^{n}. Let kk be the number of non-zero distinct frequencies and we use {m1,mj,mk}\{\textbf{m}_{1},\dots\textbf{m}_{j},\dots\textbf{m}_{k}\} to denote these distinct frequencies. Note that the number of non-zero distinct frequencies kk is always upper bounded by n\sqrt{n}. For j[1,k]j\in[1,k], we define ϕj=def|{y𝒟|f(xn,y)=mj}|\phi_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{y\in\mathcal{D}~{}|~{}\textbf{f}(x^{n},y)=\textbf{m}_{j}\}|. Let ppml\textbf{p}_{\mathrm{pml}} be the PML distribution with respect to profile ϕ\phi and is formally defined as follows,

ppmlargminpΔ𝒟(p,ϕ).\textbf{p}_{\mathrm{pml}}\in\operatorname*{arg\,min}_{\textbf{p}\in\Delta^{\mathcal{D}}}\mathbb{P}(\textbf{p},\phi)~{}.

Recall the definition of profile probability matrix Aq,ϕ\textbf{A}^{\textbf{q},\phi} with respect to profile ϕ\phi and distribution p,

Ax,yp,ϕ=defpxfy for all x,y𝒟,\textbf{A}^{\textbf{p},\phi}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{p}_{x}^{\textbf{f}_{y}}\text{ for all }x,y\in\mathcal{D}, (65)

where fy=deff(xn,y)\textbf{f}_{y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{f}(x^{n},y) is the frequency of domain element y𝒟y\in\mathcal{D} in the observed sequence xnx^{n} and recall Φ(xn)=ϕ\Phi(x^{n})=\phi. Note that the number of distinct columns is equal to number of distinct observed frequencies plus one (for the unseen) and therefore it is k+1k+1.

From Equation 4, the probability of profile ϕ\phi with respect to distribution p is,

(p,ϕ)=Cϕ(j[0,k]1ϕj!)perm(Ap,ϕ), where Cϕ=n!j[1,k](mj!)ϕj.\mathbb{P}(\textbf{p},\phi)=C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{perm}(\textbf{A}^{\textbf{p},\phi})~{},\enspace\text{ where }\enspace C_{\phi}=\frac{n!}{\prod_{j\in[1,k]}(\textbf{m}_{j}!)^{\phi_{j}}}~{}. (66)

ϕ0\phi_{0} here denotes the number of unseen domain elements and note that it is not part of the profile. Given a distribution p we know its domain 𝒟\mathcal{D} therefore the unseen domain elements. Also, note that CϕC_{\phi} is independent of the term ϕ0\phi_{0}, therefore it depends just on the profile ϕ\phi and not on the underlying distribution p.

We now provide the motivation behind the techniques used in this section. Recall that the goal of this section is to compute an approximate PML distribution and we wish to do this using the results from the previous section. A first attempt would be to use the scaled Sinkhorn (or the Bethe) permanent as a proxy for the term perm(Ap,ϕ)\mathrm{perm}(\textbf{A}^{\textbf{p},\phi}) in Equation 66 and solve the following optimization problem:

maxpΔ𝒟Cϕ(j[0,k]1ϕj!)scaledsinkhorn(Ap,ϕ).\max_{\textbf{p}\in\Delta^{\mathcal{D}}}C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{scaledsinkhorn}(\textbf{A}^{\textbf{p},\phi})~{}.

The above optimization problem is indeed a good proxy for the PML objective and recall the above optimization problem is equivalent to the following:

maxpΔ𝒟Cϕ(j[0,k]1ϕj!)maxQZrcexp(U(Ap,ϕ,Q)).\max_{\textbf{p}\in\Delta^{\mathcal{D}}}C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\max_{\textbf{Q}\in\textbf{Z}_{rc}}\exp\left(\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})\right)~{}.

Taking log and ignoring the constants we get the following equivalent optimization problem,

maxpΔ𝒟maxQZrc(log1ϕ0!+U(Ap,ϕ,Q))\max_{\textbf{p}\in\Delta^{\mathcal{D}}}\max_{\textbf{Q}\in\textbf{Z}_{rc}}\left(\log\frac{1}{\phi_{0}!}+\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})\right)

Interestingly, the function U(Ap,ϕ,Q)\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q}), is concave with respect to p for a fixed Q and concave with respect to Q for a fixed p (See [Von14]). However, unfortunately the function U(Ap,ϕ,Q)\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q}) in general is not a concave function with respect to p and Q simultaneously [Von14] and we do not know how to solve the above optimization problem. Vontobel [Von14] proposed an alternating maximization algorithm to solve the above optimization problem, and studied its implementation and convergence to a stationary point; see [Von14] for empirical performance of this approach. Using the Bethe permanent as a proxy in the above optimization problem has similar issues; see [Von12, Von14] for further details.

To address the above issue we use the idea of probability discretization from [CSS19], meaning we assume distribution takes all its probability values from some fixed predefined set. We use this idea in a different way than [CSS19] and further exploit the structure of optimal solution Q to write a convex optimization problem that approximates the PML objective. The solution of this convex optimization problem returns a fractional representation of the distribution that we later round to return the approximate PML distribution with desired guarantees. Surprisingly, the final convex optimization problem we write is exactly same as the one in [CSS19] and our work gives a better analysis of the same convex program by a completely different approach.

The rest of this section is organized as follows. In Section 6.1, we study the probability discretization. In the same section, we also study the application of results from Section 4 for approximating the permanent of profile probability matrix (Ap,ϕ\textbf{A}^{\textbf{p},\phi}). We further provide the convex optimization problem at the end of this section that can be solved efficiently and returns a fractional representation of the approximate PML distribution. In Section 6.2, we provide the rounding algorithm that returns our final approximate PML distribution. Till this point, all our results are independent of the choice of the probability discretization set. Later in Section 6.3, we choose an appropriate probability discretization set and further combine analysis from all the previous sections. In this section, we state and analyze our final algorithm that returns a exp(O(nlogn))\exp\left(-O(\sqrt{n}\log n)\right)-approximate PML distribution. Note that our rounding algorithm is technical and for the continuity of reading we defer all the proofs for results in Section 6.2 to Section 6.4.

6.1 Probability discretization

Here we study the idea of probability discretization that is also used in [CSS19]. We combine this with other ideas from Section 4 to provide a convex program that approximates the PML objective.

Let R[0,1]\textbf{R}\subseteq[0,1]_{\mathbb{R}} be some discretization of the probability space and in this section we consider distributions that take all its probability values in set R. All results in this section hold for finite set R and we specify the exact definition of R in Section 6.3.

The discretization introduces a technicality of probability values not summing up to one and we redefine pseudo-distribution and discrete pseudo-distribution from [CSS19] to deal with these.

Definition 6.1 (Pseudo-distribution).

q[0,1]𝒟\textbf{q}\in[0,1]^{\mathcal{D}}_{\mathbb{R}} is a pseudo-distribution if q11\|\textbf{q}\|_{1}\leq 1 and a discrete pseudo-distribution with respect to R if all its entries are in R as well. We use Δpseudo𝒟\Delta_{pseudo}^{\mathcal{D}} and ΔR𝒟\Delta_{\textbf{R}}^{\mathcal{D}} to denote the set of all such pseudo-distributions and discrete pseudo-distributions with respect to R respectively.

We extend and use the following definition for (v,yn)\mathbb{P}(\textbf{v},y^{n}) for any vector v0𝒟\textbf{v}\in\mathbb{R}_{\geq 0}^{\mathcal{D}} and therefore for pseudo-distributions as well,

(v,yn)=defx𝒟vxf(yn,x).\mathbb{P}(\textbf{v},y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{x\in\mathcal{D}}\textbf{v}_{x}^{\textbf{f}(y^{n},x)}~{}.

Further, for any probability terms defined involving p, we define those terms for any vector v0𝒟\textbf{v}\in\mathbb{R}_{\geq 0}^{\mathcal{D}} just by replacing px\textbf{p}_{x} by vx\textbf{v}_{x} everywhere. For convenience we refer to (q,ϕ)\mathbb{P}(\textbf{q},\phi) for any pseudo-distribution q as the “probability” of profile ϕ\phi with respect to q.

For a scalar cc and set S, define cS\lfloor c\rfloor_{\textbf{S}} and cS\lceil c\rceil_{\textbf{S}} as follows:

cS=defsupsS:scs and cS=definfsS:scs\lfloor c\rfloor_{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{s\in\textbf{S}:s\leq c}s\quad\text{ and }\quad\lceil c\rceil_{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\inf_{s\in\textbf{S}:s\geq c}s
Definition 6.2 (Discrete pseudo-distribution).

For any distribution pΔ𝒟\textbf{p}\in\Delta^{\mathcal{D}}, its discrete pseudo-distribution q=disc(p)ΔR𝒟\textbf{q}=\mathrm{disc}(\textbf{p})\in\Delta_{\textbf{R}}^{\mathcal{D}} with respect to R is defined as:

qx=defpxRx𝒟\textbf{q}_{x}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\lfloor\textbf{p}_{x}\rfloor_{\textbf{R}}\quad\forall x\in\mathcal{D}

We now define some additional definitions and notation that will help us lower and upper bound the permanent of profile probability matrix by a convex optimization problem.

  • Let =def|R|\ell\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\textbf{R}| be the cardinality of set R and ri\textbf{r}_{i} be the ii’th element of set R.

  • For any discrete pseudo-distribution q with respect to R, that is qΔR𝒟\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}, we let iq=def|{y𝒟|qy=ri}|\ell^{\textbf{q}}_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{y\in\mathcal{D}~{}|~{}\textbf{q}_{y}=\textbf{r}_{i}\}|, be the number of domain elements with probability ri\textbf{r}_{i}.

  • Let ZRq,ϕ0×(k+1)\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}\subseteq\mathbb{R}_{\geq 0}^{\ell\times(k+1)} be the set of non-negative matrices such that, for any SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi} the following holds:

    j[0,k]Si,j=iq for all i[1,]andi[1,]Si,j=ϕj for all j[0,k],\displaystyle\sum_{j\in[0,k]}\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i}\text{ for all }i\in[1,\ell]\quad\text{and}\quad\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\phi_{j}\text{ for all }j\in[0,k]~{}, (67)

    where ϕ0\phi_{0}111111ϕ0\phi_{0} is not part of the profile and is not given to us. Later in this section, we get rid of this dependency on ϕ0\phi_{0}. is the number of unseen domain elements and we use m0=def0\textbf{m}_{0}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}0 to denote the corresponding frequency element.

  • For any S0×(k+1)\textbf{S}\in\mathbb{R}_{\geq 0}^{\ell\times(k+1)} define,

    h(S)=i[1,]j[0,k][Si,jlog(rimjSi,j)]+i[1,](j[0,k]Si,j)log(j[0,k]Si,j)+j[0,k]ϕjlogϕjj[0,k]ϕj.\textbf{h}(\textbf{S})=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\left[\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})\right]+\sum_{i\in[1,\ell]}\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)\log\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}~{}. (68)
  • Throughout this section, for convenience unless stated otherwise we abuse notation and use A to denote the matrix Aq,ϕ\textbf{A}^{\textbf{q},\phi}. The underlying pseudo-distribution q and profile ϕ\phi with respect to matrix A will be clear from the context.

The first half of this section is dedicated to bound the perm(A)\mathrm{perm}(\textbf{A}) in terms of function h(S)\textbf{h}(\textbf{S}). For any fixed discrete pseudo-distribution q and profile ϕ\phi, we will show that,

maxSZRq,ϕh(S)logperm(Aq,ϕ)O(klogNk)+maxSZRq,ϕh(S).\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})\leq\log\mathrm{perm}(\textbf{A}^{\textbf{q},\phi})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

Later in the second half, we use the above inequality to maximize over all the discrete pseudo-distributions to find the approximate PML distribution and the summary of which is stated later. We start by showing the lower bound first and later in Theorem 6.4 we prove the upper bound.

Theorem 6.3.

For any discrete pseudo-distribution q with respect to R and profile ϕ\phi, let A be the matrix defined (with respect to q and ϕ\phi) in Equation 65, then the following holds,

logperm(A)maxSZRq,ϕh(S).\log\mathrm{perm}(\textbf{A})\geq\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}. (69)
Proof.

For any matrix SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}, define matrix Q𝒟×𝒟\textbf{Q}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}} as follows,

Qx,y=defSi,jiqϕj\textbf{Q}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}

where in the definition above ii and jj are such that qx=ri\textbf{q}_{x}=\textbf{r}_{i} and fy=mj\textbf{f}_{y}=\textbf{m}_{j}. We now establish that matrix Q is doubly stochastic. For each x𝒟x\in\mathcal{D}, let ii be such that qx=ri\textbf{q}_{x}=\textbf{r}_{i}, then

y𝒟Qx,y=j[0,k]{y𝒟|fy=mj}Si,jiqϕj=j[0,k]Si,jiqϕj{y𝒟|fy=mj}1=j[0,k]Sx,mjiqϕjϕj=1iqj[0,k]Sx,mj=1.\begin{split}\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}&=\sum_{j\in[0,k]}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{f}_{y}=\textbf{m}_{j}\}}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}=\sum_{j\in[0,k]}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{f}_{y}=\textbf{m}_{j}\}}1\\ &=\sum_{j\in[0,k]}\frac{\textbf{S}_{x,\textbf{m}_{j}}}{\ell^{\textbf{q}}_{i}\phi_{j}}\cdot\phi_{j}=\frac{1}{\ell^{\textbf{q}}_{i}}\sum_{j\in[0,k]}\textbf{S}_{x,\textbf{m}_{j}}=1~{}.\\ \end{split} (70)

For each y𝒟y\in\mathcal{D}, let jj be such that fy=mj\textbf{f}_{y}=\textbf{m}_{j}, then

x𝒟Qx,y=i[1,]{x𝒟|qx=ri}Si,jiqϕj=i[1,]Si,jiqϕj{x𝒟|qx=ri}1=i[1,]Sx,mjiqϕjiq=1ϕji[1,]Sx,mj=1.\begin{split}\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}&=\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}=\sum_{i\in[1,\ell]}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}1\\ &=\sum_{i\in[1,\ell]}\frac{\textbf{S}_{x,\textbf{m}_{j}}}{\ell^{\textbf{q}}_{i}\phi_{j}}\cdot\ell^{\textbf{q}}_{i}=\frac{1}{\phi_{j}}\sum_{i\in[1,\ell]}\textbf{S}_{x,\textbf{m}_{j}}=1~{}.\\ \end{split} (71)

Since matrix Q is doubly stochastic, by the definition of the scaled Sinkhorn permanent (See 2.6) and 2.7 we have logperm(A)U(A,Q)N\log\mathrm{perm}(\textbf{A})\geq\mathrm{U}(\textbf{A},\textbf{Q})-N. To conclude the proof we show that U(A,Q)N=h(S)\mathrm{U}(\textbf{A},\textbf{Q})-N=\textbf{h}(\textbf{S}).

U(A,Q)=(x,y)𝒟×𝒟Qx,ylog(Ax,yQx,y)=i[1,]j[0,k]iqϕjSi,jiqϕjlog(rimjiqϕjSi,j)=i[1,]j[0,k]Si,jlog(rimjiqϕjSi,j).\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})&=\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\ell^{\textbf{q}}_{i}\phi_{j}\cdot\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}\phi_{j}}{\textbf{S}_{i,j}})\\ &=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}\phi_{j}}{\textbf{S}_{i,j}})~{}.\end{split} (72)

We consider the final expression above and simplify it. First note that,

i[1,]j[0,k]Si,jlogiq=i[1,]logiqj[0,k]Si,j=i[1,]iqlogiq.\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\ell^{\textbf{q}}_{i}=\sum_{i\in[1,\ell]}\log\ell^{\textbf{q}}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}=\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}~{}.

Similarly,

i[1,]j[0,k]Si,jlogϕj=j[0,k]logϕji[1,]Si,j=j[0,k]ϕjlogϕj.\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\phi_{j}=\sum_{j\in[0,k]}\log\phi_{j}\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}~{}.

Using the above two expressions, the final expression of Equation 72 can be equivalently written as,

i[1,]j[0,k]Si,jlog(rimjiqϕjSi,j)=i[1,]j[0,k][Si,jlog(rimjSi,j)]+i[1,]iqlogiq+j[0,k]ϕjlogϕj.\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}\phi_{j}}{\textbf{S}_{i,j}})=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\left[\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})\right]+\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}~{}. (73)

Combining Equation 72, Equation 73 and substituting N=j[0,k]ϕjN=\sum_{j\in[0,k]}\phi_{j}, we get:

U(A,Q)N=i[1,]j[0,k]Si,jlog(rimjSi,j)+i[1,]iqlogiq+j[0,k]ϕjlogϕjj[0,k]ϕj=h(S).\mathrm{U}(\textbf{A},\textbf{Q})-N=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})+\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}=\textbf{h}(\textbf{S})~{}.

In the above equality we used j[0,k]Si,j=iq\sum_{j\in[0,k]}\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i} for all i[1,]i\in[1,\ell] and for any SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}. Combining the above inequality with logperm(A)U(A,Q)N\log\mathrm{perm}(\textbf{A})\geq\mathrm{U}(\textbf{A},\textbf{Q})-N we get,

logperm(A)h(S).\log\mathrm{perm}(\textbf{A})\geq\textbf{h}(\textbf{S})~{}.

The above inequality holds for any SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi} (and therefore holds for the maximizer as well) and we conclude the proof. ∎

We next give an upper bound for the log of permanent of A in terms of h(S)\textbf{h}(\textbf{S}).

Theorem 6.4.

For any discrete pseudo-distribution q with respect to R and profile ϕ\phi, let A be the matrix defined (with respect to q and ϕ\phi) in Equation 65. Then,

logperm(A)O(klogNk)+maxSZRq,ϕh(S).\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.
Proof.

Here we construct a particular matrix SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi} such that logperm(A)O(klogNk)+h(S)\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\textbf{h}(\textbf{S}), which immediately implies the theorem. Recall by Lemmas 4.5 and 4.4, there exists a matrix P0𝒟×(k+1)\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times(k+1)} such that, j[0,k]Px,j=1 for all x𝒟\sum_{j\in[0,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D} and x𝒟Px,j=ϕj for all j[0,k]\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[0,k], and satisfies logperm(A)O(klogNk)+f(A,P)\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\textbf{f}(\textbf{A},\textbf{P}) 121212The inequality holds because matrix A has k+1k+1 distinct columns and O((k+1)logNk+1)O((k+1)\log\frac{N}{k+1}) is asymptotically same as O(klogNk)O(k\log\frac{N}{k}).. Further using the definition of f(A,P)\textbf{f}(\textbf{A},\textbf{P}) we get,

logperm(A)O(klogNk)+j[0,k]ϕjlogϕjj[0,k]ϕj+(x,j)𝒟×[0,k]Px,jlogA^x,jPx,j,\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}+\sum_{(x,j)\in\mathcal{D}\times[0,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}~{}, (74)

where for the matrix A defined (with respect to q and ϕ\phi) in Equation 65, we have,

A^x,j=qxmj.{\hat{\textbf{A}}}_{x,j}=\textbf{q}_{x}^{\textbf{m}_{j}}~{}.

We now define the matrix S that satisfies the conditions of the lemma.

Si,j=def{x𝒟|qx=ri}Px,j\textbf{S}_{i,j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{x,j}

By Theorem 4.7, for any fixed j[0,k]j\in[0,k], all x𝒟x\in\mathcal{D} such that qx=ri\textbf{q}_{x}=\textbf{r}_{i}, share the same probability value Px,j\textbf{P}_{x,j} and we use the notation Pi,j\textbf{P}_{i,j} to denote this value. Using this definition, we have:

Si,j=iqPi,j\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i}\textbf{P}_{i,j} (75)

Further note that for any i[1,]i\in[1,\ell], if x𝒟x\in\mathcal{D} is any element such that qx=ri\textbf{q}_{x}=\textbf{r}_{i}, then

j[0,k]Pi,j=j[0,k]Px,j=1\sum_{j\in[0,k]}\textbf{P}_{i,j}=\sum_{j\in[0,k]}\textbf{P}_{x,j}=1

We wish to show that SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}. We first analyze the row sum constraint. For each i[1,]i\in[1,\ell],

j[0,k]Si,j=j[0,k]iqPi,j=iq\sum_{j\in[0,k]}\textbf{S}_{i,j}=\sum_{j\in[0,k]}\ell^{\textbf{q}}_{i}\textbf{P}_{i,j}=\ell^{\textbf{q}}_{i}

We now analyze the column constraint. For each j[0,k]j\in[0,k],

i[1,]Si,j=i[1,]{x𝒟|qx=ri}Px,j=x𝒟Px,j=ϕj\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{x,j}=\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}

In the remainder of the proof we show that the matrix S defined earlier satisfies logperm(A)O(klogNk)+h(S)\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\textbf{h}(\textbf{S}). We start by simplifying the term (x,j)𝒟×[0,k]Px,jlogA^x,jPx,j\sum_{(x,j)\in\mathcal{D}\times[0,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}} in Equation 74,

(x,j)𝒟×[0,k]Px,jlogA^x,jPx,j=j[0,k]i[1,]{x𝒟|qx=ri}Px,jlogA^x,jPx,j=j[0,k]i[1,]{x𝒟|qx=ri}Pi,jlogrimjPi,j=j[0,k]i[1,]iqPi,jlogrimjPi,j=i[1,]j[0,k]Si,jlogrimjiqSi,j=i[1,]j[0,k]Si,jlogrimjSi,j+i[1,]iqlogiq\begin{split}\sum_{(x,j)\in\mathcal{D}\times[0,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}&=\sum_{j\in[0,k]}\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}=\sum_{j\in[0,k]}\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{P}_{i,j}}\\ &=\sum_{j\in[0,k]}\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\textbf{P}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{P}_{i,j}}=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}}{\textbf{S}_{i,j}}\\ &=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}}+\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}\end{split} (76)

In the second equality, we used A^x,j=rimj{\hat{\textbf{A}}}_{x,j}=\textbf{r}_{i}^{\textbf{m}_{j}} and further by the definition of Pi,j\textbf{P}_{i,j} we have Px,j=Pi,j\textbf{P}_{x,j}=\textbf{P}_{i,j} for all x𝒟x\in\mathcal{D} that satisfy qx=ri\textbf{q}_{x}=\textbf{r}_{i}. In the third equality, we used {x𝒟|qx=ri}1=iq\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}1=\ell^{\textbf{q}}_{i}. In the fourth equality we used Equation 75. In the final equality, we used i[1,]j[0,k]Si,jlogrimjiqSi,j=i[1,]j[0,k]Si,jlogrimjSi,j+i[1,]j[0,k]Si,jlogiq\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}}{\textbf{S}_{i,j}}=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}}+\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\ell^{\textbf{q}}_{i} and the final term further simplifies to the following, i[1,]j[0,k]Si,jlogiq=i[1,]logiqj[0,k]Si,j=i[1,]iqlogiq\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\ell^{\textbf{q}}_{i}=\sum_{i\in[1,\ell]}\log\ell^{\textbf{q}}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}=\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}.

We conclude the proof by combining equations 74 and 76 and using j[0,k]Si,j=iq\sum_{j\in[0,k]}\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i} for any SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}. ∎

Note using Theorem 6.3 and 6.4, for matrix A defined (with respect to q and ϕ\phi) in Equation 65, we showed the following,

maxSZRq,ϕh(S)logperm(A)O(klogNk)+maxSZRq,ϕh(S).\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})\leq\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}. (77)

Our final goal of this section is to maximize (q,ϕ)1ϕ0!perm(A)\mathbb{P}(\textbf{q},\phi)\propto\frac{1}{\phi_{0}!}\mathrm{perm}(\textbf{A}) over discrete pseudo-distributions q but let us take a step back and just focus on writing an upper bound. Consider the term maxSZRq,ϕh(S)\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S}) in the expression above, it depends on discrete pseudo-distribution q at two different places. The first is the constraint set ZRq,ϕ\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi} and the second is the function h(S)\textbf{h}(\textbf{S}) (because it contains the ϕ0\phi_{0} term in its expression). We address the first issue by defining the following new set that encodes the constraint set ZRq,ϕ\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi} for all discrete pseudo-distributions simultaneously.

Definition 6.5.

Let ZRϕ0×(k+1)\textbf{Z}^{\phi}_{\textbf{R}}\subset\mathbb{R}_{\geq 0}^{\ell\times(k+1)} be the set of non-negative matrices, such that any SZRϕ\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}} satisfies,

i[1,]Si,j=ϕj for all j[1,k],j[0,k]Si,j+ for all i[1,] and i[1,k]rij[0,k]Si,j1.\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\phi_{j}\text{ for all }j\in[1,k],\sum_{j\in[0,k]}\textbf{S}_{i,j}\in\mathbb{Z}_{+}\text{ for all }i\in[1,\ell]\text{ and }\sum_{i\in[1,k]}\textbf{r}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq 1~{}. (78)

Note in the definition of ZRϕ\textbf{Z}^{\phi}_{\textbf{R}} we removed the constraint related to ϕ0\phi_{0} and recall ϕ0\phi_{0} denotes the number of unseen domain elements. Not having constraint with respect to ϕ0\phi_{0} helps us encode discrete pseudo-distributions (with respect to R) of different domain sizes. Further for any SZRϕ\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}, there is a discrete pseudo-distribution associated with it and we define it next.

Definition 6.6.

For any SZRϕ\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}, the discrete pseudo-distribution qS\textbf{q}_{\textbf{S}} associated with S is defined as follows: For any arbitrary j[0,k]Si,j\sum_{j\in[0,k]}\textbf{S}_{i,j} number of domain elements assign probability ri\textbf{r}_{i}.

Note in the definition above qS\textbf{q}_{\textbf{S}} is a valid pseudo-distribution because of the third condition in Equation 78. Further note that, for any discrete pseudo-distribution q and SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}, the distribution qS\textbf{q}_{\textbf{S}} associated with respect to S is a permutation of distribution q. Since the probability of a profile is invariant under permutations of distribution, we treat all these distributions the same and do not distinguish between them.

We now handle the second issue that corresponds to removing the dependency of discrete pseudo-distribution q from the function h(S)\textbf{h}(\textbf{S}). To handle this issue, we define a new function g(S)\textbf{g}(\textbf{S}) that when maximized over the set ZRq,ϕ\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi} and ZRϕ\textbf{Z}^{\phi}_{\textbf{R}} approximates the value (q,ϕ)\mathbb{P}(\textbf{q},\phi) and maxqΔR𝒟(q,ϕ)\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi) respectively (See next theorem for the formal statement). For any S0×(k+1)\textbf{S}\in\mathbb{R}_{\geq 0}^{\ell\times(k+1)}, the function g(S)\textbf{g}(\textbf{S}) is defined as follows,

g(S)=defexp(i[1,]j[0,k][Si,jlog(rimjSi,j)]+i[1,](j[0,k]Si,j)log(j[0,k]Si,j)).\textbf{g}(\textbf{S})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\exp\left(\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\left[\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})\right]+\sum_{i\in[1,\ell]}\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)\log\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)\right)~{}. (79)

Note that we switch gears and define the function g(S)\textbf{g}(\textbf{S}) as an exponential function. g(S)\textbf{g}(\textbf{S}) approximates the value (q,ϕ)\mathbb{P}(\textbf{q},\phi) instead of log of it and it helps with proof readability. The following theorem summarizes this result.

Theorem 6.7.

Let R be a probability discretization set. Given a profile ϕ\phi and discrete pseudo-distribution q with respect to R. The following inequality holds,

exp(O(klog(N+n)))CϕmaxSZRq,ϕg(S)(q,ϕ)exp(O(klogNk))CϕmaxSZRq,ϕg(S)\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})\leq\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S}) (80)

Further,

exp(O(klog(N+n)))CϕmaxSZRϕg(S)maxqΔR𝒟(q,ϕ)exp(O(klogNk))CϕmaxSZRϕg(S)\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})\leq\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S}) (81)
Proof.

For any discrete pseudo-distribution q with respect to R and profile ϕ\phi, let A be the matrix defined (with respect to q and ϕ\phi) in Equation 65. Then, by Equation 77 we have,

maxSZRq,ϕh(S)logperm(A)O(klogNk)+maxSZRq,ϕh(S).\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})\leq\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

Further by Equation 4 we have,

(q,ϕ)=Cϕ(j[0,k]1ϕj!)perm(Aq,ϕ).\mathbb{P}(\textbf{q},\phi)=C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{perm}(\textbf{A}^{\textbf{q},\phi})~{}.

Combining the above two equations we have,

Cϕ(j[0,k]1ϕj!)maxSZRq,ϕexp(h(S))(q,ϕ)exp(O(klogNk))Cϕ(j[0,k]1ϕj!)maxSZRq,ϕexp(h(S))C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\exp\left(\textbf{h}(\textbf{S})\right)\leq\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\exp\left(\textbf{h}(\textbf{S})\right) (82)

We now simplify the term (j[0,k]1ϕj!)exp(h(S))\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\textbf{h}(\textbf{S})\right) in the above expression. First note that for any SZRq,ϕ\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi},

exp(h(S))=g(S)exp(j[0,k]ϕjlogϕjj[0,k]ϕj).\exp\left(\textbf{h}(\textbf{S})\right)=\textbf{g}(\textbf{S})\cdot\exp\left(\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}\right)~{}.

Therefore,

(j[0,k]1ϕj!)exp(h(S))=(j[0,k]1ϕj!)g(S)exp(j[0,k]ϕjlogϕjj[0,k]ϕj).\begin{split}\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\textbf{h}(\textbf{S})\right)&=\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\textbf{g}(\textbf{S})\cdot\exp\left(\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}\right)~{}.\end{split} (83)

By Lemma 2.8 (Stirling’s approximation) we have,

exp(O(klog(N+n)))(j[0,k]1ϕj!)exp(j[0,k]ϕjlogϕjj[0,k]ϕj)1.\exp\left(-O\left(k\log(N+n)\right)\right)\leq\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}\right)\leq 1~{}. (84)

The first inequality follows because for each j[0,k]j\in[0,k], we have 1ϕj!exp(ϕjlogϕjϕj)Ω(1ϕj+1)\frac{1}{\phi_{j}!}\exp\left(\phi_{j}\log\phi_{j}-\phi_{j}\right)\geq\Omega(\frac{1}{\sqrt{\phi_{j}+1}}), which by using ϕjN+n\phi_{j}\leq N+n is further lower bounded by Ω(1N+n)exp(O(log(N+n)))\Omega(\frac{1}{\sqrt{N+n}})\geq\exp\left(-O(\log(N+n))\right). Equation 84 follows by taking product over all j[0,k]j\in[0,k]. Now combining Equation 84 and Equation 83 we have,

exp(O(klog(N+n)))g(S)(j[0,k]1ϕj!)exp(h(S))g(S).\exp\left(-O(k\log(N+n))\right)\cdot\textbf{g}(\textbf{S})\leq\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\textbf{h}(\textbf{S})\right)\leq\textbf{g}(\textbf{S})~{}. (85)

The first statement of the lemma follows by combining the above Equation 85 with Equation 82, that is we have,

exp(O(klog(N+n)))CϕmaxSZRq,ϕg(S)(q,ϕ)exp(O(klogNk))CϕmaxSZRq,ϕg(S).\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})\leq\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})~{}. (86)

Given a profile ϕ\phi, for any discrete pseudo-distribution qΔR𝒟\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}} we have ZRq,ϕZRϕ\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}\subseteq\textbf{Z}^{\phi}_{\textbf{R}} and further combining it with above inequality we get,

maxqΔR𝒟(q,ϕ)exp(O(klogNk))CϕmaxSZRϕg(S).\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})~{}.

Note that for any SZRϕ\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}, we also have SZRϕ,qS\textbf{S}\in\textbf{Z}^{\phi,\textbf{q}_{\textbf{S}}}_{\textbf{R}}, where qS\textbf{q}_{\textbf{S}} is the discrete pseudo-distribution associated with respect to S (See 6.6). Therefore,

exp(O(klog(N+n)))CϕmaxSZRϕg(S)exp(O(klog(N+n)))CϕmaxqΔR𝒟maxSZRq,ϕg(S)maxqΔR𝒟(q,ϕ).\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})\leq\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})\leq\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)~{}.

For the last inequality in the above derivation we used Equation 86. Now combining the previous two inequalities we conclude the proof. ∎

The previous theorem provides an upper bound for the probability of profile with respect to any discrete pseudo-distribution. However one issue with this upper bound is that it is not efficiently computable because the set ZRϕ\textbf{Z}^{\phi}_{\textbf{R}} is not a convex set (because of the integrality constraints). We relax these integrality constraints and define the following new set.

Definition 6.8.

Let ZRϕ,frac0×(k+1)\textbf{Z}^{\phi,frac}_{\textbf{R}}\subseteq\mathbb{R}_{\geq 0}^{\ell\times(k+1)} be the set of non-negative matrices, such that any SZRϕ,frac\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}} satisfies,

i[1,]Si,j=ϕj for all j[1,k] and i[1,k]rij[0,k]Si,j1.\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\phi_{j}\text{ for all }j\in[1,k]\text{ and }\sum_{i\in[1,k]}\textbf{r}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq 1~{}. (87)
Lemma 6.9.

Let R be a probability discretization set. Given a profile ϕ\phi, the following holds,

maxqΔR𝒟(q,ϕ)exp(O(klogNk))CϕmaxSZRϕ,fracg(S)\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\textbf{g}(\textbf{S}) (88)
Proof.

By Theorem 6.7 we already have,

maxqΔR𝒟(q,ϕ)exp(O(klogNk))CϕmaxSZRϕg(S).\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})~{}.

The lemma holds because ZRϕZRϕ,frac\textbf{Z}^{\phi}_{\textbf{R}}\subseteq\textbf{Z}^{\phi,frac}_{\textbf{R}}. ∎

Note in the above lemma, the upper bound only depends on the profile 131313CϕC_{\phi} has no dependency on ϕ0\phi_{0}. and we removed all dependencies related to distributions (and also ϕ0\phi_{0}). Next we show that this upper bound can be efficiently computed by using the result that function g(S)\textbf{g}(\textbf{S}) is log concave in S.

Lemma 6.10 (Lemma 4.16 in [CSS19]).

Function g(S)\textbf{g}(\textbf{S}) is log concave in S.

Theorem 6.11 (Theorem 4.17 in [CSS19]).

Given a profile ϕΦn\phi\in\Phi^{n}, the optimization problem maxSZRϕ,fraclogg(S)\max_{\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{S}) can be solved in time O~(k2)\widetilde{O}(k^{2}\ell). 141414Note here we hide the logarithmic dependence on nn, the size of sample.

6.2 Rounding Algorithm

In the previous section we provided an efficiently computable upper bound for the probability of profile ϕ\phi with respect to any discrete pseudo-distribution qΔR𝒟\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}. This upper bound returns a solution SZRϕ,frac\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}} and we need to round this solution to construct a discrete pseudo-distribution that approximates this upper bound. In this section we provide a rounding algorithm that takes as input SZRϕ,frac\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}} and returns a solution SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}, where Rext\textbf{R}^{\mathrm{ext}} is an extended probability discretization set. Further using SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}, we construct a discrete pseudo-distribution qSext\textbf{q}_{\textbf{S}^{\mathrm{ext}}} with respect to Rext\textbf{R}^{\mathrm{ext}} such that (qSext,ϕ)\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi) approximates the upper bound and therefore is an approximate PML distribution. Our rounding algorithm is technical and we next provide a overview to better understand it.

Overview of the rounding algorithm:

The goal of the rounding algorithm is to take a fractional solution S=defargmaxSZRϕ,fraclogg(S)\textbf{S}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathrm{arg}\max_{\textbf{S}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{S}^{\prime}) as input and round each row sum to an integral value while preserving the column sums and g(S)\textbf{g}(\textbf{S}) value. Our rounding algorithm proceeds in three steps:

Step 1:

Consider the fractional solution S0×(k+1)\textbf{S}\in\mathbb{R}_{\geq 0}^{\ell\times(k+1)} and recall the rows are indexed by the elements of set R (which represent probability values). We first round the rows corresponding to the higher probability values by simply taking the floor (rounding down to the nearest integer) of each entry. This procedure ensures the integrality of the row sums (corresponding to higher probability values) but violates the column sum constraints. To satisfy the column sum constraints and the distributional constraint (i.e. last condition in Equation 78) simultaneously, we create rows corresponding to new probability values using Algorithm 2. However to ensure that all these new rows also have integral row sums, we modify the (old) rows corresponding to lower probability values accordingly. Let S(1)\textbf{S}^{(1)} be the solution returned by the first step of the rounding algorithm. Algorithm 2 ensures that the g(S(1))\textbf{g}(\textbf{S}^{(1)}) value is not much smaller than g(S)\textbf{g}(\textbf{S}). In S(1)\textbf{S}^{(1)}, all the new rows and (old) rows corresponding to higher probability values have integral row sums and we round the remaining rows corresponding to smaller probability values next.

Step 2:

In this step, we round all the rows corresponding to the smaller probability values. For each of these rows, we scale all the entries in a particular row by the same factor to ensure that the row sum is rounded down to the nearest integer. Similar to the step 1, using Algorithm 2 we create rows corresponding to new probability values to maintain the column sum constraints and the distributional constraint; all these new rows further correspond to small probability values. Unlike in the previous step, the new rows created in step two may not have integral row sums but these rows have a nice diagonal structure. Let S(2)\textbf{S}^{(2)} be this intermediate solution created in step 2. Algorithm 2 ensures that the g(S(2))\textbf{g}(\textbf{S}^{(2)}) value is not much smaller than g(S(1))\textbf{g}(\textbf{S}^{(1)}) (and hence g(S)\textbf{g}(\textbf{S})). Note all the row sums in S(2)\textbf{S}^{(2)} are integral except the new rows created in step 2 that all have small probability values and have diagonal structure.

Step 3:

In this final step, using Algorithm 1 we round the new rows created in step 2. Algorithm 1 exploits the low probability and diagonal structure in these rows. The diagonal structure ensures that there is just one non-zero entry in any particular row and we modify the solution S(2)\textbf{S}^{(2)} (from the previous step) as follows. We transfer the mass from a non-integral lower probability value row to an immediate higher probability value row until the (lower probability value) row sum is integral. This process might violate the distributional constraint and we rescale the probability values accordingly to satisfy this constraint. Let Sext\textbf{S}^{\mathrm{ext}} be the solution returned by step 3. We ensure that all column sums are preserved, all row sums are integral and the g(Sext)\textbf{g}(\textbf{S}^{\mathrm{ext}}) value is not much smaller than g(S(2))\textbf{g}(\textbf{S}^{(2)}) (and hence not much smaller than g(S)\textbf{g}(\textbf{S})).

In the remainder of this section we state all three algorithms and the results corresponding to them. For continuity of reading, we defer the proofs of these results to Section 6.4. For convenience, we first state Algorithm 1 that rounds the rows corresponding to the low probability values in step 3 of our main rounding algorithm (Algorithm 3). We follow up this algorithm with a lemma that summarizes the guarantees provided by it. Later we state Algorithm 2 that creates rows corresponding to new probability values to preserve the column sums and the distributional constraint. This algorithm is invoked as a subroutine in both step 1 and 2 of Algorithm 3. Finally, we state our main rounding algorithm that consists of three different steps. We then state results analyzing each of these steps separately. The final result (6.16), is the main theorem of this subsection that summarizes the final guarantees promised by our rounding algorithm.

Algorithm 1 Structured Rounding Algorithm
1:procedure StructuredRounding(x,w,ax,w,a)
2:     Input: x(0,1)[0,k]x\in(0,1)_{\mathbb{R}}^{[0,k]}, w[0,k]w\in\mathbb{R}^{[0,k]} and a=j[0,k]xj+a=\sum_{j\in[0,k]}x_{j}\in\mathbb{Z}_{+}.
3:     Output: z[0,k]×[0,k]z\in\mathbb{R}^{[0,k]\times[0,k]} and sas\in\mathbb{R}^{a}.
4:     Initialize z=0[0,k]×[0,k]z=\textbf{0}^{[0,k]\times[0,k]}.
5:     For each i[1,a]i\in[1,a], let sis_{i} denote the smallest index such that jsixj>i1\sum_{j\leq s_{i}}x_{j}>i-1 and let sa+1=ks_{a+1}=k.
6:     for i[1,a]i\in[1,a] do
zsi,j={xj if si<j<si+1,jsixj(i1) if j=si,1sij<si+1zsi,j if j=si+1.z_{s_{i},j}=\begin{cases}x_{j}&\text{ if }s_{i}<j<s_{i+1}~{},\\ \sum_{j^{\prime}\leq s_{i}}x_{j^{\prime}}-(i-1)&\text{ if }j=s_{i}~{},\\ 1-\sum_{s_{i}\leq j^{\prime}<s_{i+1}}z_{s_{i},j^{\prime}}&\text{ if }j=s_{i+1}~{}.\\ \end{cases} (89)
7:     end for
8:     Return zz and ss.
9:end procedure

The next lemma summarizes the quality of the solution produced by Algorithm 1.

Lemma 6.12.

Given a set of reals xj(0,1)x_{j}\in(0,1) for all j[0,k]j\in[0,k] such that j[0,k]xj+\sum_{j\in[0,k]}x_{j}\in\mathbb{Z}_{+}, weights wjw_{j} for all j[0,k]j\in[0,k] and exponents mj+m_{j}\in\mathbb{Z}_{+} for all j[0,k]j\in[0,k] 151515Here m0m_{0} need not be equal to zero.. Using Algorithm 1, we can efficiently compute a matrix z[0,1][0,k]×[0,k]z\in[0,1]_{\mathbb{R}}^{[0,k]\times[0,k]} such that the following conditions hold,

  1. 1.

    j[0,k]zi,j{0,1} for all i[0,k]\sum_{j\in[0,k]}z_{i,j}\in\{0,1\}\text{ for all }~{}i\in[0,k] and i[0,k]zi,j=xj for all j[0,k]\sum_{i\in[0,k]}z_{i,j}=x_{j}~{}\text{ for all }~{}j\in[0,k].

  2. 2.

    i[0,k](j[0,k]zi,j)wij[0,k]xjwj+maxj[0,k]wj\sum_{i\in[0,k]}\left(\sum_{j\in[0,k]}z_{i,j}\right)w_{i}\leq\sum_{j\in[0,k]}x_{j}w_{j}+\max_{j\in[0,k]}w_{j}.

  3. 3.

    j[0,k]wjmjxji[0,k]j[0,k]wimjzi,j\prod_{j\in[0,k]}w_{j}^{m_{j}x_{j}}\leq\prod_{i\in[0,k]}\prod_{j\in[0,k]}w_{i}^{m_{j}z_{i,j}}.

We next provide description of Algorithm 2. The algorithm takes input (B,C,R,ϕ)(\textbf{B},\textbf{C},\textbf{R},\phi) and creates a new probability discretization set R\textbf{R}^{\prime} (lines 6-10). The solution B\textbf{B}^{\prime} outputted by the algorithm belongs to ZRϕ,frac\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}, has same column sums as B and the value g(B)\textbf{g}(\textbf{B}^{\prime}) is lower bounded by g(B)\textbf{g}(\textbf{B}).

Algorithm 2 Create New Probability Values
1:procedure CreateNewProbabilityValues\mathrm{CreateNewProbabilityValues}(B,C,R,ϕ\textbf{B},\textbf{C},\textbf{R},\phi)
2:     Input: Probability discretization set R (|R|=t|\textbf{R}|=t), profile ϕ\phi (let kk be the number of distinct frequencies) and BZRϕ,frac[1,t]×[0,k]\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}\subseteq\mathbb{R}^{[1,t]\times[0,k]} and C[1,t]×[0,k]\textbf{C}\in\mathbb{R}^{[1,t]\times[0,k]} such that Ci,jBi,j\textbf{C}_{i,j}\leq\textbf{B}_{i,j} for all i[1,t]i\in[1,t] and j[0,k]j\in[0,k]. Let ri\textbf{r}_{i} be the ii’th element of R.
3:     Output: Probability discretization set R\textbf{R}^{\prime} and B[1,t+(k+1)]×[0,k]\textbf{B}^{\prime}\in\mathbb{R}^{[1,t+(k+1)]\times[0,k]}.
4:     Initialize B=0[1,t+(k+1)]×[0,k]\textbf{B}^{\prime}=\textbf{0}^{[1,t+(k+1)]\times[0,k]}.
5:     Bij=Cij for all i[1,t],j[0,k].\textbf{B}^{\prime}_{ij}=\textbf{C}_{ij}\text{ for all }i\in[1,t],j\in[0,k]~{}.
6:     for j[0,k]j\in[0,k] do
7:         Create a new row with probability value rt+1+j=i[1,t](BijCij)rii[1,t](BijCij)\textbf{r}_{t+1+j}=\frac{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}.
8:         Assign Bt+1+j,j=i[1,t](BijCij)\textbf{B}^{\prime}_{t+1+j,j}=\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij}).
9:     end for
10:     Define R=defR{rt+1+j}j[0,k]\textbf{R}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{R}\cup\{\textbf{r}_{t+1+j}\}_{j\in[0,k]}.
11:     Return: R\textbf{R}^{\prime} and B\textbf{B}^{\prime}.
12:end procedure

The next lemma summarizes the quality of the solution produced by Algorithm 2.

Lemma 6.13.

The solution (R,B)(\textbf{R}^{\prime},\textbf{B}^{\prime}) returned by Algorithm 2 satisfies the following conditions:

  1. 1.

    j[0,k]Bi,j=j[0,k]Ci,j\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{j\in[0,k]}\textbf{C}_{i,j} for all i[1,t]i\in[1,t].

  2. 2.

    For any i[t+1,t+(k+1)]i\in[t+1,t+(k+1)] let j[0,k]j\in[0,k] be such that i=t+1+ji=t+1+j then Bt+1+j,j=0\textbf{B}^{\prime}_{t+1+j,j^{\prime}}=0 for all j[0,k]j^{\prime}\in[0,k] and jjj^{\prime}\neq j. (Diagonal Structure)

  3. 3.

    For any i[t+1,t+(k+1)]i\in[t+1,t+(k+1)] let j[0,k]j\in[0,k] be such that i=t+1+ji=t+1+j, then j[0,k]Bi,j=Bt+1+j,j=ϕji[1,t]Ci,j\sum_{j^{\prime}\in[0,k]}\textbf{B}^{\prime}_{i,j^{\prime}}=\textbf{B}^{\prime}_{t+1+j,j}=\phi_{j}-\sum_{i^{\prime}\in[1,t]}\textbf{C}_{i^{\prime},j}.

  4. 4.

    BZRϕ,frac\textbf{B}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}} and i[1,t+(k+1)]j[0,k]Bi,j=i[1,t]j[0,k]Bi,j\sum_{i\in[1,t+(k+1)]}\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\sum_{j\in[0,k]}\textbf{B}_{i,j}.

  5. 5.

    Let αi=defj[0,k]Bi,jj[0,k]Ci,j\alpha_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[0,k]}\textbf{B}_{i,j}-\sum_{j\in[0,k]}\textbf{C}_{i,j} for all i[1,t]i\in[1,t] and Δ=defmax(i[1,t](B1)i,t×k)\Delta\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max(\sum_{i\in[1,t]}(\textbf{B}\overrightarrow{\mathrm{1}})_{i},t\times k), then g(B)exp(O(i[1,t]αilogΔ))g(B).\textbf{g}(\textbf{B}^{\prime})\geq\exp\left(-O\left(\sum_{i\in[1,t]}\alpha_{i}\log\Delta\right)\right)\textbf{g}(\textbf{B})~{}.

  6. 6.

    For each j[0,k]j\in[0,k], the new row corresponds to the probability value, rt+1+j=i[1,t](BijCij)rii[1,t](BijCij)\textbf{r}_{t+1+j}=\frac{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}.

In the remainder of this section, we state and analyze our rounding algorithm. Our algorithm works in three steps, and we show that all the solutions produced during the intermediate and final steps all have the desired approximation guarantee. We divide the analysis into three lemmas. Each of the lemmas 6.14, 6.15 and 6.16 analyze the guarantees provided by the intermediate solutions S(1)\textbf{S}^{(1)}, S(2)\textbf{S}^{(2)} and final solution Sext\textbf{S}^{\mathrm{ext}} respectively.

Algorithm 3 Rounding Algorithm
1:procedure Rounding(S)
2:     Input: Probability discretization set R, profile ϕΦn\phi\in\Phi^{n} and SZRϕ,frac[1,]×[0,k]\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}\subseteq\mathbb{R}^{[1,\ell]\times[0,k]}.
3:     Output: Probability discretization set Rext\textbf{R}^{\mathrm{ext}} and Sext\textbf{S}^{\mathrm{ext}}.
4:     Step 1:
5:     Initialize A=0[1,]×[0,k]\textbf{A}=\textbf{0}^{[1,\ell]\times[0,k]}. Let ri\textbf{r}_{i} be the ii’th element of R.
6:     Define H=def{i[1,]|ri>γ}\textbf{H}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell]~{}|~{}\textbf{r}_{i}>\gamma\} and L=def{i[1,]|riγ}\textbf{L}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell]~{}|~{}\textbf{r}_{i}\leq\gamma\}.
7:     Aij=Sij for all iH,j[0,k].\textbf{A}_{ij}=\lfloor\textbf{S}_{ij}\rfloor\text{ for all }i\in\textbf{H},j\in[0,k]~{}.
8:     Aij=Si,jiLSi,jiLSi,j for all iL,j[0,k].\textbf{A}_{ij}=\textbf{S}_{i,j}\frac{\lfloor\sum_{i\in\textbf{L}}\textbf{S}_{i,j}\rfloor}{\sum_{i\in\textbf{L}}\textbf{S}_{i,j}}\text{ for all }i\in\textbf{L},j\in[0,k]~{}.
9:     (S(1),R(1))=CreateNewProbabilityValues(S,A,R)(\textbf{S}^{(1)},\textbf{R}^{(1)})=\mathrm{CreateNewProbabilityValues}(\textbf{S},\textbf{A},\textbf{R}).
10:     Step 2:
11:     Note |R(1)|=+(k+1)|\textbf{R}^{(1)}|=\ell+(k+1) and S(1)[1,+(k+1)]×[0,k]\textbf{S}^{(1)}\subseteq\mathbb{R}^{[1,\ell+(k+1)]\times[0,k]}. Let ri(1)\textbf{r}^{(1)}_{i} be the ii’th element of R(1)\textbf{R}^{(1)}.
12:     Let H(1)=def{i[1,+(k+1)]|ri(1)>γ}\textbf{H}^{(1)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell+(k+1)]~{}|~{}\textbf{r}^{(1)}_{i}>\gamma\} and L(1)=def{i[1,+(k+1)]|ri(1)γ}\textbf{L}^{(1)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell+(k+1)]~{}|~{}\textbf{r}^{(1)}_{i}\leq\gamma\}.
13:     Define A(1)=0[1,+(k+1)]×[0,k]\textbf{A}^{(1)}=\textbf{0}^{[1,\ell+(k+1)]\times[0,k]}.
14:     Aij(1)=Sij(1) for all iH(1),j[0,k].\textbf{A}^{(1)}_{ij}=\textbf{S}^{(1)}_{ij}\quad\text{ for all }i\in\textbf{H}^{(1)},j\in[0,k]~{}.
15:     Aij(1)=Sij(1)(S(1)1)i(S(1)1)i for all iL(1),j[0,k].\textbf{A}^{(1)}_{ij}=\textbf{S}^{(1)}_{ij}\frac{\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor}{(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}}\quad\text{ for all }i\in\textbf{L}^{(1)},j\in[0,k]~{}.
16:     (S(2),R(2))=CreateNewProbabilityValues(S(1),A(1),R(1))(\textbf{S}^{(2)},\textbf{R}^{(2)})=\mathrm{CreateNewProbabilityValues}(\textbf{S}^{(1)},\textbf{A}^{(1)},\textbf{R}^{(1)}).
17:     Step 3:
18:     Note |R(2)|=+2(k+1)|\textbf{R}^{(2)}|=\ell+2(k+1) and S(2)[1,+2(k+1)]×[0,k]\textbf{S}^{(2)}\subseteq\mathbb{R}^{[1,\ell+2(k+1)]\times[0,k]}. Let ri(2)\textbf{r}^{(2)}_{i} be the ii’th element of R(2)\textbf{R}^{(2)}.
19:     Let w,x[0,k]w,x\in\mathbb{R}^{[0,k]}, such that wj=defr+(k+1)+1+j(2)w_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{r}^{(2)}_{\ell+(k+1)+1+j} and xj=defS+(k+1)+1+j(2)S+(k+1)+1+j(2)x_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{S}^{(2)}_{\ell+(k+1)+1+j}-\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j}\rfloor for all j[0,k]j\in[0,k]. Define a=defj[0,k]xja\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[0,k]}x_{j}.
20:     Let (z,s)=defStructuredRounding(x,w,a)(z,s)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathrm{StructuredRounding}(x,w,a).
21:     Initialize Sext=0[1,+2(k+1)]×[0,k]\textbf{S}^{\mathrm{ext}}=0^{[1,\ell+2(k+1)]\times[0,k]}.
22:     Assign Si,jext=Si,j(2)\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j} for all i[1,+(k+1)]i\in[1,\ell+(k+1)] and j[0,k]j\in[0,k].
23:     Assign S+(k+1)+1+j,jext=S+(k+1)+1+j,j(2)+zj,j\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}} for all j,j[0,k]j,j^{\prime}\in[0,k].
24:     Define Rext=def{ri(2)1+γ|for all i[1,+2(k+1)]}\textbf{R}^{\mathrm{ext}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{\frac{\textbf{r}^{(2)}_{i}}{1+\gamma}~{}|~{}\text{for all }i\in[1,\ell+2(k+1)]\}.
25:     return Rext\textbf{R}^{\mathrm{ext}} and Sext\textbf{S}^{\mathrm{ext}}.
26:end procedure

The next lemma summarizes the quality of the intermediate solution (S(1),R(1))(\textbf{S}^{(1)},\textbf{R}^{(1)}) produced by Step 1 of Algorithm 3.

Lemma 6.14.

The solution (S(1),R(1))(\textbf{S}^{(1)},\textbf{R}^{(1)}) returned by the step 1 of Algorithm 3 satisfies the following:

  1. 1.

    (S(1)1)i+(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} for all iH(1)i\in\textbf{H}^{(1)}.

  2. 2.

    S(1)ZR(1)ϕ,frac\textbf{S}^{(1)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(1)}} and i[1,+(k+1)]j[0,k]Si,j(1)=i[1,]j[0,k]Si,j\sum_{i\in[1,\ell+(k+1)]}\sum_{j\in[0,k]}\textbf{S}^{(1)}_{i,j}=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}.

  3. 3.

    g(S(1))exp(O((1γ+k)logΔ))g(S)\textbf{g}(\textbf{S}^{(1)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S}), where Δ=max(i[1,](S1)i,×k)\Delta=\max(\sum_{i\in[1,\ell]}(\textbf{S}\overrightarrow{\mathrm{1}})_{i},\ell\times k).

Using Lemma 6.14 we now provide the guarantees for the solution S(2)\textbf{S}^{(2)} returned by the step 2 of Algorithm 3.

Lemma 6.15.

The solution (S(2),R(2))(\textbf{S}^{(2)},\textbf{R}^{(2)}) returned by the step 2 of Algorithm 3 satisfies the following,

  1. 1.

    (S(2)1)i+(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} for all i[1,+(k+1)]i\in[1,\ell+(k+1)].

  2. 2.

    S+(k+1)+1+j,j(2)=0\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}=0 for all j,j[0,k]j,j^{\prime}\in[0,k] and jjj\neq j^{\prime} (Diagonal Structure).

  3. 3.

    S(2)ZR(2)ϕ,frac\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}} and i[1,+2(k+1)]j[0,k]Si,j(2)=i[1,+(k+1)]j[0,k]Si,j(1)\sum_{i\in[1,\ell+2(k+1)]}\sum_{j\in[0,k]}\textbf{S}^{(2)}_{i,j}=\sum_{i\in[1,\ell+(k+1)]}\sum_{j\in[0,k]}\textbf{S}^{(1)}_{i,j}.

  4. 4.

    i[+(k+1)+1,+2(k+1)](S(2)1)i+\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}.

  5. 5.

    For any j[0,k]j\in[0,k], r+(k+1)+1+j(2)γ\textbf{r}^{(2)}_{\ell+(k+1)+1+j}\leq\gamma.

  6. 6.

    g(S(2))exp(O((1γ++k)logΔ))g(S)\textbf{g}(\textbf{S}^{(2)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+\ell+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S}).

Using Lemma 6.15 we now provide the guarantees for the final solution Sext\textbf{S}^{\mathrm{ext}} returned by Algorithm 3.

Theorem 6.16.

The final solution returned (Sext,Rext)(\textbf{S}^{\mathrm{ext}},\textbf{R}^{\mathrm{ext}}) by Algorithm 3 satisfies the following,

  1. 1.

    SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}.

  2. 2.

    g(Sext)exp(O((1γ++k+γn)logΔ))g(S)\textbf{g}(\textbf{S}^{\mathrm{ext}})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+\ell+k+\gamma n\right)\log\Delta\right)\right)\textbf{g}(\textbf{S}).

6.3 Combining everything together

Here we combine the analysis from previous two sections to provide an efficient algorithm to compute an exp(nlogn)\exp\left(\sqrt{n}\log n\right) approximate PML distribution. The main contribution of this section is to define a probability discretization set R that guarantees existence of a discrete pseudo-distrbution q with respect to R that is also an exp(nlogn)\exp\left(\sqrt{n}\log n\right) approximate PML pseudo-distribution. We further use this probability discretization set R and combine it with results from the previous two sections to finally output an exp(nlogn)\exp\left(\sqrt{n}\log n\right) approximate PML distribution. In this direction, we first provide definition of R that has desired guarantees and such a set R was already constructed in [CSS19] and we formally state results from [CSS19] that help us define such a set R.

Lemma 6.17 (Lemma 4.1 in [CSS19]).

For any profile ϕΦn\phi\in\Phi^{n}, there exists a distribution q′′Δ𝒟\textbf{q}^{\prime\prime}\in\Delta^{\mathcal{D}} such that q′′\textbf{q}^{\prime\prime} is an exp(6)\exp\left(-6\right)-approximate PML distribution and minx𝒟:qx′′0qx′′12n2\min_{x\in\mathcal{D}:\textbf{q}^{\prime\prime}_{x}\neq 0}\textbf{q}^{\prime\prime}_{x}\geq\frac{1}{2n^{2}}.

The above lemma allows to define a region in which our approximate PML takes all its probability values and we use idea similar to [CSS19] to define this region.

Let R=def{(1+ϵ)1i}i[]\textbf{R}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(1+\epsilon)^{1-i}\}_{i\in[\ell]} be a discretization of probability space, where =O(lognϵ)\ell=O(\frac{\log n}{\epsilon}) is the smallest integer such that 14n2(1+ϵ)112n2\frac{1}{4n^{2}}\leq(1+\epsilon)^{1-\ell}\leq\frac{1}{2n^{2}} for some ϵ(0,1)\epsilon\in(0,1). Fix any arbitrary order for the elements of set R, we use ri\textbf{r}_{i} to denote the ii’th element of this set. We next state a result in [CSS19] that captures the effect of this discretization.

Lemma 6.18 (Lemma 4.4 in [CSS19]).

For any profile ϕΦn\phi\in\Phi^{n} and distribution pΔ𝒟\textbf{p}\in\Delta^{\mathcal{D}}, its discrete pseudo-distribution q=disc(p)ΔR𝒟\textbf{q}=\mathrm{disc}(\textbf{p})\in\Delta_{\textbf{R}}^{\mathcal{D}} satisfies:

(p,ϕ)(q,ϕ)exp(ϵn)(p,ϕ).\mathbb{P}(\textbf{p},\phi)\geq\mathbb{P}(\textbf{q},\phi)\geq\exp\left(-\epsilon n\right)\mathbb{P}(\textbf{p},\phi)~{}.

We are now ready to state our final algorithm. Following this algorithm, we prove that it returns an approximate PML distribution.

Algorithm 4 Algorithm for approximate PML
1:procedure Approximate PML(ϕ,R\phi,\textbf{R})
2:     Input: Profile ϕΦn\phi\in\Phi^{n} and probability discretization set R.
3:     Output: A distribution papprox\textbf{p}_{\mathrm{approx}}.
4:     Solve S=argmaxAZRϕ,fraclogg(A)\textbf{S}=\operatorname*{arg\,max}_{\textbf{A}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{A}).
5:     Use Algorithm 3 to round the fractional solution S to integral solution SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}.
6:     Construct discrete pseudo-distribution qSext\textbf{q}_{\textbf{S}^{\mathrm{ext}}} with respect to Sext\textbf{S}^{\mathrm{ext}} (See 6.6).
7:     return papprox=defqSextqSext1\textbf{p}_{\mathrm{approx}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{q}_{\textbf{S}^{\mathrm{ext}}}}{\|\textbf{q}_{\textbf{S}^{\mathrm{ext}}}\|_{1}}.
8:end procedure

See 3.4

Proof.

Choose ϵ=lognn\epsilon=\frac{\log n}{\sqrt{n}} and let the probability discretization space R=def{(1+1n)1i}i[]\textbf{R}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(1+\frac{1}{\sqrt{n}})^{1-i}\}_{i\in[\ell]} and =def|R|\ell\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\textbf{R}| be the smallest integer such that 12n2(1+1n)114n2\frac{1}{2n^{2}}\geq(1+\frac{1}{\sqrt{n}})^{1-\ell}\geq\frac{1}{4n^{2}} and therefore O(n)\ell\in O(\sqrt{n}). Let ri\textbf{r}_{i} be the ii’th element of set R and we have ri14n2\textbf{r}_{i}\geq\frac{1}{4n^{2}}.

Given profile ϕ\phi, let ppml\textbf{p}_{\mathrm{pml}} be the PML distribution. Define qpml=defppmlR\textbf{q}_{\mathrm{pml}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\lfloor\textbf{p}_{\mathrm{pml}}\rfloor_{\textbf{R}} and by Lemma 6.18 (and choice of ϵ\epsilon) we have,

(qpml,ϕ)exp(O(nlogn))(ppml,ϕ).\mathbb{P}(\textbf{q}_{\mathrm{pml}},\phi)\geq\exp\left(-O(\sqrt{n}\log n)\right)\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)~{}. (90)

Let S=defargmaxAZRϕ,fracg(A)\textbf{S}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\operatorname*{arg\,max}_{\textbf{A}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\textbf{g}(\textbf{A}), then by Lemma 6.9 we have,

maxqΔR𝒟(p,ϕ)exp(O(klogNk))Cϕg(S).\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{p},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S})~{}. (91)

Note qpmlΔR𝒟\textbf{q}_{\mathrm{pml}}\in\Delta_{\textbf{R}}^{\mathcal{D}}, therefore (qpml,ϕ)maxqΔR𝒟(q,ϕ)\mathbb{P}(\textbf{q}_{\mathrm{pml}},\phi)\leq\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi) and further combined with equations 90 and 91 we have,

(ppml,ϕ)exp(O(klogNk+nlogn))Cϕg(S).\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}+\sqrt{n}\log n\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S})~{}. (92)

Let Sext\textbf{S}^{\mathrm{ext}} and Rext\textbf{R}^{\mathrm{ext}} be the solution returned by Algorithm 3, then by the second condition of 6.16 we have,

g(Sext)exp(O((1γ++k+γn)logΔ))g(S)\textbf{g}(\textbf{S}^{\mathrm{ext}})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+\ell+k+\gamma n\right)\log\Delta\right)\right)\textbf{g}(\textbf{S}) (93)

Combining equations 92 and 93 we have,

(ppml,ϕ)exp(O(klogNk+nlogn+(1γ++k+γn)logΔ))Cϕg(Sext).\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}+\sqrt{n}\log n+\left(\frac{1}{\gamma}+\ell+k+\gamma n\right)\log\Delta\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S}^{\mathrm{ext}})~{}. (94)

We now simplify the above expression by providing the bounds and values for parameters k,,γ,Nk,\ell,\gamma,N and Δ\Delta. We choose γ=1n\gamma=\frac{1}{\sqrt{n}} and recall O(n)\ell\in O(\sqrt{n}). Given nn samples, the number of distinct frequencies in upper bounded by n\sqrt{n} and therefore knk\leq\sqrt{n}. By Lemma 6.17, up to constant multiplicative loss we can assume that the minimum non-zero probability value of our approximate PML distribution is at least 14n2\frac{1}{4n^{2}} and therefore the support N4n2N\leq 4n^{2}. Recall by the third condition of Lemma 6.14, we have Δ=max(i[1,](S1)i,×k)\Delta=\max(\sum_{i\in[1,\ell]}(\textbf{S}\overrightarrow{\mathrm{1}})_{i},\ell\times k). The condition SZRϕ,frac\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}} implies i[1,]ri(S1)i1\sum_{i\in[1,\ell]}\textbf{r}_{i}(\textbf{S}\overrightarrow{\mathrm{1}})_{i}\leq 1 and further using ri14n2\textbf{r}_{i}\geq\frac{1}{4n^{2}} for all i[1,]i\in[1,\ell] we have i[1,](S1)i4n2\sum_{i\in[1,\ell]}(\textbf{S}\overrightarrow{\mathrm{1}})_{i}\leq 4n^{2}. Therefore Δmax(4n2,×k)O(n2)\Delta\leq\max(4n^{2},\ell\times k)\in O(n^{2}).

Substituting these values in Equation 94 we get,

(ppml,ϕ)exp(O(nlogn))Cϕg(Sext).\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)\leq\exp\left(O\left(\sqrt{n}\log n\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S}^{\mathrm{ext}})~{}. (95)

By the first condition of 6.16 we have SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}. Let qSext\textbf{q}_{\textbf{S}^{\mathrm{ext}}} be the discrete pseudo-distribution with respect to Sext\textbf{S}^{\mathrm{ext}}, then the condition SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}} further implies SextZRextqSext,ϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi}_{\textbf{R}^{\mathrm{ext}}} and combined with Theorem 6.7 we have,

exp(O(klog(N+n)))Cϕg(Sext)(qSext,ϕ)\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S}^{\mathrm{ext}})\leq\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi) (96)

Combining equations 95, 96, knk\leq\sqrt{n} and N4n2N\leq 4n^{2} we have,

(qSext,ϕ)exp(O(nlogn))(ppml,ϕ).\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi)\geq\exp\left(-O\left(\sqrt{n}\log n\right)\right)\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)~{}. (97)

Define papprox=defqSextqSext1\textbf{p}_{\mathrm{approx}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{q}_{\textbf{S}^{\mathrm{ext}}}}{\|\textbf{q}_{\textbf{S}^{\mathrm{ext}}}\|_{1}}, then papprox\textbf{p}_{\mathrm{approx}} is a distribution, (papprox,ϕ)(qSext,ϕ)\mathbb{P}(\textbf{p}_{\mathrm{approx}},\phi)\geq\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi) (because qSext\textbf{q}_{\textbf{S}^{\mathrm{ext}}} is a pseudo-distribution and qSext11\|\textbf{q}_{\textbf{S}^{\mathrm{ext}}}\|_{1}\leq 1) and combined with Equation 97 we get,

(papprox,ϕ)exp(O(nlogn))(ppml,ϕ).\mathbb{P}(\textbf{p}_{\mathrm{approx}},\phi)\geq\exp\left(-O\left(\sqrt{n}\log n\right)\right)\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)~{}. (98)

Therefore papprox\textbf{p}_{\mathrm{approx}} is an exp(O(nlogn))\exp\left(-O\left(\sqrt{n}\log n\right)\right)-approximate PML distribution.

In the remainder of the proof we argue about the running time of our final algorithm for approximate PML. Step 4 of the algorithm, that is the convex program argmaxAZRϕ,fraclogg(A)\operatorname*{arg\,max}_{\textbf{A}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{A}) can be solved in O~(k2)\widetilde{O}(k^{2}\ell) time (See Theorem 6.11). Algorithm 2 (CreateNewProbabilityValues\mathrm{CreateNewProbabilityValues}) and Algorithm 1 (StructuredRounding\mathrm{StructuredRounding}) can be implemented in O~(k)\widetilde{O}(k\ell) and O~(k2)\widetilde{O}(k^{2}) time respectively; therefore, the Algorithm 3 (Rounding algorithm) can be implemented in O~(k)\widetilde{O}(k\ell) time. Combining everything together our final algorithm (Algorithm 4) can be implemented in O~(k2)\widetilde{O}(k^{2}\ell) time. Further using k,O(n)k,\ell\in O(\sqrt{n}), we conclude the proof. ∎

6.4 Missing Proofs from Section 6.2

Here we provide the proofs for all the lemmas and theorems in Section 6.2

Proof of Lemma 6.12.

Without loss of generality assume w0w1w2wkw_{0}\geq w_{1}\geq w_{2}\dots\geq w_{k}. Let a=defj[0,k]xja\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[0,k]}x_{j}, we invoke Algorithm 1 with inputs (x,w,a)(x,w,a). Let s+as\in\mathbb{Z}_{+}^{a} and z[0,k]×[0,k]z\in\mathbb{R}^{[0,k]\times[0,k]} be the output of Algorithm 1. We now provide the proof for the three conditions in the lemma.

Condition 1: By construction of Algorithm 1, for any s{si}i[1,a]s\in\{s_{i}\}_{i\in[1,a]} we have j[0,k]zs,j=1\sum_{j\in[0,k]}z_{s,j}=1 (Line 6) and for any other s[0,k]\{si}i[1,a]s\in[0,k]\backslash\{s_{i}\}_{i\in[1,a]} we have j[0,k]zs,j=0\sum_{j\in[0,k]}z_{s,j}=0. Therefore the first part of condition 1 holds.

For any j[0,k]j\in[0,k], one of the following two cases holds,

  1. 1.

    If j{si}i[1,a]j\in\{s_{i}\}_{i\in[1,a]} and in this case let i[1,a]i\in[1,a] be such that si=js_{i}=j. By line 6 (third case) of the algorithm we have,

    zsi1,j=1(jsi1xj(i2)+si1<j<sixj)=(i1)j<sixj.z_{s_{i-1},j}=1-\left(\sum_{j^{\prime}\leq s_{i-1}}x_{j^{\prime}}-(i-2)+\sum_{s_{i-1}<j^{\prime}<s_{i}}x_{j^{\prime}}\right)=(i-1)-\sum_{j^{\prime}<s_{i}}x_{j^{\prime}}~{}. (99)

    We now analyze the term i[0,k]zi,j\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j},

    i[0,k]zi,j=zsi,j+zsi1,j=jsixj(i1)+(i1)j<sixj=xsi=xj.\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j}=z_{s_{i},j}+z_{s_{i-1},j}=\sum_{j^{\prime}\leq s_{i}}x_{j^{\prime}}-(i-1)+(i-1)-\sum_{j^{\prime}<s_{i}}x_{j^{\prime}}=x_{s_{i}}=x_{j}~{}.

    The first equality follows because for i[0,k]\{si,si1}i^{\prime}\in[0,k]\backslash\{s_{i},s_{i-1}\} we have zi,j=0z_{i^{\prime},j}=0 and this follows by the second and third case in line 6 of the algorithm. In the second equality we substituted values for zsi,siz_{s_{i},s_{i}} and zsi1,siz_{s_{i-1},s_{i}} using second case (Line 6) and Equation 99 respectively.

  2. 2.

    Else j[0,k]\{si}i[1,a]j\in[0,k]\backslash\{s_{i}\}_{i\in[1,a]}, and in this case let i[1,a]i\in[1,a] be such that si<j<si+1s_{i}<j<s_{i+1}. Then by the first case in line 6 of the algorithm we have,

    i[0,k]zi,j=zsi,j=xj.\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j}=z_{s_{i},j}=x_{j}~{}.

Condition 2: Consider i[0,k](j[0,k]zi,j)wi\sum_{i\in[0,k]}\left(\sum_{j\in[0,k]}z_{i,j}\right)w_{i},

i[0,k](j[0,k]zi,j)wi=i[1,a](sijsi+1zsi,j)wsii[1,a]sijsi+1zsi,j(wj+wsiwsi+1)i[1,a]sijsi+1zsi,jwj+i[1,a]sijsi+1zsi,j(wsiwsi+1)=i[1,a]j[0,k]zsi,jwj+i[1,a]sijsi+1zsi,j(wsiwsi+1).\begin{split}\sum_{i\in[0,k]}\left(\sum_{j\in[0,k]}z_{i,j}\right)w_{i}&=\sum_{i\in[1,a]}\left(\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}\right)w_{s_{i}}\leq\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{j}+w_{s_{i}}-w_{s_{i+1}})\\ &\leq\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}w_{j}+\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{s_{i}}-w_{s_{i+1}})\\ &=\sum_{i\in[1,a]}\sum_{j\in[0,k]}z_{s_{i},j}w_{j}+\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{s_{i}}-w_{s_{i+1}})~{}.\\ \end{split} (100)

The first equality follows because rest of the other entries are zero. In the second inequality we used jsi+1j\leq s_{i+1} and therefore wjwsi+1w_{j}\geq w_{s_{i+1}} by our assumption at the beginning of the proof. In the remainder, we simplify both the terms. Consider the first term in the final expression above,

i[1,a]j[0,k]zsi,jwj=j[0,k]wji[1,a]zsi,j=j[0,k]wjxj.\sum_{i\in[1,a]}\sum_{j\in[0,k]}z_{s_{i},j}w_{j}=\sum_{j\in[0,k]}w_{j}\sum_{i\in[1,a]}z_{s_{i},j}=\sum_{j\in[0,k]}w_{j}x_{j}~{}. (101)

In the first equality we interchanged the summations. In the second equality we used i[1,a]zsi,j=i[0,k]zi,j\sum_{i\in[1,a]}z_{s_{i},j}=\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j} and further invoked condition 1 of the lemma. Now consider the second term in the final expression of Equation 100,

i[1,a]sijsi+1zsi,j(wsiwsi+1)=i[1,a](wsiwsi+1)sijsi+1zsi,j=i[1,a](wsiwsi+1)=(ws1wsx+1)maxj[0,k]wj.\begin{split}\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{s_{i}}-w_{s_{i+1}})&=\sum_{i\in[1,a]}(w_{s_{i}}-w_{s_{i+1}})\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}=\sum_{i\in[1,a]}(w_{s_{i}}-w_{s_{i+1}})\\ &=(w_{s_{1}}-w_{s_{x+1}})\leq\max_{j\in[0,k]}w_{j}~{}.\end{split} (102)

The second equality follows by line 6 of the algorithm. Condition 2 follows by combining equations 100, 101 and 102.

Condition 3: First we show that zi,j>0z_{i,j}>0 implies iji\leq j. Consider j[0,k]j\in[0,k],

  1. 1.

    If j{si}i[1,a]j\in\{s_{i}\}_{i\in[1,a]}, in this case let i[1,a]i\in[1,a] be such that si=js_{i}=j. Then by the second and third case in line 6 of the algorithm we have, zi,j>0z_{i^{\prime},j}>0 implies i{si,si1}i^{\prime}\in\{s_{i},s_{i-1}\}. Further, using si1<sis_{i-1}<s_{i} and si=js_{i}=j we have iji^{\prime}\leq j.

  2. 2.

    Else j[0,k]\{si}i[1,a]j\in[0,k]\backslash\{s_{i}\}_{i\in[1,a]} and in this case let i[1,a]i\in[1,a] be such that si<j<si+1s_{i}<j<s_{i+1}. Then by the first case in line 6 of the algorithm we have, zi,j>0z_{i^{\prime},j}>0 implies i=sii^{\prime}=s_{i}. Further, using si<js_{i}<j we have i<ji^{\prime}<j.

Using the above implication we have,

j[0,k]wjmjxj=j[0,k]wjmji[0,k]zi,j=i[0,k]j[0,k]wjmjzi,ji[0,k]j[0,k]wimjzi,j\begin{split}\prod_{j\in[0,k]}w_{j}^{m_{j}x_{j}}&=\prod_{j\in[0,k]}w_{j}^{m_{j}\sum_{i\in[0,k]}z_{i,j}}=\prod_{i\in[0,k]}\prod_{j\in[0,k]}w_{j}^{m_{j}z_{i,j}}\leq\prod_{i\in[0,k]}\prod_{j\in[0,k]}w_{i}^{m_{j}z_{i,j}}\end{split} (103)

In the first equality we used xj=i[0,k]zi,jx_{j}=\sum_{i\in[0,k]}z_{i,j} for all j[0,k]j\in[0,k] (Condition 1). In the final inequality, we used the result zi,j>0z_{i,j}>0 implies iji\leq j and further combined it with the assumption at the begining of the proof, that is, wiwjw_{i}\geq w_{j} for all i,j[0,k]i,j\in[0,k] and iji\leq j. ∎

Proof of Lemma 6.13.

Define ϕ0=defi[1,t]Bi,0\phi_{0}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{i\in[1,t]}\textbf{B}_{i,0}. In the following we provide the proof for each case.

Condition 1: For each i[1,t]i\in[1,t], Bi,j=Ci,j\textbf{B}^{\prime}_{i,j}=\textbf{C}_{i,j} for all j[0,k]j\in[0,k] and the first condition holds.

Condition 2: Note B\textbf{B}^{\prime} is initialized to a zero matrix (Line 4). Further for any i[t+1,t+(k+1)]i\in[t+1,t+(k+1)] let j[0,k]j\in[0,k] be such that i=t+1+ji=t+1+j, then the algorithm only updates the Bt+1+j,j\textbf{B}^{\prime}_{t+1+j,j}’th entry in the ii’th row and keeps rest of the entries unchanged. Therefore the second condition holds.

Condition 3: For each i[t+1,t+(k+1)]i\in[t+1,t+(k+1)] let j[0,k]j\in[0,k] be such that i=t+1+ji=t+1+j, then j[0,k]Bi,j=Bt+1+j,j=i[1,t](Bi,jCi,j)=ϕji[1,t]Ci,j\sum_{j^{\prime}\in[0,k]}\textbf{B}^{\prime}_{i,j^{\prime}}=\textbf{B}^{\prime}_{t+1+j,j}=\sum_{i^{\prime}\in[1,t]}(\textbf{B}_{i^{\prime},j}-\textbf{C}_{i^{\prime},j})=\phi_{j}-\sum_{i^{\prime}\in[1,t]}\textbf{C}_{i^{\prime},j}. The first equality holds because of the Condition 2. The third equality follows from the Line 8 of the algorithm. The last equality holds because BZRϕ,frac\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}} and we have i[1,]Bi,j=ϕj\sum_{i\in[1,\ell]}\textbf{B}_{i,j}=\phi_{j}.

Condition 4: Here we provide the proof for BZRϕ,frac\textbf{B}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}. For any j[0,k]j\in[0,k], we first show that i[1,t+(k+1)]Bi,j=ϕj\sum_{i\in[1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\phi_{j}.

i[1,t+(k+1)]Bi,j\displaystyle\sum_{i\in[1,t+(k+1)]}\textbf{B}^{\prime}_{i,j} =i[1,t]Bi,j+i[t+1,t+(k+1)]Bi,j=i[1,t]Ci,j+Bt+1+j,j\displaystyle=\sum_{i\in[1,t]}\textbf{B}^{\prime}_{i,j}+\sum_{i\in[t+1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\textbf{C}_{i,j}+\textbf{B}^{\prime}_{t+1+j,j}
=i[1,t]Ci,j+ϕji[1,t]Ci,j=ϕj\displaystyle=\sum_{i\in[1,t]}\textbf{C}_{i,j}+\phi_{j}-\sum_{i\in[1,t]}\textbf{C}_{i,j}=\phi_{j}

The second equality follows because Bi,j=Ci,j\textbf{B}^{\prime}_{i,j}=\textbf{C}_{i,j} for all i[1,t]i\in[1,t] and j[0,k]j\in[0,k] (Line 6) and i[t+1,t+(k+1)]Bi,j=Bt+1+j,j\sum_{i\in[t+1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\textbf{B}^{\prime}_{t+1+j,j} (Condition 2). The third equality follows from the Condition 3.

We next show that i[1,t+(k+1)]ri(j[0,k]Bi,j)1\sum_{i\in[1,t+(k+1)]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}\right)\leq 1.

i[1,t+(k+1)]ri(j[0,k]Bi,j)=i[1,t]ri(j[0,k]Bi,j)+j[0,k]rt+1+jBt+1+j,j=i[1,t]ri(j[0,k]Ci,j)+j[0,k]i[1,t](BijCij)rii[1,t](BijCij)(i[1,t](Bi,jCi,j))=i[1,t]ri(j[0,k]Ci,j)+j[0,k]i[1,t](BijCij)ri=i[1,t]ri(j[0,k]Bi,j)1\begin{split}\sum_{i\in[1,t+(k+1)]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}\right)&=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}\right)+\sum_{j\in[0,k]}\textbf{r}_{t+1+j}\textbf{B}^{\prime}_{t+1+j,j}\\ &=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{C}_{i,j}\right)+\sum_{j\in[0,k]}\frac{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}\left(\sum_{i\in[1,t]}(\textbf{B}_{i,j}-\textbf{C}_{i,j})\right)\\ &=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{C}_{i,j}\right)+\sum_{j\in[0,k]}\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}\\ &=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}_{i,j}\right)\leq 1\\ \end{split} (104)

In the first equality, we divided the summation into two parts and for the second part we used Condition 3. In the second equality we used Line 7 and 8 of the algorithm. In the third and fourth equality we simplified the expression. In the final inequality we used BZRϕ,frac\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}.

Combining all the conditions together we have BZRϕ,frac\textbf{B}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}. In the remainder we show that i[1,t+(k+1)]j[0,k]Bi,j=i[1,t]j[0,k]Bi,j\sum_{i\in[1,t+(k+1)]}\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\sum_{j\in[0,k]}\textbf{B}_{i,j}.

Recall we already showed that i[1,t+(k+1)]Bi,j=ϕj\sum_{i\in[1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\phi_{j} for all j[0,k]j\in[0,k]. Recall ϕ0=i[1,t]Bi,0\phi_{0}=\sum_{i\in[1,t]}\textbf{B}_{i,0} and BZRϕ,frac\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}} implies ϕj=i[1,t]Bi,j\phi_{j}=\sum_{i\in[1,t]}\textbf{B}_{i,j} for all j[1,k]j\in[1,k]. Therefore we have,

i[1,t+(k+1)]j[0,k]Bi,j=i[1,t]j[0,k]Bi,j\sum_{i\in[1,t+(k+1)]}\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\sum_{j\in[0,k]}\textbf{B}_{i,j}

Condition 5: We first provide the explicit expressions for g(B)\textbf{g}(\textbf{B}^{\prime}) and g(B)\textbf{g}(\textbf{B}) below:

g(B)=(i[1,t]ri(Bm)iexp((B1)ilog(B1)i)j[0,k]exp(BijlogBij))(j[0,k]rt+1+jmjBt+1+j,j1)\textbf{g}(\textbf{B}^{\prime})=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\frac{\exp\left((\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{B}^{\prime}_{ij}\log\textbf{B}^{\prime}_{ij}\right)}\right)\left(\prod_{j\in[0,k]}\textbf{r}_{t+1+j}^{\overrightarrow{\mathrm{m}}_{j}\textbf{B}^{\prime}_{t+1+j,j}}\cdot 1\right)
g(B)=i[1,t](ri(Bm)iexp((B1)ilog(B1)i)j[0,k]exp(BijlogBij))\textbf{g}(\textbf{B})=\prod_{i\in[1,t]}\left(\textbf{r}_{i}^{(\textbf{B}\overrightarrow{\mathrm{m}})_{i}}\frac{\exp\left((\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{B}_{ij}\log\textbf{B}_{ij}\right)}\right)

Note in the expression for g(B)\textbf{g}(\textbf{B}^{\prime}) we used Condition 2. In the above two definitions for g(B)\textbf{g}(\textbf{B}^{\prime}) and g(B)\textbf{g}(\textbf{B}), we refer to the expression involving ri\textbf{r}_{i}’s as the probability term and the rest as the counting term. We start the analysis of Condition 5 by first bounding the probability term:

i[1,t]ri(Bm)i=(i[1,t]ri(Bm)i)(i[1,t]rij[0,k]mj(BijBij))=(i[1,t]ri(Bm)i)(j[0,k]i[1,t]rimj(BijBij))=(i[1,t]ri(Bm)i)(j[0,k](i[1,t]ri(BijBij))mj)=(i[1,t]ri(Bm)i)(j[0,k](i[1,t]ri(BijCij))mj)(i[1,t]ri(Bm)i)(j[0,k](i[1,t]ri(BijCij)i[1,t](BijCij))mji[1,t](BijCij))(i[1,t]ri(Bm)i)(j[0,k]rt+1+jmjBt+1+j,j)\begin{split}\prod_{i\in[1,t]}&\textbf{r}_{i}^{(\textbf{B}\overrightarrow{\mathrm{m}})_{i}}=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{\sum_{j\in[0,k]}\overrightarrow{\mathrm{m}}_{j}(\textbf{B}_{ij}-\textbf{B}^{\prime}_{ij})}\right)=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\prod_{i\in[1,t]}\textbf{r}_{i}^{\overrightarrow{\mathrm{m}}_{j}(\textbf{B}_{ij}-\textbf{B}^{\prime}_{ij})}\right)\\ &=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}_{ij}-\textbf{B}^{\prime}_{ij})}\right)^{\overrightarrow{\mathrm{m}}_{j}}\right)=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}_{ij}-\textbf{C}_{ij})}\right)^{\overrightarrow{\mathrm{m}}_{j}}\right)\\ &\leq\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\left(\frac{\sum_{i\in[1,t]}\textbf{r}_{i}(\textbf{B}_{ij}-\textbf{C}_{ij})}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}\right)^{\overrightarrow{\mathrm{m}}_{j}\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}\right)\\ &\leq\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\textbf{r}_{t+1+j}^{\overrightarrow{\mathrm{m}}_{j}\textbf{B}^{\prime}_{t+1+j,j}}\right)\\ \end{split} (105)

The first three inequalities simplify the expression. The fourth equality follows because Bi,j=Ci,j\textbf{B}^{\prime}_{i,j}=\textbf{C}_{i,j} for all i[1,t]i\in[1,t] and j[0,k]j\in[0,k]. The fifth inequality follows from AM-GM inequality. The final expression above is the probability term associated with B\textbf{B}^{\prime} and the equation above shows that our rounding procedure only increases the probability term and it remains to bound the counting term.

g(B)g(B)i[1,t]exp((B1)ilog(B1)i(B1)ilog(B1)i)j[0,k]exp(BijlogBijBijlogBij)=i[1,t]exp((C1)ilog(C1)i(B1)ilog(B1)i)j[0,k]exp(CijlogCijBijlogBij).\begin{split}\frac{\textbf{g}(\textbf{B}^{\prime})}{\textbf{g}(\textbf{B})}&\geq\prod_{i\in[1,t]}\frac{\exp\left((\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}-(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{B}^{\prime}_{ij}\log\textbf{B}^{\prime}_{ij}-\textbf{B}_{ij}\log\textbf{B}_{ij}\right)}\\ &=\prod_{i\in[1,t]}\frac{\exp\left((\textbf{C}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{C}\overrightarrow{\mathrm{1}})_{i}-(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\textbf{C}_{ij}-\textbf{B}_{ij}\log\textbf{B}_{ij}\right)}~{}.\end{split} (106)

Consider the numerator in the above expression, for each i[1,t]i\in[1,t] let si=def(C1)is_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\textbf{C}\overrightarrow{\mathrm{1}})_{i}, then

i[1,t]exp((C1)ilog(C1)i(B1)ilog(B1)i)=i[1,t]exp(silogsi(si+αi)log(si+αi))=i[1,t]exp(silogsisi+αiαilog(si+αi))i[1,t]exp(siαisiαilog(si+αi))exp(O(log(i[1,t]si)i[1,t]αi))exp(O(i[1,t]αilogΔ)).\begin{split}\prod_{i\in[1,t]}\exp\left((\textbf{C}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{C}\overrightarrow{\mathrm{1}})_{i}-(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)&=\prod_{i\in[1,t]}\exp\left(s_{i}\log s_{i}-(s_{i}+\alpha_{i})\log(s_{i}+\alpha_{i})\right)\\ &=\prod_{i\in[1,t]}\exp\left(s_{i}\log\frac{s_{i}}{s_{i}+\alpha_{i}}-\alpha_{i}\log(s_{i}+\alpha_{i})\right)\\ &\geq\prod_{i\in[1,t]}\exp\left(s_{i}\frac{-\alpha_{i}}{s_{i}}-\alpha_{i}\log(s_{i}+\alpha_{i})\right)\\ &\geq\exp\left(-O\left(\log(\sum_{i\in[1,t]}s_{i})\sum_{i\in[1,t]}\alpha_{i}\right)\right)\\ &\geq\exp\left(-O\left(\sum_{i\in[1,t]}\alpha_{i}\log\Delta\right)\right)~{}.\end{split} (107)

In the third inequality we used log(1+x)x1+x\log(1+x)\geq\frac{x}{1+x} for all x1x\geq-1. The final inequality follows because i[1,t]sii[1,t](B1)iΔ\sum_{i\in[1,t]}s_{i}\leq\sum_{i\in[1,t]}(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\leq\Delta. Now consider the denominator in the above expression, let αi,j=Bi,jCi,j\alpha_{i,j}=\textbf{B}_{i,j}-\textbf{C}_{i,j} for all i[1,t]i\in[1,t] and j[0,k]j\in[0,k], then

i[1,t]j[0,k]exp(CijlogCijBijlogBij)=i[1,t]j[0,k]exp(CijlogCij(Cij+αi,j)log(Cij+αi,j))=i[1,t]j[0,k]exp(CijlogCijCij+αi,jαi,jlog(Cij+αi,j))i[1,t]j[0,k]exp(αi,jlog(Cij+αi,j))i[1,t]j[0,k]exp(αi,jlogαi,j)exp(O(log(t×k)i[1,t]αi))exp(O(i[1,t]αilogΔ)).\begin{split}\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\textbf{C}_{ij}-\textbf{B}_{ij}\log\textbf{B}_{ij}\right)&=\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\textbf{C}_{ij}-(\textbf{C}_{ij}+\alpha_{i,j})\log(\textbf{C}_{ij}+\alpha_{i,j})\right)\\ &=\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\frac{\textbf{C}_{ij}}{\textbf{C}_{ij}+\alpha_{i,j}}-\alpha_{i,j}\log(\textbf{C}_{ij}+\alpha_{i,j})\right)\\ &\leq\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(-\alpha_{i,j}\log(\textbf{C}_{ij}+\alpha_{i,j})\right)\\ &\leq\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(-\alpha_{i,j}\log\alpha_{i,j}\right)\leq\exp\left(O\big{(}\log(t\times k)\sum_{i\in[1,t]}\alpha_{i}\big{)}\right)\\ &\leq\exp\left(O\left(\sum_{i\in[1,t]}\alpha_{i}\log\Delta\right)\right)~{}.\end{split} (108)

In the third inequality we used αi,j0\alpha_{i,j}\geq 0 and therefore CijlogCijCij+αi,j0\textbf{C}_{ij}\log\frac{\textbf{C}_{ij}}{\textbf{C}_{ij}+\alpha_{i,j}}\leq 0. In the fourth inequality we used log(Cij+αi,j)logαi,j\log(\textbf{C}_{ij}+\alpha_{i,j})\geq\log\alpha_{i,j}. In the fifth inequality we used j[0,k]αi,j=αi\sum_{j\in[0,k]}\alpha_{i,j}=\alpha_{i} for all i[1,t]i\in[1,t] and further i[1,t]j[0,k]αi,jlogαi,j=i[1,t]αi(j[0,k]αi,jαilogαi,jαilogαi)log(k+1)i[1,t]αii[1,t]αilogαi\sum_{i\in[1,t]}\sum_{j\in[0,k]}-\alpha_{i,j}\log\alpha_{i,j}=\sum_{i\in[1,t]}\alpha_{i}\left(\sum_{j\in[0,k]}-\frac{\alpha_{i,j}}{\alpha_{i}}\log\frac{\alpha_{i,j}}{\alpha_{i}}-\log\alpha_{i}\right)\leq\log(k+1)\sum_{i\in[1,t]}\alpha_{i}-\sum_{i\in[1,t]}\alpha_{i}\log\alpha_{i}. Now consider the term i[1,t]αilogαi-\sum_{i\in[1,t]}\alpha_{i}\log\alpha_{i} and note that i[1,t]αilogαi=(i[1,t]αi)(i[1,t]αii[1,t]αilogαii[1,t]αilogi[1,t]αi)(1+logt)i[1,t]αi-\sum_{i\in[1,t]}\alpha_{i}\log\alpha_{i}=(\sum_{i\in[1,t]}\alpha_{i})\left(-\sum_{i\in[1,t]}\frac{\alpha_{i}}{\sum_{i\in[1,t]}\alpha_{i}}\log\frac{\alpha_{i}}{\sum_{i\in[1,t]}\alpha_{i}}-\log\sum_{i\in[1,t]}\alpha_{i}\right)\leq(1+\log t)\sum_{i\in[1,t]}\alpha_{i}. The fifth inequality in Equation 108 follows by combining the previous two derivations together. The final inequality follows because t×kΔt\times k\leq\Delta.

Condition 6: This condition follows immediately from Line 7 of the algorithm. ∎

Proof of Lemma 6.14.

In the following we provide the proof for the claims in the lemma.

Condition 1: Note H(1)H[+1,+(k+1)]\textbf{H}^{(1)}\subseteq\textbf{H}\cup[\ell+1,\ell+(k+1)], where [+1,+(k+1)][\ell+1,\ell+(k+1)] are the indices corresponding to the new rows created by the procedure CreateNewProbabilityValues\mathrm{CreateNewProbabilityValues} (Algorithm 2). Consider any iH(1)i\in\textbf{H}^{(1)}, then one the following two cases hold,

  1. 1.

    If iHi\in\textbf{H}, then by the first condition of Lemma 6.13 we have (S(1)1)i=(A1)i=j[0,k]Ai,j=j[0,k]Si,j+(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{A}\overrightarrow{\mathrm{1}})_{i}=\sum_{j\in[0,k]}\textbf{A}_{i,j}=\sum_{j\in[0,k]}\lfloor\textbf{S}_{i,j}\rfloor\in\mathbb{Z}_{+}.

  2. 2.

    Else i[+1,+(k+1)]i\in[\ell+1,\ell+(k+1)] and in this case we have i[1,]Ai,j=iHAi,j+iLAi,j=iHSi,j+iLSi,j+\sum_{i\in[1,\ell]}\textbf{A}_{i,j}=\sum_{i\in\textbf{H}}\textbf{A}_{i,j}+\sum_{i\in\textbf{L}}\textbf{A}_{i,j}=\sum_{i\in\textbf{H}}\lfloor\textbf{S}_{i,j}\rfloor+\lfloor\sum_{i\in\textbf{L}}\textbf{S}_{i,j}\rfloor\in\mathbb{Z}_{+}. The second equality in the previous derivation follows from Line 7 and 8 of the algorithm. The previous derivation combined with third condition of Lemma 6.13 we get, (S(1)1)i=ϕji[1,]Ai,j+(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}=\phi_{j}-\sum_{i\in[1,\ell]}\textbf{A}_{i,j}\in\mathbb{Z}_{+}.

(S(1)1)i+(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} in both the cases and the condition 1 follows.

Condition 2: This condition follows immediately from the fourth condition of Lemma 6.13.

Condition 3: Let αi=j[0,k]Si,jj[0,k]Ai,j\alpha_{i}=\sum_{j\in[0,k]}\textbf{S}_{i,j}-\sum_{j\in[0,k]}\textbf{A}_{i,j} for all i[1,]i\in[1,\ell]. First we upper bound the term iHαi\sum_{i\in\textbf{H}}\alpha_{i}. Consider iHαiiHj[0,k]Si,j1γ\sum_{i\in\textbf{H}}\alpha_{i}\leq\sum_{i\in\textbf{H}}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq\frac{1}{\gamma}. The last inequality follows because of the constraint i[1,]rij[0,k]Si,j1\sum_{i\in[1,\ell]}\textbf{r}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq 1 (SZRϕ,frac\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}) and ri>γ\textbf{r}_{i}>\gamma for all iHi\in\textbf{H}.

We now upper bound the term iLαi\sum_{i\in\textbf{L}}\alpha_{i}. Consider iLαi=iL(j[0,k]Si,jj[0,k]Ai,j)=j[0,k](iLSi,jiLAi,j)\sum_{i\in\textbf{L}}\alpha_{i}=\sum_{i\in\textbf{L}}\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}-\sum_{j\in[0,k]}\textbf{A}_{i,j}\right)=\sum_{j\in[0,k]}\left(\sum_{i\in\textbf{L}}\textbf{S}_{i,j}-\sum_{i\in\textbf{L}}\textbf{A}_{i,j}\right). Further iLAi,j=iLSi,j\sum_{i\in\textbf{L}}\textbf{A}_{i,j}=\lfloor\sum_{i\in\textbf{L}}\textbf{S}_{i,j}\rfloor for all j[0,k]j\in[0,k] (Line 8 of the algorithm) and we get iLαik+1\sum_{i\in\textbf{L}}\alpha_{i}\leq k+1.

Therefore i[]αi=iHαi+iLαi1γ+k+1\sum_{i\in[\ell]}\alpha_{i}=\sum_{i\in\textbf{H}}\alpha_{i}+\sum_{i\in\textbf{L}}\alpha_{i}\leq\frac{1}{\gamma}+k+1 and combined with fifth condition Lemma 6.13 we have,

g(S(1))exp(O((1γ+k)logΔ))g(S).\textbf{g}(\textbf{S}^{(1)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})~{}.

Proof of Lemma 6.15.

In the following we provide proof for all the conditions in the lemma.

Condition 1: For all i[1,+(k+1)]i\in[1,\ell+(k+1)], one of the following two conditions hold,

  1. 1.

    If iH(1)i\in\textbf{H}^{(1)}, then by the first condition of Lemma 6.13 we have (S(2)1)i=(A(1)1)i=(S(1)1)i+(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}. The last expression follows from first condition of Lemma 6.14.

  2. 2.

    Else iL(1)i\in\textbf{L}^{(1)}, then again by the first condition of Lemma 6.13 we have (S(2)1)i=(A(1)1)i=(S(1)1)i+(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}=\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor\in\mathbb{Z}_{+}. The last equality follows from Line 15 of the algorithm.

For all i[1,+(k+1)]i\in[1,\ell+(k+1)], we have (S(2)1)i+(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} and therefore condition 1 holds.

Condition 2: This condition follows immediately from the second condition of Lemma 6.13.

Condition 3: This condition follows immediately from the fourth condition of Lemma 6.13.

Condition 4: Consider the term i[+(k+1)+1,+2(k+1)](S(2)1)i\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i},

i[+(k+1)+1,+2(k+1)](S(2)1)i=i[1,+2(k+1)](S(2)1)ii[1,+(k+1)](S(2)1)i=j[0,k]ϕji[1,+(k+1)](A(1)1)i=j[0,k]ϕj(iH(1)(A(1)1)i+iL(1)(A(1)1)i)=j[0,k]ϕj(iH(1)(S(1)1)i+iL(1)(S(1)1)i)+\begin{split}\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}&=\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}-\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\\ &=\sum_{j\in[0,k]}\phi_{j}-\sum_{i\in[1,\ell+(k+1)]}(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}\\ &=\sum_{j\in[0,k]}\phi_{j}-\left(\sum_{i\in\textbf{H}^{(1)}}(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}+\sum_{i\in\textbf{L}^{(1)}}(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}\right)\\ &=\sum_{j\in[0,k]}\phi_{j}-\left(\sum_{i\in\textbf{H}^{(1)}}(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}+\sum_{i\in\textbf{L}^{(1)}}\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor\right)\in\mathbb{Z}_{+}\end{split} (109)

In the first equality we add and substract i[1,+(k+1)](S(2)1)i\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i} term. The first term in the second equality follows because i[1,+2(k+1)](S(2)1)i=j[0,k]i[1,+2(k+1)]Si,j(2)=j[0,k]ϕj\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}=\sum_{j\in[0,k]}\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{(2)}_{i,j}=\sum_{j\in[0,k]}\phi_{j} and the last equality follows because S(2)ZR(2)ϕ,frac\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}} (Condition 3). The second term in the second equality follows by the first condition of Lemma 6.13. In the third equality we divided the summation terms over H(1)\textbf{H}^{(1)} and L(1)\textbf{L}^{(1)}. In the fourth equality we used Line 14 of the algorithm and further for any iL(1)i\in\textbf{L}^{(1)} Line 15 implies (A(1)1)i=j[0,k]Sij(1)(S(1)1)i(S(1)1)i=(S(1)1)i(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}=\sum_{j\in[0,k]}\textbf{S}^{(1)}_{ij}\frac{\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor}{(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}}=\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor. Finally by first condition of Lemma 6.14 we have (S(1)1)i+(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} for all iH(1)i\in\textbf{H}^{(1)} and ϕj+\phi_{j}\in\mathbb{Z}_{+} for all j[0,k]j\in[0,k]. Therefore, i[+(k+1)+1,+2(k+1)](S(2)1)i+\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} and the condition 4 holds.

Condition 5: For any j[0,k]j\in[0,k] we have,

r+(k+1)+1+j(2)=i[1,+(k+1)](Sij(1)Aij(1))ri(1)i[1,+(k+1)](Sij(1)Aij(1))=iL(1)(Sij(1)Aij(1))ri(1)iL(1)(Sij(1)Aij(1))γiL(1)(Sij(1)Aij(1))iL(1)(Sij(1)Aij(1))γ.\begin{split}\textbf{r}^{(2)}_{\ell+(k+1)+1+j}&=\frac{\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})\textbf{r}^{(1)}_{i}}{\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}=\frac{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})\textbf{r}^{(1)}_{i}}{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}\\ &\leq\gamma\frac{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}\leq\gamma.\end{split} (110)

The first equality follows from the sixth condition of Lemma 6.13. The second equality follows because Si,j(1)=Ai,j(1)\textbf{S}^{(1)}_{i,j}=\textbf{A}^{(1)}_{i,j} for all iH(1)i\in\textbf{H}^{(1)} and j[0,k]j\in[0,k] (Line 14). The third inequality follows because Si,j(1)Ai,j(1)\textbf{S}^{(1)}_{i,j}\geq\textbf{A}^{(1)}_{i,j} for all iL(1)i\in\textbf{L}^{(1)} and j[0,k]j\in[0,k] (Line 15) and further ri(1)γ\textbf{r}^{(1)}_{i}\leq\gamma for all iL(1)i\in\textbf{L}^{(1)} (Line 12).

Condition 6: For any i[1,+(k+1)]i\in[1,\ell+(k+1)], let αi=j[0,k]Si,j(1)j[0,k]A(1)\alpha_{i}=\sum_{j\in[0,k]}\textbf{S}^{(1)}_{i,j}-\sum_{j\in[0,k]}\textbf{A}^{(1)}. Note αi=0\alpha_{i}=0 for all iH(1)i\in\textbf{H}^{(1)} (Line 14) and αi=(S(1)1)i(S(1)1)i1\alpha_{i}=(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}-\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor\leq 1 for all iL(1)i\in\textbf{L}^{(1)} (Line 15). Therefore i[1,+(k+1)]αi|L(1)|+(k+1)\sum_{i\in[1,\ell+(k+1)]}\alpha_{i}\leq|\textbf{L}^{(1)}|\leq\ell+(k+1) and further combined with the fifth condition of Lemma 6.13 we have g(S(2))exp(O((+k)logΔ))g(S(1))\textbf{g}(\textbf{S}^{(2)})\geq\exp\left(-O\left((\ell+k)\log\Delta\right)\right)\textbf{g}(\textbf{S}^{(1)}). Note by the third condition of Lemma 6.14 we have g(S(1))exp(O((1γ+k)logΔ))g(S)\textbf{g}(\textbf{S}^{(1)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S}). Combining the previous two inequalities we get g(S(2))exp(O((+k+1γ)logΔ))g(S)\textbf{g}(\textbf{S}^{(2)})\geq\exp\left(-O\left((\ell+k+\frac{1}{\gamma})\log\Delta\right)\right)\textbf{g}(\textbf{S}) and condition 6 holds. ∎

Proof of 6.16.

In the following we provide proof for the two conditions of the theorem.

Condition 1: Here we provide the proof for the condition SextZRextϕ\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}.

  1. 1.

    For all i[1,+2(k+1)]i\in[1,\ell+2(k+1)], consider (Sext1)i(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}. If i[1,+(k+1)]i\in[1,\ell+(k+1)], then (Sext1)i=(S(2)1)i+(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}. The first equality follows by line 22 of the algorithm and the last expression follows by first condition of Lemma 6.15. Else i[+(k+1)+1,+2(k+1)]i\in[\ell+(k+1)+1,\ell+2(k+1)], let jj be such that i=+(k+1)+1+ji=\ell+(k+1)+1+j, then (Sext1)i=j[0,k]S+(k+1)+1+j,jext=j[0,k](S+(k+1)+1+j,j(2)+zj,j)=S+(k+1)+1+j,j(2)+j[0,k]zj,j+(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}\right)=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\in\mathbb{Z}_{+}. The second equality follows by line 23 of the algorithm. The third equality follows from the second condition of Lemma 6.15. Finally by the first condition of Lemma 6.12 we have j[0,k]zj,j{0,1}\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\in\{0,1\} for all j[0,k]j\in[0,k] and therefore (Sext1)i+(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+} for any i[+(k+1)+1,+2(k+1)]i\in[\ell+(k+1)+1,\ell+2(k+1)].

    Combining the analysis of cases i[1,+(k+1)]i\in[1,\ell+(k+1)] and i[+(k+1)+1,+2(k+1)]i\in[\ell+(k+1)+1,\ell+2(k+1)] the condition 1 holds.

  2. 2.

    For all j[0,k]j\in[0,k],

    i[1,+2(k+1)]Si,jext=i[1,+(k+1)]Si,jext+i[+(k+1)+1,+2(k+1)]Si,jext=i[1,+(k+1)]Si,j(2)+j[0,k](S+(k+1)+1+j,j(2)+zj,j).\begin{split}\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}&=\sum_{i\in[1,\ell+(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}+\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}\\ &=\sum_{i\in[1,\ell+(k+1)]}\textbf{S}^{(2)}_{i,j}+\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j^{\prime},j}\rfloor+z_{j^{\prime},j}\right)~{}.\end{split} (111)

    The second equality follows because Si,jext=Si,j(2)\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j} for all i[1,+(k+1)]i\in[1,\ell+(k+1)] (Line 22) and Si,jext=S+(k+1)+1+j,j(2)+zj,j\textbf{S}^{\mathrm{ext}}_{i,j}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j^{\prime},j}\rfloor+z_{j^{\prime},j} for all i[+(k+1)+1,+2(k+1)]i\in[\ell+(k+1)+1,\ell+2(k+1)] (Line 23). We next simplify the second term in the above expression.

    j[0,k](S+(k+1)+1+j,j(2)+zj,j)=S+(k+1)+1+j,j(2)+j[0,k]zj,j=S+(k+1)+1+j,j(2)+xj=S+(k+1)+1+j,j(2)=i[+(k+1)+1,+2(k+1)]Si,j(2).\begin{split}\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j^{\prime},j}\rfloor+z_{j^{\prime},j}\right)&=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j^{\prime},j}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+x_{j}\\ &=\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}=\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{S}^{(2)}_{i,j}~{}.\end{split} (112)

    In the first and final equality we used the second condition of Lemma 6.15 (Diagonal Structure). In the second equality we used the first condition of Lemma 6.12. In the third equality we used the definition of xjx_{j} (Line 19). Combining equations 111 and 112 we get,

    i[1,+2(k+1)]Si,jext=i[1,+2(k+1)]Si,j(2)=ϕj\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}=\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{(2)}_{i,j}=\phi_{j}

    In the last inequality we used S(2)ZR(2)ϕ,frac\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}}.

  3. 3.

    Let riext\textbf{r}^{\mathrm{ext}}_{i} for all i[1,+2(k+1)]i\in[1,\ell+2(k+1)] be the ii’th element of Rext\textbf{R}^{\mathrm{ext}}. Consider i[1,+2(k+1)]riext(Sext1)i\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{\mathrm{ext}}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}, we have,

    i[1,+2(k+1)]riext(Sext1)i=i[1,+2(k+1)]ri(2)1+γ(Sext1)i=11+γi[1,+(k+1)+1]ri(2)(S(2)1)i+11+γi[+(k+1)+1,+2(k+1)]ri(2)(Sext1)i.\begin{split}\sum_{i\in[1,\ell+2(k+1)]}&\textbf{r}^{\mathrm{ext}}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\sum_{i\in[1,\ell+2(k+1)]}\frac{\textbf{r}^{(2)}_{i}}{1+\gamma}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\\ &=\frac{1}{1+\gamma}\sum_{i\in[1,\ell+(k+1)+1]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}+\frac{1}{1+\gamma}\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}.\end{split} (113)

    The first equality follows from Line 24 of the algorithm. In the second equality we divided the summation into two parts and used Si,jext=Si,j(2)\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j} for all i[1,+(k+1)+1]i\in[1,\ell+(k+1)+1] and j[0,k]j\in[0,k] (Line 22) for the first part. We now simplify the second part of the above expression.

    i[+(k+1)+1,+2(k+1)]ri(2)(Sext1)i=j[0,k]r+(k+1)+1+j(2)j[0,k](S+(k+1)+1+j,j(2)+zj,j)=j[0,k]wj(S+(k+1)+1+j,j(2)xj)+j[0,k]wjj[0,k]zj,jj[0,k]wj(S+(k+1)+1+j,j(2)xj)+j[0,k]wjxj+maxj[0,k]wj=i[+(k+1)+1,+2(k+1)]ri(2)(S(2)1)i+γ.\begin{split}\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}&=\sum_{j\in[0,k]}\textbf{r}^{(2)}_{\ell+(k+1)+1+j}\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}\right)\\ &=\sum_{j\in[0,k]}w_{j}\left(\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}-x_{j}\right)+\sum_{j\in[0,k]}w_{j}\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\\ &\leq\sum_{j\in[0,k]}w_{j}\left(\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}-x_{j}\right)+\sum_{j\in[0,k]}w_{j}x_{j}+\max_{j\in[0,k]}w_{j}\\ &=\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}+\gamma~{}.\end{split} (114)

    In the first equality we expanded the (Sext1)i(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i} term. Further we used S+(k+1)+1+j,jext=S+(k+1)+1+j,j(2)+zj,j\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}} for all j,j[0,k]j,j^{\prime}\in[0,k] (Line 23). In the second equality we used the second condition of Lemma 6.15 (Diagonal Structure) and further combined it with definitions of wjw_{j} and xjx_{j} from Line 19 of the algorithm. The third inequality follows from second condition of Lemma 6.12. In the final inequality we used maxj[0,k]wjγ\max_{j\in[0,k]}w_{j}\leq\gamma that follows from the definition of wjw_{j} and fifth condition of Lemma 6.15. Further we combined it with S+(k+1)+1+j,j(2)=(S(2)1)i\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}=(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i} that follows from the second condition of Lemma 6.15.

    Combining equations 113 and 114 we have,

    i[1,+2(k+1)]riext(Sext1)i11+γ(i[1,+2(k+1)]ri(2)(S(2)1)i+γ)1.\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{\mathrm{ext}}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\leq\frac{1}{1+\gamma}\left(\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}+\gamma\right)\leq 1~{}.

    In the final inequality we used S(2)ZR(2)ϕ,frac\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}} and therefore i[1,+2(k+1)]ri(2)(S(2)1)i1\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\leq 1.

The condition 1 holds by combining the analysis of all the above three cases.

Condition 2: Recall the definition of g(Sext)\textbf{g}(\textbf{S}^{\mathrm{ext}}),

g(Sext)=i[1,+2(k+1)](riext(Sextm)iexp((Sext1)ilog(Sext1)i)j[0,k]exp(SijextlogSijext))\displaystyle\textbf{g}(\textbf{S}^{\mathrm{ext}})=\prod_{i\in[1,\ell+2(k+1)]}\left({\textbf{r}^{\mathrm{ext}}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\frac{\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{S}^{\mathrm{ext}}_{ij}\log\textbf{S}^{\mathrm{ext}}_{ij}\right)}\right)

In the above expression consider the probability term,

i[1,+2(k+1)]riext(Sextm)i=i[1,+2(k+1)](ri(2)1+γ)(Sextm)iexp(O(γn))(i[1,+(k+1)]ri(2)(Sextm)i)(i[+(k+1)+1,+2(k+1)]ri(2)(Sextm)i)=exp(O(γn))(i[1,+(k+1)]ri(2)(S(2)m)i)(i[+(k+1)+1,+2(k+1)]ri(2)(Sextm)i).\begin{split}\prod_{i\in[1,\ell+2(k+1)]}&{\textbf{r}^{\mathrm{ext}}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}=\prod_{i\in[1,\ell+2(k+1)]}\left(\frac{\textbf{r}^{(2)}_{i}}{1+\gamma}\right)^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\\ &\geq\exp\left(-O(\gamma n)\right)\left(\prod_{i\in[1,\ell+(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\right)\\ &=\exp\left(-O(\gamma n)\right)\left(\prod_{i\in[1,\ell+(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{(2)}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\right)~{}.\end{split} (115)

In the first equality we used line 24 of the algorithm. In the second inequality we used i[1,+2(k+1)](Sextm)i=n\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}=n that further implies (1+γ)i[1,+2(k+1)](Sextm)iexp(O(γn))\left(1+\gamma\right)^{-\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\geq\exp\left(-O(\gamma n)\right). In the third equality we used Si,jext=Si,j(2)\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j} for all i[1,+(k+1)]i\in[1,\ell+(k+1)] and j[0,k]j\in[0,k] (Line 22). We now analyze the second product term in the final expression above,

i[+(k+1)+1,+2(k+1)]ri(2)(Sextm)i=j[0,k]r+(k+1)+1+j(2)j[0,k]S+(k+1)+1+j,jextmj=j[0,k]r+(k+1)+1+j(2)j[0,k](S+(k+1)+1+j,j(2)+zj,j)mj=(j[0,k]r+(k+1)+1+j(2)S+(k+1)+1+j,j(2))(j[0,k]r+(k+1)+1+j(2)j[0,k]zj,jmj).\begin{split}\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}&{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}=\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}\\ &=\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}\right)\overrightarrow{\mathrm{m}}_{j^{\prime}}}\\ &=\left(\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor}\right)\left(\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}\right).\end{split} (116)

The second equality follows from line 23 of the algorithm. The third equality follows from the second condition of Lemma 6.15 (Diagonal Structure).

Now consider the second product term in the above expression.

j[0,k]r+(k+1)+1+j(2)j[0,k]zj,jmj=j[0,k]wjj[0,k]zj,jmjj[0,k]wjxjmj.\begin{split}\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}&=\prod_{j\in[0,k]}w_{j}^{\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}\geq\prod_{j\in[0,k]}w_{j}^{x_{j}\overrightarrow{\mathrm{m}}_{j}}~{}.\end{split} (117)

In the first equality we used the definition of wjw_{j} (Line 19). The second inequality follows from the third condition of Lemma 6.12.

Combining equations 116, 117 and further using xj=S+(k+1)+1+j,j(2)S+(k+1)+1+j,j(2)x_{j}=\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}-\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor for all j[0,k]j\in[0,k] (Line 19) we have,

i[+(k+1)+1,+2(k+1)]ri(2)(Sextm)ij[0,k]r+(k+1)+1+j(2)S+(k+1)+1+j,j(2)mj=i[+(k+1)+1,+2(k+1)]ri(2)(S(2)m)i.\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\geq\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\overrightarrow{\mathrm{m}}_{j}}=\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{(2)}\overrightarrow{\mathrm{m}})_{i}}~{}. (118)

In the final inequality we used the second condition of Lemma 6.15 (Diagonal Structure).

Combining equations 115 and 118 we have,

i[1,+2(k+1)]riext(Sextm)iexp(O(γn))i[1,+2(k+1)]ri(2)(S(2)m)i\prod_{i\in[1,\ell+2(k+1)]}{\textbf{r}^{\mathrm{ext}}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\geq\exp\left(-O(\gamma n)\right)\prod_{i\in[1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{(2)}\overrightarrow{\mathrm{m}})_{i}}

Using the above expression we have,

g(Sext)g(S(2))exp(O(γn))i[1,+2(k+1)](exp((Sext1)ilog(Sext1)i(S(2)1)ilog(S(2)1)i)j[0,k]exp(Si,jextlogSi,jextSi,j(2)logSi,j(2)))=exp(O(γn))i[+(k+1)+1,+2(k+1)](exp((Sext1)ilog(Sext1)i(S(2)1)ilog(S(2)1)i)j[0,k]exp(Si,jextlogSi,jextSi,j(2)logSi,j(2)))=exp(O(γn))i[+(k+1)+1,+2(k+1)]exp((Sext1)ilog(Sext1)ij[0,k]Si,jextlogSi,jext).\begin{split}\frac{\textbf{g}(\textbf{S}^{\mathrm{ext}})}{\textbf{g}(\textbf{S}^{(2)})}&\geq\exp\left(-O(\gamma n)\right)\prod_{i\in[1,\ell+2(k+1)]}\left(\frac{\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j^{\prime}\in[0,k]}\exp\left(\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}-\textbf{S}^{(2)}_{i,j^{\prime}}\log\textbf{S}^{(2)}_{i,j^{\prime}}\right)}\right)\\ &=\exp\left(-O(\gamma n)\right)\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\left(\frac{\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j^{\prime}\in[0,k]}\exp\left(\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}-\textbf{S}^{(2)}_{i,j^{\prime}}\log\textbf{S}^{(2)}_{i,j^{\prime}}\right)}\right)\\ &=\exp\left(-O(\gamma n)\right)\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)~{}.\end{split} (119)

In the second equality we used Si,jext=Si,j(2)\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j} for all i[1,+(k+1)]i\in[1,\ell+(k+1)] and j[0,k]j\in[0,k] (Line 22). The third inequality follows by the second condition of Lemma 6.15 (Diagonal Structure). In the remainder of the proof we lower bound the term in the final expression.

For each i[+(k+1)+1,+2(k+1)]i\in[\ell+(k+1)+1,\ell+2(k+1)] let j[0,k]j\in[0,k] be such that i=+(k+1)+1+ji=\ell+(k+1)+1+j, then (Sext1)i=j[0,k](S+(k+1)+1+j,j(2)+zj,j)=S+(k+1)+1+j,j(2)+j[0,k]zj,j(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\sum_{j^{\prime}\in[0,k]}(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}})=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}. The first equality follows from line 23 of the algorithm. The second equality follows by the second condition of Lemma 6.15 (Diagonal Structure). Using first condition of Lemma 6.12, one of the following two cases hold,

  1. 1.

    If j[0,k]zj,j=0\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=0, then zj,j=0z_{j,j^{\prime}}=0 for all j[0,k]j^{\prime}\in[0,k]. Using second condition of Lemma 6.15 (Diagonal Structure), we have S+(k+1)+1+j,jext=S+(k+1)+1+j,j(2)+zj,j=0\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}=0 for all j[0,k]j^{\prime}\in[0,k] and jjj^{\prime}\neq j. Further note, (Sext1)i=S+(k+1)+1+j,j(2)+j[0,k]zj,j=S+(k+1)+1+j,jext(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j}. Combining previous two equalities we have, (Sext1)ilog(Sext1)ij[0,k]Si,jextlogSi,jext=0(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}=0. Therefore,

    exp((Sext1)ilog(Sext1)ij[0,k]Si,jextlogSi,jext)1.\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)\geq 1~{}. (120)
  2. 2.

    If j[0,k]zj,j=1\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=1, then zj,j[0,1]z_{j,j^{\prime}}\in[0,1]_{\mathbb{R}} for all j[0,k]j^{\prime}\in[0,k]. Using second condition of Lemma 6.15 (Diagonal Structure), we have Si,jext=S+(k+1)+1+j,jext=S+(k+1)+1+j,j(2)+zj,j=zj,j\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}=\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}=z_{j,j^{\prime}} for all j[0,k]j^{\prime}\in[0,k] and jjj^{\prime}\neq j. Therefore, j[0,k]Si,jextlogSi,jext=(S+(k+1)+1+j,j(2)+zj,j)log(S+(k+1)+1+j,j(2)+zj,j)+jjzj,jlogzj,j(S+(k+1)+1+j,j(2)+zj,j)log(S+(k+1)+1+j,j(2)+zj,j)\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}=(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})+\sum_{j^{\prime}\neq j}z_{j,j^{\prime}}\log z_{j,j^{\prime}}\leq(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j}). The final inequality follows because zj,j[0,1]z_{j,j^{\prime}}\in[0,1]_{\mathbb{R}} and zj,jlogzj,j0z_{j,j^{\prime}}\log z_{j,j^{\prime}}\leq 0 for all j[0,k]j^{\prime}\in[0,k].

    Further note, (Sext1)i=S+(k+1)+1+j,j(2)+j[0,k]zj,j=S+(k+1)+1+j,j(2)+1(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+1. Combining previous two inequalities we have, (Sext1)ilog(Sext1)ij[0,k]Si,jextlogSi,jext(S+(k+1)+1+j,j(2)+1)log(S+(k+1)+1+j,j(2)+1)(S+(k+1)+1+j,j(2)+zj,j)log(S+(k+1)+1+j,j(2)+zj,j)0(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\geq(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+1)\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+1)-(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\geq 0. The last inequality follows because of the following: If S+(k+1)+1+j,j(2)=0\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor=0, then the inequality follows because zj,j[0,1]z_{j,j}\in[0,1]_{\mathbb{R}} and zj,jlogzj,j0z_{j,j}\log z_{j,j}\leq 0. Else S+(k+1)+1+j,j(2)1\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor\geq 1, in this case we use the fact that xlogxx\log x is a monotonically increasing for x1x\geq 1.

    Therefore

    exp((Sext1)ilog(Sext1)ij[0,k]Si,jextlogSi,jext)1.\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)\geq 1~{}. (121)

Combining equations 120 and 121, for all i[+(k+1)+1,+2(k+1)]i\in[\ell+(k+1)+1,\ell+2(k+1)] we have,

exp((Sext1)ilog(Sext1)ij[0,k]Si,jextlogSi,jext)1.\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)\geq 1~{}.

Substituting previous inequality in Equation 119 we get,

g(Sext)g(S(2))exp(O(γn)).\frac{\textbf{g}(\textbf{S}^{\mathrm{ext}})}{\textbf{g}(\textbf{S}^{(2)})}\geq\exp\left(-O(\gamma n)\right)~{}.

Further the condition 2 of the theorem follows by combining the above inequality with the sixth condition of Lemma 6.15. ∎

7 Acknowledgments

We thank Jayadev Acharya and Yanjun Han for helpful conversations. We thank the anonymous reviewer for pointing out the alternative proof of the quality of scaled Sinkhorn and Bethe approximations on approximating the permanent of matrices with a bounded number of distinct columns (see Section 3.1 and Appendix A).

References

  • [AA11] Scott Aaronson and Alex Arkhipov. The computational complexity of linear optics. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 333–342, New York, NY, USA, 2011. ACM.
  • [ADM+10] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern probabilities. In 2010 IEEE International Symposium on Information Theory, pages 1498–1502, June 2010.
  • [ADOS16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. CoRR, abs/1611.02960, 2016.
  • [AOST14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 2014.
  • [AOST17] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating renyi entropy of discrete distributions. IEEE Trans. Inf. Theor., 63(1):38–56, January 2017.
  • [AR18] Nima Anari and Alireza Rezaei. A tight analysis of bethe approximation for permanent. CoRR, abs/1811.02933, 2018.
  • [AS04] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.
  • [Bar96] Alexander I Barvinok. Two algorithmic results for the traveling salesman problem. Mathematics of Operations Research, 21(1):65–84, 1996.
  • [Bar17] Alexander Barvinok. Combinatorics and Complexity of Partition Functions. Springer Publishing Company, Incorporated, 1st edition, 2017.
  • [BF93] John Bunge and Michael Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364–373, 1993.
  • [Bre73] L. M. Bregman. Certain properties of nonnegative matrices and their permanents. 1973.
  • [BZLV16] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of kl divergence between large-alphabet distributions. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1118–1122, July 2016.
  • [CCG+12] Robert K Colwell, Anne Chao, Nicholas J Gotelli, Shang-Yi Lin, Chang Xuan Mao, Robin L Chazdon, and John T Longino. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of plant ecology, 5(1):3–21, 2012.
  • [Cha84] A Chao. Nonparametric estimation of the number of classes in a population. scandinavianjournal of statistics11, 265-270. Chao26511Scandinavian Journal of Statistics1984, 1984.
  • [CL92] Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 87(417):210–217, 1992.
  • [CSS19] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 780–791, New York, NY, USA, 2019. ACM.
  • [DS13] Timothy Daley and Andrew D Smith. Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4):325, 2013.
  • [ET76] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did shakespeare know? Biometrika, 63(3):435–447, 1976.
  • [Für05] Johannes Fürnkranz. Web mining. In Data mining and knowledge discovery handbook, pages 899–920. Springer, 2005.
  • [GS14] Leonid Gurvits and Alex Samorodnitsky. Bounds on the permanent and some applications. arXiv e-prints, page arXiv:1408.0976, Aug 2014.
  • [GS18] Daniel Grier and Luke Schaeffer. New hardness results for the permanent using linear optics. In Proceedings of the 33rd Computational Complexity Conference, CCC ’18, pages 19:1–19:29, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [GTPB07] Zhan Gao, Chi-hong Tseng, Zhiheng Pei, and Martin J Blaser. Molecular analysis of human forearm superficial skin bacterial biota. Proceedings of the National Academy of Sciences, 104(8):2927–2932, 2007.
  • [Gur05] Leonid Gurvits. On the complexity of mixed discriminants and related problems. In Proceedings of the 30th International Conference on Mathematical Foundations of Computer Science, MFCS’05, pages 447–458, Berlin, Heidelberg, 2005. Springer-Verlag.
  • [Gur11] Leonid Gurvits. Unleashing the power of Schrijver’s permanental inequality with the help of the Bethe Approximation. arXiv e-prints, page arXiv:1106.2844, Jun 2011.
  • [HHRB01] Jennifer B Hughes, Jessica J Hellmann, Taylor H Ricketts, and Brendan JM Bohannan. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67(10):4399–4406, 2001.
  • [HJM17] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On Estimation of $L_{r}$-Norms in Gaussian White Noise Models. arXiv e-prints, page arXiv:1710.03863, Oct 2017.
  • [HJW16] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of KL divergence between discrete distributions. CoRR, abs/1605.09124, 2016.
  • [HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. arXiv preprint arXiv:1802.08405, 2018.
  • [HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy estimation over Lipschitz balls. arXiv e-prints, page arXiv:1711.02141, Nov 2017.
  • [HO19] Yi Hao and Alon Orlitsky. The Broad Optimality of Profile Maximum Likelihood. arXiv e-prints, page arXiv:1906.03794, Jun 2019.
  • [JHW16] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the l1 distance. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 750–754, July 2016.
  • [JSV04] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. J. ACM, 51(4):671–697, July 2004.
  • [JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835–2885, May 2015.
  • [KLR99] Ian Kroes, Paul W Lepp, and David A Relman. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547–14552, 1999.
  • [LSW98] Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 644–652, New York, NY, USA, 1998. ACM.
  • [OSS+04] A. Orlitsky, S. Sajama, N. P. Santhanam, K. Viswanathan, and Junan Zhang. Algorithms for modeling distributions over large alphabets. In International Symposium on Information Theory, 2004. ISIT 2004. Proceedings., pages 304–304, 2004.
  • [OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283–13288, 2016.
  • [OSZ03] A. Orlitsky, N. P. Santhanam, and J. Zhang. Always good turing: asymptotically optimal probability estimation. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 179–188, Oct 2003.
  • [PBG+01] Bruce J Paster, Susan K Boches, Jamie L Galvin, Rebecca E Ericson, Carol N Lau, Valerie A Levanos, Ashish Sahasrabudhe, and Floyd E Dewhirst. Bacterial diversity in human subgingival plaque. Journal of bacteriology, 183(12):3770–3783, 2001.
  • [PGM+01] A. Porta, S. Guzzetti, N. Montano, R. Furlan, M. Pagani, A. Malliani, and S. Cerutti. Entropy, entropy rate, and pattern classification as tools to typify complexity in short heart period variability series. IEEE Transactions on Biomedical Engineering, 48(11):1282–1291, Nov 2001.
  • [PJW17] D. S. Pavlichin, J. Jiao, and T. Weissman. Approximate Profile Maximum Likelihood. ArXiv e-prints, December 2017.
  • [PW96] Nina T. Plotkin and Abraham J. Wyner. An Entropy Estimator Algorithm and Telecommunications Applications, pages 351–363. Springer Netherlands, Dordrecht, 1996.
  • [Rad97] Jaikumar Radhakrishnan. An entropy proof of bregman’s theorem. Journal of Combinatorial Theory, Series A, 77(1):161 – 164, 1997.
  • [RCS+09] Harlan S Robins, Paulo V Campregher, Santosh K Srivastava, Abigail Wacher, Cameron J Turtle, Orsalem Kahsai, Stanley R Riddell, Edus H Warren, and Christopher S Carlson. Comprehensive assessment of t-cell receptor β\beta-chain diversity in α\alphaβ\beta t cells. Blood, 114(19):4099–4107, 2009.
  • [RRSS07] S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 559–569, Oct 2007.
  • [RVZ17] Aditi Raghunathan, Gregory Valiant, and James Zou. Estimating the unseen from multiple populations. CoRR, abs/1707.03854, 2017.
  • [RWdRvSB99] Fred Rieke, Davd Warland, Rob de Ruyter van Steveninck, and William Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, USA, 1999.
  • [Sch78] A Schrijver. A short proof of minc’s conjecture. Journal of Combinatorial Theory, Series A, 25(1):80 – 83, 1978.
  • [Sch98] Alexander Schrijver. Counting 1-factors in regular bipartite graphs. Journal of Combinatorial Theory, Series B, 72(1):122 – 135, 1998.
  • [Spe82] E. Spence. H. minc, permanents (encyclopedia of mathematics and its applications, vol. 6, addison-wesley advanced book programme, 1978), xviii 205 pp., 21.50. Proceedings of the Edinburgh Mathematical Society, 25(1):110–110, 1982.
  • [TE87] Ronald Thisted and Bradley Efron. Did shakespeare write a newly-discovered poem? Biometrika, 74(3):445–455, 1987.
  • [Val79] L.G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189 – 201, 1979.
  • [VBB+12] Martin Vinck, Francesco P. Battaglia, Vladimir B. Balakirsky, A. J. Han Vinck, and Cyriel M. A. Pennartz. Estimation of the entropy based on its polynomial representation. Phys. Rev. E, 85:051139, May 2012.
  • [Von11] Pascal O. Vontobel. The bethe permanent of a non-negative matrix. CoRR, abs/1107.4196, 2011.
  • [Von12] Pascal O. Vontobel. The bethe approximation of the pattern maximum likelihood distribution. pages 2012–2016, 07 2012.
  • [Von13] P. O. Vontobel. The bethe permanent of a nonnegative matrix. IEEE Transactions on Information Theory, 59(3):1866–1901, March 2013.
  • [Von14] P. O. Vontobel. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In 2014 Information Theory and Applications Workshop (ITA), pages 1–10, Feb 2014.
  • [VV11a] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 403–412, Oct 2011.
  • [VV11b] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 685–694, New York, NY, USA, 2011. ACM.
  • [WY15] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. ArXiv e-prints, April 2015.
  • [WY16a] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702–3720, June 2016.
  • [WY16b] Yihong Wu and Pengkun Yang. Sample complexity of the distinct elements problem. arXiv e-prints, page arXiv:1612.03375, Dec 2016.
  • [YFW05] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282–2312, July 2005.
  • [ZVV+16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G. MacArthur. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature Communications, 7:13293 EP –, 10 2016.

Appendix A Alternative proof for the distinct column case.

Here we provide an alternative and simpler proof for Lemma 4.1 which was pointed to us by an anonymous reviewer. This alternative proof is derived using Corollary 3.4.5 in Barvinok’s book [Bar17] (which is further derived using the Bregman-Minc inequality) and we formally state it below.

Lemma A.1 (Corollary 3.4.5 from [Bar17]).

Suppose that Q is a N×NN\times N doubly stochastic matrix that satisfies,

Qi,j1bi for all i[N],j[N]\textbf{Q}_{i,j}\leq\frac{1}{b_{i}}\text{ for all }i\in[N],j\in[N]

for some positive integers b1,bNb_{1},\dots b_{N}. Then,

perm(Q)i[N](bi!)1/bibi.\mathrm{perm}(\textbf{Q})\leq\prod_{i\in[N]}\frac{(b_{i}!)^{1/b_{i}}}{b_{i}}~{}.

Using the above result, we now prove Lemma 4.1 and we restate it for convenience. See 4.1

Alternative proof for Lemma 4.1.

The lower bound follows from 2.7 and in the remainder we prove the upper bound. Let Q be the maximizer of the scaled Sinkhorn objective, then it is a well know fact that Q satisfies,

Q=LAR,\textbf{Q}=\textbf{L}\textbf{A}\textbf{R}~{},

where matrices L and R are the left and right non-negative diagonal matrices. Further by the symmetry of the objective, there exists an optimum solution Q that has at most kk distinct columns and we work with such an optimum solution. As L and R are diagonal matrices, the following two inequalities are trivial,

perm(Q)=perm(L)perm(A)perm(R),\mathrm{perm}(\textbf{Q})=\mathrm{perm}(\textbf{L})\mathrm{perm}(\textbf{A})\mathrm{perm}(\textbf{R})~{}, (122)
scaledsinkhorn(Q)=perm(L)scaledsinkhorn(A)perm(R),\mathrm{scaledsinkhorn}(\textbf{Q})=\mathrm{perm}(\textbf{L})~{}\mathrm{scaledsinkhorn}(\textbf{A})~{}\mathrm{perm}(\textbf{R}), (123)

Further note that for all doubly stochastic matrices Q we always have,

exp(N)scaledsinkhorn(Q).\exp(-N)\leq\mathrm{scaledsinkhorn}(\textbf{Q})~{}. (124)

Therefore combining Equations 122, 123 and 124, to prove the upper bound it is enough to show that,

perm(Q)exp(O(klogNk))exp(N).\mathrm{perm}(\textbf{Q})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot\exp(-N)~{}.

As matrix Q has at most kk distinct columns, let the multiplicities of these distinct columns be ϕ1,,ϕk\phi_{1},\ldots,\phi_{k}. Note that if a column has multiplicity ϕi\phi_{i}, the maximal element in this column is at most 1/ϕi1/\phi_{i}. Now by Lemma A.1 (Corollary 3.4.5. in [Bar17]), we have

perm(Q)i=1kϕi!ϕiϕiexp(O(klogNk))exp(N),\mathrm{perm}(\textbf{Q})\leq\prod_{i=1}^{k}\frac{\phi_{i}!}{\phi_{i}^{\phi_{i}}}\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot\exp\left(-N\right)~{},

where the last inequality follows because the term i=1kϕi!ϕiϕi\prod_{i=1}^{k}\frac{\phi_{i}!}{\phi_{i}^{\phi_{i}}} is maximized when all ϕi\phi_{i}’s are equal and take value N/kN/k. Therefore we conclude the proof. ∎