The Bethe and Sinkhorn Permanents of Low Rank Matrices
and Implications for Profile Maximum Likelihood

Nima Anari
Stanford University
[email protected]
Moses Charikar
Stanford University
[email protected]
Moses Charikar was supported by a Simons Investigator Award, a Google Faculty Research Award and an Amazon Research Award. Kirankumar Shiragur
Stanford University
[email protected]
Kirankumar Shiragur was supported by Stanford Data Science Scholarship. Aaron Sidford
Stanford University
[email protected]
Aaron Sidford was supported by NSF CAREER Award CCF-1844855.

Abstract

In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d. samples from a discrete distribution, achieve an approximation factor of $\exp\left(-O(\sqrt{n}\log n)\right)$ , improving upon the previous best-known bound achievable in polynomial time of $\exp(-O(n^{2/3}\log n))$ (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter.

We achieve these results by providing new bounds on the quality of approximation of the Bethe and Sinkhorn permanents (Vontobel, 2012 and 2014). We show that each of these are $\exp(O(k\log(N/k)))$ approximations to the permanent of $N\times N$ matrices with non-negative rank at most $k$ , improving upon the previous known bounds of $\exp(O(N))$ . To obtain our results on PML, we exploit the fact that the PML objective is proportional to the permanent of a certain Vandermonde matrix with $\sqrt{n}$ distinct columns, i.e. with non-negative rank at most $\sqrt{n}$ . As a by-product of our work we establish a surprising connection between the convex relaxation in prior work (CSS19) and the well-studied Bethe and Sinkhorn approximations.

1 Introduction

Symmetric property estimation of distributions¹¹1Throughout this paper, we use the word distribution to refer to discrete distributions. is an important and well studied problem in statistics and theoretical computer science. Given access to $n$ i.i.d samples from a hidden discrete distribution p the goal is to estimate $\textbf{f}(\textbf{p})$ , for a symmetric property $\textbf{f}(\cdot)$ . Formally, a property is symmetric if it is invariant to permutating the labels, i.e. it is a function of the multiset of probabilities and does not depend on the symbol labels. There are many well-known well-studied such properties, including support size and coverage, entropy, distance to uniformity, Renyi entropy, and sorted $\ell_{1}$ distance. Understanding the computational and sample complexity for estimating these symmetric properties has led to an extensive line of interesting research over the past decade.

Symmetric property estimation spans applications in many different fields. For instance, entropy estimation has found applications in neuroscience [RWdRvSB99], physics [VBB⁺12] and others [PW96, PGM⁺01]. Support size and coverage estimation were initially used in estimating ecological diversity [Cha84, CL92, BF93, CCG⁺12] and subsequently applied to many different applications [ET76, TE87, Für05, KLR99, PBG⁺01, DS13, RCS⁺09, GTPB07, HHRB01]. For applications of other symmetric properties we refer the reader to [HJWW17, HJM17, AOST14, RVZ17, ZVV⁺16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a].

Early work on symmetric property estimation developed estimators tailored to the particular property of interest. Consequently, a fundamental and important open questions was to come up with an estimator that is universal, i.e. the same esstimator could be used for all symmetric properties. A natural approach for constructing universal estimators is plug-in approach, where given samples we first compute a distribution independent of the property and later we output the (value of this) property for the computed distribution as our estimate.

Our approach is based on the observation (see [ADOS16]) that a sufficient statistic for estimating a symmetric property from a sequence of samples is the profile, i.e. the multiset of frequencies of symbols in the sequence; e.g. the profile of sequence $abbc$ is $\{2,1,1\}$ . We provide an efficient universal estimator that is based on the plug-in approach applied to the profile maximum likelihood (PML) distribution introduced by Orlitsky et al. [OSS⁺04]: given a sequence of $n$ samples, PML is the distribution that maximizes the likelihood of the observed profile. The problem of computing the PML distribution has been studied in several papers since, applying heuristic approaches such as Bethe/Sinkhorn approximation [Von12, Von14], the EM algorithm [OSS⁺04], a dynamic programming [PJW17] and algebraic methods [ADM⁺10].

A recent paper of Acharya et al. [ADOS16] showed that a plug-in estimator using the optimal PML distribution is universal in estimating various symmetric properties of distributions. In fact it suffices to compute a $\beta$ -approximate PML distribution (i.e. a distribution that approximates the PML objective to within a factor of $\beta$ ) for $\beta>\exp(-n^{1-\delta})$ for constant $\delta>0$ . Previous work of the authors in [CSS19], gave the first efficient algorithm to compute a $\beta$ -approximate PML for some non-trivial $\beta$ . In particular, [CSS19] gave a nearly linear running time algorithm to compute an $\exp(-O(n^{2/3}\log n))$ -approximate PML distribution. In this work, we give an efficient algorithm to compute an $\exp(-O(\sqrt{n}\log n))$ -approximate PML distribution.

The parameter $\beta$ in $\beta$ -approximate PML effects the error parameter regime under which the estimator is sample complexity optimal. Smaller values of $\beta$ yield a universal estimator that is sample optimal over broader parameter regime. For instance, [CSS19] show that $\exp(-O(n^{2/3}\log n))$ -approximate PML²²2Throughout this paper, $\widetilde{O}(\cdot)$ hides poly $\log n$ terms. is sample complexity optimal for estimating certain symmetric properties within accuracy for $\epsilon>n^{-0.16666}$ . On the other hand [ADOS16] showed that computing an $\exp(-O(\sqrt{n}\log n))$ -approximate PML is sample complexity optimal for $\epsilon>n^{-0.249}$ . However note that, using the current analysis techniques [ADOS16] we are unsure on how to exploit the computation of exact PML any better than computing an $\exp(-O(\sqrt{n}\log n))$ -approximate PML and they both are sample complexity optimal over the same error parameter regime.

In our work, we use the Bethe approximation of the permanent or the Bethe permanent (for short), a previously proposed heuristic to compute an approximate PML distribution. This is based on the Bethe free energy approximation originating in statistical physics and is very closely connected to the belief propagation algorithm [YFW05, Von13]. The idea of using the Bethe permanent for computing an approximate PML distribution comes from the fact that the likelihood of a profile with respect to a distribution can be written as the permanent of a non-negative Vandermonde matrix (which we call the profile probability matrix). For a $N\times N$ non-negative matrix, [GS14] show that the ratio between the permanent and the Bethe permanent of a matrix is upper bounded by $1.9022^{N}\leq 2^{N}$ , that was later improved to $\sqrt{2}^{N}$ [AR18]³³3Note that previous results on the Bethe permanent do not immediately imply non-trivial results for PML. For consistency with the literature, we use approximation factors $<1$ for PML.. A natural question is whether the approximation ratio of the Bethe permanent depends on some other structural parameter better than the input dimension of the matrix? In this work, we show that the approximation ratio between the permanent and the Bethe permanent is upper bounded by an exponential in the non-negative rank of the matrix (up to a logarithmic factor). We also give an explicit construction of a matrix to show that our result for this structural parameter is asymptotically tight. As the non-negative rank of any $N\times N$ non-negative matrix is at most $N$ , our analysis implies an upper bound of $c^{N}$ for some constant $c>0$ on the approximation ratio. Therefore our work (asymptotically) generalizes previous results for general non-negative matrices.

To obtain our efficient algorithm, we prove a slightly stronger statement than the bound of the Bethe permanent of a matrix with non-negative rank at most $k$ . We show that a scaling of a simpler approximation of the permanent known as the Sinkhorn⁴⁴4Sinkhorn is also called as capacity in the literature. permanent also approximates the permanent up to exponential in the non-negative rank of the matrix (up to log factors). This implies our bound for the Bethe permanent and shows that scaled Sinkhorn is a compelling alternative to Sinkhorn, with a tighter worst-case multiplicative approximation to the permanent.

An immediate application of our work on the Bethe and the scaled Sinkhorn permanent is to approximate PML. Given $n$ samples, the number of distinct columns in the profile probability matrix is always upper bounded by $\sqrt{n}$ , i.e. its non-negative rank is at most $\sqrt{n}$ . Therefore our analysis of the scaled Sinkhorn permanent immediately implies an $\exp\left(-O(\sqrt{n}\log n)\right)$ approximation to the PML objective with respect to a fixed distribution. This result, combined with probability discretization, results in a convex program whose optimal solution is a fractional representation of an approximate PML distribution. We round this fractional solution to output a valid distribution that is an $\exp\left(-O(\sqrt{n}\log n)\right)$ -approximate PML distribution. Surprisingly the resulting convex program is exactly the same as the one in [CSS19], where a completely different (combinatorial) technique was used to arrive at the convex program. Our work here provides a better analysis of the convex program in [CSS19] using a more delicate and sophisticated rounding algorithm.

Organization of the paper:

In Section 2 we present preliminaries. In Section 3, we provide the main results of the paper. In Section 4, we analyze the scaled Sinkhorn permanent of structured matrices. In Section 4.1, we prove an upper bound for the approximation ratio of the scaled Sinkhorn permanent to the permanent as a function of the number of distinct columns. In Section 4.2, we prove the generalized result of the scaled Sinkhorn permanent for the low non-negative rank matrices. In Section 5, we prove the lower bound for the Bethe and scaled Sinkhorn approximations of the permanent. In Section 6, we combine the result for the scaled Sinkhorn permanent with the idea of probability discretization to provide the convex program that returns a fractional representation of an approximate PML distribution. In the same section, we provide the rounding algorithm to return a valid approximate PML distribution.

1.1 Overview of Techniques

In [CSS19], the authors presented a convex relaxation for the PML objective. This was obtained by a combinatorial view of the PML problem. In a sequence of steps, they discretized the set of probabilities and the frequencies, grouped the terms in the objective into groups and developed a relaxation for the sum of terms in the largest group, giving an $\exp({-{O}(n^{2/3}\log n)})$ approximation. In this paper, we exploit the fact that the likelihood of a profile with respect to a distribution is the permanent of a certain non-negative Vandermonde matrix (referred to here as the profile probability matrix with respect to a distribution) and that the PML objective is an optimization problem over such permanents. We work with the same convex relaxation we derived earlier, but relate it to the well known Bethe and scaled Sinkhorn approximations for the permanent. In fact, Vontobel [Von12, Von14] proposed the Bethe and Sinkhorn permanents as a heuristic approximation of the PML objective, but bounding the quality of the solution was an open problem [Von11]. We show that both the Bethe and scaled Sinkhorn permanents are within a factor $\exp\left(-O(\sqrt{n}\log n)\right)$ of the PML objective. Enroute, we show that the approximation ratio of the Bethe and scaled Sinkhorn permanents for any non-negative matrix $A$ are upper bounded by the exponential in the non-negative rank of matrix $A$ . This is a strengthening of the well known $\exp\left(O(N)\right)$ upper bound on the approximation ratio of both the Bethe and scaled Sinkhorn permanents of an $N\times N$ matrix.

In [CSS19], the fact that the convex problem we obtained was a relaxation of the PML objective followed directly from the combinatorial derivation of our relaxation. By contrast, our analysis here exploits the non-trivial fact that the Bethe and scaled Sinkhorn approximations are lower bounds for the permanent of a non-negative matrix. The Bethe and scaled Sinkhorn permanents of the profile probability matrix $A$ with respect to a fixed distribution are optimum solutions to maximization problems over doubly stochastic matrices $Q$ where the objective functions have entropy-like terms involving the entries of $A$ and $Q$ . In order to obtain an upper bound on the Bethe and scaled Sinkhorn approximation as a function of the non-negative rank, we show the existence of a doubly stochastic matrix $Q$ as a witness such that the objective of the Bethe and scaled Sinkhorn w.r.t. $Q$ upper bounds the permanent of $A$ within the desired factor.

We first work with a simpler setting of matrices $A$ with at most $k$ distinct columns.⁵⁵5In the final preparation of this paper for posting an anonymous reviewer showed that a simpler proof for the distinct column case can be derived using Corollary 3.4.5 of Barvinok’s book [Bar17]. The proof of the Corollary 3.4.5 further uses the famous Bregman–Minc inequality. We thank the anonymous reviewer for this and include the derivation in Appendix A. In constrast, our proof is self-contained and we believe it provides further insight into the structure of the Sinkhorn/Bethe approximations. See Section 3.1 for further details. Here we consider a modified matrix $\hat{A}$ that contains the $k$ distinct columns of $A$ . We define a distribution $\mu$ on permutations of the domain where the probability of a permutation $\sigma$ is proportional to its contribution to the permanent of $A$ . There is a many-to-one mapping from such permutations to 0-1 $N\times k$ matrices with row sums 1 and column sums $\phi_{j}$ , the number of times the $j$ th column of $\hat{A}$ appears in $A$ . We next define an $N\times k$ real-valued, non-negative matrix $P$ with row sums 1 and column sums $\phi_{j}$ , in terms of the marginals of the distribution $\mu$ . We also define a different distribution $\nu$ on 0-1 $N\times k$ row-stochastic matrices by independent sampling from $P$ . Finally, we use the fact that the KL-divergence between $\mu$ and $\nu$ is non-negative to get the required upper bound on the scaled Sinkhorn approximation with a doubly stochastic witness $Q$ (obtained from $P$ ). This proof technique is inspired by the recent work of Anari and Rezaei [AR18] that gives a tight $\sqrt{2}^{N}$ bound on the approximation ratio of the Bethe approximation for the permanent of an $N\times N$ non-negative matrix.

Though this bound on the quality of the Bethe and scaled Sinkhorn approximations for non-negative matrices with $k$ distinct columns suffices for our PML applications, interestingly we show that it can be extended to non-negative matrices with bounded rank. In order to obtain an upper bound on the Bethe and scaled Sinkhorn approximation as a function of the non-negative rank of $A$ , recall that we need to show the existence of a suitable doubly stochastic witness $Q$ which certifies the required guarantee. We express the permanent of $A$ as the sum of $O(\exp(k\log(N/k)))$ terms of the form $\mathrm{perm}(U)\mathrm{perm}(V)$ where matrices $U$ and $V$ have at most $k$ distinct columns. We focus on the largest of these terms, and construct a doubly stochastic witness $Q$ for matrix $A$ from the witnesses for matrices $U$ and $V$ in this largest term. This doubly stochastic witness $Q$ certifies the required guarantee and we get an upper bound on the scaled Sinkhorn approximation as a function of the non-negative rank. This result for the scaled Sinkhorn approximation further implies an upper bound for the Bethe approximation.

Even with this improved bound on the quality of the Bethe and scaled Sinkhorn approximations as applied to the PML objective, challenges remain in obtaining an improved approximate PML distribution. In particular, we do not know of an efficient algorithm to maximize the Bethe or the scaled Sinkhorn permanent of the profile probability matrix over a family of distributions as it would be needed to compute the Bethe or the scaled Sinkhorn approximation to the optimum of the PML objective. Prior work by Vontobel suggests an alternating maximization approach, but this is only guaranteed to produce a local optimum. To address this, we revisit the efficiently computable convex relaxation from [CSS19] and show that this is suitably close to the scaled Sinkhorn approximation. This is quite surprising as the prior derivation of this relaxation in [CSS19] was purely combinatorial and had nothing to do with the scaled Sinkhorn approximation.

The final challenge towards obtaining our PML results is to round the fractional solution produced so that the approximation guarantee is preserved. The rounding procedure from [CSS19] does not immediately suffice, but we present a more sophisticated and delicate rounding procedure that does indeed give us the required approximation guarantee. The rounding algorithm proceeds in three steps, where in the first step we first apply a procedure analogous to [CSS19] to handle large probability values and in the later steps we provide a new procedure to the smaller probability values; in each step, we ensure that the objective function does not drop significantly. The input to the rounding procedure is a matrix where the rows correspond to discretized probability values and the columns correspond to distinct frequencies. We create rows corresponding to new probability values in the course of the rounding algorithm, maintain column sums and eventually ensure that all row sums are integral, and ensure that the objective function has not dropped significantly.

2 Preliminaries

Let $[a,b]$ and $[a,b]_{\mathbb{R}}$ denote the interval of integers and reals $\geq a$ and $\leq b$ respectively. Let $\mathcal{D}$ be the domain of elements and $N\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\mathcal{D}|$ be its size. Let $\textbf{A}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}}$ be a non-negative matrix, where its $(x,y)$ ’th entry is denoted by $\textbf{A}_{x,y}$ . We further use $\textbf{A}_{x:}$ and $\textbf{A}_{:y}$ to denote the row and column corresponding to $x$ and $y$ respectively. The non-negative rank of a non-negative matrix $\textbf{A}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}}$ is equal to the smallest number $k$ such there exist non-negative vectors $\textbf{v}_{j},\textbf{u}_{j}\in\mathbb{R}^{\mathcal{D}}$ for $j\in[1,k]$ such that $\textbf{A}=\sum_{j\in[1,k]}\textbf{v}_{j}\textbf{u}_{j}^{\top}$ . Let $S_{\mathcal{D}}$ be the set of all permutations of domain $\mathcal{D}$ and we denote a permutation $\sigma$ in the following way $\sigma=\{(x,\sigma(x))\text{ for all }x\in\mathcal{D}\}$ . The permanent of a matrix A denoted by $\mathrm{perm}(\textbf{A})$ is defined as follows,

\mathrm{perm}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\sigma\in S_{\mathcal{D}}}\prod_{e\in\sigma}\textbf{A}_{e}~{}.

Let $\mathbf{K}_{rc}\subseteq\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ be the set of all non-negative matrices that are doubly stochastic. For any matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ and $\textbf{Q}\in\mathbf{K}_{rc}$ , we define the following set of functions:

\mathrm{U}(\textbf{A},\textbf{Q})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\left(\frac{\textbf{A}_{x,y}}{\textbf{Q}_{x,y}}\right)\quad\text{and}\quad\mathrm{V}(\textbf{Q})=\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}(1-\textbf{Q}_{x,y})\log\left(1-\textbf{Q}_{x,y}\right)~{}.

(1)

Further,

\mathrm{F}(\textbf{A},\textbf{Q})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathrm{U}(\textbf{A},\textbf{Q})+\mathrm{V}(\textbf{Q})~{}.

Using these definitions, we define the Bethe permanent of a matrix.

Definition 2.1.

For a matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the Bethe permanent of A is defined as follows,

\mathrm{bethe}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{\textbf{Q}\in\mathbf{K}_{rc}}\exp\left(\mathrm{F}(\textbf{A},\textbf{Q})\right)~{}.

A well known and important result about the Bethe permanent is that it lower bounds the value of permanent of a non-negative matrix and we state this result next.

Lemma 2.2 ([Gur11] based on [Sch98]).

For any non-negative $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the following holds,

\mathrm{bethe}(\textbf{A})\leq\mathrm{perm}(\textbf{A})

We next define the Sinkhorn permanent of a matrix and later we state the relationship between the Bethe and the Sinkhorn permanent.

Definition 2.3.

For a matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the Sinkhorn permanent of A is defined as follows,

\mathrm{sinkhorn}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{\textbf{Q}\in\mathbf{K}_{rc}}\exp\left(\mathrm{U}(\textbf{A},\textbf{Q})\right)~{}.

To establish the relationship between the Bethe and the Sinkhorn permanent we need the following lemma from [GS14].

Lemma 2.4 (Proposition 3.1 in [GS14]).

For any distribution $\textbf{p}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}$ , the following holds,

\sum_{x\in\mathcal{D}}(1-\textbf{p}_{x})\log(1-\textbf{p}_{x})\geq-1~{}.

For any matrix $\textbf{Q}\in\mathbf{K}_{rc}$ , each row of Q is a distribution; therefore the following holds,

\mathrm{V}(\textbf{Q})\geq-N~{}.

As a corollary of the above inequality we have,

Corollary 2.5.

For any non-negative matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the following inequality holds,

\exp(-N)\mathrm{sinkhorn}(\textbf{A})\leq\mathrm{bethe}(\textbf{A})~{}.

Later we will see that it is convenient to work with $\exp(-N)\mathrm{sinkhorn}(\textbf{A})$ than $\mathrm{sinkhorn}(\textbf{A})$ itself; we define this expression to be scaled Sinkhorn and we formally state it next.

Definition 2.6.

For a matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the scaled Sinkhorn permanent of A is defined as follows,

\mathrm{scaledsinkhorn}(\textbf{A})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{\textbf{Q}\in\mathbf{K}_{rc}}\exp\left(\mathrm{U}(\textbf{A},\textbf{Q})-N\right)~{}.

The above expression can be equivalently stated as,

\mathrm{scaledsinkhorn}(\textbf{A})=\exp(-N)\mathrm{sinkhorn}(\textbf{A})~{}.

Combining Lemma 2.2 and 2.5 we get the following result.

Corollary 2.7.

For any matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the following inequality holds,

\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{bethe}(\textbf{A}),

which further implies,

\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})~{}.

Other than approximations to the permanent of a matrix, we next state two important results that will be helpful throughout the paper. The first result is the Stirling’s approximation for factorial function and the second is the non-negativity result of KL divergence between two distributions.

Lemma 2.8 (Stirling’s approximation).

For all $n\in\mathbb{Z}_{+}$ , the following holds:

\exp(n\log n-n)\leq n!\leq O(\sqrt{n})\exp(n\log n-n)~{}.

Let $\mu$ and $\nu$ be distributions defined on some domain $\Omega$ . The KL divergence denoted $\mathrm{KL}\left(\mu\|\nu\right)$ between distributions $\mu$ and $\nu$ is defined as follows,

\mathrm{KL}\left(\mu\|\nu\right)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\textbf{X}\in\Omega}\mu(\textbf{X})\log\frac{\mu(\textbf{X})}{\nu(\textbf{X})}=\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]-\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\nu(\textbf{X})\right]

Lemma 2.9 (Non-negativity of KL divergence).

For any distributions $\mu$ and $\nu$ defined on domain $\Omega$ , the KL divergence between distributions $\mu$ and $\nu$ satisfies,

\mathrm{KL}\left(\mu\|\nu\right)\geq 0~{}.

In the remainder of this section we provide formal definitions related to PML.

2.1 Profile maximum likelihood

Let $\Delta^{\mathcal{D}}\subset[0,1]_{\mathbb{R}}^{\mathcal{D}}$ be the set of all discrete distributions supported on domain $\mathcal{D}$ . Here on we use the word distribution to refer to discrete distributions. Throughout this paper we assume that we receive a sequence of $n$ independent samples from an underlying distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ . Let $\mathcal{D}^{n}$ be the set of all length $n$ sequences and $y^{n}\in\mathcal{D}^{n}$ be one such sequence with $y^{n}_{i}$ denoting its $i$ th element. The probability of observing sequence $y^{n}$ is:

\mathbb{P}(\textbf{p},y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{x\in\mathcal{D}}\textbf{p}_{x}^{\textbf{f}(y^{n},x)}

where $\textbf{f}(y^{n},x)=|\{i\in[n]~{}|~{}y^{n}_{i}=x\}|$ is the frequency/multiplicity of symbol $x$ in sequence $y^{n}$ and $\textbf{p}_{x}$ is the probability of domain element $x\in\mathcal{D}$ .

For any given sequence one could define its profile (histogram of a histogram or fingerprint) that is sufficient statistic for symmetric property estimation.

Definition 2.10 (Profile).

For any sequence $y^{n}\in\mathcal{D}^{n}$ , let $\textbf{M}=\{\textbf{f}(y^{n},x)\}_{x\in\mathcal{D}}\backslash\{0\}$ be the set of all its non-zero distinct frequencies and $\textbf{m}_{1},\textbf{m}_{2},\dots,\textbf{m}_{|\textbf{M}|}$ be elements of the set M. The profile of a sequence $y^{n}\in\mathcal{D}^{n}$ denoted $\phi=\Phi(y^{n})\in\mathbb{Z}_{+}^{|\textbf{M}|}$ is

\phi\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\phi_{j})_{j\in[1,|\textbf{M}|]}\text{ , where }\phi_{j}=\phi_{j}(y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{x\in\mathcal{D}~{}|~{}\textbf{f}(y^{n},x)=\textbf{m}_{j}\}|

is the number of domain elements with frequency $\textbf{m}_{j}$ in $y^{n}$ ⁶⁶6The profile does not contain information about the number of unseen domain elements.. We call $n$ the length of profile $\phi$ and as a function of profile $\phi$ , $n=\sum_{j\in[1,|\textbf{M}|]}\textbf{m}_{j}\cdot\phi_{j}$ . Let $\Phi^{n}$ denote the set of all profiles of length $n$ . We use $k$ to denote the number of distinct frequencies in the profile $\phi$ and $k=|\textbf{M}|$ .⁷⁷7Note the number of distinct frequencies denoted $k$ in a length $n$ sequence is always upper bounded by $\sqrt{n}$ . For convenience we use $\overrightarrow{\mathrm{m}}\in\mathbb{Z}_{+}^{k}$ to denote the vector of observed frequencies, therefore $\overrightarrow{\mathrm{m}}_{j}=\textbf{m}_{j}$ for all $j\in[1,k]$ .

For any distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , the probability of a profile $\phi\in\Phi^{n}$ is defined as:

\mathbb{P}(\textbf{p},\phi)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\{y^{n}\in\mathcal{D}^{n}~{}|~{}\Phi(y^{n})=\phi\}}\mathbb{P}(\textbf{p},y^{n})\\

(2)

Let $x^{n}$ be a sequence such that $\Phi(x^{n})=\phi$ . We define a profile probability matrix $\textbf{A}^{\textbf{p},\phi}$ with respect to sequence $x^{n}$ (therefore profile $\phi$ ) and distribution p as follows,

\textbf{A}^{\textbf{p},\phi}_{z,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{p}_{z}^{\textbf{f}_{y}}\text{ for all }z,y\in\mathcal{D},

(3)

where $\textbf{f}_{y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{f}(x^{n},y)$ is the frequency of domain element $y\in\mathcal{D}$ in sequence $x^{n}$ and recall $\Phi(x^{n})=\phi$ . We are interested in the permanent of the matrix $\textbf{A}^{\textbf{p},\phi}$ , and note that the $\mathrm{perm}(\textbf{A}^{\textbf{p},\phi})$ is invariant under the choice of sequences $x^{n}$ that satisfy $\Phi(x^{n})=\phi$ . Therefore we index the matrix $\textbf{A}^{\textbf{p},\phi}$ with profile $\phi$ rather than sequence $x^{n}$ itself. The number of distinct columns in $\textbf{A}^{\textbf{p},\phi}$ is equal to number of distinct observed frequencies plus one (for the unseen), i.e. $k+1$ .

The probability of a profile $\phi\in\Phi^{n}$ with respect to distribution p (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of permanent of matrix $\textbf{A}^{\textbf{p},\phi}$ is given below:

\mathbb{P}(\textbf{p},\phi)=C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{perm}(\textbf{A}^{\textbf{p},\phi})

(4)

where $C_{\phi}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{n!}{\prod_{j\in[1,k]}(\textbf{m}_{j}!)^{\phi_{j}}}$ and $\phi_{0}$ is the number of unseen domain elements⁸⁸8Given a distribution p, we know its domain $\mathcal{D}$ and therefore the value of $\phi_{0}$ ..

The distribution which maximizes the probability of a profile $\phi$ is the profile maximum likelihood distribution which we formally define next.

Definition 2.11 (Profile maximum likelihood).

For any profile $\phi\in\Phi^{n}$ , a profile maximum likelihood (PML) distribution $\textbf{p}_{\mathrm{pml},\phi}\in\Delta^{\mathcal{D}}$ is:

\textbf{p}_{\mathrm{pml},\phi}\in\operatorname*{arg\,max}_{\textbf{p}\in\Delta^{\mathcal{D}}}\mathbb{P}(\textbf{p},\phi)

and $\mathbb{P}(\textbf{p}_{\mathrm{pml},\phi},\phi)$ is the maximum PML objective value.

The central goal of this paper is to define efficient algorithms for computing approximate PML distributions defined as follows.

Definition 2.12 (Approximate PML).

For any profile $\phi\in\Phi^{n}$ , a distribution $\textbf{p}^{\beta}_{\mathrm{pml},\phi}\in\Delta^{\mathcal{D}}$ is a $\beta$ -approximate PML distribution if

\mathbb{P}(\textbf{p}^{\beta}_{\mathrm{pml},\phi},\phi)\geq\beta\cdot\mathbb{P}(\textbf{p}_{\mathrm{pml},\phi},\phi)

3 Results

Here we state the main results of this paper. In our first class of main results, we improve the analysis of the scaled Sinkhorn permanent for structured non-negative matrices. We first show that the scaled Sinkhorn permanent approximates the permanent of a non-negative matrix A, where the approximation factor (up to log factors) depends exponentially on the non-negative rank of the smatrix A. We formally state this result next.

Theorem 3.1 (Scaled Sinkhorn permanent approximation for low non-negative rank matrices).

For any matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ with non-negative rank at most $k$ , the following inequality holds,

\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}.

(5)

Further using $\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{bethe}(\textbf{A})$ (See 2.7) and $\mathrm{bethe}(\textbf{A})\leq\mathrm{perm}(\textbf{A})$ (See Lemma 2.2) we immediately get the same result for the Bethe permanent.

Corollary 3.2 (Bethe permanent approximation for low non-negative rank matrices).

For any matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ with non-negative rank at most $k$ , the following inequality holds,

\mathrm{bethe}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{bethe}(\textbf{A})~{}.

(6)

Interestingly, in the worst case, Sinkhorn is an $e^{N}$ approximation to the permanent of $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , even when A has at most 1 distinct column (e.g. consider the all 1’s matrix). Consequently, for matrices with non-negative rank at most $k$ , whenever $k=o(N/\log N)$ , scaled Sinkhorn is a compelling alternative to Sinkhorn, with a tighter worst-case multiplicative approximation to the permanent.

Our results improve the analysis of the Bethe permanent for such structured matrices. Previously, the best known analysis of the Bethe permanent showed an $\sqrt{2}^{N}$ -approximation factor to the permanent [AR18]. The analysis in [AR18] is tight for general non-negative matrices and the authors showed that this bound cannot be improved without leveraging further structure. Our next result is of similar flavor, and we provide an asymptotically tight example for Theorem 3.1 and 3.2.

Theorem 3.3 (Lower bound for the Bethe and the scaled Sinkhorn permanents approximation).

There exists a matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ with non-negative rank $k$ , that satisfies

\mathrm{perm}(\textbf{A})\geq\exp\left(\Omega\left(k\log\frac{N}{k}\right)\right)\mathrm{bethe}(\textbf{A})~{},

(7)

which further implies,

\mathrm{perm}(\textbf{A})\geq\exp\left(\Omega\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}.

(8)

An immediate application of these above stated results is for PML. Recall, that for any fixed distribution p and profile $\phi$ , $\mathbb{P}(\textbf{p},\phi)$ is proportional to the permanent of the non-negative matrix $\textbf{A}^{\textbf{p},\phi}$ (See Section 2 for the definition of $\textbf{A}^{\textbf{p},\phi}$ ). Note the number of distinct columns in the profile probability matrix $\textbf{A}^{\textbf{p},\phi}$ is upper bounded by the number of distinct frequencies plus one, which further is always less than $\sqrt{n}+1$ (where $n$ is the length of the profile). Therefore the non-negative rank of the profile probability matrix $\textbf{A}^{\textbf{p},\phi}$ is always upper bounded by $\sqrt{n}+1$ . Since $\mathrm{scaledsinkhorn}(\textbf{A})$ can be computed in polynomial time [CSS19]⁹⁹9 $\mathrm{scaledsinkhorn}(\textbf{A})$ corresponds to a convex optimization problem and a minor modification of the approach in [CSS19] to solve a related, but slightly different optimization problem, yields a polynomial time algorithm to compute $\mathrm{scaledsinkhorn}(\textbf{A})$ up to high accuracy., Theorem 3.1 implies an efficient algorithm to approximate the value $\mathbb{P}(\textbf{p},\phi)$ for a fixed distribution p up to multiplicative $\exp(O(\sqrt{n}\log n))$ factor, and is also the best known approximation factor achieved by a deterministic algorithm.

Analyzing the relationship between the Bethe permanent and the permanent of the profile probability matrix was posed as an interesting research direction in [Von11] (See Section VII). Moreover, one of the primary interests in the area of algorithmic statistics/machine learning is to efficiently compute the PML distribution. Exploiting the structure of doubly stochastic matrix Q that maximizes $\mathrm{scaledsinkhorn}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})$ combined with the probability discretization idea from [CSS19] we provide an efficient algorithm to compute an approximate PML distribution. We use Lemma 4.1 to argue the approximation of our approximate PML distribution and we summarize this result below.

Theorem 3.4 ( $\exp\left(\sqrt{n}\log n\right)$ -approximate PML).

For any given profile $\phi\in\Phi^{n}$ , Algorithm 4 computes an $\exp\left(-O(\sqrt{n}\log n)\right)$ -approximate PML distribution in $\widetilde{O}(n^{1.5})$ time.

Previous to our work, the best known result [CSS19] gave an efficient algorithm to compute $\exp(-O(n^{2/3}\log n))$ -approximate PML distribution.

One important application of approximate PML is in symmetric property estimation. In [ADOS16], the authors showed that approximate PML distribution based plug-in estimator is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity. Further combining their result with our Theorem 3.4 we get the efficient version of Theorem 2 in [ADOS16] and we summarize this result next.

Theorem 3.5 (Efficient universal estimator using approximate PML).

Let $n$ be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity. If $\epsilon\geq\frac{c}{n^{0.2499}}$ for some constant $c>0$ , then there exists a PML based universal plug-in estimator that runs in time $\widetilde{O}(n^{1.5})$ and is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity to accuracy $\epsilon$ .

Note the dependency on $\epsilon$ in the above theorem and the approximation factor in Theorem 3.4 are strictly better than [CSS19], which is another efficient PML based approach for universal symmetric property estimation; [CSS19] works when the error parameter $\epsilon\geq\frac{1}{n^{0.166}}$ .

Recent work [HO19], further gives the broad optimality of approximate PML. [HO19] shows optimality of approximate PML distribution based estimator for other symmetric properties, such as, sorted distribution estimation (under $\ell_{1}$ distance), $\alpha$ -Renyi entropy for non-integer $\alpha>3/4$ , and other broad class of additive properties that are Lipschitz. [HO19] also provides a PML-based tester to test whether an unknown distribution is $\geq\epsilon$ far from a given distribution in $\ell_{1}$ distance and achieves the optimal sample complexity up to logarithmic factors. Our result further implies an efficient version of all these results.

3.1 Related work

We divide the related work into two broad categories: permanent approximations and profile maximum likelihood.

Permanent approximations:

The first set of related work is with respect to computing the permanent of matrices. [Val79] showed that computing the permanent of matrices even when it has entries in 0, 1 is #P-Hard. This led to the study of computing approximations to the permanent. Additive approximation to the permanent for arbitrary A was given by [Gur05]. On the other hand, multiplicative approximation to the permanent (or even determining the sign) is hard for general A [AA11, GS18]. This hardness results led to the study of the multiplicative approximation to the permanent for special class of matrices and one such class is the set of non-negative matrices. In this direction, [JSV04] gave the first efficient randomized algorithm to approximate the permanent within $(1+\epsilon)$ multiplicative accuracy. There has also been a rich literature on coming up with deterministic approximation to the permanent of non-negative matrices. [LSW98] gave the first deterministic algorithm to the permanent of $N\times N$ non-negative matrices with approximation ratio $\leq e^{N}$ . [Gur11] using an inequality from [Sch98] showed that the Bethe permanent lower bounds the value of the permanent of non-negative matrices. We refer the reader to [Von13, GS14] for the polynomial computability of the Bethe permanent and [AR18] for a more rigorous literature survey on the Bethe permanent and other related work.

As discussed in the footnote of the introduction, an anonymous reviewer showed us an alternative and simpler proof for the upper bound on the scaled Sinkhorn approximation to the permanent of matrices with at most $k$ distinct columns (Lemma 4.1). This alternative proof is deferred to Appendix A and is derived using Corollary 3.4.5. in Barvinok’s book [Bar17]. The result in turn, is proved using the Bregman-Minc inequality conjectured by Minc, cf. [Spe82] and later proved by Bregman [Bre73]. The Bregman-Minc inequality is well-known and there are many different proofs [Sch78, Rad97, AS04] known. In comparison to this alternative proof for matrices with $k$ distinct columns (Lemma 4.1), our proof is self contained and intuitive. We believe it could help provide further insights into the Sinkhorn/Bethe approximations.

Profile maximum likelihood:

The second set of related work is with respect to profile maximum likelihood and its applications. As discussed in the introduction, PML was introduced by [OSS⁺04]. Many heuristic approaches such as the EM algorithm [OSS⁺04], algebraic approaches [ADM⁺10] and a dynamic programming approach [PJW17] have been proposed to compute approximations to PML. Further [Von12, Von14] used the Bethe permanent as a heuristic to compute the PML distribution. All these approaches don’t provide any theoretical guarantees for the quality of the approximate PML distribution and it was an open question to efficiently compute a non-trivial approximate PML distribution. [CSS19] gave the first efficient algorithm to compute the $\exp(-n^{2/3}\log n)$ approximate PML distribution, where $n$ is the number of samples.

The connection between PML and universal estimators was first studied in [ADOS16]. [ADOS16] showed that an approximate PML distribution can be used as an universal estimator for estimating symmetric properties, namely entropy, distance to uniformity, support size and coverage. See [HO19] for broad applicability of approximate PML in property testing and estimating other symmetric properties such as sorted $\ell_{1}$ distance, Renyi entropy, and other broad class of additive properties. [CSS19] combined with [ADOS16], gave the first efficient PML based universal estimator for symmetric property estimation. There have been several other approaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b] adopted and rigorously analyzed a linear programming based approach for universal estimators proposed by [ET76] and showed that it is sample complexity optimal in the constant error regime for estimating certain symmetric properties (namely, entropy, support size, support coverage, and distance to uniformity). Recent work of Han, Jiao and Weissman [HJW18] applied a local moment matching based approach in designing efficient universal symmetric property estimators for a single distribution. [HJW18] achieves the optimal sample complexity in a broader error regimes for estimating the power sum function, support and entropy.

Estimating symmetric properties of a distribution is a rich field and extensive work has been dedicated to studying their optimal sample complexity for estimating each of these properties. Optimal sample complexities for estimating many symmetric properties were resolved in the past few years; support [VV11b, WY15], support coverage [OSW16, ZVV⁺16], entropy [VV11b, WY16a], distance to uniformity [VV11a, JHW16], sorted $\ell_{1}$ distance [VV11a, HJW18], Renyi entropy [AOST14, AOST17], KL divergence [BZLV16, HJW16] and many others.

Comparison to [CSS19]: [CSS19] provides the first efficient algorithm for computing $\beta$ -approximate PML distribution for $\beta>\exp(-n^{1-\delta})$ for some constant $\delta>0$ , where $n$ is the number of samples. Formally, [CSS19] computes an $\exp(-n^{2/3}\log n)$ -approximate PML distribution. Suppose $\ell$ and $k$ are the number of distinct probability values and frequencies respectively, then [CSS19] provides a convex program that using combinatorial techniques they analyze and show that it approximates the PML objective up to $\exp(-\widetilde{O}(\ell\times k))$ multiplicative factor. Further this convex program outputs a fractional solution and [CSS19] provides a rounding algorithm that outputs a valid integral solution (that corresponds to a valid distribution). [CSS19] further incurs a $\exp(-\widetilde{O}(\ell\times k))$ multiplicative loss in the rounding procedure. Using the discretization results, up to $\exp(-n^{2/3}\log n)$ -multiplicative loss one can assume $\ell,k\leq n^{1/3}$ and therefore [CSS19] provides a $\exp(-n^{2/3}\log n)$ -approximate PML distribution.

However in our current work, using results for the scaled Sinkhorn permanent, we show that the same convex program in [CSS19] approximates the PML objective up to $\exp(-\widetilde{O}(\ell+k))$ multiplicative factor. Further we also provide a better rounding algorithm that outputs a valid distribution and incur a $\exp(-\widetilde{O}(\ell+k))$ multiplicative loss. Further using the discretization results, up to $\exp(-\sqrt{n}\log n)$ -multiplicative loss one can assume $\ell,k\leq\sqrt{n}$ and therefore our work provides a $\exp(-\sqrt{n}\log n)$ -approximate PML distribution.

4 The Sinkhorn permanent for structured matrices.

In this section, we provide the proof for our first main theorem (Theorem 3.1). We show that the scaled Sinkhorn permanent of a non-negative matrix approximates its permanent, where the approximation factor is exponential in the non-negative of the matrix (up to log factors). Our proof is divided into two parts. First in Section 4.1, we work with a simpler setting of matrices $A$ with at most $k$ distinct columns and prove the following lemma.

Lemma 4.1 (Scaled Sinkhorn permanent approximation).

For any matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ with at most $k$ distinct columns, the following holds,

\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}.

(9)

Further using the above result, in Section 4.2 we prove our main theorem (Theorem 3.1) for low non-negative rank matrices.

4.1 The Sinkhorn permanent for distinct column case.

We start this section by defining some notation that captures the structure of repetition of columns in a matrix. For the remainder of this section we fix a matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ . We let $k$ denote the number of distinct columns of A and use $\textbf{c}_{1},\textbf{c}_{2},\dots\textbf{c}_{k}$ to denote these distinct columns. Further we let ${\hat{\textbf{A}}}=[\textbf{c}_{1}~{}|~{}\textbf{c}_{2}~{}|\dots|~{}\textbf{c}_{k}]$ denote the $\mathcal{D}\times k$ matrix formed by these distinct columns. We use $\textbf{A}_{:y}$ to denote the $y$ ’th column of matrix A and let $\phi_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}|$ denote the number of columns equal to $\textbf{c}_{j}$ . It is immediate that,

\sum_{j\in[1,k]}\phi_{j}=N~{},

(10)

where $N=|\mathcal{D}|$ is the size of the domain. For any matrix $\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k}$ define,

\textbf{f}(\textbf{A},\textbf{P})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[1,k]}\phi_{j}~{}.

(11)

In the first half of this section, we show existence of a matrix $\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k}$ (See Lemma 4.4) such that $\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}$ , $\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]$ , and further (See Lemma 4.5),

\log\mathrm{perm}(\textbf{A})\leq O\left(k\log\frac{N}{k}\right)+\textbf{f}(\textbf{A},\textbf{P})~{}.

(12)

Later in the second half (See Lemma 4.6), we show that for any matrix $\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k}$ that satisfies $\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}$ and $\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]$ , there exists a matrix $\textbf{Q}\in\mathbf{K}_{rc}$ (recall $\mathbf{K}_{rc}$ is the set of all $\mathcal{D}\times\mathcal{D}$ doubly stochastic matrices) that satisfies,

\textbf{f}(\textbf{A},\textbf{P})=\mathrm{U}(\textbf{A},\textbf{Q})-N~{}.

(13)

However, using 2.7 we already know that, $\mathrm{scaledsinkhorn}(\textbf{A})\leq\log\mathrm{perm}(\textbf{A})$ . Further using the definition of $\mathrm{scaledsinkhorn}(\textbf{A})$ and combining with Equations 12 and 13 we get,

\mathrm{scaledsinkhorn}(\textbf{A})\leq\mathrm{perm}(\textbf{A})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\mathrm{scaledsinkhorn}(\textbf{A})~{}.

In the remainder, we provide proofs for all the above mentioned inequalities and we need the following set of definitions. Let $\textbf{K}_{r}\subseteq\{0,1\}^{\mathcal{D}\times k}$ , be the subset of all $\mathcal{D}\times k$ matrices that are row stochastic, meaning there is exactly single $1$ in each row. Let $\textbf{K}_{\textbf{A}}\subseteq\textbf{K}_{r}$ be the set of matrices such that any $\textbf{X}\in\textbf{K}_{\textbf{A}}$ satisfies $\sum_{x\in\mathcal{D}}\textbf{X}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]$ .

Definition 4.2.

Let $\textbf{h}_{\textbf{A}}:S_{\mathcal{D}}\rightarrow\textbf{K}_{\textbf{A}}$ be the function that takes a permutation $\sigma\in S_{\mathcal{D}}$ as input and returns a matrix $\textbf{X}\in\textbf{K}_{\textbf{A}}$ in the following way,

\displaystyle\textbf{X}_{x,j}=\begin{cases}1&\text{ if }\textbf{A}_{:\sigma(x)}=\textbf{c}_{j}\\ 0&\text{ otherwise}\end{cases}\quad\quad\text{ for all }x\in\mathcal{D}.

(14)

Remark: Note that as desired $\textbf{h}_{\textbf{A}}(\sigma)\in\textbf{K}_{\textbf{A}}$ for all $\sigma\in S_{\mathcal{D}}$ because of the following. For any $\sigma\in S_{\mathcal{D}}$ , let $\textbf{X}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{h}_{\textbf{A}}(\sigma)$ . Since $\textbf{c}_{j}$ for all $j\in[1,k]$ are distinct, we have $\sum_{j\in[1,k]}\textbf{X}_{x,j}=1$ . Further for any $j\in[1,k]$ , $\sum_{x\in\mathcal{D}}\textbf{X}_{x,j}=\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{A}_{:\sigma(x)}=\textbf{c}_{j}\}}1=\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{A}_{\cdot x}=\textbf{c}_{j}\}}1=\phi_{j}$ .

We next define the probability of a permutation $\sigma\in S_{\mathcal{D}}$ with respect to matrix A as follows,

\mathrm{Pr}\left(\sigma\right)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\prod_{e\in\sigma}\textbf{A}_{e}}{\mathrm{perm}(\textbf{A})}

(15)

Further we define a marginal distribution $\mu$ on $\textbf{K}_{r}$ and later we will establish that this is indeed a probability distribution, that is, probabilities add up to 1.

\displaystyle\mu(\textbf{X})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{cases}0&\text{ if }\textbf{X}\in\textbf{K}_{r}\backslash\textbf{K}_{\textbf{A}}\\ \sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\mathrm{Pr}\left(\sigma\right)&\text{ if }\textbf{X}\in\textbf{K}_{\textbf{A}}~{}.\end{cases}

(16)

For $\textbf{X}\in\textbf{K}_{\textbf{A}}$ , we next provide another equivalent expression for $\mu(\textbf{X})$ .

\begin{split}\mu(\textbf{X})&=\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\mathrm{Pr}\left(\sigma\right)=\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\frac{\prod_{(x,\sigma(x))}\textbf{A}_{x,\sigma(x)}}{\mathrm{perm}(\textbf{A})},\\ &=\frac{1}{\mathrm{perm}(\textbf{A})}\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}\\ &=\left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\end{split}

(17)

In the first and second equality, we used definitions of $\mu(\textbf{X})$ and $\mathrm{Pr}\left(\sigma\right)$ (See Equation 15). For any $\sigma\in S_{\mathcal{D}}$ , let $\textbf{X}=\textbf{h}_{\textbf{A}}(\sigma)$ . Further for any $x\in\mathcal{D}$ , let $j^{\prime}$ be such that $\textbf{A}_{:\sigma(x)}=\textbf{c}_{j^{\prime}}$ , then $\textbf{A}_{x,\sigma(x)}={\hat{\textbf{A}}}_{x,j^{\prime}}$ that is further equal to $\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}$ because $\textbf{X}_{x,j}$ is equal to $1$ if $j=j^{\prime}$ and $0$ otherwise. Therefore the third equality holds. For the final equality, observe that for any $\sigma\in S_{\mathcal{D}}$ if we let $\textbf{X}=\textbf{h}_{\textbf{A}}(\sigma)$ , then for each $j\in[1,k]$ , any permutation within the subset of elements $\{x\in\mathcal{D}~{}|~{}\textbf{A}_{:\sigma(x)}=\textbf{c}_{j}\}$ results in a permutation $\sigma^{\prime}$ that satisfies $\textbf{h}_{\textbf{A}}(\sigma^{\prime})=\textbf{X}$ . These permutations can be carried out independently for each $j\in[1,k]$ that corresponds to $\prod_{j\in[1,k]}\phi_{j}!$ number of permutations and all of them have the same $\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}$ value.

Using the derivation from above, the definition for $\mu$ can also be written as follows:

\displaystyle\mu(\textbf{X})=\begin{cases}0&\text{ if }\textbf{X}\in\textbf{K}_{r}\backslash\textbf{K}_{\textbf{A}}\\ \left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{x\in\mathcal{D}}\prod_{j\in[1,k]}{\hat{\textbf{A}}}_{x,j}^{\textbf{X}_{x,j}}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)&\text{ if }\textbf{X}\in\textbf{K}_{\textbf{A}}~{}.\end{cases}

(18)

Note for $\textbf{X}\in\textbf{K}_{\textbf{A}}$ , the expression for $\mu(\textbf{X})$ can be equivalently written as follows:

\mu(\textbf{X})=\left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)~{}.

(19)

We next show that the $\mu$ defined above is a valid distribution.

\sum_{\textbf{X}\in\textbf{K}_{r}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\sum_{\{\sigma\in S_{\mathcal{D}}~{}|~{}\textbf{h}_{\textbf{A}}(\sigma)=\textbf{X}\}}\Pr(\sigma)=\sum_{\sigma\in S_{\mathcal{D}}}\Pr(\sigma)=1

Remark: The domain of distribution $\mu$ is $\textbf{K}_{r}$ , but its support is subset of $\textbf{K}_{\textbf{A}}$ .

Definition 4.3.

For the distribution $\mu$ , we define a non-negative matrix $\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times k}$ with respect to $\mu$ as follows:

\textbf{P}_{x,j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\Pr_{\textbf{X}\sim\mu}(\textbf{X}_{x,j}=1)=\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})~{}.

(20)

Lemma 4.4.

The matrix P defined in Equation 20 satisfies the following conditions:

\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}\quad\text{ and }\quad\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]~{}.

(21)

Proof.

We first evaluate the row sum. For each $x\in\mathcal{D}$ ,

\displaystyle\sum_{j\in[1,k]}\textbf{P}_{x,j}=\sum_{j\in[1,k]}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})=1~{}.

In the second inequality we used that $\textbf{X}\in\textbf{K}_{\textbf{A}}$ , meaning for each $x\in\mathcal{D}$ , $\sum_{j\in[1,k]}\textbf{X}_{x,j}=1$ . Next we evaluate the column sum, for each $j\in[1,k]$ ,

	$\displaystyle\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}$	$\displaystyle=\sum_{x\in\mathcal{D}}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}\|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\sum_{\{x\in\mathcal{D}~{}\|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})$
		$\displaystyle=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\sum_{\{x\in\mathcal{D}~{}\|~{}\textbf{X}_{x,j}=1\}}1=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\phi_{j}=\phi_{j}$

In the first equality we used the definition of $\textbf{P}_{x,j}$ . In the second inequality we interchanged the summations. In the final equality we used $\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})=1$ . ∎

The matrix P defined in Equation 20 is important because we can upper bound the permanent of matrix A in terms of entries of this matrix. We formalize this result in our next lemma.

Lemma 4.5.

For matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , if P is the matrix defined in Equation 20, then

\log\mathrm{perm}(\textbf{A})\leq O\left(k\log\frac{N}{k}\right)+\textbf{f}(\textbf{A},\textbf{P})

Proof.

We first calculate the expectation of $\log(\mu(\textbf{X}))$ and express it in terms of matrix P.

\begin{split}\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]&=\sum_{\textbf{X}\in\textbf{K}_{r}}\mu(\textbf{X})\log\mu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\mu(\textbf{X})~{},\\ &=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\left(\prod_{j\in[1,k]}\phi_{j}!\right)\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\right)~{},\\ &=\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)-\log\mathrm{perm}(\textbf{A})+\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)~{}.\end{split}

(22)

The second equality holds because the support of distribution $\mu$ is subset of $\textbf{K}_{\textbf{A}}$ . In the third equality we used Equation 19. We now simplify the last term in the final expression from the above derivation.

\begin{split}\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}{\hat{\textbf{A}}}_{x,j}\right)&=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\sum_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}\log{\hat{\textbf{A}}}_{x,j}~{},\\ &=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\log{\hat{\textbf{A}}}_{x,j}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\mu(\textbf{X})~{},\\ &=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log{\hat{\textbf{A}}}_{x,j}~{}.\\ \end{split}

(23)

Combining Equation 22 and Equation 23 together we get,

\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]=\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)-\log\mathrm{perm}(\textbf{A})+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log{\hat{\textbf{A}}}_{x,j}~{}.

(24)

We next define a different distribution $\nu$ on $\textbf{K}_{r}$ using the following sampling procedure: For each $x\in\mathcal{D}$ , pick a column $j\in[1,k]$ independently with probability $\textbf{P}_{x,j}$ . Note that this is a valid sampling procedure because for each $x\in\mathcal{D}$ , $\sum_{j\in[1,k]}\textbf{P}_{x,j}=1$ . The description of distribution $\nu$ is as follows: for each $\textbf{X}\in\textbf{K}_{r}$ ,

\nu(\textbf{X})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}|~{}\textbf{X}_{x,j}=1\}}\textbf{P}_{x,j}

(25)

Remark: Note that $\sum_{X\in\textbf{K}_{r}}\nu(\textbf{X})=\prod_{x\in\mathcal{D}}(\sum_{j\in[1,k]}\textbf{P}_{x,j})=1$ and $\nu$ is a valid distribution.

We next calculate the expectation of $\log(\nu(\textbf{X}))$ with respect to distribution $\mu$ and express it in terms of matrix P. Note that $\sum_{\textbf{X}\in\textbf{K}_{r}}\mu(\textbf{X})\log\nu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\nu(\textbf{X})$ because $\mu(\textbf{X})=0$ for all $\textbf{X}\in\textbf{K}_{r}\backslash\textbf{K}_{\textbf{A}}$ and we get,

	$\displaystyle\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\nu(\textbf{X})\right]$	$\displaystyle=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\nu(\textbf{X})=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\log\left(\prod_{\{(x,j)\in\mathcal{D}\times[1,k]~{}\|~{}\textbf{X}_{x,j}=1\}}\textbf{P}_{x,j}\right)$
		$\displaystyle=\sum_{\textbf{X}\in\textbf{K}_{\textbf{A}}}\mu(\textbf{X})\sum_{\{(x,j)\in\mathcal{D}\times[1,k]\|\textbf{X}_{x,j}=1\}}\log\textbf{P}_{x,j}=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\log\textbf{P}_{x,j}\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}\|\textbf{X}_{x,j}=1\}}\mu(\textbf{X})$
		$\displaystyle=\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\textbf{P}_{x,j}$

We now calculate the KL divergence $\mathrm{KL}\left(\mu\|\nu\right)$ between distributions $\mu$ and $\nu$ .

	$\displaystyle\mathrm{KL}\left(\mu\\|\nu\right)$	$\displaystyle=\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\mu(\textbf{X})\right]-\mathbb{E}_{\textbf{X}\sim\mu}\left[\log\nu(\textbf{X})\right]$
		$\displaystyle=\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)-\log\mathrm{perm}(\textbf{A})+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log{\hat{\textbf{A}}}_{x,j}-\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\textbf{P}_{x,j}$

Using Lemma 2.9, we have $\mathrm{KL}\left(\mu\|\nu\right)\geq 0$ , that further implies,

\begin{split}\log\mathrm{perm}(\textbf{A})&\leq\log\left(\prod_{j\in[1,k]}\phi_{j}!\right)+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}\\ &\leq\sum_{j\in[1,k]}O(\log\phi_{j})+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[1,k]}\phi_{j}+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}\\ &\leq O(k\log\frac{N}{k})+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[1,k]}\phi_{j}+\sum_{(x,j)\in\mathcal{D}\times[1,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}\end{split}

(26)

In the second inequality we used Lemma 2.8 on each $\phi_{j}$ and further in the third inequality we used $\sum_{j\in[1,k]}\phi_{j}=N$ and the fact that the function $\sum_{j\in[1,k]}\log\phi_{j}$ is always upper bounded by $O(k\log\frac{N}{k})$ . Further using the definition of $\textbf{f}(\textbf{A},\textbf{P})$ (See Equation 11), we conclude the proof. ∎

We provided an upper bound to the permanent of matrix A and all that remains is to relate this upper bound to the scaled Sinkhorn permanent of matrix A. Our next lemma will serve this purpose.

Lemma 4.6.

For any matrix $\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times[1,k]}$ that satisfies,

\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}\quad\text{ and }\quad\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[1,k]~{}.

(27)

there exists a doubly stochastic matrix $\textbf{Q}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ such that,

\textbf{f}(\textbf{A},\textbf{P})=\mathrm{U}(\textbf{A},\textbf{Q})-N~{}.

(28)

Proof.

Define matrix $\textbf{Q}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}}$ as follows,

\textbf{Q}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{P}_{x,j}}{\phi_{j}}

where in the definition above $j$ is such that $\textbf{A}_{:y}=\textbf{c}_{j}$ . Now we verify the row and column sums of matrix Q. For each $x\in\mathcal{D}$ ,

\begin{split}\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}&=\sum_{j\in[1,k]}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}\frac{\textbf{P}_{x,j}}{\phi_{j}}=\sum_{j\in[1,k]}\frac{\textbf{P}_{x,j}}{\phi_{j}}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}1\\ &=\sum_{j\in[1,k]}\frac{\textbf{P}_{x,j}}{\phi_{j}}\cdot\phi_{j}=\sum_{j\in[1,k]}\textbf{P}_{x,j}=1\\ \end{split}

(29)

We next evaluate the column sums. For each $y\in\mathcal{D}$ , let $j$ ¹⁰¹⁰10Note that $j$ is a function of $y$ . For convenience, in our notation we don’t capture its dependence on $y$ . be such that $\textbf{A}_{:y}=\textbf{c}_{j}$ , then

\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}=\sum_{x\in\mathcal{D}}\frac{\textbf{P}_{x,j}}{\phi_{j}}=\frac{1}{\phi_{j}}\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\frac{1}{\phi_{j}}\phi_{j}=1~{}.

(30)

Therefore the matrix Q is doubly stochastic and we next relate $\mathrm{U}(\textbf{A},\textbf{Q})$ with $\textbf{f}(\textbf{A},\textbf{P})$ . Recall the definition of $\mathrm{U}(\textbf{A},\textbf{Q})$ (Equation 1),

\mathrm{U}(\textbf{A},\textbf{Q})=\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})~{}.

(31)

We analyze the above term and express it in terms of entries of matrices P and ${\hat{\textbf{A}}}$ .

\begin{split}\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})&=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})\right]\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}\frac{\textbf{P}_{x,j}}{\phi_{j}}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\phi_{j}\cdot\frac{\textbf{P}_{x,j}}{\phi_{j}}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]\end{split}

(32)

The first equality follows because $\textbf{c}_{j}$ for all $j\in[1,k]$ are distinct. The second equality follows because for each $x\in\mathcal{D}$ , consider any $y\in\mathcal{D}$ such that $\textbf{A}_{:y}=\textbf{c}_{j}$ and note that for all such $y$ ’s, $\textbf{A}_{x,y}={\hat{\textbf{A}}}_{x,j}$ and $\textbf{Q}_{x,y}=\frac{\textbf{P}_{x,j}}{\phi_{j}}$ . The third equality follows because $\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}}1=|\{y\in\mathcal{D}~{}|~{}\textbf{A}_{:y}=\textbf{c}_{j}\}|=\phi_{j}$ .

We further simplify the final term in the above derivation.

\begin{split}\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}\phi_{j}}{\textbf{P}_{x,j}})\right]&=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\textbf{P}_{x,j}\log\phi_{j}\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{j\in[1,k]}\log\phi_{j}\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}~{}.\end{split}

(33)

Combining Equation 32, Equation 33 and further substituting back in Equation 31 we get,

\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})&=\sum_{x\in\mathcal{D}}\sum_{j\in[1,k]}\left[\textbf{P}_{x,j}\log(\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}})\right]+\sum_{j\in[1,k]}\phi_{j}\log\phi_{j}\\ &=\textbf{f}(\textbf{A},\textbf{Q})+N~{}.\end{split}

(34)

In the final expression, we used the definition of $\textbf{f}(\textbf{A},\textbf{Q})$ and combined it with $N=\sum_{j\in[1,k]}\phi_{j}$ . ∎

We are now ready to prove our main lemma of this section and is restated for convenience. See 4.1

Proof.

Consider the matrix P defined in Equation 20. By Lemma 4.4, matrix P satisfies the conditions of Lemma 4.6; therefore, there exists a doubly stochastic matrix $\textbf{Q}\in\mathbf{K}_{rc}$ such that $\textbf{f}(\textbf{A},\textbf{P})=\mathrm{U}(\textbf{A},\textbf{Q})-N$ . Combining it with Lemma 4.5 we get $\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\mathrm{U}(\textbf{A},\textbf{Q})-N$ , which further implies $\mathrm{perm}(\textbf{A})\leq\exp(O(k\log\frac{N}{k}))\mathrm{scaledsinkhorn}(\textbf{A})$ . The lower bound for the $\mathrm{perm}(\textbf{A})$ follows from 2.7 and we conclude the proof. ∎

We next state another interesting property of the matrix P defined in Equation 20. This property will be useful for the purposes of PML (Section 6).

Theorem 4.7.

For matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , the matrix P defined in Equation 20 satisfies the following: $\text{If }x,y\in\mathcal{D}\text{ are such that }\textbf{A}_{x.}=\textbf{A}_{y.}\text{ then, for all }j\in[1,k]\text{ we have }\textbf{P}_{x,j}=\textbf{P}_{y,j}~{}.$

Proof.

For any $j\in[1,k]$ , recall by the definitions of terms $\textbf{P}_{x,j}$ and $\textbf{P}_{y,j}$ ,

\begin{split}\textbf{P}_{x,j}&=\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\left(\prod_{j^{\prime}\in[1,k]}\phi_{j^{\prime}}!\right)\left(\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{X}_{z,j^{\prime}}}\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right),\\ &=\left(\prod_{j^{\prime}\in[1,k]}\phi_{j^{\prime}}!\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\sum_{\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}}\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{X}_{z,j^{\prime}}}~{}.\end{split}

(35)

\begin{split}\textbf{P}_{y,j}=\left(\prod_{j^{\prime}\in[1,k]}\phi_{j^{\prime}}!\right)\left(\frac{1}{\mathrm{perm}(\textbf{A})}\right)\sum_{\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}}\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{X}_{z,j^{\prime}}^{\prime}}~{}.\end{split}

(36)

For any $\textbf{Y}\in\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}$ we next construct a unique $\textbf{Y}^{\prime}\in\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}$ (and vice versa) such that,

\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}}=\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}^{\prime}}

Each $\textbf{Y}\in\textbf{K}_{\textbf{A}}$ corresponds to a bipartite graph where vertices correspond to set $\mathcal{D}$ on left side and $[1,k]$ on the other, such that, degree of every left vertex $x\in\mathcal{D}$ is 1 and degree of every right vertex $j\in[1,k]$ is $\phi_{j}$ .

Consider $\textbf{Y}\in\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}$ , we divide the analysis into the following two cases,

1.

If $\textbf{Y}_{y,j}=1$ , meaning both vertices $x,y\in\mathcal{D}$ are connected to $j\in[1,k]$ in our bipartite graph representation. Then, $\textbf{Y}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{Y}$ .

If $\textbf{Y}_{y,j}=0$ , meaning vertex $x$ is connected to $j$ and $y$ to some other vertex $j^{\prime}\neq j$ . In this case we swap the edges, meaning we remove edges $(x,j),(y,j^{\prime})$ and add $(x,j^{\prime}),(y,j)$ to construct $\textbf{Y}^{\prime}$ . We formally define $\textbf{Y}^{\prime}$ next,

\textbf{Y}^{\prime}_{z,j^{\prime\prime}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{cases}1\text{ if }z=y\text{ and }j^{\prime\prime}=j,\\ 0\text{ if }z=y\text{ and }j^{\prime\prime}=j^{\prime},\\ 1\text{ if }z=x\text{ and }j^{\prime\prime}=j^{\prime},\\ 0\text{ if }z=x\text{ and }j^{\prime\prime}=j,\\ \textbf{Y}_{z,j^{\prime}}\text{ otherwise}~{}.\end{cases}

(37)

In both cases, clearly $\textbf{Y}^{\prime}\in\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}$ . Further, $\textbf{A}_{x.}=\textbf{A}_{y.}$ implies ${\hat{\textbf{A}}}_{x,j^{\prime}}={\hat{\textbf{A}}}_{y,j}$ for all $j^{\prime}\in[1,k]$ and the following equality holds,

\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}}=\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}^{\prime}}

The same analysis also holds when we start with a $\textbf{Y}^{\prime}\in\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}$ and construct $\textbf{Y}\in\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}$ . We have a one to one correspondence between elements Y and $\textbf{Y}^{\prime}$ in the sets $\{\textbf{X}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{x,j}=1\}$ and $\{\textbf{X}^{\prime}\in\textbf{K}_{\textbf{A}}~{}|~{}\textbf{X}_{y,j}^{\prime}=1\}$ respectively, satisfying,

\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}}=\prod_{(z,j^{\prime})\in\mathcal{D}\times[1,k]}\hat{\textbf{A}}_{z,j^{\prime}}^{\textbf{Y}_{z,j^{\prime}}^{\prime}}~{}.

Therefore, $\textbf{P}_{x,j}=\textbf{P}_{y,j}$ and we conclude the proof. ∎

4.2 Generalization to low non-negative rank matrices

Here we prove our main result for the scaled Sinkhorn permanent of low non-negative rank matrices (Theorem 3.1). To prove this result, we use the performance result of the scaled Sinkhorn permanent for non-negative matrices with $k$ distinct columns. The following lemma relates the permanent of a matrix A of non-negative rank $k$ to matrices with at most $k$ distinct columns and will be crucial for our analysis.

Lemma 4.8 ([Bar96]).

For any matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ of non-negative rank $k$ . If $\textbf{A}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[k]}\textbf{v}_{j}\textbf{u}_{j}^{\top}$ for $\textbf{v}_{j},\textbf{u}_{j}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}$ , then

\mathrm{perm}(\textbf{A})=\sum_{\{\alpha\subseteq\mathbb{Z}_{+}^{k}|\sum_{j\in[k]}\alpha_{j}=N\}}\frac{1}{\prod_{j\in[k]}\alpha_{j}!}\mathrm{perm}(\textbf{V}^{\alpha})\mathrm{perm}(\textbf{U}^{\alpha}),

where $\textbf{V}^{\alpha}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[\underbrace{\textbf{v}_{1}\dots\textbf{v}_{1}}_{\alpha_{1}}~{}|~{}\underbrace{\textbf{v}_{2}\dots\textbf{v}_{2}}_{\alpha_{2}}~{}|~{}\dots~{}|\underbrace{\textbf{v}_{k}\dots\textbf{v}_{k}}_{\alpha_{k}}]$ , $\textbf{U}^{\alpha}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[\underbrace{\textbf{u}_{1}\dots\textbf{u}_{1}}_{\alpha_{1}}~{}|~{}\underbrace{\textbf{u}_{2}\dots\textbf{u}_{2}}_{\alpha_{2}}~{}|~{}\dots~{}|\underbrace{\textbf{u}_{k}\dots\textbf{u}_{k}}_{\alpha_{k}}]$ .

As the number of terms in the above summation is low, the maximizing term is a good approximation to the permanent of A.

Corollary 4.9.

Given a non-negative matrix $\textbf{A}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ , let $k$ denote the non-negative rank of the matrix. If $\textbf{A}=\sum_{j\in[k]}\textbf{v}_{j}\textbf{u}_{j}^{\top}$ for $\textbf{v}_{j},\textbf{u}_{j}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}$ is any non-negative matrix factorization of A, then

\mathrm{perm}(\textbf{A})\leq\exp\left(O(k\log\frac{N}{k})\right)\max_{\{\alpha\subseteq\mathbb{Z}_{+}^{k}|\sum_{j\in[k]}\alpha_{j}=N\}}\frac{1}{\prod_{j\in[k]}\alpha_{j}!}\mathrm{perm}(\textbf{V}^{\alpha})\mathrm{perm}(\textbf{U}^{\alpha})~{}.

(38)

Proof.

The number of feasible $\alpha$ ’s in the set $\{\alpha\subseteq\mathbb{Z}_{+}^{k}|\sum_{j\in[k]}\alpha_{j}=N\}$ is at most $\binom{N+k-1}{k-1}\in\exp\left(O(k\log\frac{N}{k})\right)$ and we conclude the proof. ∎

Lemma 4.10.

Let $\textbf{{Q}}^{\prime},\textbf{{Q}}^{\prime\prime}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times\mathcal{D}}$ be any doubly stochastic matrices. Then $\textbf{Q}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime}$ is a doubly stochastic matrix.

Proof.

We first consider the row sums,

\textbf{Q}\textbf{1}=\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime}\textbf{1}=\textbf{{Q}}^{\prime}\textbf{1}=\textbf{1}~{}.

Therefore the matrix Q is row stochastic. In the above derivation, the second and third equalities follow because $\textbf{{Q}}^{\prime\prime}$ and $\textbf{{Q}}^{\prime}$ are row stochastic matrices respectively. We now consider the column sums,

\textbf{Q}^{\top}\textbf{1}=\textbf{{Q}}^{\prime\prime\top}\textbf{{Q}}^{\prime\top}\textbf{1}=\textbf{{Q}}^{\prime\prime\top}\textbf{1}=\textbf{1}~{}.

The above derivation follows because $\textbf{{Q}}^{\prime}$ and $\textbf{{Q}}^{\prime\prime}$ are column stochastic and therefore the matrix Q is column stochastic. As the matrix Q is both row and column stochastic we conclude the proof. ∎

We are now ready to prove our main result of this section and we restate it for convenience. See 3.1

Proof.

Let $\alpha$ be the maximizer of the optimization problem 38, then

\mathrm{perm}(\textbf{A})\leq\exp\left(O(k\log\frac{N}{k})\right)\frac{1}{\prod_{j\in[k]}\alpha_{j}!}\mathrm{perm}(\textbf{V}^{\alpha})\mathrm{perm}(\textbf{U}^{\alpha})~{}.

(39)

Recall to prove the theorem, we need to construct a doubly stochastic witness Q that satisfies:

\log\mathrm{perm}(\textbf{A})\leq{O(k\log\frac{N}{k})}+\mathrm{U}(\textbf{A},\textbf{Q})-N~{}.

We construct such a witness Q from the doubly stochastic witnesses for matrices $\textbf{V}^{\alpha}$ and $\textbf{U}^{\alpha}$ . For all $j\in[k]$ define $S_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{y\in\mathcal{D}~{}|~{}\textbf{V}^{\alpha}_{:y}=\textbf{v}_{j}\}$ , equivalently $S_{j}=\{y\in\mathcal{D}~{}|~{}\textbf{U}^{\alpha}_{:y}=\textbf{u}_{j}\}$ and note that $\alpha_{j}=|S_{j}|$ . Let $\textbf{{Q}}^{\prime}$ and $\textbf{{Q}}^{\prime\prime}$ be the doubly stochastic matrices that maximize the scaled Sinkhorn permanent for matrices $\textbf{V}^{\alpha}$ and $\textbf{U}^{\alpha}$ respectively. Therefore by Lemma 4.1 the following inequalities hold,

\log\mathrm{perm}(\textbf{V}^{\alpha})\leq O(k\log\frac{N}{k})+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})-N~{},

(40)

\log\mathrm{perm}(\textbf{U}^{\alpha})\leq O(k\log\frac{N}{k})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{},

(41)

where recall $\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\frac{\textbf{V}^{\alpha}_{x,y}}{\textbf{{Q}}^{\prime}_{x,y}}$ and $\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\frac{\textbf{U}^{\alpha}_{x,y}}{\textbf{{Q}}^{\prime\prime}_{x,y}}$ . Without loss of generality by the symmetry (with respect to columns within $S_{j}$ ) and concavity of the scaled Sinkhorn objective, we can assume that the maximizing matrices $\textbf{{Q}}^{\prime}$ and $\textbf{{Q}}^{\prime\prime}$ satisfy the following: for all $x\in\mathcal{D}$ and $j\in[k]$ ,

\textbf{{Q}}^{\prime}_{x,y}=\textbf{{Q}}^{\prime}_{x,y^{\prime}}\text{ and }\textbf{{Q}}^{\prime\prime}_{x,y}=\textbf{{Q}}^{\prime\prime}_{x,y^{\prime}}\text{ for all }y,y^{\prime}\in S_{j}\text{ and }x\in\mathcal{D}~{}.

(42)

Note that the doubly stochastic matrix that we constructed for the proof of Lemma 4.1 also satisfies the above collection of equalities. Now combining Equations 39, 40 and 41 we get,

\begin{split}\log\mathrm{perm}(\textbf{A})&\leq{O(k\log\frac{N}{k})}-\log\prod_{j\in[k]}\alpha_{j}!+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})-N+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{},\\ &\leq{O(k\log\frac{N}{k})}-\sum_{j\in[k]}\left(\alpha_{j}\log\alpha_{j}-\alpha_{j}\right)+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})-N+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{},\\ &\leq{O(k\log\frac{N}{k})}-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-N~{}.\end{split}

(43)

In the second inequality we use the Stirling’s approximation (Lemma 2.8) and the error term due to this approximation is upper bounded by $O(k\log\frac{N}{k})$ . In the third inequality we used $\sum_{j\in[k]}\alpha_{j}=N$ .

Let $\textbf{Q}=\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime\top}$ , then by Lemma 4.10 the matrix Q is doubly stochastic. In the remainder of the proof we show that,

-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}+\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})\leq\mathrm{U}(\textbf{A},\textbf{Q})~{},

(44)

where recall $\mathrm{U}(\textbf{A},\textbf{Q})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\frac{\textbf{A}_{x,y}}{\textbf{Q}_{x,y}}$ . As matrix Q is doubly stochastic, the above inequality combined with Equation 43 concludes the proof. Therefore in the remainder we focus our attention to prove Equation 44 and we start by simplifying the above expression. Define,

\beta^{j}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{1}{\textbf{Q}_{x,y}}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\quad\text{ for all }x\in\mathcal{D},y\in\mathcal{D}\text{ and for all }j\in[k]~{}.

(45)

For all $x\in\mathcal{D}$ and $y\in\mathcal{D}$ the variables defined above satisfy the following,

\sum_{j\in[k]}\beta^{j}_{x,y}=\frac{1}{\textbf{Q}_{x,y}}\sum_{j\in[k]}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}=\frac{1}{\textbf{Q}_{x,y}}\sum_{z\in\mathcal{D}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}=\frac{1}{\textbf{Q}_{x,y}}\textbf{Q}_{x,y}=1~{},

(46)

where in the third inequality we used the definition of $\textbf{Q}=\textbf{{Q}}^{\prime}\textbf{{Q}}^{\prime\prime\top}$ . We next simplify and lower bound the term $\mathrm{U}(\textbf{A},\textbf{Q})$ in terms of these newly defined variables.

\displaystyle\log\textbf{A}_{x,y}=\log\sum_{j\in[k]}\textbf{v}_{j}(x)\textbf{u}_{j}(y)\geq\log\prod_{j\in[k]}\left(\frac{\textbf{v}_{j}(x)\textbf{u}_{j}(y)}{\beta^{j}_{x,y}}\right)^{\beta^{j}_{x,y}}=\sum_{j\in[k]}{\beta^{j}_{x,y}}\log\left(\frac{\textbf{v}_{j}(x)\textbf{u}_{j}(y)}{\beta^{j}_{x,y}}\right)~{},

(47)

where in the first equality we used $\textbf{A}=\sum_{j\in[k]}\textbf{v}_{j}\textbf{u}_{j}^{\top}$ . In the second inequality we used weighted AM-GM inequality. Now consider the term $\textbf{Q}_{x,y}\log\textbf{A}_{x,y}$ and substitute the above lower bound,

\displaystyle\textbf{Q}_{x,y}\log\textbf{A}_{x,y}

\displaystyle\geq\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}(\log\textbf{v}_{j}(x)+\log\textbf{u}_{j}(y))-\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\beta^{j}_{x,y}}~{}.

(48)

Summing over all the $(x,y)$ pairs we get,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\textbf{A}_{x,y}&\geq\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}\big{)}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(y)\big{(}\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}\big{)}~{},\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\beta^{j}_{x,y}}~{}.\end{split}

(49)

In the above expression the following terms simplify,

\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}=\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}\frac{1}{\textbf{Q}_{x,y}}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}=\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\sum_{y\in\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{y,z}=\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}~{}.

(50)

Similarly,

\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}\beta^{j}_{x,y}=\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}~{}.

(51)

Also note that,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\beta^{j}_{x,y}}&=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log\frac{\beta^{j}_{x,y}\textbf{Q}_{x,y}}{\textbf{Q}_{x,y}},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log(\beta^{j}_{x,y}\textbf{Q}_{x,y})-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\sum_{j\in[k]}{\beta^{j}_{x,y}}\log{\textbf{Q}_{x,y}},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}{\beta^{j}_{x,y}}\textbf{Q}_{x,y}\log(\beta^{j}_{x,y}\textbf{Q}_{x,y})-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log{\textbf{Q}_{x,y}}~{},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log{\textbf{Q}_{x,y}}~{}.\\ \end{split}

(52)

In the third and fourth inequality we used Equation 46 and Equation 45 respectively. Substituting Equations 50, 51 and 52 in Equation 49 we get,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\textbf{A}_{x,y}&\geq\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\big{)}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(y)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log{\textbf{Q}_{x,y}}~{}.\end{split}

(53)

By rearranging terms the above expression can be equivalently written as,

\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log\frac{\textbf{A}_{x,y}}{\textbf{Q}_{x,y}}&\geq\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\big{)}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{u}_{j}(y)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}~{}.\end{split}

(54)

In the above expression we have a lower bound for the term $\mathrm{U}(\textbf{A},\textbf{Q})$ and we relate it to terms $\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})$ and $\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})$ . Consider the following term,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{V}^{\alpha}_{x,y}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{V}^{\alpha}_{x,y}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{v}_{j}(x)~{},\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\big{)}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{v}_{j}(x)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\big{)}~{},\\ \end{split}

(55)

In the final equality we renamed the variables and the rest of equalities are straightforward. Carrying out similar derivation we also get,

\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{U}^{\alpha}_{x,y}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{u}_{j}(x)\big{(}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime\prime}_{x,y}\big{)}=\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\log\textbf{u}_{j}(y)\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}~{}.

(56)

As before in the final equality we renamed variables. Substituting Equations 55 and 56 in Equation 54 we get,

\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})&\geq\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{V}^{\alpha}_{x,y}+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{U}^{\alpha}_{x,y}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\\ &=\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}+\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}\\ &\quad-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}.\end{split}

(57)

Therefore to prove Equation 44, all that remains is to show that,

\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}+\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}\big{)}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\geq-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.

(58)

To prove the above inequality we use the symmetry in the solutions $\textbf{{Q}}^{\prime}$ and $\textbf{{Q}}^{\prime\prime}$ . Recall from Equation 42, for all $x\in\mathcal{D}$ and $j\in[k]$ we have $\textbf{{Q}}^{\prime}_{x,y}=\textbf{{Q}}^{\prime}_{x,y^{\prime}}\text{ and }\textbf{{Q}}^{\prime\prime}_{x,y}=\textbf{{Q}}^{\prime\prime}_{x,y^{\prime}}\text{ for all }y,y^{\prime}\in S_{j}\text{ and }x\in\mathcal{D}$ . Define $\textbf{R}^{\prime}_{x,j}=\textbf{{Q}}^{\prime}_{x,y}$ and $\textbf{R}^{\prime\prime}_{x,j}=\textbf{{Q}}^{\prime\prime}_{x,y}$ for any $y\in S_{j}$ . We next substitute these definitions and simplify terms on the left hand side of Equation 58,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\sum_{y\in S_{j}}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j},\\ &=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}~{}.\end{split}

(59)

In the final equality we used $|S_{j}|=\alpha_{j}$ and the rest of the equalities are straightforward. Similar argument as above also gets us,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{x,j}\log\textbf{R}^{\prime\prime}_{x,j}=\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}~{}.\end{split}

(60)

Note in the final equality we renamed variables. Finally,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}&\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}~{},\\ &=\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\big{(}\log\alpha_{j}+\log\textbf{R}^{\prime}_{x,j}+\log\textbf{R}^{\prime\prime}_{y,j}\big{)}~{},\\ \end{split}

(61)

Again each of the terms in the parenthesis further simplify as follows,

	$\displaystyle\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\alpha_{j}$	$\displaystyle=\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}=\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}\sum_{x\in\mathcal{D}}\textbf{R}^{\prime}_{x,j}\sum_{y\in\mathcal{D}}\textbf{R}^{\prime\prime}_{y,j},$
		$\displaystyle=\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.$

\displaystyle\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime}_{x,j}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}\sum_{y\in\mathcal{D}}\textbf{R}^{\prime\prime}_{y,j}=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}~{}.

Similarly,

\displaystyle\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}=\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}~{}.

Substituting back all the above three expressions in Equation 61 we get,

\begin{split}\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\sum_{j\in[k]}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}&=\sum_{x\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime}_{x,j}\log\textbf{R}^{\prime}_{x,j}+\sum_{y\in\mathcal{D}}\sum_{j\in[k]}\alpha_{j}\textbf{R}^{\prime\prime}_{y,j}\log\textbf{R}^{\prime\prime}_{y,j}\\ &\quad+\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.\end{split}

(62)

Further substituting Equations 59, 60 and 62 in the derivation below we get,

\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\textbf{{Q}}^{\prime}_{x,y}\log\textbf{{Q}}^{\prime}_{x,y}+\textbf{{Q}}^{\prime\prime}_{x,y}\log\textbf{{Q}}^{\prime\prime}_{x,y}\big{)}-\sum_{x,y\in\mathcal{D}\times\mathcal{D}}\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}\log\big{(}\sum_{z\in S_{j}}\textbf{{Q}}^{\prime}_{x,z}\textbf{{Q}}^{\prime\prime}_{y,z}\big{)}=-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.

Therefore the above derivation proves Equation 58 and we further substitute it in Equation 57 to get,

\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})\geq\mathrm{U}(\textbf{V}^{\alpha},\textbf{{Q}}^{\prime})+\mathrm{U}(\textbf{U}^{\alpha},\textbf{{Q}}^{\prime\prime})-\sum_{j\in[k]}\alpha_{j}\log\alpha_{j}~{}.\end{split}

(63)

The above expression combined with Equation 43 gives the following upper bound on the log of permanent,

\displaystyle\log\mathrm{perm}(\textbf{A})\leq{O(k\log\frac{N}{k})}+\mathrm{U}(\textbf{A},\textbf{Q})-N~{}.

(64)

The above expression combined with definition of the scaled Sinkhorn permanent concludes the proof. ∎

5 Lower bound for Bethe and scaled Sinkhorn permanent approximations

Here we provide the proof of Theorem 3.3 that is stated below for convenience. See 3.3

Proof.

Assume $N$ is divisible by $k$ . Let 1 and 0 be $\frac{N}{k}\times\frac{N}{k}$ all ones and all zeros matrices respectively. Note that $\mathrm{bethe}(\textbf{1})\leq\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+1$ and the proof for this statement follows because $\frac{k}{N}\textbf{1}$ is the maximizer of the optimization problem $\max_{\textbf{Q}}\mathrm{F}(\textbf{1},\textbf{Q})$ over all doubly stochastic matrices Q. On the other hand $\log\mathrm{perm}(\textbf{1})=\log\frac{N}{k}!\geq\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+\Omega(\log\frac{N}{k})$ , where in the last inequality we used the Stirling’s approximation. Now consider the following matrix,

\textbf{A}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{bmatrix}\textbf{1}&\textbf{0}&\dots\textbf{0}\\ \textbf{0}&\textbf{1}&\dots\textbf{0}\\ \vdots&\dots&\ddots\\ \textbf{0}&\textbf{0}&\dots\textbf{1}\\ \end{bmatrix}

In the above definition, A is a $N\times N$ matrix, with $k\times k$ blocks. For the matrix A we have, $\log\mathrm{perm}(\textbf{A})=k\cdot\log\mathrm{perm}(\textbf{1})\geq k\left(\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+\Omega(\log\frac{N}{k})\right)$ and $\mathrm{bethe}(\textbf{A})=k\cdot\mathrm{bethe}(\textbf{1})\leq k\left(\frac{N}{k}\log\frac{N}{k}-\frac{N}{k}+1\right)$ . Therefore $\log\mathrm{perm}(\textbf{A})-\mathrm{bethe}(\textbf{A})\geq\Omega(k\log\frac{N}{k})$ .

The proof for the case when $N$ is not divisible by $k$ is similar. Here matrix A is the $N\times N$ block diagonal matrix where the first $k$ blocks correspond to $\lfloor\frac{N}{k}\rfloor\times\lfloor\frac{N}{k}\rfloor$ all ones matrix and the final block corresponds to $r\times r$ all ones matrix, where $r\stackrel{{\scriptstyle\mathrm{def}}}{{=}}N-k\lfloor\frac{N}{k}\rfloor$ . For this definition of matrix A we have, $\log\mathrm{perm}(\textbf{A})=k\cdot\log\lfloor\frac{N}{k}\rfloor!+\log r!\geq k\left(\lfloor\frac{N}{k}\rfloor\log\lfloor\frac{N}{k}\rfloor-\lfloor\frac{N}{k}\rfloor+\Omega(\log\frac{N}{k})\right)+r\log r-r+\Omega(\log r)$ and $\mathrm{bethe}(\textbf{A})=k\cdot\mathrm{bethe}(\textbf{1})\leq k\left(\lfloor\frac{N}{k}\rfloor\log\lfloor\frac{N}{k}\rfloor-\lfloor\frac{N}{k}\rfloor+1\right)+r\log r-r+1$ . Therefore $\log\mathrm{perm}(\textbf{A})-\mathrm{bethe}(\textbf{A})\geq\Omega(k\log\frac{N}{k})$ . The first condition of the theorem follows by taking exponentials on both sides of the previous inequality.

The second inequality in the theorem follows by using $\mathrm{bethe}(\textbf{A})\geq\mathrm{scaledsinkhorn}(\textbf{A})$ (See 2.7). As the matrix A constructed here is of non-negative rank $k$ , we conclude the proof. ∎

6 Improved approximation to profile maximum likelihood

In this section, we provide an efficient algorithm to compute an $\exp\left(-O(\sqrt{n}\log n)\right)$ -approximate PML distribution. We first introduce the setup and some new notation. For convenience, we also recall some definitions from Section 2.

We are given access to $n$ independent samples from a hidden distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ supported on domain $\mathcal{D}$ . Let $x^{n}$ be this length $n$ sequence and $\phi=\Phi(x^{n})$ be its corresponding profile. Let $\textbf{f}(x^{n},y)$ be the frequency for domain element $y\in\mathcal{D}$ in sequence $x^{n}$ . Let $k$ be the number of non-zero distinct frequencies and we use $\{\textbf{m}_{1},\dots\textbf{m}_{j},\dots\textbf{m}_{k}\}$ to denote these distinct frequencies. Note that the number of non-zero distinct frequencies $k$ is always upper bounded by $\sqrt{n}$ . For $j\in[1,k]$ , we define $\phi_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{y\in\mathcal{D}~{}|~{}\textbf{f}(x^{n},y)=\textbf{m}_{j}\}|$ . Let $\textbf{p}_{\mathrm{pml}}$ be the PML distribution with respect to profile $\phi$ and is formally defined as follows,

\textbf{p}_{\mathrm{pml}}\in\operatorname*{arg\,min}_{\textbf{p}\in\Delta^{\mathcal{D}}}\mathbb{P}(\textbf{p},\phi)~{}.

Recall the definition of profile probability matrix $\textbf{A}^{\textbf{q},\phi}$ with respect to profile $\phi$ and distribution p,

\textbf{A}^{\textbf{p},\phi}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{p}_{x}^{\textbf{f}_{y}}\text{ for all }x,y\in\mathcal{D},

(65)

where $\textbf{f}_{y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{f}(x^{n},y)$ is the frequency of domain element $y\in\mathcal{D}$ in the observed sequence $x^{n}$ and recall $\Phi(x^{n})=\phi$ . Note that the number of distinct columns is equal to number of distinct observed frequencies plus one (for the unseen) and therefore it is $k+1$ .

From Equation 4, the probability of profile $\phi$ with respect to distribution p is,

\mathbb{P}(\textbf{p},\phi)=C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{perm}(\textbf{A}^{\textbf{p},\phi})~{},\enspace\text{ where }\enspace C_{\phi}=\frac{n!}{\prod_{j\in[1,k]}(\textbf{m}_{j}!)^{\phi_{j}}}~{}.

(66)

$\phi_{0}$ here denotes the number of unseen domain elements and note that it is not part of the profile. Given a distribution p we know its domain $\mathcal{D}$ therefore the unseen domain elements. Also, note that $C_{\phi}$ is independent of the term $\phi_{0}$ , therefore it depends just on the profile $\phi$ and not on the underlying distribution p.

We now provide the motivation behind the techniques used in this section. Recall that the goal of this section is to compute an approximate PML distribution and we wish to do this using the results from the previous section. A first attempt would be to use the scaled Sinkhorn (or the Bethe) permanent as a proxy for the term $\mathrm{perm}(\textbf{A}^{\textbf{p},\phi})$ in Equation 66 and solve the following optimization problem:

\max_{\textbf{p}\in\Delta^{\mathcal{D}}}C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{scaledsinkhorn}(\textbf{A}^{\textbf{p},\phi})~{}.

The above optimization problem is indeed a good proxy for the PML objective and recall the above optimization problem is equivalent to the following:

\max_{\textbf{p}\in\Delta^{\mathcal{D}}}C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\max_{\textbf{Q}\in\textbf{Z}_{rc}}\exp\left(\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})\right)~{}.

Taking log and ignoring the constants we get the following equivalent optimization problem,

\max_{\textbf{p}\in\Delta^{\mathcal{D}}}\max_{\textbf{Q}\in\textbf{Z}_{rc}}\left(\log\frac{1}{\phi_{0}!}+\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})\right)

Interestingly, the function $\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})$ , is concave with respect to p for a fixed Q and concave with respect to Q for a fixed p (See [Von14]). However, unfortunately the function $\mathrm{U}(\textbf{A}^{\textbf{p},\phi},\textbf{Q})$ in general is not a concave function with respect to p and Q simultaneously [Von14] and we do not know how to solve the above optimization problem. Vontobel [Von14] proposed an alternating maximization algorithm to solve the above optimization problem, and studied its implementation and convergence to a stationary point; see [Von14] for empirical performance of this approach. Using the Bethe permanent as a proxy in the above optimization problem has similar issues; see [Von12, Von14] for further details.

To address the above issue we use the idea of probability discretization from [CSS19], meaning we assume distribution takes all its probability values from some fixed predefined set. We use this idea in a different way than [CSS19] and further exploit the structure of optimal solution Q to write a convex optimization problem that approximates the PML objective. The solution of this convex optimization problem returns a fractional representation of the distribution that we later round to return the approximate PML distribution with desired guarantees. Surprisingly, the final convex optimization problem we write is exactly same as the one in [CSS19] and our work gives a better analysis of the same convex program by a completely different approach.

The rest of this section is organized as follows. In Section 6.1, we study the probability discretization. In the same section, we also study the application of results from Section 4 for approximating the permanent of profile probability matrix ( $\textbf{A}^{\textbf{p},\phi}$ ). We further provide the convex optimization problem at the end of this section that can be solved efficiently and returns a fractional representation of the approximate PML distribution. In Section 6.2, we provide the rounding algorithm that returns our final approximate PML distribution. Till this point, all our results are independent of the choice of the probability discretization set. Later in Section 6.3, we choose an appropriate probability discretization set and further combine analysis from all the previous sections. In this section, we state and analyze our final algorithm that returns a $\exp\left(-O(\sqrt{n}\log n)\right)$ -approximate PML distribution. Note that our rounding algorithm is technical and for the continuity of reading we defer all the proofs for results in Section 6.2 to Section 6.4.

6.1 Probability discretization

Here we study the idea of probability discretization that is also used in [CSS19]. We combine this with other ideas from Section 4 to provide a convex program that approximates the PML objective.

Let $\textbf{R}\subseteq[0,1]_{\mathbb{R}}$ be some discretization of the probability space and in this section we consider distributions that take all its probability values in set R. All results in this section hold for finite set R and we specify the exact definition of R in Section 6.3.

The discretization introduces a technicality of probability values not summing up to one and we redefine pseudo-distribution and discrete pseudo-distribution from [CSS19] to deal with these.

Definition 6.1 (Pseudo-distribution).

$\textbf{q}\in[0,1]^{\mathcal{D}}_{\mathbb{R}}$ is a pseudo-distribution if $\|\textbf{q}\|_{1}\leq 1$ and a discrete pseudo-distribution with respect to R if all its entries are in R as well. We use $\Delta_{pseudo}^{\mathcal{D}}$ and $\Delta_{\textbf{R}}^{\mathcal{D}}$ to denote the set of all such pseudo-distributions and discrete pseudo-distributions with respect to R respectively.

We extend and use the following definition for $\mathbb{P}(\textbf{v},y^{n})$ for any vector $\textbf{v}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}$ and therefore for pseudo-distributions as well,

\mathbb{P}(\textbf{v},y^{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{x\in\mathcal{D}}\textbf{v}_{x}^{\textbf{f}(y^{n},x)}~{}.

Further, for any probability terms defined involving p, we define those terms for any vector $\textbf{v}\in\mathbb{R}_{\geq 0}^{\mathcal{D}}$ just by replacing $\textbf{p}_{x}$ by $\textbf{v}_{x}$ everywhere. For convenience we refer to $\mathbb{P}(\textbf{q},\phi)$ for any pseudo-distribution q as the “probability” of profile $\phi$ with respect to q.

For a scalar $c$ and set S, define $\lfloor c\rfloor_{\textbf{S}}$ and $\lceil c\rceil_{\textbf{S}}$ as follows:

\lfloor c\rfloor_{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{s\in\textbf{S}:s\leq c}s\quad\text{ and }\quad\lceil c\rceil_{\textbf{S}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\inf_{s\in\textbf{S}:s\geq c}s

Definition 6.2 (Discrete pseudo-distribution).

For any distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , its discrete pseudo-distribution $\textbf{q}=\mathrm{disc}(\textbf{p})\in\Delta_{\textbf{R}}^{\mathcal{D}}$ with respect to R is defined as:

\textbf{q}_{x}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\lfloor\textbf{p}_{x}\rfloor_{\textbf{R}}\quad\forall x\in\mathcal{D}

We now define some additional definitions and notation that will help us lower and upper bound the permanent of profile probability matrix by a convex optimization problem.

•

Let $\ell\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\textbf{R}|$ be the cardinality of set R and $\textbf{r}_{i}$ be the $i$ ’th element of set R.
•

For any discrete pseudo-distribution q with respect to R, that is $\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}$ , we let $\ell^{\textbf{q}}_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\{y\in\mathcal{D}~{}|~{}\textbf{q}_{y}=\textbf{r}_{i}\}|$ , be the number of domain elements with probability $\textbf{r}_{i}$ .

•

Let $\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}\subseteq\mathbb{R}_{\geq 0}^{\ell\times(k+1)}$ be the set of non-negative matrices such that, for any $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ the following holds:

\displaystyle\sum_{j\in[0,k]}\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i}\text{ for all }i\in[1,\ell]\quad\text{and}\quad\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\phi_{j}\text{ for all }j\in[0,k]~{},

(67)

where $\phi_{0}$ ¹¹¹¹11 $\phi_{0}$ is not part of the profile and is not given to us. Later in this section, we get rid of this dependency on $\phi_{0}$ . is the number of unseen domain elements and we use $\textbf{m}_{0}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}0$ to denote the corresponding frequency element.

•

For any $\textbf{S}\in\mathbb{R}_{\geq 0}^{\ell\times(k+1)}$ define,

\textbf{h}(\textbf{S})=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\left[\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})\right]+\sum_{i\in[1,\ell]}\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)\log\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}~{}.

(68)

•

Throughout this section, for convenience unless stated otherwise we abuse notation and use A to denote the matrix $\textbf{A}^{\textbf{q},\phi}$ . The underlying pseudo-distribution q and profile $\phi$ with respect to matrix A will be clear from the context.

The first half of this section is dedicated to bound the $\mathrm{perm}(\textbf{A})$ in terms of function $\textbf{h}(\textbf{S})$ . For any fixed discrete pseudo-distribution q and profile $\phi$ , we will show that,

\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})\leq\log\mathrm{perm}(\textbf{A}^{\textbf{q},\phi})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

Later in the second half, we use the above inequality to maximize over all the discrete pseudo-distributions to find the approximate PML distribution and the summary of which is stated later. We start by showing the lower bound first and later in Theorem 6.4 we prove the upper bound.

Theorem 6.3.

For any discrete pseudo-distribution q with respect to R and profile $\phi$ , let A be the matrix defined (with respect to q and $\phi$ ) in Equation 65, then the following holds,

\log\mathrm{perm}(\textbf{A})\geq\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

(69)

Proof.

For any matrix $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ , define matrix $\textbf{Q}\in\mathbb{R}^{\mathcal{D}\times\mathcal{D}}$ as follows,

\textbf{Q}_{x,y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}

where in the definition above $i$ and $j$ are such that $\textbf{q}_{x}=\textbf{r}_{i}$ and $\textbf{f}_{y}=\textbf{m}_{j}$ . We now establish that matrix Q is doubly stochastic. For each $x\in\mathcal{D}$ , let $i$ be such that $\textbf{q}_{x}=\textbf{r}_{i}$ , then

\begin{split}\sum_{y\in\mathcal{D}}\textbf{Q}_{x,y}&=\sum_{j\in[0,k]}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{f}_{y}=\textbf{m}_{j}\}}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}=\sum_{j\in[0,k]}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}\sum_{\{y\in\mathcal{D}~{}|~{}\textbf{f}_{y}=\textbf{m}_{j}\}}1\\ &=\sum_{j\in[0,k]}\frac{\textbf{S}_{x,\textbf{m}_{j}}}{\ell^{\textbf{q}}_{i}\phi_{j}}\cdot\phi_{j}=\frac{1}{\ell^{\textbf{q}}_{i}}\sum_{j\in[0,k]}\textbf{S}_{x,\textbf{m}_{j}}=1~{}.\\ \end{split}

(70)

For each $y\in\mathcal{D}$ , let $j$ be such that $\textbf{f}_{y}=\textbf{m}_{j}$ , then

\begin{split}\sum_{x\in\mathcal{D}}\textbf{Q}_{x,y}&=\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}=\sum_{i\in[1,\ell]}\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}1\\ &=\sum_{i\in[1,\ell]}\frac{\textbf{S}_{x,\textbf{m}_{j}}}{\ell^{\textbf{q}}_{i}\phi_{j}}\cdot\ell^{\textbf{q}}_{i}=\frac{1}{\phi_{j}}\sum_{i\in[1,\ell]}\textbf{S}_{x,\textbf{m}_{j}}=1~{}.\\ \end{split}

(71)

Since matrix Q is doubly stochastic, by the definition of the scaled Sinkhorn permanent (See 2.6) and 2.7 we have $\log\mathrm{perm}(\textbf{A})\geq\mathrm{U}(\textbf{A},\textbf{Q})-N$ . To conclude the proof we show that $\mathrm{U}(\textbf{A},\textbf{Q})-N=\textbf{h}(\textbf{S})$ .

\begin{split}\mathrm{U}(\textbf{A},\textbf{Q})&=\sum_{(x,y)\in\mathcal{D}\times\mathcal{D}}\textbf{Q}_{x,y}\log(\frac{{\textbf{A}}_{x,y}}{\textbf{Q}_{x,y}})=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\ell^{\textbf{q}}_{i}\phi_{j}\cdot\frac{\textbf{S}_{i,j}}{\ell^{\textbf{q}}_{i}\phi_{j}}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}\phi_{j}}{\textbf{S}_{i,j}})\\ &=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}\phi_{j}}{\textbf{S}_{i,j}})~{}.\end{split}

(72)

We consider the final expression above and simplify it. First note that,

\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\ell^{\textbf{q}}_{i}=\sum_{i\in[1,\ell]}\log\ell^{\textbf{q}}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}=\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}~{}.

Similarly,

\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\phi_{j}=\sum_{j\in[0,k]}\log\phi_{j}\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}~{}.

Using the above two expressions, the final expression of Equation 72 can be equivalently written as,

\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}\phi_{j}}{\textbf{S}_{i,j}})=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\left[\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})\right]+\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}~{}.

(73)

Combining Equation 72, Equation 73 and substituting $N=\sum_{j\in[0,k]}\phi_{j}$ , we get:

\mathrm{U}(\textbf{A},\textbf{Q})-N=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})+\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}=\textbf{h}(\textbf{S})~{}.

In the above equality we used $\sum_{j\in[0,k]}\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i}$ for all $i\in[1,\ell]$ and for any $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ . Combining the above inequality with $\log\mathrm{perm}(\textbf{A})\geq\mathrm{U}(\textbf{A},\textbf{Q})-N$ we get,

\log\mathrm{perm}(\textbf{A})\geq\textbf{h}(\textbf{S})~{}.

The above inequality holds for any $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ (and therefore holds for the maximizer as well) and we conclude the proof. ∎

We next give an upper bound for the log of permanent of A in terms of $\textbf{h}(\textbf{S})$ .

Theorem 6.4.

For any discrete pseudo-distribution q with respect to R and profile $\phi$ , let A be the matrix defined (with respect to q and $\phi$ ) in Equation 65. Then,

\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

Proof.

Here we construct a particular matrix $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ such that $\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\textbf{h}(\textbf{S})$ , which immediately implies the theorem. Recall by Lemmas 4.5 and 4.4, there exists a matrix $\textbf{P}\in\mathbb{R}_{\geq 0}^{\mathcal{D}\times(k+1)}$ such that, $\sum_{j\in[0,k]}\textbf{P}_{x,j}=1\text{ for all }x\in\mathcal{D}$ and $\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}\text{ for all }j\in[0,k]$ , and satisfies $\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\textbf{f}(\textbf{A},\textbf{P})$ ¹²¹²12The inequality holds because matrix A has $k+1$ distinct columns and $O((k+1)\log\frac{N}{k+1})$ is asymptotically same as $O(k\log\frac{N}{k})$ .. Further using the definition of $\textbf{f}(\textbf{A},\textbf{P})$ we get,

\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}+\sum_{(x,j)\in\mathcal{D}\times[0,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}~{},

(74)

where for the matrix A defined (with respect to q and $\phi$ ) in Equation 65, we have,

{\hat{\textbf{A}}}_{x,j}=\textbf{q}_{x}^{\textbf{m}_{j}}~{}.

We now define the matrix S that satisfies the conditions of the lemma.

\textbf{S}_{i,j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{x,j}

By Theorem 4.7, for any fixed $j\in[0,k]$ , all $x\in\mathcal{D}$ such that $\textbf{q}_{x}=\textbf{r}_{i}$ , share the same probability value $\textbf{P}_{x,j}$ and we use the notation $\textbf{P}_{i,j}$ to denote this value. Using this definition, we have:

\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i}\textbf{P}_{i,j}

(75)

Further note that for any $i\in[1,\ell]$ , if $x\in\mathcal{D}$ is any element such that $\textbf{q}_{x}=\textbf{r}_{i}$ , then

\sum_{j\in[0,k]}\textbf{P}_{i,j}=\sum_{j\in[0,k]}\textbf{P}_{x,j}=1

We wish to show that $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ . We first analyze the row sum constraint. For each $i\in[1,\ell]$ ,

\sum_{j\in[0,k]}\textbf{S}_{i,j}=\sum_{j\in[0,k]}\ell^{\textbf{q}}_{i}\textbf{P}_{i,j}=\ell^{\textbf{q}}_{i}

We now analyze the column constraint. For each $j\in[0,k]$ ,

\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{x,j}=\sum_{x\in\mathcal{D}}\textbf{P}_{x,j}=\phi_{j}

In the remainder of the proof we show that the matrix S defined earlier satisfies $\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\textbf{h}(\textbf{S})$ . We start by simplifying the term $\sum_{(x,j)\in\mathcal{D}\times[0,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}$ in Equation 74,

\begin{split}\sum_{(x,j)\in\mathcal{D}\times[0,k]}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}&=\sum_{j\in[0,k]}\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{x,j}\log\frac{{\hat{\textbf{A}}}_{x,j}}{\textbf{P}_{x,j}}=\sum_{j\in[0,k]}\sum_{i\in[1,\ell]}\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}\textbf{P}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{P}_{i,j}}\\ &=\sum_{j\in[0,k]}\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\textbf{P}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{P}_{i,j}}=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}}{\textbf{S}_{i,j}}\\ &=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}}+\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}\end{split}

(76)

In the second equality, we used ${\hat{\textbf{A}}}_{x,j}=\textbf{r}_{i}^{\textbf{m}_{j}}$ and further by the definition of $\textbf{P}_{i,j}$ we have $\textbf{P}_{x,j}=\textbf{P}_{i,j}$ for all $x\in\mathcal{D}$ that satisfy $\textbf{q}_{x}=\textbf{r}_{i}$ . In the third equality, we used $\sum_{\{x\in\mathcal{D}~{}|~{}\textbf{q}_{x}=\textbf{r}_{i}\}}1=\ell^{\textbf{q}}_{i}$ . In the fourth equality we used Equation 75. In the final equality, we used $\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}\ell^{\textbf{q}}_{i}}{\textbf{S}_{i,j}}=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}}+\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\ell^{\textbf{q}}_{i}$ and the final term further simplifies to the following, $\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}\log\ell^{\textbf{q}}_{i}=\sum_{i\in[1,\ell]}\log\ell^{\textbf{q}}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}=\sum_{i\in[1,\ell]}\ell^{\textbf{q}}_{i}\log\ell^{\textbf{q}}_{i}$ .

We conclude the proof by combining equations 74 and 76 and using $\sum_{j\in[0,k]}\textbf{S}_{i,j}=\ell^{\textbf{q}}_{i}$ for any $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ . ∎

Note using Theorem 6.3 and 6.4, for matrix A defined (with respect to q and $\phi$ ) in Equation 65, we showed the following,

\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})\leq\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

(77)

Our final goal of this section is to maximize $\mathbb{P}(\textbf{q},\phi)\propto\frac{1}{\phi_{0}!}\mathrm{perm}(\textbf{A})$ over discrete pseudo-distributions q but let us take a step back and just focus on writing an upper bound. Consider the term $\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})$ in the expression above, it depends on discrete pseudo-distribution q at two different places. The first is the constraint set $\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ and the second is the function $\textbf{h}(\textbf{S})$ (because it contains the $\phi_{0}$ term in its expression). We address the first issue by defining the following new set that encodes the constraint set $\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ for all discrete pseudo-distributions simultaneously.

Definition 6.5.

Let $\textbf{Z}^{\phi}_{\textbf{R}}\subset\mathbb{R}_{\geq 0}^{\ell\times(k+1)}$ be the set of non-negative matrices, such that any $\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}$ satisfies,

\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\phi_{j}\text{ for all }j\in[1,k],\sum_{j\in[0,k]}\textbf{S}_{i,j}\in\mathbb{Z}_{+}\text{ for all }i\in[1,\ell]\text{ and }\sum_{i\in[1,k]}\textbf{r}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq 1~{}.

(78)

Note in the definition of $\textbf{Z}^{\phi}_{\textbf{R}}$ we removed the constraint related to $\phi_{0}$ and recall $\phi_{0}$ denotes the number of unseen domain elements. Not having constraint with respect to $\phi_{0}$ helps us encode discrete pseudo-distributions (with respect to R) of different domain sizes. Further for any $\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}$ , there is a discrete pseudo-distribution associated with it and we define it next.

Definition 6.6.

For any $\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}$ , the discrete pseudo-distribution $\textbf{q}_{\textbf{S}}$ associated with S is defined as follows: For any arbitrary $\sum_{j\in[0,k]}\textbf{S}_{i,j}$ number of domain elements assign probability $\textbf{r}_{i}$ .

Note in the definition above $\textbf{q}_{\textbf{S}}$ is a valid pseudo-distribution because of the third condition in Equation 78. Further note that, for any discrete pseudo-distribution q and $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ , the distribution $\textbf{q}_{\textbf{S}}$ associated with respect to S is a permutation of distribution q. Since the probability of a profile is invariant under permutations of distribution, we treat all these distributions the same and do not distinguish between them.

We now handle the second issue that corresponds to removing the dependency of discrete pseudo-distribution q from the function $\textbf{h}(\textbf{S})$ . To handle this issue, we define a new function $\textbf{g}(\textbf{S})$ that when maximized over the set $\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ and $\textbf{Z}^{\phi}_{\textbf{R}}$ approximates the value $\mathbb{P}(\textbf{q},\phi)$ and $\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)$ respectively (See next theorem for the formal statement). For any $\textbf{S}\in\mathbb{R}_{\geq 0}^{\ell\times(k+1)}$ , the function $\textbf{g}(\textbf{S})$ is defined as follows,

\textbf{g}(\textbf{S})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\exp\left(\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\left[\textbf{S}_{i,j}\log(\frac{\textbf{r}_{i}^{\textbf{m}_{j}}}{\textbf{S}_{i,j}})\right]+\sum_{i\in[1,\ell]}\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)\log\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}\right)\right)~{}.

(79)

Note that we switch gears and define the function $\textbf{g}(\textbf{S})$ as an exponential function. $\textbf{g}(\textbf{S})$ approximates the value $\mathbb{P}(\textbf{q},\phi)$ instead of log of it and it helps with proof readability. The following theorem summarizes this result.

Theorem 6.7.

Let R be a probability discretization set. Given a profile $\phi$ and discrete pseudo-distribution q with respect to R. The following inequality holds,

\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})\leq\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})

(80)

Further,

\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})\leq\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})

(81)

Proof.

For any discrete pseudo-distribution q with respect to R and profile $\phi$ , let A be the matrix defined (with respect to q and $\phi$ ) in Equation 65. Then, by Equation 77 we have,

\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})\leq\log\mathrm{perm}(\textbf{A})\leq O(k\log\frac{N}{k})+\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{h}(\textbf{S})~{}.

Further by Equation 4 we have,

\mathbb{P}(\textbf{q},\phi)=C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\mathrm{perm}(\textbf{A}^{\textbf{q},\phi})~{}.

Combining the above two equations we have,

C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\exp\left(\textbf{h}(\textbf{S})\right)\leq\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\exp\left(\textbf{h}(\textbf{S})\right)

(82)

We now simplify the term $\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\textbf{h}(\textbf{S})\right)$ in the above expression. First note that for any $\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}$ ,

\exp\left(\textbf{h}(\textbf{S})\right)=\textbf{g}(\textbf{S})\cdot\exp\left(\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}\right)~{}.

Therefore,

\begin{split}\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\textbf{h}(\textbf{S})\right)&=\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\textbf{g}(\textbf{S})\cdot\exp\left(\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}\right)~{}.\end{split}

(83)

By Lemma 2.8 (Stirling’s approximation) we have,

\exp\left(-O\left(k\log(N+n)\right)\right)\leq\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\sum_{j\in[0,k]}\phi_{j}\log\phi_{j}-\sum_{j\in[0,k]}\phi_{j}\right)\leq 1~{}.

(84)

The first inequality follows because for each $j\in[0,k]$ , we have $\frac{1}{\phi_{j}!}\exp\left(\phi_{j}\log\phi_{j}-\phi_{j}\right)\geq\Omega(\frac{1}{\sqrt{\phi_{j}+1}})$ , which by using $\phi_{j}\leq N+n$ is further lower bounded by $\Omega(\frac{1}{\sqrt{N+n}})\geq\exp\left(-O(\log(N+n))\right)$ . Equation 84 follows by taking product over all $j\in[0,k]$ . Now combining Equation 84 and Equation 83 we have,

\exp\left(-O(k\log(N+n))\right)\cdot\textbf{g}(\textbf{S})\leq\left(\prod_{j\in[0,k]}\frac{1}{\phi_{j}!}\right)\cdot\exp\left(\textbf{h}(\textbf{S})\right)\leq\textbf{g}(\textbf{S})~{}.

(85)

The first statement of the lemma follows by combining the above Equation 85 with Equation 82, that is we have,

\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})\leq\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})~{}.

(86)

Given a profile $\phi$ , for any discrete pseudo-distribution $\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}$ we have $\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}\subseteq\textbf{Z}^{\phi}_{\textbf{R}}$ and further combining it with above inequality we get,

\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})~{}.

Note that for any $\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}$ , we also have $\textbf{S}\in\textbf{Z}^{\phi,\textbf{q}_{\textbf{S}}}_{\textbf{R}}$ , where $\textbf{q}_{\textbf{S}}$ is the discrete pseudo-distribution associated with respect to S (See 6.6). Therefore,

\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})\leq\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\max_{\textbf{S}\in\textbf{Z}_{\textbf{R}}^{{\textbf{q}},\phi}}\textbf{g}(\textbf{S})\leq\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)~{}.

For the last inequality in the above derivation we used Equation 86. Now combining the previous two inequalities we conclude the proof. ∎

The previous theorem provides an upper bound for the probability of profile with respect to any discrete pseudo-distribution. However one issue with this upper bound is that it is not efficiently computable because the set $\textbf{Z}^{\phi}_{\textbf{R}}$ is not a convex set (because of the integrality constraints). We relax these integrality constraints and define the following new set.

Definition 6.8.

Let $\textbf{Z}^{\phi,frac}_{\textbf{R}}\subseteq\mathbb{R}_{\geq 0}^{\ell\times(k+1)}$ be the set of non-negative matrices, such that any $\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ satisfies,

\sum_{i\in[1,\ell]}\textbf{S}_{i,j}=\phi_{j}\text{ for all }j\in[1,k]\text{ and }\sum_{i\in[1,k]}\textbf{r}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq 1~{}.

(87)

Lemma 6.9.

Let R be a probability discretization set. Given a profile $\phi$ , the following holds,

\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\textbf{g}(\textbf{S})

(88)

Proof.

By Theorem 6.7 we already have,

\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\max_{\textbf{S}\in\textbf{Z}^{\phi}_{\textbf{R}}}\textbf{g}(\textbf{S})~{}.

The lemma holds because $\textbf{Z}^{\phi}_{\textbf{R}}\subseteq\textbf{Z}^{\phi,frac}_{\textbf{R}}$ . ∎

Note in the above lemma, the upper bound only depends on the profile ¹³¹³13 $C_{\phi}$ has no dependency on $\phi_{0}$ . and we removed all dependencies related to distributions (and also $\phi_{0}$ ). Next we show that this upper bound can be efficiently computed by using the result that function $\textbf{g}(\textbf{S})$ is log concave in S.

Lemma 6.10 (Lemma 4.16 in [CSS19]).

Function $\textbf{g}(\textbf{S})$ is log concave in S.

Theorem 6.11 (Theorem 4.17 in [CSS19]).

Given a profile $\phi\in\Phi^{n}$ , the optimization problem $\max_{\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{S})$ can be solved in time $\widetilde{O}(k^{2}\ell)$ . ¹⁴¹⁴14Note here we hide the logarithmic dependence on $n$ , the size of sample.

6.2 Rounding Algorithm

In the previous section we provided an efficiently computable upper bound for the probability of profile $\phi$ with respect to any discrete pseudo-distribution $\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}$ . This upper bound returns a solution $\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ and we need to round this solution to construct a discrete pseudo-distribution that approximates this upper bound. In this section we provide a rounding algorithm that takes as input $\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ and returns a solution $\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}$ , where $\textbf{R}^{\mathrm{ext}}$ is an extended probability discretization set. Further using $\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}$ , we construct a discrete pseudo-distribution $\textbf{q}_{\textbf{S}^{\mathrm{ext}}}$ with respect to $\textbf{R}^{\mathrm{ext}}$ such that $\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi)$ approximates the upper bound and therefore is an approximate PML distribution. Our rounding algorithm is technical and we next provide a overview to better understand it.

Overview of the rounding algorithm:

The goal of the rounding algorithm is to take a fractional solution $\textbf{S}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathrm{arg}\max_{\textbf{S}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{S}^{\prime})$ as input and round each row sum to an integral value while preserving the column sums and $\textbf{g}(\textbf{S})$ value. Our rounding algorithm proceeds in three steps:

Step 1:

Consider the fractional solution $\textbf{S}\in\mathbb{R}_{\geq 0}^{\ell\times(k+1)}$ and recall the rows are indexed by the elements of set R (which represent probability values). We first round the rows corresponding to the higher probability values by simply taking the floor (rounding down to the nearest integer) of each entry. This procedure ensures the integrality of the row sums (corresponding to higher probability values) but violates the column sum constraints. To satisfy the column sum constraints and the distributional constraint (i.e. last condition in Equation 78) simultaneously, we create rows corresponding to new probability values using Algorithm 2. However to ensure that all these new rows also have integral row sums, we modify the (old) rows corresponding to lower probability values accordingly. Let $\textbf{S}^{(1)}$ be the solution returned by the first step of the rounding algorithm. Algorithm 2 ensures that the $\textbf{g}(\textbf{S}^{(1)})$ value is not much smaller than $\textbf{g}(\textbf{S})$ . In $\textbf{S}^{(1)}$ , all the new rows and (old) rows corresponding to higher probability values have integral row sums and we round the remaining rows corresponding to smaller probability values next.

Step 2:

In this step, we round all the rows corresponding to the smaller probability values. For each of these rows, we scale all the entries in a particular row by the same factor to ensure that the row sum is rounded down to the nearest integer. Similar to the step 1, using Algorithm 2 we create rows corresponding to new probability values to maintain the column sum constraints and the distributional constraint; all these new rows further correspond to small probability values. Unlike in the previous step, the new rows created in step two may not have integral row sums but these rows have a nice diagonal structure. Let $\textbf{S}^{(2)}$ be this intermediate solution created in step 2. Algorithm 2 ensures that the $\textbf{g}(\textbf{S}^{(2)})$ value is not much smaller than $\textbf{g}(\textbf{S}^{(1)})$ (and hence $\textbf{g}(\textbf{S})$ ). Note all the row sums in $\textbf{S}^{(2)}$ are integral except the new rows created in step 2 that all have small probability values and have diagonal structure.

Step 3:

In this final step, using Algorithm 1 we round the new rows created in step 2. Algorithm 1 exploits the low probability and diagonal structure in these rows. The diagonal structure ensures that there is just one non-zero entry in any particular row and we modify the solution $\textbf{S}^{(2)}$ (from the previous step) as follows. We transfer the mass from a non-integral lower probability value row to an immediate higher probability value row until the (lower probability value) row sum is integral. This process might violate the distributional constraint and we rescale the probability values accordingly to satisfy this constraint. Let $\textbf{S}^{\mathrm{ext}}$ be the solution returned by step 3. We ensure that all column sums are preserved, all row sums are integral and the $\textbf{g}(\textbf{S}^{\mathrm{ext}})$ value is not much smaller than $\textbf{g}(\textbf{S}^{(2)})$ (and hence not much smaller than $\textbf{g}(\textbf{S})$ ).

In the remainder of this section we state all three algorithms and the results corresponding to them. For continuity of reading, we defer the proofs of these results to Section 6.4. For convenience, we first state Algorithm 1 that rounds the rows corresponding to the low probability values in step 3 of our main rounding algorithm (Algorithm 3). We follow up this algorithm with a lemma that summarizes the guarantees provided by it. Later we state Algorithm 2 that creates rows corresponding to new probability values to preserve the column sums and the distributional constraint. This algorithm is invoked as a subroutine in both step 1 and 2 of Algorithm 3. Finally, we state our main rounding algorithm that consists of three different steps. We then state results analyzing each of these steps separately. The final result (6.16), is the main theorem of this subsection that summarizes the final guarantees promised by our rounding algorithm.

Algorithm 1 Structured Rounding Algorithm

1:procedure StructuredRounding(

x,w,a

)

2: Input:

x\in(0,1)_{\mathbb{R}}^{[0,k]}

w\in\mathbb{R}^{[0,k]}

and

a=\sum_{j\in[0,k]}x_{j}\in\mathbb{Z}_{+}

3: Output:

z\in\mathbb{R}^{[0,k]\times[0,k]}

and

s\in\mathbb{R}^{a}

4: Initialize

z=\textbf{0}^{[0,k]\times[0,k]}

5: For each

i\in[1,a]

, let

s_{i}

denote the smallest index such that

\sum_{j\leq s_{i}}x_{j}>i-1

and let

s_{a+1}=k

6: for

i\in[1,a]

z_{s_{i},j}=\begin{cases}x_{j}&\text{ if }s_{i}<j<s_{i+1}~{},\\ \sum_{j^{\prime}\leq s_{i}}x_{j^{\prime}}-(i-1)&\text{ if }j=s_{i}~{},\\ 1-\sum_{s_{i}\leq j^{\prime}<s_{i+1}}z_{s_{i},j^{\prime}}&\text{ if }j=s_{i+1}~{}.\\ \end{cases}

(89)

7: end for

8: Return

z

and

s

9:end procedure

The next lemma summarizes the quality of the solution produced by Algorithm 1.

Lemma 6.12.

Given a set of reals $x_{j}\in(0,1)$ for all $j\in[0,k]$ such that $\sum_{j\in[0,k]}x_{j}\in\mathbb{Z}_{+}$ , weights $w_{j}$ for all $j\in[0,k]$ and exponents $m_{j}\in\mathbb{Z}_{+}$ for all $j\in[0,k]$ ¹⁵¹⁵15Here $m_{0}$ need not be equal to zero.. Using Algorithm 1, we can efficiently compute a matrix $z\in[0,1]_{\mathbb{R}}^{[0,k]\times[0,k]}$ such that the following conditions hold,

1.

$\sum_{j\in[0,k]}z_{i,j}\in\{0,1\}\text{ for all }~{}i\in[0,k]$ and $\sum_{i\in[0,k]}z_{i,j}=x_{j}~{}\text{ for all }~{}j\in[0,k]$ .
2.

$\sum_{i\in[0,k]}\left(\sum_{j\in[0,k]}z_{i,j}\right)w_{i}\leq\sum_{j\in[0,k]}x_{j}w_{j}+\max_{j\in[0,k]}w_{j}$ .
3.

$\prod_{j\in[0,k]}w_{j}^{m_{j}x_{j}}\leq\prod_{i\in[0,k]}\prod_{j\in[0,k]}w_{i}^{m_{j}z_{i,j}}$ .

We next provide description of Algorithm 2. The algorithm takes input $(\textbf{B},\textbf{C},\textbf{R},\phi)$ and creates a new probability discretization set $\textbf{R}^{\prime}$ (lines 6-10). The solution $\textbf{B}^{\prime}$ outputted by the algorithm belongs to $\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}$ , has same column sums as B and the value $\textbf{g}(\textbf{B}^{\prime})$ is lower bounded by $\textbf{g}(\textbf{B})$ .

Algorithm 2 Create New Probability Values

1:procedure

\mathrm{CreateNewProbabilityValues}

(

\textbf{B},\textbf{C},\textbf{R},\phi

)

2: Input: Probability discretization set R (

|\textbf{R}|=t

), profile

\phi

(let

k

be the number of distinct frequencies) and

\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}\subseteq\mathbb{R}^{[1,t]\times[0,k]}

and

\textbf{C}\in\mathbb{R}^{[1,t]\times[0,k]}

such that

\textbf{C}_{i,j}\leq\textbf{B}_{i,j}

for all

i\in[1,t]

and

j\in[0,k]

. Let

\textbf{r}_{i}

be the

i

’th element of R.

3: Output: Probability discretization set

\textbf{R}^{\prime}

and

\textbf{B}^{\prime}\in\mathbb{R}^{[1,t+(k+1)]\times[0,k]}

4: Initialize

\textbf{B}^{\prime}=\textbf{0}^{[1,t+(k+1)]\times[0,k]}

\textbf{B}^{\prime}_{ij}=\textbf{C}_{ij}\text{ for all }i\in[1,t],j\in[0,k]~{}.

6: for

j\in[0,k]

7: Create a new row with probability value

\textbf{r}_{t+1+j}=\frac{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}

8: Assign

\textbf{B}^{\prime}_{t+1+j,j}=\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})

9: end for

10: Define

\textbf{R}^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{R}\cup\{\textbf{r}_{t+1+j}\}_{j\in[0,k]}

11: Return:

\textbf{R}^{\prime}

and

\textbf{B}^{\prime}

12:end procedure

The next lemma summarizes the quality of the solution produced by Algorithm 2.

Lemma 6.13.

The solution $(\textbf{R}^{\prime},\textbf{B}^{\prime})$ returned by Algorithm 2 satisfies the following conditions:

1.

$\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{j\in[0,k]}\textbf{C}_{i,j}$ for all $i\in[1,t]$ .
2.

For any $i\in[t+1,t+(k+1)]$ let $j\in[0,k]$ be such that $i=t+1+j$ then $\textbf{B}^{\prime}_{t+1+j,j^{\prime}}=0$ for all $j^{\prime}\in[0,k]$ and $j^{\prime}\neq j$ . (Diagonal Structure)
3.

For any $i\in[t+1,t+(k+1)]$ let $j\in[0,k]$ be such that $i=t+1+j$ , then $\sum_{j^{\prime}\in[0,k]}\textbf{B}^{\prime}_{i,j^{\prime}}=\textbf{B}^{\prime}_{t+1+j,j}=\phi_{j}-\sum_{i^{\prime}\in[1,t]}\textbf{C}_{i^{\prime},j}$ .
4.

$\textbf{B}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}$ and $\sum_{i\in[1,t+(k+1)]}\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\sum_{j\in[0,k]}\textbf{B}_{i,j}$ .
5.

Let $\alpha_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[0,k]}\textbf{B}_{i,j}-\sum_{j\in[0,k]}\textbf{C}_{i,j}$ for all $i\in[1,t]$ and $\Delta\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max(\sum_{i\in[1,t]}(\textbf{B}\overrightarrow{\mathrm{1}})_{i},t\times k)$ , then $\textbf{g}(\textbf{B}^{\prime})\geq\exp\left(-O\left(\sum_{i\in[1,t]}\alpha_{i}\log\Delta\right)\right)\textbf{g}(\textbf{B})~{}.$
6.

For each $j\in[0,k]$ , the new row corresponds to the probability value, $\textbf{r}_{t+1+j}=\frac{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}$ .

In the remainder of this section, we state and analyze our rounding algorithm. Our algorithm works in three steps, and we show that all the solutions produced during the intermediate and final steps all have the desired approximation guarantee. We divide the analysis into three lemmas. Each of the lemmas 6.14, 6.15 and 6.16 analyze the guarantees provided by the intermediate solutions $\textbf{S}^{(1)}$ , $\textbf{S}^{(2)}$ and final solution $\textbf{S}^{\mathrm{ext}}$ respectively.

Algorithm 3 Rounding Algorithm

1:procedure Rounding(S)

2: Input: Probability discretization set R, profile

\phi\in\Phi^{n}

and

\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}\subseteq\mathbb{R}^{[1,\ell]\times[0,k]}

3: Output: Probability discretization set

\textbf{R}^{\mathrm{ext}}

and

\textbf{S}^{\mathrm{ext}}

4: Step 1:

5: Initialize

\textbf{A}=\textbf{0}^{[1,\ell]\times[0,k]}

. Let

\textbf{r}_{i}

be the

i

’th element of R.

6: Define

\textbf{H}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell]~{}|~{}\textbf{r}_{i}>\gamma\}

and

\textbf{L}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell]~{}|~{}\textbf{r}_{i}\leq\gamma\}

\textbf{A}_{ij}=\lfloor\textbf{S}_{ij}\rfloor\text{ for all }i\in\textbf{H},j\in[0,k]~{}.

\textbf{A}_{ij}=\textbf{S}_{i,j}\frac{\lfloor\sum_{i\in\textbf{L}}\textbf{S}_{i,j}\rfloor}{\sum_{i\in\textbf{L}}\textbf{S}_{i,j}}\text{ for all }i\in\textbf{L},j\in[0,k]~{}.

(\textbf{S}^{(1)},\textbf{R}^{(1)})=\mathrm{CreateNewProbabilityValues}(\textbf{S},\textbf{A},\textbf{R})

10: Step 2:

11: Note

|\textbf{R}^{(1)}|=\ell+(k+1)

and

\textbf{S}^{(1)}\subseteq\mathbb{R}^{[1,\ell+(k+1)]\times[0,k]}

. Let

\textbf{r}^{(1)}_{i}

be the

i

’th element of

\textbf{R}^{(1)}

12: Let

\textbf{H}^{(1)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell+(k+1)]~{}|~{}\textbf{r}^{(1)}_{i}>\gamma\}

and

\textbf{L}^{(1)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{i\in[1,\ell+(k+1)]~{}|~{}\textbf{r}^{(1)}_{i}\leq\gamma\}

13: Define

\textbf{A}^{(1)}=\textbf{0}^{[1,\ell+(k+1)]\times[0,k]}

14:

\textbf{A}^{(1)}_{ij}=\textbf{S}^{(1)}_{ij}\quad\text{ for all }i\in\textbf{H}^{(1)},j\in[0,k]~{}.

15:

\textbf{A}^{(1)}_{ij}=\textbf{S}^{(1)}_{ij}\frac{\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor}{(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}}\quad\text{ for all }i\in\textbf{L}^{(1)},j\in[0,k]~{}.

16:

(\textbf{S}^{(2)},\textbf{R}^{(2)})=\mathrm{CreateNewProbabilityValues}(\textbf{S}^{(1)},\textbf{A}^{(1)},\textbf{R}^{(1)})

17: Step 3:

18: Note

|\textbf{R}^{(2)}|=\ell+2(k+1)

and

\textbf{S}^{(2)}\subseteq\mathbb{R}^{[1,\ell+2(k+1)]\times[0,k]}

. Let

\textbf{r}^{(2)}_{i}

be the

i

’th element of

\textbf{R}^{(2)}

19: Let

w,x\in\mathbb{R}^{[0,k]}

, such that

w_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{r}^{(2)}_{\ell+(k+1)+1+j}

and

x_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\textbf{S}^{(2)}_{\ell+(k+1)+1+j}-\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j}\rfloor

for all

j\in[0,k]

. Define

a\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[0,k]}x_{j}

20: Let

(z,s)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathrm{StructuredRounding}(x,w,a)

21: Initialize

\textbf{S}^{\mathrm{ext}}=0^{[1,\ell+2(k+1)]\times[0,k]}

22: Assign

\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j}

for all

i\in[1,\ell+(k+1)]

and

j\in[0,k]

23: Assign

\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}

for all

j,j^{\prime}\in[0,k]

24: Define

\textbf{R}^{\mathrm{ext}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{\frac{\textbf{r}^{(2)}_{i}}{1+\gamma}~{}|~{}\text{for all }i\in[1,\ell+2(k+1)]\}

25: return

\textbf{R}^{\mathrm{ext}}

and

\textbf{S}^{\mathrm{ext}}

26:end procedure

The next lemma summarizes the quality of the intermediate solution $(\textbf{S}^{(1)},\textbf{R}^{(1)})$ produced by Step 1 of Algorithm 3.

Lemma 6.14.

The solution $(\textbf{S}^{(1)},\textbf{R}^{(1)})$ returned by the step 1 of Algorithm 3 satisfies the following:

1.

$(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ for all $i\in\textbf{H}^{(1)}$ .
2.

$\textbf{S}^{(1)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(1)}}$ and $\sum_{i\in[1,\ell+(k+1)]}\sum_{j\in[0,k]}\textbf{S}^{(1)}_{i,j}=\sum_{i\in[1,\ell]}\sum_{j\in[0,k]}\textbf{S}_{i,j}$ .
3.

$\textbf{g}(\textbf{S}^{(1)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})$ , where $\Delta=\max(\sum_{i\in[1,\ell]}(\textbf{S}\overrightarrow{\mathrm{1}})_{i},\ell\times k)$ .

Using Lemma 6.14 we now provide the guarantees for the solution $\textbf{S}^{(2)}$ returned by the step 2 of Algorithm 3.

Lemma 6.15.

The solution $(\textbf{S}^{(2)},\textbf{R}^{(2)})$ returned by the step 2 of Algorithm 3 satisfies the following,

1.

$(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ for all $i\in[1,\ell+(k+1)]$ .
2.

$\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}=0$ for all $j,j^{\prime}\in[0,k]$ and $j\neq j^{\prime}$ (Diagonal Structure).
3.

$\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}}$ and $\sum_{i\in[1,\ell+2(k+1)]}\sum_{j\in[0,k]}\textbf{S}^{(2)}_{i,j}=\sum_{i\in[1,\ell+(k+1)]}\sum_{j\in[0,k]}\textbf{S}^{(1)}_{i,j}$ .
4.

$\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ .
5.

For any $j\in[0,k]$ , $\textbf{r}^{(2)}_{\ell+(k+1)+1+j}\leq\gamma$ .
6.

$\textbf{g}(\textbf{S}^{(2)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+\ell+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})$ .

Using Lemma 6.15 we now provide the guarantees for the final solution $\textbf{S}^{\mathrm{ext}}$ returned by Algorithm 3.

Theorem 6.16.

The final solution returned $(\textbf{S}^{\mathrm{ext}},\textbf{R}^{\mathrm{ext}})$ by Algorithm 3 satisfies the following,

1.

$\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}$ .
2.

$\textbf{g}(\textbf{S}^{\mathrm{ext}})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+\ell+k+\gamma n\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})$ .

6.3 Combining everything together

Here we combine the analysis from previous two sections to provide an efficient algorithm to compute an $\exp\left(\sqrt{n}\log n\right)$ approximate PML distribution. The main contribution of this section is to define a probability discretization set R that guarantees existence of a discrete pseudo-distrbution q with respect to R that is also an $\exp\left(\sqrt{n}\log n\right)$ approximate PML pseudo-distribution. We further use this probability discretization set R and combine it with results from the previous two sections to finally output an $\exp\left(\sqrt{n}\log n\right)$ approximate PML distribution. In this direction, we first provide definition of R that has desired guarantees and such a set R was already constructed in [CSS19] and we formally state results from [CSS19] that help us define such a set R.

Lemma 6.17 (Lemma 4.1 in [CSS19]).

For any profile $\phi\in\Phi^{n}$ , there exists a distribution $\textbf{q}^{\prime\prime}\in\Delta^{\mathcal{D}}$ such that $\textbf{q}^{\prime\prime}$ is an $\exp\left(-6\right)$ -approximate PML distribution and $\min_{x\in\mathcal{D}:\textbf{q}^{\prime\prime}_{x}\neq 0}\textbf{q}^{\prime\prime}_{x}\geq\frac{1}{2n^{2}}$ .

The above lemma allows to define a region in which our approximate PML takes all its probability values and we use idea similar to [CSS19] to define this region.

Let $\textbf{R}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(1+\epsilon)^{1-i}\}_{i\in[\ell]}$ be a discretization of probability space, where $\ell=O(\frac{\log n}{\epsilon})$ is the smallest integer such that $\frac{1}{4n^{2}}\leq(1+\epsilon)^{1-\ell}\leq\frac{1}{2n^{2}}$ for some $\epsilon\in(0,1)$ . Fix any arbitrary order for the elements of set R, we use $\textbf{r}_{i}$ to denote the $i$ ’th element of this set. We next state a result in [CSS19] that captures the effect of this discretization.

Lemma 6.18 (Lemma 4.4 in [CSS19]).

For any profile $\phi\in\Phi^{n}$ and distribution $\textbf{p}\in\Delta^{\mathcal{D}}$ , its discrete pseudo-distribution $\textbf{q}=\mathrm{disc}(\textbf{p})\in\Delta_{\textbf{R}}^{\mathcal{D}}$ satisfies:

\mathbb{P}(\textbf{p},\phi)\geq\mathbb{P}(\textbf{q},\phi)\geq\exp\left(-\epsilon n\right)\mathbb{P}(\textbf{p},\phi)~{}.

We are now ready to state our final algorithm. Following this algorithm, we prove that it returns an approximate PML distribution.

Algorithm 4 Algorithm for approximate PML

1:procedure Approximate PML(

\phi,\textbf{R}

)

2: Input: Profile

\phi\in\Phi^{n}

and probability discretization set R.

3: Output: A distribution

\textbf{p}_{\mathrm{approx}}

4: Solve

\textbf{S}=\operatorname*{arg\,max}_{\textbf{A}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{A})

5: Use Algorithm 3 to round the fractional solution S to integral solution

\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}

6: Construct discrete pseudo-distribution

\textbf{q}_{\textbf{S}^{\mathrm{ext}}}

with respect to

\textbf{S}^{\mathrm{ext}}

(See 6.6).

7: return

\textbf{p}_{\mathrm{approx}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{q}_{\textbf{S}^{\mathrm{ext}}}}{\|\textbf{q}_{\textbf{S}^{\mathrm{ext}}}\|_{1}}

8:end procedure

See 3.4

Proof.

Choose $\epsilon=\frac{\log n}{\sqrt{n}}$ and let the probability discretization space $\textbf{R}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{(1+\frac{1}{\sqrt{n}})^{1-i}\}_{i\in[\ell]}$ and $\ell\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\textbf{R}|$ be the smallest integer such that $\frac{1}{2n^{2}}\geq(1+\frac{1}{\sqrt{n}})^{1-\ell}\geq\frac{1}{4n^{2}}$ and therefore $\ell\in O(\sqrt{n})$ . Let $\textbf{r}_{i}$ be the $i$ ’th element of set R and we have $\textbf{r}_{i}\geq\frac{1}{4n^{2}}$ .

Given profile $\phi$ , let $\textbf{p}_{\mathrm{pml}}$ be the PML distribution. Define $\textbf{q}_{\mathrm{pml}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\lfloor\textbf{p}_{\mathrm{pml}}\rfloor_{\textbf{R}}$ and by Lemma 6.18 (and choice of $\epsilon$ ) we have,

\mathbb{P}(\textbf{q}_{\mathrm{pml}},\phi)\geq\exp\left(-O(\sqrt{n}\log n)\right)\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)~{}.

(90)

Let $\textbf{S}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\operatorname*{arg\,max}_{\textbf{A}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\textbf{g}(\textbf{A})$ , then by Lemma 6.9 we have,

\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{p},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S})~{}.

(91)

Note $\textbf{q}_{\mathrm{pml}}\in\Delta_{\textbf{R}}^{\mathcal{D}}$ , therefore $\mathbb{P}(\textbf{q}_{\mathrm{pml}},\phi)\leq\max_{\textbf{q}\in\Delta_{\textbf{R}}^{\mathcal{D}}}\mathbb{P}(\textbf{q},\phi)$ and further combined with equations 90 and 91 we have,

\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}+\sqrt{n}\log n\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S})~{}.

(92)

Let $\textbf{S}^{\mathrm{ext}}$ and $\textbf{R}^{\mathrm{ext}}$ be the solution returned by Algorithm 3, then by the second condition of 6.16 we have,

\textbf{g}(\textbf{S}^{\mathrm{ext}})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+\ell+k+\gamma n\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})

(93)

Combining equations 92 and 93 we have,

\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)\leq\exp\left(O\left(k\log\frac{N}{k}+\sqrt{n}\log n+\left(\frac{1}{\gamma}+\ell+k+\gamma n\right)\log\Delta\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S}^{\mathrm{ext}})~{}.

(94)

We now simplify the above expression by providing the bounds and values for parameters $k,\ell,\gamma,N$ and $\Delta$ . We choose $\gamma=\frac{1}{\sqrt{n}}$ and recall $\ell\in O(\sqrt{n})$ . Given $n$ samples, the number of distinct frequencies in upper bounded by $\sqrt{n}$ and therefore $k\leq\sqrt{n}$ . By Lemma 6.17, up to constant multiplicative loss we can assume that the minimum non-zero probability value of our approximate PML distribution is at least $\frac{1}{4n^{2}}$ and therefore the support $N\leq 4n^{2}$ . Recall by the third condition of Lemma 6.14, we have $\Delta=\max(\sum_{i\in[1,\ell]}(\textbf{S}\overrightarrow{\mathrm{1}})_{i},\ell\times k)$ . The condition $\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ implies $\sum_{i\in[1,\ell]}\textbf{r}_{i}(\textbf{S}\overrightarrow{\mathrm{1}})_{i}\leq 1$ and further using $\textbf{r}_{i}\geq\frac{1}{4n^{2}}$ for all $i\in[1,\ell]$ we have $\sum_{i\in[1,\ell]}(\textbf{S}\overrightarrow{\mathrm{1}})_{i}\leq 4n^{2}$ . Therefore $\Delta\leq\max(4n^{2},\ell\times k)\in O(n^{2})$ .

Substituting these values in Equation 94 we get,

\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)\leq\exp\left(O\left(\sqrt{n}\log n\right)\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S}^{\mathrm{ext}})~{}.

(95)

By the first condition of 6.16 we have $\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}$ . Let $\textbf{q}_{\textbf{S}^{\mathrm{ext}}}$ be the discrete pseudo-distribution with respect to $\textbf{S}^{\mathrm{ext}}$ , then the condition $\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}$ further implies $\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi}_{\textbf{R}^{\mathrm{ext}}}$ and combined with Theorem 6.7 we have,

\exp\left(-O(k\log(N+n))\right)\cdot C_{\phi}\cdot\textbf{g}(\textbf{S}^{\mathrm{ext}})\leq\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi)

(96)

Combining equations 95, 96, $k\leq\sqrt{n}$ and $N\leq 4n^{2}$ we have,

\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi)\geq\exp\left(-O\left(\sqrt{n}\log n\right)\right)\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)~{}.

(97)

Define $\textbf{p}_{\mathrm{approx}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{\textbf{q}_{\textbf{S}^{\mathrm{ext}}}}{\|\textbf{q}_{\textbf{S}^{\mathrm{ext}}}\|_{1}}$ , then $\textbf{p}_{\mathrm{approx}}$ is a distribution, $\mathbb{P}(\textbf{p}_{\mathrm{approx}},\phi)\geq\mathbb{P}(\textbf{q}_{\textbf{S}^{\mathrm{ext}}},\phi)$ (because $\textbf{q}_{\textbf{S}^{\mathrm{ext}}}$ is a pseudo-distribution and $\|\textbf{q}_{\textbf{S}^{\mathrm{ext}}}\|_{1}\leq 1$ ) and combined with Equation 97 we get,

\mathbb{P}(\textbf{p}_{\mathrm{approx}},\phi)\geq\exp\left(-O\left(\sqrt{n}\log n\right)\right)\mathbb{P}(\textbf{p}_{\mathrm{pml}},\phi)~{}.

(98)

Therefore $\textbf{p}_{\mathrm{approx}}$ is an $\exp\left(-O\left(\sqrt{n}\log n\right)\right)$ -approximate PML distribution.

In the remainder of the proof we argue about the running time of our final algorithm for approximate PML. Step 4 of the algorithm, that is the convex program $\operatorname*{arg\,max}_{\textbf{A}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}}\log\textbf{g}(\textbf{A})$ can be solved in $\widetilde{O}(k^{2}\ell)$ time (See Theorem 6.11). Algorithm 2 ( $\mathrm{CreateNewProbabilityValues}$ ) and Algorithm 1 ( $\mathrm{StructuredRounding}$ ) can be implemented in $\widetilde{O}(k\ell)$ and $\widetilde{O}(k^{2})$ time respectively; therefore, the Algorithm 3 (Rounding algorithm) can be implemented in $\widetilde{O}(k\ell)$ time. Combining everything together our final algorithm (Algorithm 4) can be implemented in $\widetilde{O}(k^{2}\ell)$ time. Further using $k,\ell\in O(\sqrt{n})$ , we conclude the proof. ∎

6.4 Missing Proofs from Section 6.2

Here we provide the proofs for all the lemmas and theorems in Section 6.2

Proof of Lemma 6.12.

Without loss of generality assume $w_{0}\geq w_{1}\geq w_{2}\dots\geq w_{k}$ . Let $a\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{j\in[0,k]}x_{j}$ , we invoke Algorithm 1 with inputs $(x,w,a)$ . Let $s\in\mathbb{Z}_{+}^{a}$ and $z\in\mathbb{R}^{[0,k]\times[0,k]}$ be the output of Algorithm 1. We now provide the proof for the three conditions in the lemma.

Condition 1: By construction of Algorithm 1, for any $s\in\{s_{i}\}_{i\in[1,a]}$ we have $\sum_{j\in[0,k]}z_{s,j}=1$ (Line 6) and for any other $s\in[0,k]\backslash\{s_{i}\}_{i\in[1,a]}$ we have $\sum_{j\in[0,k]}z_{s,j}=0$ . Therefore the first part of condition 1 holds.

For any $j\in[0,k]$ , one of the following two cases holds,

If $j\in\{s_{i}\}_{i\in[1,a]}$ and in this case let $i\in[1,a]$ be such that $s_{i}=j$ . By line 6 (third case) of the algorithm we have,

z_{s_{i-1},j}=1-\left(\sum_{j^{\prime}\leq s_{i-1}}x_{j^{\prime}}-(i-2)+\sum_{s_{i-1}<j^{\prime}<s_{i}}x_{j^{\prime}}\right)=(i-1)-\sum_{j^{\prime}<s_{i}}x_{j^{\prime}}~{}.

(99)

We now analyze the term $\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j}$ ,

\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j}=z_{s_{i},j}+z_{s_{i-1},j}=\sum_{j^{\prime}\leq s_{i}}x_{j^{\prime}}-(i-1)+(i-1)-\sum_{j^{\prime}<s_{i}}x_{j^{\prime}}=x_{s_{i}}=x_{j}~{}.

The first equality follows because for $i^{\prime}\in[0,k]\backslash\{s_{i},s_{i-1}\}$ we have $z_{i^{\prime},j}=0$ and this follows by the second and third case in line 6 of the algorithm. In the second equality we substituted values for $z_{s_{i},s_{i}}$ and $z_{s_{i-1},s_{i}}$ using second case (Line 6) and Equation 99 respectively.

2.

Else $j\in[0,k]\backslash\{s_{i}\}_{i\in[1,a]}$ , and in this case let $i\in[1,a]$ be such that $s_{i}<j<s_{i+1}$ . Then by the first case in line 6 of the algorithm we have,

$\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j}=z_{s_{i},j}=x_{j}~{}.$

Condition 2: Consider $\sum_{i\in[0,k]}\left(\sum_{j\in[0,k]}z_{i,j}\right)w_{i}$ ,

\begin{split}\sum_{i\in[0,k]}\left(\sum_{j\in[0,k]}z_{i,j}\right)w_{i}&=\sum_{i\in[1,a]}\left(\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}\right)w_{s_{i}}\leq\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{j}+w_{s_{i}}-w_{s_{i+1}})\\ &\leq\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}w_{j}+\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{s_{i}}-w_{s_{i+1}})\\ &=\sum_{i\in[1,a]}\sum_{j\in[0,k]}z_{s_{i},j}w_{j}+\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{s_{i}}-w_{s_{i+1}})~{}.\\ \end{split}

(100)

The first equality follows because rest of the other entries are zero. In the second inequality we used $j\leq s_{i+1}$ and therefore $w_{j}\geq w_{s_{i+1}}$ by our assumption at the beginning of the proof. In the remainder, we simplify both the terms. Consider the first term in the final expression above,

\sum_{i\in[1,a]}\sum_{j\in[0,k]}z_{s_{i},j}w_{j}=\sum_{j\in[0,k]}w_{j}\sum_{i\in[1,a]}z_{s_{i},j}=\sum_{j\in[0,k]}w_{j}x_{j}~{}.

(101)

In the first equality we interchanged the summations. In the second equality we used $\sum_{i\in[1,a]}z_{s_{i},j}=\sum_{i^{\prime}\in[0,k]}z_{i^{\prime},j}$ and further invoked condition 1 of the lemma. Now consider the second term in the final expression of Equation 100,

\begin{split}\sum_{i\in[1,a]}\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}(w_{s_{i}}-w_{s_{i+1}})&=\sum_{i\in[1,a]}(w_{s_{i}}-w_{s_{i+1}})\sum_{s_{i}\leq j\leq s_{i+1}}z_{s_{i},j}=\sum_{i\in[1,a]}(w_{s_{i}}-w_{s_{i+1}})\\ &=(w_{s_{1}}-w_{s_{x+1}})\leq\max_{j\in[0,k]}w_{j}~{}.\end{split}

(102)

The second equality follows by line 6 of the algorithm. Condition 2 follows by combining equations 100, 101 and 102.

Condition 3: First we show that $z_{i,j}>0$ implies $i\leq j$ . Consider $j\in[0,k]$ ,

1.

If $j\in\{s_{i}\}_{i\in[1,a]}$ , in this case let $i\in[1,a]$ be such that $s_{i}=j$ . Then by the second and third case in line 6 of the algorithm we have, $z_{i^{\prime},j}>0$ implies $i^{\prime}\in\{s_{i},s_{i-1}\}$ . Further, using $s_{i-1}<s_{i}$ and $s_{i}=j$ we have $i^{\prime}\leq j$ .
2.

Else $j\in[0,k]\backslash\{s_{i}\}_{i\in[1,a]}$ and in this case let $i\in[1,a]$ be such that $s_{i}<j<s_{i+1}$ . Then by the first case in line 6 of the algorithm we have, $z_{i^{\prime},j}>0$ implies $i^{\prime}=s_{i}$ . Further, using $s_{i}<j$ we have $i^{\prime}<j$ .

Using the above implication we have,

\begin{split}\prod_{j\in[0,k]}w_{j}^{m_{j}x_{j}}&=\prod_{j\in[0,k]}w_{j}^{m_{j}\sum_{i\in[0,k]}z_{i,j}}=\prod_{i\in[0,k]}\prod_{j\in[0,k]}w_{j}^{m_{j}z_{i,j}}\leq\prod_{i\in[0,k]}\prod_{j\in[0,k]}w_{i}^{m_{j}z_{i,j}}\end{split}

(103)

In the first equality we used $x_{j}=\sum_{i\in[0,k]}z_{i,j}$ for all $j\in[0,k]$ (Condition 1). In the final inequality, we used the result $z_{i,j}>0$ implies $i\leq j$ and further combined it with the assumption at the begining of the proof, that is, $w_{i}\geq w_{j}$ for all $i,j\in[0,k]$ and $i\leq j$ . ∎

Proof of Lemma 6.13.

Define $\phi_{0}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{i\in[1,t]}\textbf{B}_{i,0}$ . In the following we provide the proof for each case.

Condition 1: For each $i\in[1,t]$ , $\textbf{B}^{\prime}_{i,j}=\textbf{C}_{i,j}$ for all $j\in[0,k]$ and the first condition holds.

Condition 2: Note $\textbf{B}^{\prime}$ is initialized to a zero matrix (Line 4). Further for any $i\in[t+1,t+(k+1)]$ let $j\in[0,k]$ be such that $i=t+1+j$ , then the algorithm only updates the $\textbf{B}^{\prime}_{t+1+j,j}$ ’th entry in the $i$ ’th row and keeps rest of the entries unchanged. Therefore the second condition holds.

Condition 3: For each $i\in[t+1,t+(k+1)]$ let $j\in[0,k]$ be such that $i=t+1+j$ , then $\sum_{j^{\prime}\in[0,k]}\textbf{B}^{\prime}_{i,j^{\prime}}=\textbf{B}^{\prime}_{t+1+j,j}=\sum_{i^{\prime}\in[1,t]}(\textbf{B}_{i^{\prime},j}-\textbf{C}_{i^{\prime},j})=\phi_{j}-\sum_{i^{\prime}\in[1,t]}\textbf{C}_{i^{\prime},j}$ . The first equality holds because of the Condition 2. The third equality follows from the Line 8 of the algorithm. The last equality holds because $\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ and we have $\sum_{i\in[1,\ell]}\textbf{B}_{i,j}=\phi_{j}$ .

Condition 4: Here we provide the proof for $\textbf{B}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}$ . For any $j\in[0,k]$ , we first show that $\sum_{i\in[1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\phi_{j}$ .

	$\displaystyle\sum_{i\in[1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}$	$\displaystyle=\sum_{i\in[1,t]}\textbf{B}^{\prime}_{i,j}+\sum_{i\in[t+1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\textbf{C}_{i,j}+\textbf{B}^{\prime}_{t+1+j,j}$
		$\displaystyle=\sum_{i\in[1,t]}\textbf{C}_{i,j}+\phi_{j}-\sum_{i\in[1,t]}\textbf{C}_{i,j}=\phi_{j}$

The second equality follows because $\textbf{B}^{\prime}_{i,j}=\textbf{C}_{i,j}$ for all $i\in[1,t]$ and $j\in[0,k]$ (Line 6) and $\sum_{i\in[t+1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\textbf{B}^{\prime}_{t+1+j,j}$ (Condition 2). The third equality follows from the Condition 3.

We next show that $\sum_{i\in[1,t+(k+1)]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}\right)\leq 1$ .

\begin{split}\sum_{i\in[1,t+(k+1)]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}\right)&=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}\right)+\sum_{j\in[0,k]}\textbf{r}_{t+1+j}\textbf{B}^{\prime}_{t+1+j,j}\\ &=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{C}_{i,j}\right)+\sum_{j\in[0,k]}\frac{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}\left(\sum_{i\in[1,t]}(\textbf{B}_{i,j}-\textbf{C}_{i,j})\right)\\ &=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{C}_{i,j}\right)+\sum_{j\in[0,k]}\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})\textbf{r}_{i}\\ &=\sum_{i\in[1,t]}\textbf{r}_{i}\left(\sum_{j\in[0,k]}\textbf{B}_{i,j}\right)\leq 1\\ \end{split}

(104)

In the first equality, we divided the summation into two parts and for the second part we used Condition 3. In the second equality we used Line 7 and 8 of the algorithm. In the third and fourth equality we simplified the expression. In the final inequality we used $\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ .

Combining all the conditions together we have $\textbf{B}^{\prime}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{\prime}}$ . In the remainder we show that $\sum_{i\in[1,t+(k+1)]}\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\sum_{j\in[0,k]}\textbf{B}_{i,j}$ .

Recall we already showed that $\sum_{i\in[1,t+(k+1)]}\textbf{B}^{\prime}_{i,j}=\phi_{j}$ for all $j\in[0,k]$ . Recall $\phi_{0}=\sum_{i\in[1,t]}\textbf{B}_{i,0}$ and $\textbf{B}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ implies $\phi_{j}=\sum_{i\in[1,t]}\textbf{B}_{i,j}$ for all $j\in[1,k]$ . Therefore we have,

\sum_{i\in[1,t+(k+1)]}\sum_{j\in[0,k]}\textbf{B}^{\prime}_{i,j}=\sum_{i\in[1,t]}\sum_{j\in[0,k]}\textbf{B}_{i,j}

Condition 5: We first provide the explicit expressions for $\textbf{g}(\textbf{B}^{\prime})$ and $\textbf{g}(\textbf{B})$ below:

\textbf{g}(\textbf{B}^{\prime})=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\frac{\exp\left((\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{B}^{\prime}_{ij}\log\textbf{B}^{\prime}_{ij}\right)}\right)\left(\prod_{j\in[0,k]}\textbf{r}_{t+1+j}^{\overrightarrow{\mathrm{m}}_{j}\textbf{B}^{\prime}_{t+1+j,j}}\cdot 1\right)

\textbf{g}(\textbf{B})=\prod_{i\in[1,t]}\left(\textbf{r}_{i}^{(\textbf{B}\overrightarrow{\mathrm{m}})_{i}}\frac{\exp\left((\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{B}_{ij}\log\textbf{B}_{ij}\right)}\right)

Note in the expression for $\textbf{g}(\textbf{B}^{\prime})$ we used Condition 2. In the above two definitions for $\textbf{g}(\textbf{B}^{\prime})$ and $\textbf{g}(\textbf{B})$ , we refer to the expression involving $\textbf{r}_{i}$ ’s as the probability term and the rest as the counting term. We start the analysis of Condition 5 by first bounding the probability term:

\begin{split}\prod_{i\in[1,t]}&\textbf{r}_{i}^{(\textbf{B}\overrightarrow{\mathrm{m}})_{i}}=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{\sum_{j\in[0,k]}\overrightarrow{\mathrm{m}}_{j}(\textbf{B}_{ij}-\textbf{B}^{\prime}_{ij})}\right)=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\prod_{i\in[1,t]}\textbf{r}_{i}^{\overrightarrow{\mathrm{m}}_{j}(\textbf{B}_{ij}-\textbf{B}^{\prime}_{ij})}\right)\\ &=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}_{ij}-\textbf{B}^{\prime}_{ij})}\right)^{\overrightarrow{\mathrm{m}}_{j}}\right)=\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}_{ij}-\textbf{C}_{ij})}\right)^{\overrightarrow{\mathrm{m}}_{j}}\right)\\ &\leq\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\left(\frac{\sum_{i\in[1,t]}\textbf{r}_{i}(\textbf{B}_{ij}-\textbf{C}_{ij})}{\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}\right)^{\overrightarrow{\mathrm{m}}_{j}\sum_{i\in[1,t]}(\textbf{B}_{ij}-\textbf{C}_{ij})}\right)\\ &\leq\left(\prod_{i\in[1,t]}\textbf{r}_{i}^{(\textbf{B}^{\prime}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{j\in[0,k]}\textbf{r}_{t+1+j}^{\overrightarrow{\mathrm{m}}_{j}\textbf{B}^{\prime}_{t+1+j,j}}\right)\\ \end{split}

(105)

The first three inequalities simplify the expression. The fourth equality follows because $\textbf{B}^{\prime}_{i,j}=\textbf{C}_{i,j}$ for all $i\in[1,t]$ and $j\in[0,k]$ . The fifth inequality follows from AM-GM inequality. The final expression above is the probability term associated with $\textbf{B}^{\prime}$ and the equation above shows that our rounding procedure only increases the probability term and it remains to bound the counting term.

\begin{split}\frac{\textbf{g}(\textbf{B}^{\prime})}{\textbf{g}(\textbf{B})}&\geq\prod_{i\in[1,t]}\frac{\exp\left((\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}^{\prime}\overrightarrow{\mathrm{1}})_{i}-(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{B}^{\prime}_{ij}\log\textbf{B}^{\prime}_{ij}-\textbf{B}_{ij}\log\textbf{B}_{ij}\right)}\\ &=\prod_{i\in[1,t]}\frac{\exp\left((\textbf{C}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{C}\overrightarrow{\mathrm{1}})_{i}-(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\textbf{C}_{ij}-\textbf{B}_{ij}\log\textbf{B}_{ij}\right)}~{}.\end{split}

(106)

Consider the numerator in the above expression, for each $i\in[1,t]$ let $s_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\textbf{C}\overrightarrow{\mathrm{1}})_{i}$ , then

\begin{split}\prod_{i\in[1,t]}\exp\left((\textbf{C}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{C}\overrightarrow{\mathrm{1}})_{i}-(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\right)&=\prod_{i\in[1,t]}\exp\left(s_{i}\log s_{i}-(s_{i}+\alpha_{i})\log(s_{i}+\alpha_{i})\right)\\ &=\prod_{i\in[1,t]}\exp\left(s_{i}\log\frac{s_{i}}{s_{i}+\alpha_{i}}-\alpha_{i}\log(s_{i}+\alpha_{i})\right)\\ &\geq\prod_{i\in[1,t]}\exp\left(s_{i}\frac{-\alpha_{i}}{s_{i}}-\alpha_{i}\log(s_{i}+\alpha_{i})\right)\\ &\geq\exp\left(-O\left(\log(\sum_{i\in[1,t]}s_{i})\sum_{i\in[1,t]}\alpha_{i}\right)\right)\\ &\geq\exp\left(-O\left(\sum_{i\in[1,t]}\alpha_{i}\log\Delta\right)\right)~{}.\end{split}

(107)

In the third inequality we used $\log(1+x)\geq\frac{x}{1+x}$ for all $x\geq-1$ . The final inequality follows because $\sum_{i\in[1,t]}s_{i}\leq\sum_{i\in[1,t]}(\textbf{B}\overrightarrow{\mathrm{1}})_{i}\leq\Delta$ . Now consider the denominator in the above expression, let $\alpha_{i,j}=\textbf{B}_{i,j}-\textbf{C}_{i,j}$ for all $i\in[1,t]$ and $j\in[0,k]$ , then

\begin{split}\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\textbf{C}_{ij}-\textbf{B}_{ij}\log\textbf{B}_{ij}\right)&=\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\textbf{C}_{ij}-(\textbf{C}_{ij}+\alpha_{i,j})\log(\textbf{C}_{ij}+\alpha_{i,j})\right)\\ &=\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(\textbf{C}_{ij}\log\frac{\textbf{C}_{ij}}{\textbf{C}_{ij}+\alpha_{i,j}}-\alpha_{i,j}\log(\textbf{C}_{ij}+\alpha_{i,j})\right)\\ &\leq\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(-\alpha_{i,j}\log(\textbf{C}_{ij}+\alpha_{i,j})\right)\\ &\leq\prod_{i\in[1,t]}\prod_{j\in[0,k]}\exp\left(-\alpha_{i,j}\log\alpha_{i,j}\right)\leq\exp\left(O\big{(}\log(t\times k)\sum_{i\in[1,t]}\alpha_{i}\big{)}\right)\\ &\leq\exp\left(O\left(\sum_{i\in[1,t]}\alpha_{i}\log\Delta\right)\right)~{}.\end{split}

(108)

In the third inequality we used $\alpha_{i,j}\geq 0$ and therefore $\textbf{C}_{ij}\log\frac{\textbf{C}_{ij}}{\textbf{C}_{ij}+\alpha_{i,j}}\leq 0$ . In the fourth inequality we used $\log(\textbf{C}_{ij}+\alpha_{i,j})\geq\log\alpha_{i,j}$ . In the fifth inequality we used $\sum_{j\in[0,k]}\alpha_{i,j}=\alpha_{i}$ for all $i\in[1,t]$ and further $\sum_{i\in[1,t]}\sum_{j\in[0,k]}-\alpha_{i,j}\log\alpha_{i,j}=\sum_{i\in[1,t]}\alpha_{i}\left(\sum_{j\in[0,k]}-\frac{\alpha_{i,j}}{\alpha_{i}}\log\frac{\alpha_{i,j}}{\alpha_{i}}-\log\alpha_{i}\right)\leq\log(k+1)\sum_{i\in[1,t]}\alpha_{i}-\sum_{i\in[1,t]}\alpha_{i}\log\alpha_{i}$ . Now consider the term $-\sum_{i\in[1,t]}\alpha_{i}\log\alpha_{i}$ and note that $-\sum_{i\in[1,t]}\alpha_{i}\log\alpha_{i}=(\sum_{i\in[1,t]}\alpha_{i})\left(-\sum_{i\in[1,t]}\frac{\alpha_{i}}{\sum_{i\in[1,t]}\alpha_{i}}\log\frac{\alpha_{i}}{\sum_{i\in[1,t]}\alpha_{i}}-\log\sum_{i\in[1,t]}\alpha_{i}\right)\leq(1+\log t)\sum_{i\in[1,t]}\alpha_{i}$ . The fifth inequality in Equation 108 follows by combining the previous two derivations together. The final inequality follows because $t\times k\leq\Delta$ .

Condition 6: This condition follows immediately from Line 7 of the algorithm. ∎

Proof of Lemma 6.14.

In the following we provide the proof for the claims in the lemma.

Condition 1: Note $\textbf{H}^{(1)}\subseteq\textbf{H}\cup[\ell+1,\ell+(k+1)]$ , where $[\ell+1,\ell+(k+1)]$ are the indices corresponding to the new rows created by the procedure $\mathrm{CreateNewProbabilityValues}$ (Algorithm 2). Consider any $i\in\textbf{H}^{(1)}$ , then one the following two cases hold,

1.

If $i\in\textbf{H}$ , then by the first condition of Lemma 6.13 we have $(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{A}\overrightarrow{\mathrm{1}})_{i}=\sum_{j\in[0,k]}\textbf{A}_{i,j}=\sum_{j\in[0,k]}\lfloor\textbf{S}_{i,j}\rfloor\in\mathbb{Z}_{+}$ .
2.

Else $i\in[\ell+1,\ell+(k+1)]$ and in this case we have $\sum_{i\in[1,\ell]}\textbf{A}_{i,j}=\sum_{i\in\textbf{H}}\textbf{A}_{i,j}+\sum_{i\in\textbf{L}}\textbf{A}_{i,j}=\sum_{i\in\textbf{H}}\lfloor\textbf{S}_{i,j}\rfloor+\lfloor\sum_{i\in\textbf{L}}\textbf{S}_{i,j}\rfloor\in\mathbb{Z}_{+}$ . The second equality in the previous derivation follows from Line 7 and 8 of the algorithm. The previous derivation combined with third condition of Lemma 6.13 we get, $(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}=\phi_{j}-\sum_{i\in[1,\ell]}\textbf{A}_{i,j}\in\mathbb{Z}_{+}$ .

$(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ in both the cases and the condition 1 follows.

Condition 2: This condition follows immediately from the fourth condition of Lemma 6.13.

Condition 3: Let $\alpha_{i}=\sum_{j\in[0,k]}\textbf{S}_{i,j}-\sum_{j\in[0,k]}\textbf{A}_{i,j}$ for all $i\in[1,\ell]$ . First we upper bound the term $\sum_{i\in\textbf{H}}\alpha_{i}$ . Consider $\sum_{i\in\textbf{H}}\alpha_{i}\leq\sum_{i\in\textbf{H}}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq\frac{1}{\gamma}$ . The last inequality follows because of the constraint $\sum_{i\in[1,\ell]}\textbf{r}_{i}\sum_{j\in[0,k]}\textbf{S}_{i,j}\leq 1$ ( $\textbf{S}\in\textbf{Z}^{\phi,frac}_{\textbf{R}}$ ) and $\textbf{r}_{i}>\gamma$ for all $i\in\textbf{H}$ .

We now upper bound the term $\sum_{i\in\textbf{L}}\alpha_{i}$ . Consider $\sum_{i\in\textbf{L}}\alpha_{i}=\sum_{i\in\textbf{L}}\left(\sum_{j\in[0,k]}\textbf{S}_{i,j}-\sum_{j\in[0,k]}\textbf{A}_{i,j}\right)=\sum_{j\in[0,k]}\left(\sum_{i\in\textbf{L}}\textbf{S}_{i,j}-\sum_{i\in\textbf{L}}\textbf{A}_{i,j}\right)$ . Further $\sum_{i\in\textbf{L}}\textbf{A}_{i,j}=\lfloor\sum_{i\in\textbf{L}}\textbf{S}_{i,j}\rfloor$ for all $j\in[0,k]$ (Line 8 of the algorithm) and we get $\sum_{i\in\textbf{L}}\alpha_{i}\leq k+1$ .

Therefore $\sum_{i\in[\ell]}\alpha_{i}=\sum_{i\in\textbf{H}}\alpha_{i}+\sum_{i\in\textbf{L}}\alpha_{i}\leq\frac{1}{\gamma}+k+1$ and combined with fifth condition Lemma 6.13 we have,

\textbf{g}(\textbf{S}^{(1)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})~{}.

∎

Proof of Lemma 6.15.

In the following we provide proof for all the conditions in the lemma.

Condition 1: For all $i\in[1,\ell+(k+1)]$ , one of the following two conditions hold,

1.

If $i\in\textbf{H}^{(1)}$ , then by the first condition of Lemma 6.13 we have $(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ . The last expression follows from first condition of Lemma 6.14.
2.

Else $i\in\textbf{L}^{(1)}$ , then again by the first condition of Lemma 6.13 we have $(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}=(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}=\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor\in\mathbb{Z}_{+}$ . The last equality follows from Line 15 of the algorithm.

For all $i\in[1,\ell+(k+1)]$ , we have $(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ and therefore condition 1 holds.

Condition 2: This condition follows immediately from the second condition of Lemma 6.13.

Condition 3: This condition follows immediately from the fourth condition of Lemma 6.13.

Condition 4: Consider the term $\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}$ ,

\begin{split}\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}&=\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}-\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\\ &=\sum_{j\in[0,k]}\phi_{j}-\sum_{i\in[1,\ell+(k+1)]}(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}\\ &=\sum_{j\in[0,k]}\phi_{j}-\left(\sum_{i\in\textbf{H}^{(1)}}(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}+\sum_{i\in\textbf{L}^{(1)}}(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}\right)\\ &=\sum_{j\in[0,k]}\phi_{j}-\left(\sum_{i\in\textbf{H}^{(1)}}(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}+\sum_{i\in\textbf{L}^{(1)}}\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor\right)\in\mathbb{Z}_{+}\end{split}

(109)

In the first equality we add and substract $\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}$ term. The first term in the second equality follows because $\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}=\sum_{j\in[0,k]}\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{(2)}_{i,j}=\sum_{j\in[0,k]}\phi_{j}$ and the last equality follows because $\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}}$ (Condition 3). The second term in the second equality follows by the first condition of Lemma 6.13. In the third equality we divided the summation terms over $\textbf{H}^{(1)}$ and $\textbf{L}^{(1)}$ . In the fourth equality we used Line 14 of the algorithm and further for any $i\in\textbf{L}^{(1)}$ Line 15 implies $(\textbf{A}^{(1)}\overrightarrow{\mathrm{1}})_{i}=\sum_{j\in[0,k]}\textbf{S}^{(1)}_{ij}\frac{\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor}{(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}}=\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor$ . Finally by first condition of Lemma 6.14 we have $(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ for all $i\in\textbf{H}^{(1)}$ and $\phi_{j}\in\mathbb{Z}_{+}$ for all $j\in[0,k]$ . Therefore, $\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ and the condition 4 holds.

Condition 5: For any $j\in[0,k]$ we have,

\begin{split}\textbf{r}^{(2)}_{\ell+(k+1)+1+j}&=\frac{\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})\textbf{r}^{(1)}_{i}}{\sum_{i\in[1,\ell+(k+1)]}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}=\frac{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})\textbf{r}^{(1)}_{i}}{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}\\ &\leq\gamma\frac{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}{\sum_{i\in\textbf{L}^{(1)}}(\textbf{S}^{(1)}_{ij}-\textbf{A}^{(1)}_{ij})}\leq\gamma.\end{split}

(110)

The first equality follows from the sixth condition of Lemma 6.13. The second equality follows because $\textbf{S}^{(1)}_{i,j}=\textbf{A}^{(1)}_{i,j}$ for all $i\in\textbf{H}^{(1)}$ and $j\in[0,k]$ (Line 14). The third inequality follows because $\textbf{S}^{(1)}_{i,j}\geq\textbf{A}^{(1)}_{i,j}$ for all $i\in\textbf{L}^{(1)}$ and $j\in[0,k]$ (Line 15) and further $\textbf{r}^{(1)}_{i}\leq\gamma$ for all $i\in\textbf{L}^{(1)}$ (Line 12).

Condition 6: For any $i\in[1,\ell+(k+1)]$ , let $\alpha_{i}=\sum_{j\in[0,k]}\textbf{S}^{(1)}_{i,j}-\sum_{j\in[0,k]}\textbf{A}^{(1)}$ . Note $\alpha_{i}=0$ for all $i\in\textbf{H}^{(1)}$ (Line 14) and $\alpha_{i}=(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}-\lfloor(\textbf{S}^{(1)}\overrightarrow{\mathrm{1}})_{i}\rfloor\leq 1$ for all $i\in\textbf{L}^{(1)}$ (Line 15). Therefore $\sum_{i\in[1,\ell+(k+1)]}\alpha_{i}\leq|\textbf{L}^{(1)}|\leq\ell+(k+1)$ and further combined with the fifth condition of Lemma 6.13 we have $\textbf{g}(\textbf{S}^{(2)})\geq\exp\left(-O\left((\ell+k)\log\Delta\right)\right)\textbf{g}(\textbf{S}^{(1)})$ . Note by the third condition of Lemma 6.14 we have $\textbf{g}(\textbf{S}^{(1)})\geq\exp\left(-O\left(\left(\frac{1}{\gamma}+k\right)\log\Delta\right)\right)\textbf{g}(\textbf{S})$ . Combining the previous two inequalities we get $\textbf{g}(\textbf{S}^{(2)})\geq\exp\left(-O\left((\ell+k+\frac{1}{\gamma})\log\Delta\right)\right)\textbf{g}(\textbf{S})$ and condition 6 holds. ∎

Proof of 6.16.

In the following we provide proof for the two conditions of the theorem.

Condition 1: Here we provide the proof for the condition $\textbf{S}^{\mathrm{ext}}\in\textbf{Z}^{\phi}_{\textbf{R}^{\mathrm{ext}}}$ .

1.

For all $i\in[1,\ell+2(k+1)]$ , consider $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}$ . If $i\in[1,\ell+(k+1)]$ , then $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ . The first equality follows by line 22 of the algorithm and the last expression follows by first condition of Lemma 6.15. Else $i\in[\ell+(k+1)+1,\ell+2(k+1)]$ , let $j$ be such that $i=\ell+(k+1)+1+j$ , then $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}\right)=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\in\mathbb{Z}_{+}$ . The second equality follows by line 23 of the algorithm. The third equality follows from the second condition of Lemma 6.15. Finally by the first condition of Lemma 6.12 we have $\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\in\{0,1\}$ for all $j\in[0,k]$ and therefore $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\in\mathbb{Z}_{+}$ for any $i\in[\ell+(k+1)+1,\ell+2(k+1)]$ .

Combining the analysis of cases $i\in[1,\ell+(k+1)]$ and $i\in[\ell+(k+1)+1,\ell+2(k+1)]$ the condition 1 holds.

For all $j\in[0,k]$ ,

\begin{split}\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}&=\sum_{i\in[1,\ell+(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}+\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}\\ &=\sum_{i\in[1,\ell+(k+1)]}\textbf{S}^{(2)}_{i,j}+\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j^{\prime},j}\rfloor+z_{j^{\prime},j}\right)~{}.\end{split}

(111)

The second equality follows because $\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j}$ for all $i\in[1,\ell+(k+1)]$ (Line 22) and $\textbf{S}^{\mathrm{ext}}_{i,j}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j^{\prime},j}\rfloor+z_{j^{\prime},j}$ for all $i\in[\ell+(k+1)+1,\ell+2(k+1)]$ (Line 23). We next simplify the second term in the above expression.

\begin{split}\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j^{\prime},j}\rfloor+z_{j^{\prime},j}\right)&=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j^{\prime},j}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+x_{j}\\ &=\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}=\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{S}^{(2)}_{i,j}~{}.\end{split}

(112)

In the first and final equality we used the second condition of Lemma 6.15 (Diagonal Structure). In the second equality we used the first condition of Lemma 6.12. In the third equality we used the definition of $x_{j}$ (Line 19). Combining equations 111 and 112 we get,

\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{\mathrm{ext}}_{i,j}=\sum_{i\in[1,\ell+2(k+1)]}\textbf{S}^{(2)}_{i,j}=\phi_{j}

In the last inequality we used $\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}}$ .

Let $\textbf{r}^{\mathrm{ext}}_{i}$ for all $i\in[1,\ell+2(k+1)]$ be the $i$ ’th element of $\textbf{R}^{\mathrm{ext}}$ . Consider $\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{\mathrm{ext}}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}$ , we have,

\begin{split}\sum_{i\in[1,\ell+2(k+1)]}&\textbf{r}^{\mathrm{ext}}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\sum_{i\in[1,\ell+2(k+1)]}\frac{\textbf{r}^{(2)}_{i}}{1+\gamma}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\\ &=\frac{1}{1+\gamma}\sum_{i\in[1,\ell+(k+1)+1]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}+\frac{1}{1+\gamma}\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}.\end{split}

(113)

The first equality follows from Line 24 of the algorithm. In the second equality we divided the summation into two parts and used $\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j}$ for all $i\in[1,\ell+(k+1)+1]$ and $j\in[0,k]$ (Line 22) for the first part. We now simplify the second part of the above expression.

\begin{split}\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}&=\sum_{j\in[0,k]}\textbf{r}^{(2)}_{\ell+(k+1)+1+j}\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}\right)\\ &=\sum_{j\in[0,k]}w_{j}\left(\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}-x_{j}\right)+\sum_{j\in[0,k]}w_{j}\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\\ &\leq\sum_{j\in[0,k]}w_{j}\left(\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}-x_{j}\right)+\sum_{j\in[0,k]}w_{j}x_{j}+\max_{j\in[0,k]}w_{j}\\ &=\sum_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}+\gamma~{}.\end{split}

(114)

In the first equality we expanded the $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}$ term. Further we used $\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}$ for all $j,j^{\prime}\in[0,k]$ (Line 23). In the second equality we used the second condition of Lemma 6.15 (Diagonal Structure) and further combined it with definitions of $w_{j}$ and $x_{j}$ from Line 19 of the algorithm. The third inequality follows from second condition of Lemma 6.12. In the final inequality we used $\max_{j\in[0,k]}w_{j}\leq\gamma$ that follows from the definition of $w_{j}$ and fifth condition of Lemma 6.15. Further we combined it with $\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}=(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}$ that follows from the second condition of Lemma 6.15.

Combining equations 113 and 114 we have,

\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{\mathrm{ext}}_{i}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\leq\frac{1}{1+\gamma}\left(\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}+\gamma\right)\leq 1~{}.

In the final inequality we used $\textbf{S}^{(2)}\in\textbf{Z}^{\phi,frac}_{\textbf{R}^{(2)}}$ and therefore $\sum_{i\in[1,\ell+2(k+1)]}\textbf{r}^{(2)}_{i}(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\leq 1$ .

The condition 1 holds by combining the analysis of all the above three cases.

Condition 2: Recall the definition of $\textbf{g}(\textbf{S}^{\mathrm{ext}})$ ,

\displaystyle\textbf{g}(\textbf{S}^{\mathrm{ext}})=\prod_{i\in[1,\ell+2(k+1)]}\left({\textbf{r}^{\mathrm{ext}}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\frac{\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j\in[0,k]}\exp\left(\textbf{S}^{\mathrm{ext}}_{ij}\log\textbf{S}^{\mathrm{ext}}_{ij}\right)}\right)

In the above expression consider the probability term,

\begin{split}\prod_{i\in[1,\ell+2(k+1)]}&{\textbf{r}^{\mathrm{ext}}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}=\prod_{i\in[1,\ell+2(k+1)]}\left(\frac{\textbf{r}^{(2)}_{i}}{1+\gamma}\right)^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\\ &\geq\exp\left(-O(\gamma n)\right)\left(\prod_{i\in[1,\ell+(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\right)\\ &=\exp\left(-O(\gamma n)\right)\left(\prod_{i\in[1,\ell+(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{(2)}\overrightarrow{\mathrm{m}})_{i}}\right)\left(\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\right)~{}.\end{split}

(115)

In the first equality we used line 24 of the algorithm. In the second inequality we used $\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}=n$ that further implies $\left(1+\gamma\right)^{-\sum_{i\in[1,\ell+2(k+1)]}(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\geq\exp\left(-O(\gamma n)\right)$ . In the third equality we used $\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j}$ for all $i\in[1,\ell+(k+1)]$ and $j\in[0,k]$ (Line 22). We now analyze the second product term in the final expression above,

\begin{split}\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}&{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}=\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}\\ &=\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}\left(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}\right)\overrightarrow{\mathrm{m}}_{j^{\prime}}}\\ &=\left(\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor}\right)\left(\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}\right).\end{split}

(116)

The second equality follows from line 23 of the algorithm. The third equality follows from the second condition of Lemma 6.15 (Diagonal Structure).

Now consider the second product term in the above expression.

\begin{split}\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}&=\prod_{j\in[0,k]}w_{j}^{\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}\overrightarrow{\mathrm{m}}_{j^{\prime}}}\geq\prod_{j\in[0,k]}w_{j}^{x_{j}\overrightarrow{\mathrm{m}}_{j}}~{}.\end{split}

(117)

In the first equality we used the definition of $w_{j}$ (Line 19). The second inequality follows from the third condition of Lemma 6.12.

Combining equations 116, 117 and further using $x_{j}=\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}-\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor$ for all $j\in[0,k]$ (Line 19) we have,

\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\geq\prod_{j\in[0,k]}{\textbf{r}^{(2)}_{\ell+(k+1)+1+j}}^{\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\overrightarrow{\mathrm{m}}_{j}}=\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{(2)}\overrightarrow{\mathrm{m}})_{i}}~{}.

(118)

In the final inequality we used the second condition of Lemma 6.15 (Diagonal Structure).

Combining equations 115 and 118 we have,

\prod_{i\in[1,\ell+2(k+1)]}{\textbf{r}^{\mathrm{ext}}_{i}}^{(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{m}})_{i}}\geq\exp\left(-O(\gamma n)\right)\prod_{i\in[1,\ell+2(k+1)]}{\textbf{r}^{(2)}_{i}}^{(\textbf{S}^{(2)}\overrightarrow{\mathrm{m}})_{i}}

Using the above expression we have,

\begin{split}\frac{\textbf{g}(\textbf{S}^{\mathrm{ext}})}{\textbf{g}(\textbf{S}^{(2)})}&\geq\exp\left(-O(\gamma n)\right)\prod_{i\in[1,\ell+2(k+1)]}\left(\frac{\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j^{\prime}\in[0,k]}\exp\left(\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}-\textbf{S}^{(2)}_{i,j^{\prime}}\log\textbf{S}^{(2)}_{i,j^{\prime}}\right)}\right)\\ &=\exp\left(-O(\gamma n)\right)\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\left(\frac{\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{(2)}\overrightarrow{\mathrm{1}})_{i}\right)}{\prod_{j^{\prime}\in[0,k]}\exp\left(\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}-\textbf{S}^{(2)}_{i,j^{\prime}}\log\textbf{S}^{(2)}_{i,j^{\prime}}\right)}\right)\\ &=\exp\left(-O(\gamma n)\right)\prod_{i\in[\ell+(k+1)+1,\ell+2(k+1)]}\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)~{}.\end{split}

(119)

In the second equality we used $\textbf{S}^{\mathrm{ext}}_{i,j}=\textbf{S}^{(2)}_{i,j}$ for all $i\in[1,\ell+(k+1)]$ and $j\in[0,k]$ (Line 22). The third inequality follows by the second condition of Lemma 6.15 (Diagonal Structure). In the remainder of the proof we lower bound the term in the final expression.

For each $i\in[\ell+(k+1)+1,\ell+2(k+1)]$ let $j\in[0,k]$ be such that $i=\ell+(k+1)+1+j$ , then $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\sum_{j^{\prime}\in[0,k]}(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}})=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}$ . The first equality follows from line 23 of the algorithm. The second equality follows by the second condition of Lemma 6.15 (Diagonal Structure). Using first condition of Lemma 6.12, one of the following two cases hold,

If $\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=0$ , then $z_{j,j^{\prime}}=0$ for all $j^{\prime}\in[0,k]$ . Using second condition of Lemma 6.15 (Diagonal Structure), we have $\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}=0$ for all $j^{\prime}\in[0,k]$ and $j^{\prime}\neq j$ . Further note, $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j}$ . Combining previous two equalities we have, $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}=0$ . Therefore,

\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)\geq 1~{}.

(120)

If $\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=1$ , then $z_{j,j^{\prime}}\in[0,1]_{\mathbb{R}}$ for all $j^{\prime}\in[0,k]$ . Using second condition of Lemma 6.15 (Diagonal Structure), we have $\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}=\textbf{S}^{\mathrm{ext}}_{\ell+(k+1)+1+j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j^{\prime}}\rfloor+z_{j,j^{\prime}}=z_{j,j^{\prime}}$ for all $j^{\prime}\in[0,k]$ and $j^{\prime}\neq j$ . Therefore, $\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}=(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})+\sum_{j^{\prime}\neq j}z_{j,j^{\prime}}\log z_{j,j^{\prime}}\leq(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})$ . The final inequality follows because $z_{j,j^{\prime}}\in[0,1]_{\mathbb{R}}$ and $z_{j,j^{\prime}}\log z_{j,j^{\prime}}\leq 0$ for all $j^{\prime}\in[0,k]$ .

Further note, $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+\sum_{j^{\prime}\in[0,k]}z_{j,j^{\prime}}=\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+1$ . Combining previous two inequalities we have, $(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\geq(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+1)\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+1)-(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\log(\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor+z_{j,j})\geq 0$ . The last inequality follows because of the following: If $\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor=0$ , then the inequality follows because $z_{j,j}\in[0,1]_{\mathbb{R}}$ and $z_{j,j}\log z_{j,j}\leq 0$ . Else $\lfloor\textbf{S}^{(2)}_{\ell+(k+1)+1+j,j}\rfloor\geq 1$ , in this case we use the fact that $x\log x$ is a monotonically increasing for $x\geq 1$ .

Therefore

\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)\geq 1~{}.

(121)

Combining equations 120 and 121, for all $i\in[\ell+(k+1)+1,\ell+2(k+1)]$ we have,

\exp\left((\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}\log(\textbf{S}^{\mathrm{ext}}\overrightarrow{\mathrm{1}})_{i}-\sum_{j^{\prime}\in[0,k]}\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\log\textbf{S}^{\mathrm{ext}}_{i,j^{\prime}}\right)\geq 1~{}.

Substituting previous inequality in Equation 119 we get,

\frac{\textbf{g}(\textbf{S}^{\mathrm{ext}})}{\textbf{g}(\textbf{S}^{(2)})}\geq\exp\left(-O(\gamma n)\right)~{}.

Further the condition 2 of the theorem follows by combining the above inequality with the sixth condition of Lemma 6.15. ∎

7 Acknowledgments

We thank Jayadev Acharya and Yanjun Han for helpful conversations. We thank the anonymous reviewer for pointing out the alternative proof of the quality of scaled Sinkhorn and Bethe approximations on approximating the permanent of matrices with a bounded number of distinct columns (see Section 3.1 and Appendix A).

References

[AA11] Scott Aaronson and Alex Arkhipov. The computational complexity of linear optics. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 333–342, New York, NY, USA, 2011. ACM.
[ADM⁺10] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern probabilities. In 2010 IEEE International Symposium on Information Theory, pages 1498–1502, June 2010.
[ADOS16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. CoRR, abs/1611.02960, 2016.
[AOST14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 2014.
[AOST17] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating renyi entropy of discrete distributions. IEEE Trans. Inf. Theor., 63(1):38–56, January 2017.
[AR18] Nima Anari and Alireza Rezaei. A tight analysis of bethe approximation for permanent. CoRR, abs/1811.02933, 2018.
[AS04] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.
[Bar96] Alexander I Barvinok. Two algorithmic results for the traveling salesman problem. Mathematics of Operations Research, 21(1):65–84, 1996.
[Bar17] Alexander Barvinok. Combinatorics and Complexity of Partition Functions. Springer Publishing Company, Incorporated, 1st edition, 2017.
[BF93] John Bunge and Michael Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364–373, 1993.
[Bre73] L. M. Bregman. Certain properties of nonnegative matrices and their permanents. 1973.
[BZLV16] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of kl divergence between large-alphabet distributions. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1118–1122, July 2016.
[CCG⁺12] Robert K Colwell, Anne Chao, Nicholas J Gotelli, Shang-Yi Lin, Chang Xuan Mao, Robin L Chazdon, and John T Longino. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of plant ecology, 5(1):3–21, 2012.
[Cha84] A Chao. Nonparametric estimation of the number of classes in a population. scandinavianjournal of statistics11, 265-270. Chao26511Scandinavian Journal of Statistics1984, 1984.
[CL92] Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 87(417):210–217, 1992.
[CSS19] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 780–791, New York, NY, USA, 2019. ACM.
[DS13] Timothy Daley and Andrew D Smith. Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4):325, 2013.
[ET76] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did shakespeare know? Biometrika, 63(3):435–447, 1976.
[Für05] Johannes Fürnkranz. Web mining. In Data mining and knowledge discovery handbook, pages 899–920. Springer, 2005.
[GS14] Leonid Gurvits and Alex Samorodnitsky. Bounds on the permanent and some applications. arXiv e-prints, page arXiv:1408.0976, Aug 2014.
[GS18] Daniel Grier and Luke Schaeffer. New hardness results for the permanent using linear optics. In Proceedings of the 33rd Computational Complexity Conference, CCC ’18, pages 19:1–19:29, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[GTPB07] Zhan Gao, Chi-hong Tseng, Zhiheng Pei, and Martin J Blaser. Molecular analysis of human forearm superficial skin bacterial biota. Proceedings of the National Academy of Sciences, 104(8):2927–2932, 2007.
[Gur05] Leonid Gurvits. On the complexity of mixed discriminants and related problems. In Proceedings of the 30th International Conference on Mathematical Foundations of Computer Science, MFCS’05, pages 447–458, Berlin, Heidelberg, 2005. Springer-Verlag.
[Gur11] Leonid Gurvits. Unleashing the power of Schrijver’s permanental inequality with the help of the Bethe Approximation. arXiv e-prints, page arXiv:1106.2844, Jun 2011.
[HHRB01] Jennifer B Hughes, Jessica J Hellmann, Taylor H Ricketts, and Brendan JM Bohannan. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67(10):4399–4406, 2001.
[HJM17] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On Estimation of $L_{r}$-Norms in Gaussian White Noise Models. arXiv e-prints, page arXiv:1710.03863, Oct 2017.
[HJW16] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of KL divergence between discrete distributions. CoRR, abs/1605.09124, 2016.
[HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. arXiv preprint arXiv:1802.08405, 2018.
[HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy estimation over Lipschitz balls. arXiv e-prints, page arXiv:1711.02141, Nov 2017.
[HO19] Yi Hao and Alon Orlitsky. The Broad Optimality of Profile Maximum Likelihood. arXiv e-prints, page arXiv:1906.03794, Jun 2019.
[JHW16] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the l1 distance. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 750–754, July 2016.
[JSV04] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. J. ACM, 51(4):671–697, July 2004.
[JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835–2885, May 2015.
[KLR99] Ian Kroes, Paul W Lepp, and David A Relman. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547–14552, 1999.
[LSW98] Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 644–652, New York, NY, USA, 1998. ACM.
[OSS⁺04] A. Orlitsky, S. Sajama, N. P. Santhanam, K. Viswanathan, and Junan Zhang. Algorithms for modeling distributions over large alphabets. In International Symposium on Information Theory, 2004. ISIT 2004. Proceedings., pages 304–304, 2004.
[OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283–13288, 2016.
[OSZ03] A. Orlitsky, N. P. Santhanam, and J. Zhang. Always good turing: asymptotically optimal probability estimation. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 179–188, Oct 2003.
[PBG⁺01] Bruce J Paster, Susan K Boches, Jamie L Galvin, Rebecca E Ericson, Carol N Lau, Valerie A Levanos, Ashish Sahasrabudhe, and Floyd E Dewhirst. Bacterial diversity in human subgingival plaque. Journal of bacteriology, 183(12):3770–3783, 2001.
[PGM⁺01] A. Porta, S. Guzzetti, N. Montano, R. Furlan, M. Pagani, A. Malliani, and S. Cerutti. Entropy, entropy rate, and pattern classification as tools to typify complexity in short heart period variability series. IEEE Transactions on Biomedical Engineering, 48(11):1282–1291, Nov 2001.
[PJW17] D. S. Pavlichin, J. Jiao, and T. Weissman. Approximate Profile Maximum Likelihood. ArXiv e-prints, December 2017.
[PW96] Nina T. Plotkin and Abraham J. Wyner. An Entropy Estimator Algorithm and Telecommunications Applications, pages 351–363. Springer Netherlands, Dordrecht, 1996.
[Rad97] Jaikumar Radhakrishnan. An entropy proof of bregman’s theorem. Journal of Combinatorial Theory, Series A, 77(1):161 – 164, 1997.
[RCS⁺09] Harlan S Robins, Paulo V Campregher, Santosh K Srivastava, Abigail Wacher, Cameron J Turtle, Orsalem Kahsai, Stanley R Riddell, Edus H Warren, and Christopher S Carlson. Comprehensive assessment of t-cell receptor $\beta$ -chain diversity in $\alpha$ $\beta$ t cells. Blood, 114(19):4099–4107, 2009.
[RRSS07] S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 559–569, Oct 2007.
[RVZ17] Aditi Raghunathan, Gregory Valiant, and James Zou. Estimating the unseen from multiple populations. CoRR, abs/1707.03854, 2017.
[RWdRvSB99] Fred Rieke, Davd Warland, Rob de Ruyter van Steveninck, and William Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, USA, 1999.
[Sch78] A Schrijver. A short proof of minc’s conjecture. Journal of Combinatorial Theory, Series A, 25(1):80 – 83, 1978.
[Sch98] Alexander Schrijver. Counting 1-factors in regular bipartite graphs. Journal of Combinatorial Theory, Series B, 72(1):122 – 135, 1998.
[Spe82] E. Spence. H. minc, permanents (encyclopedia of mathematics and its applications, vol. 6, addison-wesley advanced book programme, 1978), xviii 205 pp., 21.50. Proceedings of the Edinburgh Mathematical Society, 25(1):110–110, 1982.
[TE87] Ronald Thisted and Bradley Efron. Did shakespeare write a newly-discovered poem? Biometrika, 74(3):445–455, 1987.
[Val79] L.G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189 – 201, 1979.
[VBB⁺12] Martin Vinck, Francesco P. Battaglia, Vladimir B. Balakirsky, A. J. Han Vinck, and Cyriel M. A. Pennartz. Estimation of the entropy based on its polynomial representation. Phys. Rev. E, 85:051139, May 2012.
[Von11] Pascal O. Vontobel. The bethe permanent of a non-negative matrix. CoRR, abs/1107.4196, 2011.
[Von12] Pascal O. Vontobel. The bethe approximation of the pattern maximum likelihood distribution. pages 2012–2016, 07 2012.
[Von13] P. O. Vontobel. The bethe permanent of a nonnegative matrix. IEEE Transactions on Information Theory, 59(3):1866–1901, March 2013.
[Von14] P. O. Vontobel. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In 2014 Information Theory and Applications Workshop (ITA), pages 1–10, Feb 2014.
[VV11a] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 403–412, Oct 2011.
[VV11b] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 685–694, New York, NY, USA, 2011. ACM.
[WY15] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. ArXiv e-prints, April 2015.
[WY16a] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702–3720, June 2016.
[WY16b] Yihong Wu and Pengkun Yang. Sample complexity of the distinct elements problem. arXiv e-prints, page arXiv:1612.03375, Dec 2016.
[YFW05] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282–2312, July 2005.
[ZVV⁺16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G. MacArthur. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature Communications, 7:13293 EP –, 10 2016.

Appendix A Alternative proof for the distinct column case.

Here we provide an alternative and simpler proof for Lemma 4.1 which was pointed to us by an anonymous reviewer. This alternative proof is derived using Corollary 3.4.5 in Barvinok’s book [Bar17] (which is further derived using the Bregman-Minc inequality) and we formally state it below.

Lemma A.1 (Corollary 3.4.5 from [Bar17]).

Suppose that Q is a $N\times N$ doubly stochastic matrix that satisfies,

\textbf{Q}_{i,j}\leq\frac{1}{b_{i}}\text{ for all }i\in[N],j\in[N]

for some positive integers $b_{1},\dots b_{N}$ . Then,

\mathrm{perm}(\textbf{Q})\leq\prod_{i\in[N]}\frac{(b_{i}!)^{1/b_{i}}}{b_{i}}~{}.

Using the above result, we now prove Lemma 4.1 and we restate it for convenience. See 4.1

Alternative proof for Lemma 4.1.

The lower bound follows from 2.7 and in the remainder we prove the upper bound. Let Q be the maximizer of the scaled Sinkhorn objective, then it is a well know fact that Q satisfies,

\textbf{Q}=\textbf{L}\textbf{A}\textbf{R}~{},

where matrices L and R are the left and right non-negative diagonal matrices. Further by the symmetry of the objective, there exists an optimum solution Q that has at most $k$ distinct columns and we work with such an optimum solution. As L and R are diagonal matrices, the following two inequalities are trivial,

\mathrm{perm}(\textbf{Q})=\mathrm{perm}(\textbf{L})\mathrm{perm}(\textbf{A})\mathrm{perm}(\textbf{R})~{},

(122)

\mathrm{scaledsinkhorn}(\textbf{Q})=\mathrm{perm}(\textbf{L})~{}\mathrm{scaledsinkhorn}(\textbf{A})~{}\mathrm{perm}(\textbf{R}),

(123)

Further note that for all doubly stochastic matrices Q we always have,

\exp(-N)\leq\mathrm{scaledsinkhorn}(\textbf{Q})~{}.

(124)

Therefore combining Equations 122, 123 and 124, to prove the upper bound it is enough to show that,

\mathrm{perm}(\textbf{Q})\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot\exp(-N)~{}.

As matrix Q has at most $k$ distinct columns, let the multiplicities of these distinct columns be $\phi_{1},\ldots,\phi_{k}$ . Note that if a column has multiplicity $\phi_{i}$ , the maximal element in this column is at most $1/\phi_{i}$ . Now by Lemma A.1 (Corollary 3.4.5. in [Bar17]), we have

\mathrm{perm}(\textbf{Q})\leq\prod_{i=1}^{k}\frac{\phi_{i}!}{\phi_{i}^{\phi_{i}}}\leq\exp\left(O\left(k\log\frac{N}{k}\right)\right)\cdot\exp\left(-N\right)~{},

where the last inequality follows because the term $\prod_{i=1}^{k}\frac{\phi_{i}!}{\phi_{i}^{\phi_{i}}}$ is maximized when all $\phi_{i}$ ’s are equal and take value $N/k$ . Therefore we conclude the proof. ∎

The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum Likelihood

Abstract

1 Introduction

Organization of the paper:

1.1 Overview of Techniques

2 Preliminaries

Definition 2.1.

Lemma 2.2 ([Gur11] based on [Sch98]).

Definition 2.3.

Lemma 2.4 (Proposition 3.1 in [GS14]).

Corollary 2.5.

Definition 2.6.

Corollary 2.7.

Lemma 2.8 (Stirling’s approximation).

Lemma 2.9 (Non-negativity of KL divergence).

2.1 Profile maximum likelihood

Definition 2.10 (Profile).

Definition 2.11 (Profile maximum likelihood).

Definition 2.12 (Approximate PML).

3 Results

Theorem 3.1 (Scaled Sinkhorn permanent approximation for low non-negative rank matrices).

Corollary 3.2 (Bethe permanent approximation for low non-negative rank matrices).

Theorem 3.3 (Lower bound for the Bethe and the scaled Sinkhorn permanents approximation).

Theorem 3.4 (exp⁡(n​log⁡n)\exp\left(\sqrt{n}\log n\right)-approximate PML).

Theorem 3.5 (Efficient universal estimator using approximate PML).

3.1 Related work

Permanent approximations:

Profile maximum likelihood:

4 The Sinkhorn permanent for structured matrices.

Lemma 4.1 (Scaled Sinkhorn permanent approximation).

4.1 The Sinkhorn permanent for distinct column case.

Definition 4.2.

Definition 4.3.

Lemma 4.4.

Proof.

Lemma 4.5.

Proof.

Lemma 4.6.

Proof.

Proof.

Theorem 4.7.

Proof.

4.2 Generalization to low non-negative rank matrices

Lemma 4.8 ([Bar96]).

Corollary 4.9.

Proof.

Lemma 4.10.

Proof.

Proof.

5 Lower bound for Bethe and scaled Sinkhorn permanent approximations

Proof.

6 Improved approximation to profile maximum likelihood

6.1 Probability discretization

Definition 6.1 (Pseudo-distribution).

Definition 6.2 (Discrete pseudo-distribution).

Theorem 6.3.

Proof.

Theorem 6.4.

Proof.

Definition 6.5.

Definition 6.6.

Theorem 6.7.

Proof.

Definition 6.8.

Lemma 6.9.

Proof.

Lemma 6.10 (Lemma 4.16 in [CSS19]).

Theorem 6.11 (Theorem 4.17 in [CSS19]).

6.2 Rounding Algorithm

Overview of the rounding algorithm:

Step 1:

Step 2:

Step 3:

Lemma 6.12.

Lemma 6.13.

Lemma 6.14.

Lemma 6.15.

Theorem 6.16.

6.3 Combining everything together

Lemma 6.17 (Lemma 4.1 in [CSS19]).

The Bethe and Sinkhorn Permanents of Low Rank Matrices
and Implications for Profile Maximum Likelihood

Theorem 3.4 ( $\exp\left(\sqrt{n}\log n\right)$ -approximate PML).