The Bethe and Sinkhorn Permanents of Low Rank Matrices
and Implications for Profile Maximum Likelihood
Abstract
In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given i.i.d. samples from a discrete distribution, achieve an approximation factor of , improving upon the previous best-known bound achievable in polynomial time of (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter.
We achieve these results by providing new bounds on the quality of approximation of the Bethe and Sinkhorn permanents (Vontobel, 2012 and 2014). We show that each of these are approximations to the permanent of matrices with non-negative rank at most , improving upon the previous known bounds of . To obtain our results on PML, we exploit the fact that the PML objective is proportional to the permanent of a certain Vandermonde matrix with distinct columns, i.e. with non-negative rank at most . As a by-product of our work we establish a surprising connection between the convex relaxation in prior work (CSS19) and the well-studied Bethe and Sinkhorn approximations.
1 Introduction
Symmetric property estimation of distributions111Throughout this paper, we use the word distribution to refer to discrete distributions. is an important and well studied problem in statistics and theoretical computer science. Given access to i.i.d samples from a hidden discrete distribution p the goal is to estimate , for a symmetric property . Formally, a property is symmetric if it is invariant to permutating the labels, i.e. it is a function of the multiset of probabilities and does not depend on the symbol labels. There are many well-known well-studied such properties, including support size and coverage, entropy, distance to uniformity, Renyi entropy, and sorted distance. Understanding the computational and sample complexity for estimating these symmetric properties has led to an extensive line of interesting research over the past decade.
Symmetric property estimation spans applications in many different fields. For instance, entropy estimation has found applications in neuroscience [RWdRvSB99], physics [VBB+12] and others [PW96, PGM+01]. Support size and coverage estimation were initially used in estimating ecological diversity [Cha84, CL92, BF93, CCG+12] and subsequently applied to many different applications [ET76, TE87, Für05, KLR99, PBG+01, DS13, RCS+09, GTPB07, HHRB01]. For applications of other symmetric properties we refer the reader to [HJWW17, HJM17, AOST14, RVZ17, ZVV+16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a].
Early work on symmetric property estimation developed estimators tailored to the particular property of interest. Consequently, a fundamental and important open questions was to come up with an estimator that is universal, i.e. the same esstimator could be used for all symmetric properties. A natural approach for constructing universal estimators is plug-in approach, where given samples we first compute a distribution independent of the property and later we output the (value of this) property for the computed distribution as our estimate.
Our approach is based on the observation (see [ADOS16]) that a sufficient statistic for estimating a symmetric property from a sequence of samples is the profile, i.e. the multiset of frequencies of symbols in the sequence; e.g. the profile of sequence is . We provide an efficient universal estimator that is based on the plug-in approach applied to the profile maximum likelihood (PML) distribution introduced by Orlitsky et al. [OSS+04]: given a sequence of samples, PML is the distribution that maximizes the likelihood of the observed profile. The problem of computing the PML distribution has been studied in several papers since, applying heuristic approaches such as Bethe/Sinkhorn approximation [Von12, Von14], the EM algorithm [OSS+04], a dynamic programming [PJW17] and algebraic methods [ADM+10].
A recent paper of Acharya et al. [ADOS16] showed that a plug-in estimator using the optimal PML distribution is universal in estimating various symmetric properties of distributions. In fact it suffices to compute a -approximate PML distribution (i.e. a distribution that approximates the PML objective to within a factor of ) for for constant . Previous work of the authors in [CSS19], gave the first efficient algorithm to compute a -approximate PML for some non-trivial . In particular, [CSS19] gave a nearly linear running time algorithm to compute an -approximate PML distribution. In this work, we give an efficient algorithm to compute an -approximate PML distribution.
The parameter in -approximate PML effects the error parameter regime under which the estimator is sample complexity optimal. Smaller values of yield a universal estimator that is sample optimal over broader parameter regime. For instance, [CSS19] show that -approximate PML222Throughout this paper, hides poly terms. is sample complexity optimal for estimating certain symmetric properties within accuracy for . On the other hand [ADOS16] showed that computing an -approximate PML is sample complexity optimal for . However note that, using the current analysis techniques [ADOS16] we are unsure on how to exploit the computation of exact PML any better than computing an -approximate PML and they both are sample complexity optimal over the same error parameter regime.
In our work, we use the Bethe approximation of the permanent or the Bethe permanent (for short), a previously proposed heuristic to compute an approximate PML distribution. This is based on the Bethe free energy approximation originating in statistical physics and is very closely connected to the belief propagation algorithm [YFW05, Von13]. The idea of using the Bethe permanent for computing an approximate PML distribution comes from the fact that the likelihood of a profile with respect to a distribution can be written as the permanent of a non-negative Vandermonde matrix (which we call the profile probability matrix). For a non-negative matrix, [GS14] show that the ratio between the permanent and the Bethe permanent of a matrix is upper bounded by , that was later improved to [AR18]333Note that previous results on the Bethe permanent do not immediately imply non-trivial results for PML. For consistency with the literature, we use approximation factors for PML.. A natural question is whether the approximation ratio of the Bethe permanent depends on some other structural parameter better than the input dimension of the matrix? In this work, we show that the approximation ratio between the permanent and the Bethe permanent is upper bounded by an exponential in the non-negative rank of the matrix (up to a logarithmic factor). We also give an explicit construction of a matrix to show that our result for this structural parameter is asymptotically tight. As the non-negative rank of any non-negative matrix is at most , our analysis implies an upper bound of for some constant on the approximation ratio. Therefore our work (asymptotically) generalizes previous results for general non-negative matrices.
To obtain our efficient algorithm, we prove a slightly stronger statement than the bound of the Bethe permanent of a matrix with non-negative rank at most . We show that a scaling of a simpler approximation of the permanent known as the Sinkhorn444Sinkhorn is also called as capacity in the literature. permanent also approximates the permanent up to exponential in the non-negative rank of the matrix (up to log factors). This implies our bound for the Bethe permanent and shows that scaled Sinkhorn is a compelling alternative to Sinkhorn, with a tighter worst-case multiplicative approximation to the permanent.
An immediate application of our work on the Bethe and the scaled Sinkhorn permanent is to approximate PML. Given samples, the number of distinct columns in the profile probability matrix is always upper bounded by , i.e. its non-negative rank is at most . Therefore our analysis of the scaled Sinkhorn permanent immediately implies an approximation to the PML objective with respect to a fixed distribution. This result, combined with probability discretization, results in a convex program whose optimal solution is a fractional representation of an approximate PML distribution. We round this fractional solution to output a valid distribution that is an -approximate PML distribution. Surprisingly the resulting convex program is exactly the same as the one in [CSS19], where a completely different (combinatorial) technique was used to arrive at the convex program. Our work here provides a better analysis of the convex program in [CSS19] using a more delicate and sophisticated rounding algorithm.
Organization of the paper:
In Section 2 we present preliminaries. In Section 3, we provide the main results of the paper. In Section 4, we analyze the scaled Sinkhorn permanent of structured matrices. In Section 4.1, we prove an upper bound for the approximation ratio of the scaled Sinkhorn permanent to the permanent as a function of the number of distinct columns. In Section 4.2, we prove the generalized result of the scaled Sinkhorn permanent for the low non-negative rank matrices. In Section 5, we prove the lower bound for the Bethe and scaled Sinkhorn approximations of the permanent. In Section 6, we combine the result for the scaled Sinkhorn permanent with the idea of probability discretization to provide the convex program that returns a fractional representation of an approximate PML distribution. In the same section, we provide the rounding algorithm to return a valid approximate PML distribution.
1.1 Overview of Techniques
In [CSS19], the authors presented a convex relaxation for the PML objective. This was obtained by a combinatorial view of the PML problem. In a sequence of steps, they discretized the set of probabilities and the frequencies, grouped the terms in the objective into groups and developed a relaxation for the sum of terms in the largest group, giving an approximation. In this paper, we exploit the fact that the likelihood of a profile with respect to a distribution is the permanent of a certain non-negative Vandermonde matrix (referred to here as the profile probability matrix with respect to a distribution) and that the PML objective is an optimization problem over such permanents. We work with the same convex relaxation we derived earlier, but relate it to the well known Bethe and scaled Sinkhorn approximations for the permanent. In fact, Vontobel [Von12, Von14] proposed the Bethe and Sinkhorn permanents as a heuristic approximation of the PML objective, but bounding the quality of the solution was an open problem [Von11]. We show that both the Bethe and scaled Sinkhorn permanents are within a factor of the PML objective. Enroute, we show that the approximation ratio of the Bethe and scaled Sinkhorn permanents for any non-negative matrix are upper bounded by the exponential in the non-negative rank of matrix . This is a strengthening of the well known upper bound on the approximation ratio of both the Bethe and scaled Sinkhorn permanents of an matrix.
In [CSS19], the fact that the convex problem we obtained was a relaxation of the PML objective followed directly from the combinatorial derivation of our relaxation. By contrast, our analysis here exploits the non-trivial fact that the Bethe and scaled Sinkhorn approximations are lower bounds for the permanent of a non-negative matrix. The Bethe and scaled Sinkhorn permanents of the profile probability matrix with respect to a fixed distribution are optimum solutions to maximization problems over doubly stochastic matrices where the objective functions have entropy-like terms involving the entries of and . In order to obtain an upper bound on the Bethe and scaled Sinkhorn approximation as a function of the non-negative rank, we show the existence of a doubly stochastic matrix as a witness such that the objective of the Bethe and scaled Sinkhorn w.r.t. upper bounds the permanent of within the desired factor.
We first work with a simpler setting of matrices with at most distinct columns.555In the final preparation of this paper for posting an anonymous reviewer showed that a simpler proof for the distinct column case can be derived using Corollary 3.4.5 of Barvinok’s book [Bar17]. The proof of the Corollary 3.4.5 further uses the famous Bregman–Minc inequality. We thank the anonymous reviewer for this and include the derivation in Appendix A. In constrast, our proof is self-contained and we believe it provides further insight into the structure of the Sinkhorn/Bethe approximations. See Section 3.1 for further details. Here we consider a modified matrix that contains the distinct columns of . We define a distribution on permutations of the domain where the probability of a permutation is proportional to its contribution to the permanent of . There is a many-to-one mapping from such permutations to 0-1 matrices with row sums 1 and column sums , the number of times the th column of appears in . We next define an real-valued, non-negative matrix with row sums 1 and column sums , in terms of the marginals of the distribution . We also define a different distribution on 0-1 row-stochastic matrices by independent sampling from . Finally, we use the fact that the KL-divergence between and is non-negative to get the required upper bound on the scaled Sinkhorn approximation with a doubly stochastic witness (obtained from ). This proof technique is inspired by the recent work of Anari and Rezaei [AR18] that gives a tight bound on the approximation ratio of the Bethe approximation for the permanent of an non-negative matrix.
Though this bound on the quality of the Bethe and scaled Sinkhorn approximations for non-negative matrices with distinct columns suffices for our PML applications, interestingly we show that it can be extended to non-negative matrices with bounded rank. In order to obtain an upper bound on the Bethe and scaled Sinkhorn approximation as a function of the non-negative rank of , recall that we need to show the existence of a suitable doubly stochastic witness which certifies the required guarantee. We express the permanent of as the sum of terms of the form where matrices and have at most distinct columns. We focus on the largest of these terms, and construct a doubly stochastic witness for matrix from the witnesses for matrices and in this largest term. This doubly stochastic witness certifies the required guarantee and we get an upper bound on the scaled Sinkhorn approximation as a function of the non-negative rank. This result for the scaled Sinkhorn approximation further implies an upper bound for the Bethe approximation.
Even with this improved bound on the quality of the Bethe and scaled Sinkhorn approximations as applied to the PML objective, challenges remain in obtaining an improved approximate PML distribution. In particular, we do not know of an efficient algorithm to maximize the Bethe or the scaled Sinkhorn permanent of the profile probability matrix over a family of distributions as it would be needed to compute the Bethe or the scaled Sinkhorn approximation to the optimum of the PML objective. Prior work by Vontobel suggests an alternating maximization approach, but this is only guaranteed to produce a local optimum. To address this, we revisit the efficiently computable convex relaxation from [CSS19] and show that this is suitably close to the scaled Sinkhorn approximation. This is quite surprising as the prior derivation of this relaxation in [CSS19] was purely combinatorial and had nothing to do with the scaled Sinkhorn approximation.
The final challenge towards obtaining our PML results is to round the fractional solution produced so that the approximation guarantee is preserved. The rounding procedure from [CSS19] does not immediately suffice, but we present a more sophisticated and delicate rounding procedure that does indeed give us the required approximation guarantee. The rounding algorithm proceeds in three steps, where in the first step we first apply a procedure analogous to [CSS19] to handle large probability values and in the later steps we provide a new procedure to the smaller probability values; in each step, we ensure that the objective function does not drop significantly. The input to the rounding procedure is a matrix where the rows correspond to discretized probability values and the columns correspond to distinct frequencies. We create rows corresponding to new probability values in the course of the rounding algorithm, maintain column sums and eventually ensure that all row sums are integral, and ensure that the objective function has not dropped significantly.
2 Preliminaries
Let and denote the interval of integers and reals and respectively. Let be the domain of elements and be its size. Let be a non-negative matrix, where its ’th entry is denoted by . We further use and to denote the row and column corresponding to and respectively. The non-negative rank of a non-negative matrix is equal to the smallest number such there exist non-negative vectors for such that . Let be the set of all permutations of domain and we denote a permutation in the following way . The permanent of a matrix A denoted by is defined as follows,
Let be the set of all non-negative matrices that are doubly stochastic. For any matrix and , we define the following set of functions:
(1) |
Further,
Using these definitions, we define the Bethe permanent of a matrix.
Definition 2.1.
For a matrix , the Bethe permanent of A is defined as follows,
A well known and important result about the Bethe permanent is that it lower bounds the value of permanent of a non-negative matrix and we state this result next.
We next define the Sinkhorn permanent of a matrix and later we state the relationship between the Bethe and the Sinkhorn permanent.
Definition 2.3.
For a matrix , the Sinkhorn permanent of A is defined as follows,
To establish the relationship between the Bethe and the Sinkhorn permanent we need the following lemma from [GS14].
Lemma 2.4 (Proposition 3.1 in [GS14]).
For any distribution , the following holds,
For any matrix , each row of Q is a distribution; therefore the following holds,
As a corollary of the above inequality we have,
Corollary 2.5.
For any non-negative matrix , the following inequality holds,
Later we will see that it is convenient to work with than itself; we define this expression to be scaled Sinkhorn and we formally state it next.
Definition 2.6.
For a matrix , the scaled Sinkhorn permanent of A is defined as follows,
The above expression can be equivalently stated as,
Corollary 2.7.
For any matrix , the following inequality holds,
which further implies,
Other than approximations to the permanent of a matrix, we next state two important results that will be helpful throughout the paper. The first result is the Stirling’s approximation for factorial function and the second is the non-negativity result of KL divergence between two distributions.
Lemma 2.8 (Stirling’s approximation).
For all , the following holds:
Let and be distributions defined on some domain . The KL divergence denoted between distributions and is defined as follows,
Lemma 2.9 (Non-negativity of KL divergence).
For any distributions and defined on domain , the KL divergence between distributions and satisfies,
In the remainder of this section we provide formal definitions related to PML.
2.1 Profile maximum likelihood
Let be the set of all discrete distributions supported on domain . Here on we use the word distribution to refer to discrete distributions. Throughout this paper we assume that we receive a sequence of independent samples from an underlying distribution . Let be the set of all length sequences and be one such sequence with denoting its th element. The probability of observing sequence is:
where is the frequency/multiplicity of symbol in sequence and is the probability of domain element .
For any given sequence one could define its profile (histogram of a histogram or fingerprint) that is sufficient statistic for symmetric property estimation.
Definition 2.10 (Profile).
For any sequence , let be the set of all its non-zero distinct frequencies and be elements of the set M. The profile of a sequence denoted is
is the number of domain elements with frequency in 666The profile does not contain information about the number of unseen domain elements.. We call the length of profile and as a function of profile , . Let denote the set of all profiles of length . We use to denote the number of distinct frequencies in the profile and .777Note the number of distinct frequencies denoted in a length sequence is always upper bounded by . For convenience we use to denote the vector of observed frequencies, therefore for all .
For any distribution , the probability of a profile is defined as:
(2) |
Let be a sequence such that . We define a profile probability matrix with respect to sequence (therefore profile ) and distribution p as follows,
(3) |
where is the frequency of domain element in sequence and recall . We are interested in the permanent of the matrix , and note that the is invariant under the choice of sequences that satisfy . Therefore we index the matrix with profile rather than sequence itself. The number of distinct columns in is equal to number of distinct observed frequencies plus one (for the unseen), i.e. .
The probability of a profile with respect to distribution p (from Equation 20 in [OSZ03], Equation 15 in [PJW17]) in terms of permanent of matrix is given below:
(4) |
where and is the number of unseen domain elements888Given a distribution p, we know its domain and therefore the value of ..
The distribution which maximizes the probability of a profile is the profile maximum likelihood distribution which we formally define next.
Definition 2.11 (Profile maximum likelihood).
For any profile , a profile maximum likelihood (PML) distribution is:
and is the maximum PML objective value.
The central goal of this paper is to define efficient algorithms for computing approximate PML distributions defined as follows.
Definition 2.12 (Approximate PML).
For any profile , a distribution is a -approximate PML distribution if
3 Results
Here we state the main results of this paper. In our first class of main results, we improve the analysis of the scaled Sinkhorn permanent for structured non-negative matrices. We first show that the scaled Sinkhorn permanent approximates the permanent of a non-negative matrix A, where the approximation factor (up to log factors) depends exponentially on the non-negative rank of the smatrix A. We formally state this result next.
Theorem 3.1 (Scaled Sinkhorn permanent approximation for low non-negative rank matrices).
For any matrix with non-negative rank at most , the following inequality holds,
(5) |
Further using (See 2.7) and (See Lemma 2.2) we immediately get the same result for the Bethe permanent.
Corollary 3.2 (Bethe permanent approximation for low non-negative rank matrices).
For any matrix with non-negative rank at most , the following inequality holds,
(6) |
Interestingly, in the worst case, Sinkhorn is an approximation to the permanent of , even when A has at most 1 distinct column (e.g. consider the all 1’s matrix). Consequently, for matrices with non-negative rank at most , whenever , scaled Sinkhorn is a compelling alternative to Sinkhorn, with a tighter worst-case multiplicative approximation to the permanent.
Our results improve the analysis of the Bethe permanent for such structured matrices. Previously, the best known analysis of the Bethe permanent showed an -approximation factor to the permanent [AR18]. The analysis in [AR18] is tight for general non-negative matrices and the authors showed that this bound cannot be improved without leveraging further structure. Our next result is of similar flavor, and we provide an asymptotically tight example for Theorem 3.1 and 3.2.
Theorem 3.3 (Lower bound for the Bethe and the scaled Sinkhorn permanents approximation).
There exists a matrix with non-negative rank , that satisfies
(7) |
which further implies,
(8) |
An immediate application of these above stated results is for PML. Recall, that for any fixed distribution p and profile , is proportional to the permanent of the non-negative matrix (See Section 2 for the definition of ). Note the number of distinct columns in the profile probability matrix is upper bounded by the number of distinct frequencies plus one, which further is always less than (where is the length of the profile). Therefore the non-negative rank of the profile probability matrix is always upper bounded by . Since can be computed in polynomial time [CSS19]999 corresponds to a convex optimization problem and a minor modification of the approach in [CSS19] to solve a related, but slightly different optimization problem, yields a polynomial time algorithm to compute up to high accuracy., Theorem 3.1 implies an efficient algorithm to approximate the value for a fixed distribution p up to multiplicative factor, and is also the best known approximation factor achieved by a deterministic algorithm.
Analyzing the relationship between the Bethe permanent and the permanent of the profile probability matrix was posed as an interesting research direction in [Von11] (See Section VII). Moreover, one of the primary interests in the area of algorithmic statistics/machine learning is to efficiently compute the PML distribution. Exploiting the structure of doubly stochastic matrix Q that maximizes combined with the probability discretization idea from [CSS19] we provide an efficient algorithm to compute an approximate PML distribution. We use Lemma 4.1 to argue the approximation of our approximate PML distribution and we summarize this result below.
Theorem 3.4 (-approximate PML).
For any given profile , Algorithm 4 computes an -approximate PML distribution in time.
Previous to our work, the best known result [CSS19] gave an efficient algorithm to compute -approximate PML distribution.
One important application of approximate PML is in symmetric property estimation. In [ADOS16], the authors showed that approximate PML distribution based plug-in estimator is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity. Further combining their result with our Theorem 3.4 we get the efficient version of Theorem 2 in [ADOS16] and we summarize this result next.
Theorem 3.5 (Efficient universal estimator using approximate PML).
Let be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity. If for some constant , then there exists a PML based universal plug-in estimator that runs in time and is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity to accuracy .
Note the dependency on in the above theorem and the approximation factor in Theorem 3.4 are strictly better than [CSS19], which is another efficient PML based approach for universal symmetric property estimation; [CSS19] works when the error parameter .
Recent work [HO19], further gives the broad optimality of approximate PML. [HO19] shows optimality of approximate PML distribution based estimator for other symmetric properties, such as, sorted distribution estimation (under distance), -Renyi entropy for non-integer , and other broad class of additive properties that are Lipschitz. [HO19] also provides a PML-based tester to test whether an unknown distribution is far from a given distribution in distance and achieves the optimal sample complexity up to logarithmic factors. Our result further implies an efficient version of all these results.
3.1 Related work
We divide the related work into two broad categories: permanent approximations and profile maximum likelihood.
Permanent approximations:
The first set of related work is with respect to computing the permanent of matrices. [Val79] showed that computing the permanent of matrices even when it has entries in 0, 1 is #P-Hard. This led to the study of computing approximations to the permanent. Additive approximation to the permanent for arbitrary A was given by [Gur05]. On the other hand, multiplicative approximation to the permanent (or even determining the sign) is hard for general A [AA11, GS18]. This hardness results led to the study of the multiplicative approximation to the permanent for special class of matrices and one such class is the set of non-negative matrices. In this direction, [JSV04] gave the first efficient randomized algorithm to approximate the permanent within multiplicative accuracy. There has also been a rich literature on coming up with deterministic approximation to the permanent of non-negative matrices. [LSW98] gave the first deterministic algorithm to the permanent of non-negative matrices with approximation ratio . [Gur11] using an inequality from [Sch98] showed that the Bethe permanent lower bounds the value of the permanent of non-negative matrices. We refer the reader to [Von13, GS14] for the polynomial computability of the Bethe permanent and [AR18] for a more rigorous literature survey on the Bethe permanent and other related work.
As discussed in the footnote of the introduction, an anonymous reviewer showed us an alternative and simpler proof for the upper bound on the scaled Sinkhorn approximation to the permanent of matrices with at most distinct columns (Lemma 4.1). This alternative proof is deferred to Appendix A and is derived using Corollary 3.4.5. in Barvinok’s book [Bar17]. The result in turn, is proved using the Bregman-Minc inequality conjectured by Minc, cf. [Spe82] and later proved by Bregman [Bre73]. The Bregman-Minc inequality is well-known and there are many different proofs [Sch78, Rad97, AS04] known. In comparison to this alternative proof for matrices with distinct columns (Lemma 4.1), our proof is self contained and intuitive. We believe it could help provide further insights into the Sinkhorn/Bethe approximations.
Profile maximum likelihood:
The second set of related work is with respect to profile maximum likelihood and its applications. As discussed in the introduction, PML was introduced by [OSS+04]. Many heuristic approaches such as the EM algorithm [OSS+04], algebraic approaches [ADM+10] and a dynamic programming approach [PJW17] have been proposed to compute approximations to PML. Further [Von12, Von14] used the Bethe permanent as a heuristic to compute the PML distribution. All these approaches don’t provide any theoretical guarantees for the quality of the approximate PML distribution and it was an open question to efficiently compute a non-trivial approximate PML distribution. [CSS19] gave the first efficient algorithm to compute the approximate PML distribution, where is the number of samples.
The connection between PML and universal estimators was first studied in [ADOS16]. [ADOS16] showed that an approximate PML distribution can be used as an universal estimator for estimating symmetric properties, namely entropy, distance to uniformity, support size and coverage. See [HO19] for broad applicability of approximate PML in property testing and estimating other symmetric properties such as sorted distance, Renyi entropy, and other broad class of additive properties. [CSS19] combined with [ADOS16], gave the first efficient PML based universal estimator for symmetric property estimation. There have been several other approaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b] adopted and rigorously analyzed a linear programming based approach for universal estimators proposed by [ET76] and showed that it is sample complexity optimal in the constant error regime for estimating certain symmetric properties (namely, entropy, support size, support coverage, and distance to uniformity). Recent work of Han, Jiao and Weissman [HJW18] applied a local moment matching based approach in designing efficient universal symmetric property estimators for a single distribution. [HJW18] achieves the optimal sample complexity in a broader error regimes for estimating the power sum function, support and entropy.
Estimating symmetric properties of a distribution is a rich field and extensive work has been dedicated to studying their optimal sample complexity for estimating each of these properties. Optimal sample complexities for estimating many symmetric properties were resolved in the past few years; support [VV11b, WY15], support coverage [OSW16, ZVV+16], entropy [VV11b, WY16a], distance to uniformity [VV11a, JHW16], sorted distance [VV11a, HJW18], Renyi entropy [AOST14, AOST17], KL divergence [BZLV16, HJW16] and many others.
Comparison to [CSS19]: [CSS19] provides the first efficient algorithm for computing -approximate PML distribution for for some constant , where is the number of samples. Formally, [CSS19] computes an -approximate PML distribution. Suppose and are the number of distinct probability values and frequencies respectively, then [CSS19] provides a convex program that using combinatorial techniques they analyze and show that it approximates the PML objective up to multiplicative factor. Further this convex program outputs a fractional solution and [CSS19] provides a rounding algorithm that outputs a valid integral solution (that corresponds to a valid distribution). [CSS19] further incurs a multiplicative loss in the rounding procedure. Using the discretization results, up to -multiplicative loss one can assume and therefore [CSS19] provides a -approximate PML distribution.
However in our current work, using results for the scaled Sinkhorn permanent, we show that the same convex program in [CSS19] approximates the PML objective up to multiplicative factor. Further we also provide a better rounding algorithm that outputs a valid distribution and incur a multiplicative loss. Further using the discretization results, up to -multiplicative loss one can assume and therefore our work provides a -approximate PML distribution.
4 The Sinkhorn permanent for structured matrices.
In this section, we provide the proof for our first main theorem (Theorem 3.1). We show that the scaled Sinkhorn permanent of a non-negative matrix approximates its permanent, where the approximation factor is exponential in the non-negative of the matrix (up to log factors). Our proof is divided into two parts. First in Section 4.1, we work with a simpler setting of matrices with at most distinct columns and prove the following lemma.
Lemma 4.1 (Scaled Sinkhorn permanent approximation).
For any matrix with at most distinct columns, the following holds,
(9) |
Further using the above result, in Section 4.2 we prove our main theorem (Theorem 3.1) for low non-negative rank matrices.
4.1 The Sinkhorn permanent for distinct column case.
We start this section by defining some notation that captures the structure of repetition of columns in a matrix. For the remainder of this section we fix a matrix . We let denote the number of distinct columns of A and use to denote these distinct columns. Further we let denote the matrix formed by these distinct columns. We use to denote the ’th column of matrix A and let denote the number of columns equal to . It is immediate that,
(10) |
where is the size of the domain. For any matrix define,
(11) |
In the first half of this section, we show existence of a matrix (See Lemma 4.4) such that , , and further (See Lemma 4.5),
(12) |
Later in the second half (See Lemma 4.6), we show that for any matrix that satisfies and , there exists a matrix (recall is the set of all doubly stochastic matrices) that satisfies,
(13) |
However, using 2.7 we already know that, . Further using the definition of and combining with Equations 12 and 13 we get,
In the remainder, we provide proofs for all the above mentioned inequalities and we need the following set of definitions. Let , be the subset of all matrices that are row stochastic, meaning there is exactly single in each row. Let be the set of matrices such that any satisfies .
Definition 4.2.
Let be the function that takes a permutation as input and returns a matrix in the following way,
(14) |
Remark: Note that as desired for all because of the following. For any , let . Since for all are distinct, we have . Further for any , .
We next define the probability of a permutation with respect to matrix A as follows,
(15) |
Further we define a marginal distribution on and later we will establish that this is indeed a probability distribution, that is, probabilities add up to 1.
(16) |
For , we next provide another equivalent expression for .
(17) |
In the first and second equality, we used definitions of and (See Equation 15). For any , let . Further for any , let be such that , then that is further equal to because is equal to if and otherwise. Therefore the third equality holds. For the final equality, observe that for any if we let , then for each , any permutation within the subset of elements results in a permutation that satisfies . These permutations can be carried out independently for each that corresponds to number of permutations and all of them have the same value.
Using the derivation from above, the definition for can also be written as follows:
(18) |
Note for , the expression for can be equivalently written as follows:
(19) |
We next show that the defined above is a valid distribution.
Remark: The domain of distribution is , but its support is subset of .
Definition 4.3.
For the distribution , we define a non-negative matrix with respect to as follows:
(20) |
Lemma 4.4.
The matrix P defined in Equation 20 satisfies the following conditions:
(21) |
Proof.
We first evaluate the row sum. For each ,
In the second inequality we used that , meaning for each , . Next we evaluate the column sum, for each ,
In the first equality we used the definition of . In the second inequality we interchanged the summations. In the final equality we used . ∎
The matrix P defined in Equation 20 is important because we can upper bound the permanent of matrix A in terms of entries of this matrix. We formalize this result in our next lemma.
Lemma 4.5.
For matrix , if P is the matrix defined in Equation 20, then
Proof.
We first calculate the expectation of and express it in terms of matrix P.
(22) |
The second equality holds because the support of distribution is subset of . In the third equality we used Equation 19. We now simplify the last term in the final expression from the above derivation.
(23) |
Combining Equation 22 and Equation 23 together we get,
(24) |
We next define a different distribution on using the following sampling procedure: For each , pick a column independently with probability . Note that this is a valid sampling procedure because for each , . The description of distribution is as follows: for each ,
(25) |
Remark: Note that and is a valid distribution.
We next calculate the expectation of with respect to distribution and express it in terms of matrix P. Note that because for all and we get,
We now calculate the KL divergence between distributions and .
Using Lemma 2.9, we have , that further implies,
(26) |
In the second inequality we used Lemma 2.8 on each and further in the third inequality we used and the fact that the function is always upper bounded by . Further using the definition of (See Equation 11), we conclude the proof. ∎
We provided an upper bound to the permanent of matrix A and all that remains is to relate this upper bound to the scaled Sinkhorn permanent of matrix A. Our next lemma will serve this purpose.
Lemma 4.6.
For any matrix that satisfies,
(27) |
there exists a doubly stochastic matrix such that,
(28) |
Proof.
Define matrix as follows,
where in the definition above is such that . Now we verify the row and column sums of matrix Q. For each ,
(29) |
We next evaluate the column sums. For each , let 101010Note that is a function of . For convenience, in our notation we don’t capture its dependence on . be such that , then
(30) |
Therefore the matrix Q is doubly stochastic and we next relate with . Recall the definition of (Equation 1),
(31) |
We analyze the above term and express it in terms of entries of matrices P and .
(32) |
The first equality follows because for all are distinct. The second equality follows because for each , consider any such that and note that for all such ’s, and . The third equality follows because .
We further simplify the final term in the above derivation.
(33) |
Combining Equation 32, Equation 33 and further substituting back in Equation 31 we get,
(34) |
In the final expression, we used the definition of and combined it with . ∎
We are now ready to prove our main lemma of this section and is restated for convenience. See 4.1
Proof.
Consider the matrix P defined in Equation 20. By Lemma 4.4, matrix P satisfies the conditions of Lemma 4.6; therefore, there exists a doubly stochastic matrix such that . Combining it with Lemma 4.5 we get , which further implies . The lower bound for the follows from 2.7 and we conclude the proof. ∎
We next state another interesting property of the matrix P defined in Equation 20. This property will be useful for the purposes of PML (Section 6).
Theorem 4.7.
For matrix , the matrix P defined in Equation 20 satisfies the following:
Proof.
For any , recall by the definitions of terms and ,
(35) |
(36) |
For any we next construct a unique (and vice versa) such that,
Each corresponds to a bipartite graph where vertices correspond to set on left side and on the other, such that, degree of every left vertex is 1 and degree of every right vertex is .
Consider , we divide the analysis into the following two cases,
-
1.
If , meaning both vertices are connected to in our bipartite graph representation. Then, .
-
2.
If , meaning vertex is connected to and to some other vertex . In this case we swap the edges, meaning we remove edges and add to construct . We formally define next,
(37)
In both cases, clearly . Further, implies for all and the following equality holds,
The same analysis also holds when we start with a and construct . We have a one to one correspondence between elements Y and in the sets and respectively, satisfying,
Therefore, and we conclude the proof. ∎
4.2 Generalization to low non-negative rank matrices
Here we prove our main result for the scaled Sinkhorn permanent of low non-negative rank matrices (Theorem 3.1). To prove this result, we use the performance result of the scaled Sinkhorn permanent for non-negative matrices with distinct columns. The following lemma relates the permanent of a matrix A of non-negative rank to matrices with at most distinct columns and will be crucial for our analysis.
Lemma 4.8 ([Bar96]).
For any matrix of non-negative rank . If for , then
where , .
As the number of terms in the above summation is low, the maximizing term is a good approximation to the permanent of A.
Corollary 4.9.
Given a non-negative matrix , let denote the non-negative rank of the matrix. If for is any non-negative matrix factorization of A, then
(38) |
Proof.
The number of feasible ’s in the set is at most and we conclude the proof. ∎
Lemma 4.10.
Let be any doubly stochastic matrices. Then is a doubly stochastic matrix.
Proof.
We first consider the row sums,
Therefore the matrix Q is row stochastic. In the above derivation, the second and third equalities follow because and are row stochastic matrices respectively. We now consider the column sums,
The above derivation follows because and are column stochastic and therefore the matrix Q is column stochastic. As the matrix Q is both row and column stochastic we conclude the proof. ∎
We are now ready to prove our main result of this section and we restate it for convenience. See 3.1
Proof.
Let be the maximizer of the optimization problem 38, then
(39) |
Recall to prove the theorem, we need to construct a doubly stochastic witness Q that satisfies:
We construct such a witness Q from the doubly stochastic witnesses for matrices and . For all define , equivalently and note that . Let and be the doubly stochastic matrices that maximize the scaled Sinkhorn permanent for matrices and respectively. Therefore by Lemma 4.1 the following inequalities hold,
(40) |
(41) |
where recall and . Without loss of generality by the symmetry (with respect to columns within ) and concavity of the scaled Sinkhorn objective, we can assume that the maximizing matrices and satisfy the following: for all and ,
(42) |
Note that the doubly stochastic matrix that we constructed for the proof of Lemma 4.1 also satisfies the above collection of equalities. Now combining Equations 39, 40 and 41 we get,
(43) |
In the second inequality we use the Stirling’s approximation (Lemma 2.8) and the error term due to this approximation is upper bounded by . In the third inequality we used .
Let , then by Lemma 4.10 the matrix Q is doubly stochastic. In the remainder of the proof we show that,
(44) |
where recall . As matrix Q is doubly stochastic, the above inequality combined with Equation 43 concludes the proof. Therefore in the remainder we focus our attention to prove Equation 44 and we start by simplifying the above expression. Define,
(45) |
For all and the variables defined above satisfy the following,
(46) |
where in the third inequality we used the definition of . We next simplify and lower bound the term in terms of these newly defined variables.
(47) |
where in the first equality we used . In the second inequality we used weighted AM-GM inequality. Now consider the term and substitute the above lower bound,
(48) |
Summing over all the pairs we get,
(49) |
In the above expression the following terms simplify,
(50) |
Similarly,
(51) |
Also note that,
(52) |
In the third and fourth inequality we used Equation 46 and Equation 45 respectively. Substituting Equations 50, 51 and 52 in Equation 49 we get,
(53) |
By rearranging terms the above expression can be equivalently written as,
(54) |
In the above expression we have a lower bound for the term and we relate it to terms and . Consider the following term,
(55) |
In the final equality we renamed the variables and the rest of equalities are straightforward. Carrying out similar derivation we also get,
(56) |
As before in the final equality we renamed variables. Substituting Equations 55 and 56 in Equation 54 we get,
(57) |
Therefore to prove Equation 44, all that remains is to show that,
(58) |
To prove the above inequality we use the symmetry in the solutions and . Recall from Equation 42, for all and we have . Define and for any . We next substitute these definitions and simplify terms on the left hand side of Equation 58,
(59) |
In the final equality we used and the rest of the equalities are straightforward. Similar argument as above also gets us,
(60) |
Note in the final equality we renamed variables. Finally,
(61) |
Again each of the terms in the parenthesis further simplify as follows,
Similarly,
Substituting back all the above three expressions in Equation 61 we get,
(62) |
Further substituting Equations 59, 60 and 62 in the derivation below we get,
Therefore the above derivation proves Equation 58 and we further substitute it in Equation 57 to get,
(63) |
The above expression combined with Equation 43 gives the following upper bound on the log of permanent,
(64) |
The above expression combined with definition of the scaled Sinkhorn permanent concludes the proof. ∎
5 Lower bound for Bethe and scaled Sinkhorn permanent approximations
Here we provide the proof of Theorem 3.3 that is stated below for convenience. See 3.3
Proof.
Assume is divisible by . Let 1 and 0 be all ones and all zeros matrices respectively. Note that and the proof for this statement follows because is the maximizer of the optimization problem over all doubly stochastic matrices Q. On the other hand , where in the last inequality we used the Stirling’s approximation. Now consider the following matrix,
In the above definition, A is a matrix, with blocks. For the matrix A we have, and . Therefore .
The proof for the case when is not divisible by is similar. Here matrix A is the block diagonal matrix where the first blocks correspond to all ones matrix and the final block corresponds to all ones matrix, where . For this definition of matrix A we have, and . Therefore . The first condition of the theorem follows by taking exponentials on both sides of the previous inequality.
The second inequality in the theorem follows by using (See 2.7). As the matrix A constructed here is of non-negative rank , we conclude the proof. ∎
6 Improved approximation to profile maximum likelihood
In this section, we provide an efficient algorithm to compute an -approximate PML distribution. We first introduce the setup and some new notation. For convenience, we also recall some definitions from Section 2.
We are given access to independent samples from a hidden distribution supported on domain . Let be this length sequence and be its corresponding profile. Let be the frequency for domain element in sequence . Let be the number of non-zero distinct frequencies and we use to denote these distinct frequencies. Note that the number of non-zero distinct frequencies is always upper bounded by . For , we define . Let be the PML distribution with respect to profile and is formally defined as follows,
Recall the definition of profile probability matrix with respect to profile and distribution p,
(65) |
where is the frequency of domain element in the observed sequence and recall . Note that the number of distinct columns is equal to number of distinct observed frequencies plus one (for the unseen) and therefore it is .
From Equation 4, the probability of profile with respect to distribution p is,
(66) |
here denotes the number of unseen domain elements and note that it is not part of the profile. Given a distribution p we know its domain therefore the unseen domain elements. Also, note that is independent of the term , therefore it depends just on the profile and not on the underlying distribution p.
We now provide the motivation behind the techniques used in this section. Recall that the goal of this section is to compute an approximate PML distribution and we wish to do this using the results from the previous section. A first attempt would be to use the scaled Sinkhorn (or the Bethe) permanent as a proxy for the term in Equation 66 and solve the following optimization problem:
The above optimization problem is indeed a good proxy for the PML objective and recall the above optimization problem is equivalent to the following:
Taking log and ignoring the constants we get the following equivalent optimization problem,
Interestingly, the function , is concave with respect to p for a fixed Q and concave with respect to Q for a fixed p (See [Von14]). However, unfortunately the function in general is not a concave function with respect to p and Q simultaneously [Von14] and we do not know how to solve the above optimization problem. Vontobel [Von14] proposed an alternating maximization algorithm to solve the above optimization problem, and studied its implementation and convergence to a stationary point; see [Von14] for empirical performance of this approach. Using the Bethe permanent as a proxy in the above optimization problem has similar issues; see [Von12, Von14] for further details.
To address the above issue we use the idea of probability discretization from [CSS19], meaning we assume distribution takes all its probability values from some fixed predefined set. We use this idea in a different way than [CSS19] and further exploit the structure of optimal solution Q to write a convex optimization problem that approximates the PML objective. The solution of this convex optimization problem returns a fractional representation of the distribution that we later round to return the approximate PML distribution with desired guarantees. Surprisingly, the final convex optimization problem we write is exactly same as the one in [CSS19] and our work gives a better analysis of the same convex program by a completely different approach.
The rest of this section is organized as follows. In Section 6.1, we study the probability discretization. In the same section, we also study the application of results from Section 4 for approximating the permanent of profile probability matrix (). We further provide the convex optimization problem at the end of this section that can be solved efficiently and returns a fractional representation of the approximate PML distribution. In Section 6.2, we provide the rounding algorithm that returns our final approximate PML distribution. Till this point, all our results are independent of the choice of the probability discretization set. Later in Section 6.3, we choose an appropriate probability discretization set and further combine analysis from all the previous sections. In this section, we state and analyze our final algorithm that returns a -approximate PML distribution. Note that our rounding algorithm is technical and for the continuity of reading we defer all the proofs for results in Section 6.2 to Section 6.4.
6.1 Probability discretization
Here we study the idea of probability discretization that is also used in [CSS19]. We combine this with other ideas from Section 4 to provide a convex program that approximates the PML objective.
Let be some discretization of the probability space and in this section we consider distributions that take all its probability values in set R. All results in this section hold for finite set R and we specify the exact definition of R in Section 6.3.
The discretization introduces a technicality of probability values not summing up to one and we redefine pseudo-distribution and discrete pseudo-distribution from [CSS19] to deal with these.
Definition 6.1 (Pseudo-distribution).
is a pseudo-distribution if and a discrete pseudo-distribution with respect to R if all its entries are in R as well. We use and to denote the set of all such pseudo-distributions and discrete pseudo-distributions with respect to R respectively.
We extend and use the following definition for for any vector and therefore for pseudo-distributions as well,
Further, for any probability terms defined involving p, we define those terms for any vector just by replacing by everywhere. For convenience we refer to for any pseudo-distribution q as the “probability” of profile with respect to q.
For a scalar and set S, define and as follows:
Definition 6.2 (Discrete pseudo-distribution).
For any distribution , its discrete pseudo-distribution with respect to R is defined as:
We now define some additional definitions and notation that will help us lower and upper bound the permanent of profile probability matrix by a convex optimization problem.
-
•
Let be the cardinality of set R and be the ’th element of set R.
-
•
For any discrete pseudo-distribution q with respect to R, that is , we let , be the number of domain elements with probability .
-
•
Let be the set of non-negative matrices such that, for any the following holds:
(67) where 111111 is not part of the profile and is not given to us. Later in this section, we get rid of this dependency on . is the number of unseen domain elements and we use to denote the corresponding frequency element.
-
•
For any define,
(68) -
•
Throughout this section, for convenience unless stated otherwise we abuse notation and use A to denote the matrix . The underlying pseudo-distribution q and profile with respect to matrix A will be clear from the context.
The first half of this section is dedicated to bound the in terms of function . For any fixed discrete pseudo-distribution q and profile , we will show that,
Later in the second half, we use the above inequality to maximize over all the discrete pseudo-distributions to find the approximate PML distribution and the summary of which is stated later. We start by showing the lower bound first and later in Theorem 6.4 we prove the upper bound.
Theorem 6.3.
For any discrete pseudo-distribution q with respect to R and profile , let A be the matrix defined (with respect to q and ) in Equation 65, then the following holds,
(69) |
Proof.
For any matrix , define matrix as follows,
where in the definition above and are such that and . We now establish that matrix Q is doubly stochastic. For each , let be such that , then
(70) |
For each , let be such that , then
(71) |
Since matrix Q is doubly stochastic, by the definition of the scaled Sinkhorn permanent (See 2.6) and 2.7 we have . To conclude the proof we show that .
(72) |
We consider the final expression above and simplify it. First note that,
Similarly,
Using the above two expressions, the final expression of Equation 72 can be equivalently written as,
(73) |
Combining Equation 72, Equation 73 and substituting , we get:
In the above equality we used for all and for any . Combining the above inequality with we get,
The above inequality holds for any (and therefore holds for the maximizer as well) and we conclude the proof. ∎
We next give an upper bound for the log of permanent of A in terms of .
Theorem 6.4.
For any discrete pseudo-distribution q with respect to R and profile , let A be the matrix defined (with respect to q and ) in Equation 65. Then,
Proof.
Here we construct a particular matrix such that , which immediately implies the theorem. Recall by Lemmas 4.5 and 4.4, there exists a matrix such that, and , and satisfies 121212The inequality holds because matrix A has distinct columns and is asymptotically same as .. Further using the definition of we get,
(74) |
where for the matrix A defined (with respect to q and ) in Equation 65, we have,
We now define the matrix S that satisfies the conditions of the lemma.
By Theorem 4.7, for any fixed , all such that , share the same probability value and we use the notation to denote this value. Using this definition, we have:
(75) |
Further note that for any , if is any element such that , then
We wish to show that . We first analyze the row sum constraint. For each ,
We now analyze the column constraint. For each ,
In the remainder of the proof we show that the matrix S defined earlier satisfies . We start by simplifying the term in Equation 74,
(76) |
In the second equality, we used and further by the definition of we have for all that satisfy . In the third equality, we used . In the fourth equality we used Equation 75. In the final equality, we used and the final term further simplifies to the following, .
Note using Theorem 6.3 and 6.4, for matrix A defined (with respect to q and ) in Equation 65, we showed the following,
(77) |
Our final goal of this section is to maximize over discrete pseudo-distributions q but let us take a step back and just focus on writing an upper bound. Consider the term in the expression above, it depends on discrete pseudo-distribution q at two different places. The first is the constraint set and the second is the function (because it contains the term in its expression). We address the first issue by defining the following new set that encodes the constraint set for all discrete pseudo-distributions simultaneously.
Definition 6.5.
Let be the set of non-negative matrices, such that any satisfies,
(78) |
Note in the definition of we removed the constraint related to and recall denotes the number of unseen domain elements. Not having constraint with respect to helps us encode discrete pseudo-distributions (with respect to R) of different domain sizes. Further for any , there is a discrete pseudo-distribution associated with it and we define it next.
Definition 6.6.
For any , the discrete pseudo-distribution associated with S is defined as follows: For any arbitrary number of domain elements assign probability .
Note in the definition above is a valid pseudo-distribution because of the third condition in Equation 78. Further note that, for any discrete pseudo-distribution q and , the distribution associated with respect to S is a permutation of distribution q. Since the probability of a profile is invariant under permutations of distribution, we treat all these distributions the same and do not distinguish between them.
We now handle the second issue that corresponds to removing the dependency of discrete pseudo-distribution q from the function . To handle this issue, we define a new function that when maximized over the set and approximates the value and respectively (See next theorem for the formal statement). For any , the function is defined as follows,
(79) |
Note that we switch gears and define the function as an exponential function. approximates the value instead of log of it and it helps with proof readability. The following theorem summarizes this result.
Theorem 6.7.
Let R be a probability discretization set. Given a profile and discrete pseudo-distribution q with respect to R. The following inequality holds,
(80) |
Further,
(81) |
Proof.
For any discrete pseudo-distribution q with respect to R and profile , let A be the matrix defined (with respect to q and ) in Equation 65. Then, by Equation 77 we have,
Further by Equation 4 we have,
Combining the above two equations we have,
(82) |
We now simplify the term in the above expression. First note that for any ,
Therefore,
(83) |
By Lemma 2.8 (Stirling’s approximation) we have,
(84) |
The first inequality follows because for each , we have , which by using is further lower bounded by . Equation 84 follows by taking product over all . Now combining Equation 84 and Equation 83 we have,
(85) |
The first statement of the lemma follows by combining the above Equation 85 with Equation 82, that is we have,
(86) |
Given a profile , for any discrete pseudo-distribution we have and further combining it with above inequality we get,
Note that for any , we also have , where is the discrete pseudo-distribution associated with respect to S (See 6.6). Therefore,
For the last inequality in the above derivation we used Equation 86. Now combining the previous two inequalities we conclude the proof. ∎
The previous theorem provides an upper bound for the probability of profile with respect to any discrete pseudo-distribution. However one issue with this upper bound is that it is not efficiently computable because the set is not a convex set (because of the integrality constraints). We relax these integrality constraints and define the following new set.
Definition 6.8.
Let be the set of non-negative matrices, such that any satisfies,
(87) |
Lemma 6.9.
Let R be a probability discretization set. Given a profile , the following holds,
(88) |
Proof.
Note in the above lemma, the upper bound only depends on the profile 131313 has no dependency on . and we removed all dependencies related to distributions (and also ). Next we show that this upper bound can be efficiently computed by using the result that function is log concave in S.
Lemma 6.10 (Lemma 4.16 in [CSS19]).
Function is log concave in S.
Theorem 6.11 (Theorem 4.17 in [CSS19]).
Given a profile , the optimization problem can be solved in time . 141414Note here we hide the logarithmic dependence on , the size of sample.
6.2 Rounding Algorithm
In the previous section we provided an efficiently computable upper bound for the probability of profile with respect to any discrete pseudo-distribution . This upper bound returns a solution and we need to round this solution to construct a discrete pseudo-distribution that approximates this upper bound. In this section we provide a rounding algorithm that takes as input and returns a solution , where is an extended probability discretization set. Further using , we construct a discrete pseudo-distribution with respect to such that approximates the upper bound and therefore is an approximate PML distribution. Our rounding algorithm is technical and we next provide a overview to better understand it.
Overview of the rounding algorithm:
The goal of the rounding algorithm is to take a fractional solution as input and round each row sum to an integral value while preserving the column sums and value. Our rounding algorithm proceeds in three steps:
Step 1:
Consider the fractional solution and recall the rows are indexed by the elements of set R (which represent probability values). We first round the rows corresponding to the higher probability values by simply taking the floor (rounding down to the nearest integer) of each entry. This procedure ensures the integrality of the row sums (corresponding to higher probability values) but violates the column sum constraints. To satisfy the column sum constraints and the distributional constraint (i.e. last condition in Equation 78) simultaneously, we create rows corresponding to new probability values using Algorithm 2. However to ensure that all these new rows also have integral row sums, we modify the (old) rows corresponding to lower probability values accordingly. Let be the solution returned by the first step of the rounding algorithm. Algorithm 2 ensures that the value is not much smaller than . In , all the new rows and (old) rows corresponding to higher probability values have integral row sums and we round the remaining rows corresponding to smaller probability values next.
Step 2:
In this step, we round all the rows corresponding to the smaller probability values. For each of these rows, we scale all the entries in a particular row by the same factor to ensure that the row sum is rounded down to the nearest integer. Similar to the step 1, using Algorithm 2 we create rows corresponding to new probability values to maintain the column sum constraints and the distributional constraint; all these new rows further correspond to small probability values. Unlike in the previous step, the new rows created in step two may not have integral row sums but these rows have a nice diagonal structure. Let be this intermediate solution created in step 2. Algorithm 2 ensures that the value is not much smaller than (and hence ). Note all the row sums in are integral except the new rows created in step 2 that all have small probability values and have diagonal structure.
Step 3:
In this final step, using Algorithm 1 we round the new rows created in step 2. Algorithm 1 exploits the low probability and diagonal structure in these rows. The diagonal structure ensures that there is just one non-zero entry in any particular row and we modify the solution (from the previous step) as follows. We transfer the mass from a non-integral lower probability value row to an immediate higher probability value row until the (lower probability value) row sum is integral. This process might violate the distributional constraint and we rescale the probability values accordingly to satisfy this constraint. Let be the solution returned by step 3. We ensure that all column sums are preserved, all row sums are integral and the value is not much smaller than (and hence not much smaller than ).
In the remainder of this section we state all three algorithms and the results corresponding to them. For continuity of reading, we defer the proofs of these results to Section 6.4. For convenience, we first state Algorithm 1 that rounds the rows corresponding to the low probability values in step 3 of our main rounding algorithm (Algorithm 3). We follow up this algorithm with a lemma that summarizes the guarantees provided by it. Later we state Algorithm 2 that creates rows corresponding to new probability values to preserve the column sums and the distributional constraint. This algorithm is invoked as a subroutine in both step 1 and 2 of Algorithm 3. Finally, we state our main rounding algorithm that consists of three different steps. We then state results analyzing each of these steps separately. The final result (6.16), is the main theorem of this subsection that summarizes the final guarantees promised by our rounding algorithm.
(89) |
The next lemma summarizes the quality of the solution produced by Algorithm 1.
Lemma 6.12.
Given a set of reals for all such that , weights for all and exponents for all 151515Here need not be equal to zero.. Using Algorithm 1, we can efficiently compute a matrix such that the following conditions hold,
-
1.
and .
-
2.
.
-
3.
.
We next provide description of Algorithm 2. The algorithm takes input and creates a new probability discretization set (lines 6-10). The solution outputted by the algorithm belongs to , has same column sums as B and the value is lower bounded by .
The next lemma summarizes the quality of the solution produced by Algorithm 2.
Lemma 6.13.
The solution returned by Algorithm 2 satisfies the following conditions:
-
1.
for all .
-
2.
For any let be such that then for all and . (Diagonal Structure)
-
3.
For any let be such that , then .
-
4.
and .
-
5.
Let for all and , then
-
6.
For each , the new row corresponds to the probability value, .
In the remainder of this section, we state and analyze our rounding algorithm. Our algorithm works in three steps, and we show that all the solutions produced during the intermediate and final steps all have the desired approximation guarantee. We divide the analysis into three lemmas. Each of the lemmas 6.14, 6.15 and 6.16 analyze the guarantees provided by the intermediate solutions , and final solution respectively.
The next lemma summarizes the quality of the intermediate solution produced by Step 1 of Algorithm 3.
Lemma 6.14.
The solution returned by the step 1 of Algorithm 3 satisfies the following:
-
1.
for all .
-
2.
and .
-
3.
, where .
Using Lemma 6.14 we now provide the guarantees for the solution returned by the step 2 of Algorithm 3.
Lemma 6.15.
The solution returned by the step 2 of Algorithm 3 satisfies the following,
-
1.
for all .
-
2.
for all and (Diagonal Structure).
-
3.
and .
-
4.
.
-
5.
For any , .
-
6.
.
Using Lemma 6.15 we now provide the guarantees for the final solution returned by Algorithm 3.
Theorem 6.16.
The final solution returned by Algorithm 3 satisfies the following,
-
1.
.
-
2.
.
6.3 Combining everything together
Here we combine the analysis from previous two sections to provide an efficient algorithm to compute an approximate PML distribution. The main contribution of this section is to define a probability discretization set R that guarantees existence of a discrete pseudo-distrbution q with respect to R that is also an approximate PML pseudo-distribution. We further use this probability discretization set R and combine it with results from the previous two sections to finally output an approximate PML distribution. In this direction, we first provide definition of R that has desired guarantees and such a set R was already constructed in [CSS19] and we formally state results from [CSS19] that help us define such a set R.
Lemma 6.17 (Lemma 4.1 in [CSS19]).
For any profile , there exists a distribution such that is an -approximate PML distribution and .
The above lemma allows to define a region in which our approximate PML takes all its probability values and we use idea similar to [CSS19] to define this region.
Let be a discretization of probability space, where is the smallest integer such that for some . Fix any arbitrary order for the elements of set R, we use to denote the ’th element of this set. We next state a result in [CSS19] that captures the effect of this discretization.
Lemma 6.18 (Lemma 4.4 in [CSS19]).
For any profile and distribution , its discrete pseudo-distribution satisfies:
We are now ready to state our final algorithm. Following this algorithm, we prove that it returns an approximate PML distribution.
See 3.4
Proof.
Choose and let the probability discretization space and be the smallest integer such that and therefore . Let be the ’th element of set R and we have .
Given profile , let be the PML distribution. Define and by Lemma 6.18 (and choice of ) we have,
(90) |
Let , then by Lemma 6.9 we have,
(91) |
Note , therefore and further combined with equations 90 and 91 we have,
(92) |
Let and be the solution returned by Algorithm 3, then by the second condition of 6.16 we have,
(93) |
Combining equations 92 and 93 we have,
(94) |
We now simplify the above expression by providing the bounds and values for parameters and . We choose and recall . Given samples, the number of distinct frequencies in upper bounded by and therefore . By Lemma 6.17, up to constant multiplicative loss we can assume that the minimum non-zero probability value of our approximate PML distribution is at least and therefore the support . Recall by the third condition of Lemma 6.14, we have . The condition implies and further using for all we have . Therefore .
Substituting these values in Equation 94 we get,
(95) |
By the first condition of 6.16 we have . Let be the discrete pseudo-distribution with respect to , then the condition further implies and combined with Theorem 6.7 we have,
(96) |
Combining equations 95, 96, and we have,
(97) |
Define , then is a distribution, (because is a pseudo-distribution and ) and combined with Equation 97 we get,
(98) |
Therefore is an -approximate PML distribution.
In the remainder of the proof we argue about the running time of our final algorithm for approximate PML. Step 4 of the algorithm, that is the convex program can be solved in time (See Theorem 6.11). Algorithm 2 () and Algorithm 1 () can be implemented in and time respectively; therefore, the Algorithm 3 (Rounding algorithm) can be implemented in time. Combining everything together our final algorithm (Algorithm 4) can be implemented in time. Further using , we conclude the proof. ∎
6.4 Missing Proofs from Section 6.2
Here we provide the proofs for all the lemmas and theorems in Section 6.2
Proof of Lemma 6.12.
Without loss of generality assume . Let , we invoke Algorithm 1 with inputs . Let and be the output of Algorithm 1. We now provide the proof for the three conditions in the lemma.
Condition 1: By construction of Algorithm 1, for any we have (Line 6) and for any other we have . Therefore the first part of condition 1 holds.
For any , one of the following two cases holds,
-
1.
If and in this case let be such that . By line 6 (third case) of the algorithm we have,
(99) We now analyze the term ,
The first equality follows because for we have and this follows by the second and third case in line 6 of the algorithm. In the second equality we substituted values for and using second case (Line 6) and Equation 99 respectively.
-
2.
Else , and in this case let be such that . Then by the first case in line 6 of the algorithm we have,
Condition 2: Consider ,
(100) |
The first equality follows because rest of the other entries are zero. In the second inequality we used and therefore by our assumption at the beginning of the proof. In the remainder, we simplify both the terms. Consider the first term in the final expression above,
(101) |
In the first equality we interchanged the summations. In the second equality we used and further invoked condition 1 of the lemma. Now consider the second term in the final expression of Equation 100,
(102) |
The second equality follows by line 6 of the algorithm. Condition 2 follows by combining equations 100, 101 and 102.
Condition 3: First we show that implies . Consider ,
-
1.
If , in this case let be such that . Then by the second and third case in line 6 of the algorithm we have, implies . Further, using and we have .
-
2.
Else and in this case let be such that . Then by the first case in line 6 of the algorithm we have, implies . Further, using we have .
Using the above implication we have,
(103) |
In the first equality we used for all (Condition 1). In the final inequality, we used the result implies and further combined it with the assumption at the begining of the proof, that is, for all and . ∎
Proof of Lemma 6.13.
Define . In the following we provide the proof for each case.
Condition 1: For each , for all and the first condition holds.
Condition 2: Note is initialized to a zero matrix (Line 4). Further for any let be such that , then the algorithm only updates the ’th entry in the ’th row and keeps rest of the entries unchanged. Therefore the second condition holds.
Condition 3: For each let be such that , then . The first equality holds because of the Condition 2. The third equality follows from the Line 8 of the algorithm. The last equality holds because and we have .
Condition 4: Here we provide the proof for . For any , we first show that .
The second equality follows because for all and (Line 6) and (Condition 2). The third equality follows from the Condition 3.
We next show that .
(104) |
In the first equality, we divided the summation into two parts and for the second part we used Condition 3. In the second equality we used Line 7 and 8 of the algorithm. In the third and fourth equality we simplified the expression. In the final inequality we used .
Combining all the conditions together we have . In the remainder we show that .
Recall we already showed that for all . Recall and implies for all . Therefore we have,
Condition 5: We first provide the explicit expressions for and below:
Note in the expression for we used Condition 2. In the above two definitions for and , we refer to the expression involving ’s as the probability term and the rest as the counting term. We start the analysis of Condition 5 by first bounding the probability term:
(105) |
The first three inequalities simplify the expression. The fourth equality follows because for all and . The fifth inequality follows from AM-GM inequality. The final expression above is the probability term associated with and the equation above shows that our rounding procedure only increases the probability term and it remains to bound the counting term.
(106) |
Consider the numerator in the above expression, for each let , then
(107) |
In the third inequality we used for all . The final inequality follows because . Now consider the denominator in the above expression, let for all and , then
(108) |
In the third inequality we used and therefore . In the fourth inequality we used . In the fifth inequality we used for all and further . Now consider the term and note that . The fifth inequality in Equation 108 follows by combining the previous two derivations together. The final inequality follows because .
Condition 6: This condition follows immediately from Line 7 of the algorithm. ∎
Proof of Lemma 6.14.
In the following we provide the proof for the claims in the lemma.
Condition 1: Note , where are the indices corresponding to the new rows created by the procedure (Algorithm 2). Consider any , then one the following two cases hold,
-
1.
If , then by the first condition of Lemma 6.13 we have .
-
2.
Else and in this case we have . The second equality in the previous derivation follows from Line 7 and 8 of the algorithm. The previous derivation combined with third condition of Lemma 6.13 we get, .
in both the cases and the condition 1 follows.
Condition 2: This condition follows immediately from the fourth condition of Lemma 6.13.
Condition 3: Let for all . First we upper bound the term . Consider . The last inequality follows because of the constraint () and for all .
We now upper bound the term . Consider . Further for all (Line 8 of the algorithm) and we get .
Proof of Lemma 6.15.
In the following we provide proof for all the conditions in the lemma.
Condition 1: For all , one of the following two conditions hold,
-
1.
If , then by the first condition of Lemma 6.13 we have . The last expression follows from first condition of Lemma 6.14.
-
2.
Else , then again by the first condition of Lemma 6.13 we have . The last equality follows from Line 15 of the algorithm.
For all , we have and therefore condition 1 holds.
Condition 2: This condition follows immediately from the second condition of Lemma 6.13.
Condition 3: This condition follows immediately from the fourth condition of Lemma 6.13.
Condition 4: Consider the term ,
(109) |
In the first equality we add and substract term. The first term in the second equality follows because and the last equality follows because (Condition 3). The second term in the second equality follows by the first condition of Lemma 6.13. In the third equality we divided the summation terms over and . In the fourth equality we used Line 14 of the algorithm and further for any Line 15 implies . Finally by first condition of Lemma 6.14 we have for all and for all . Therefore, and the condition 4 holds.
Condition 5: For any we have,
(110) |
The first equality follows from the sixth condition of Lemma 6.13. The second equality follows because for all and (Line 14). The third inequality follows because for all and (Line 15) and further for all (Line 12).
Condition 6: For any , let . Note for all (Line 14) and for all (Line 15). Therefore and further combined with the fifth condition of Lemma 6.13 we have . Note by the third condition of Lemma 6.14 we have . Combining the previous two inequalities we get and condition 6 holds. ∎
Proof of 6.16.
In the following we provide proof for the two conditions of the theorem.
Condition 1: Here we provide the proof for the condition .
-
1.
For all , consider . If , then . The first equality follows by line 22 of the algorithm and the last expression follows by first condition of Lemma 6.15. Else , let be such that , then . The second equality follows by line 23 of the algorithm. The third equality follows from the second condition of Lemma 6.15. Finally by the first condition of Lemma 6.12 we have for all and therefore for any .
Combining the analysis of cases and the condition 1 holds.
-
2.
For all ,
(111) The second equality follows because for all (Line 22) and for all (Line 23). We next simplify the second term in the above expression.
(112) In the first and final equality we used the second condition of Lemma 6.15 (Diagonal Structure). In the second equality we used the first condition of Lemma 6.12. In the third equality we used the definition of (Line 19). Combining equations 111 and 112 we get,
In the last inequality we used .
-
3.
Let for all be the ’th element of . Consider , we have,
(113) The first equality follows from Line 24 of the algorithm. In the second equality we divided the summation into two parts and used for all and (Line 22) for the first part. We now simplify the second part of the above expression.
(114) In the first equality we expanded the term. Further we used for all (Line 23). In the second equality we used the second condition of Lemma 6.15 (Diagonal Structure) and further combined it with definitions of and from Line 19 of the algorithm. The third inequality follows from second condition of Lemma 6.12. In the final inequality we used that follows from the definition of and fifth condition of Lemma 6.15. Further we combined it with that follows from the second condition of Lemma 6.15.
The condition 1 holds by combining the analysis of all the above three cases.
Condition 2: Recall the definition of ,
In the above expression consider the probability term,
(115) |
In the first equality we used line 24 of the algorithm. In the second inequality we used that further implies . In the third equality we used for all and (Line 22). We now analyze the second product term in the final expression above,
(116) |
The second equality follows from line 23 of the algorithm. The third equality follows from the second condition of Lemma 6.15 (Diagonal Structure).
Now consider the second product term in the above expression.
(117) |
In the first equality we used the definition of (Line 19). The second inequality follows from the third condition of Lemma 6.12.
Combining equations 116, 117 and further using for all (Line 19) we have,
(118) |
In the final inequality we used the second condition of Lemma 6.15 (Diagonal Structure).
Using the above expression we have,
(119) |
In the second equality we used for all and (Line 22). The third inequality follows by the second condition of Lemma 6.15 (Diagonal Structure). In the remainder of the proof we lower bound the term in the final expression.
For each let be such that , then . The first equality follows from line 23 of the algorithm. The second equality follows by the second condition of Lemma 6.15 (Diagonal Structure). Using first condition of Lemma 6.12, one of the following two cases hold,
-
1.
If , then for all . Using second condition of Lemma 6.15 (Diagonal Structure), we have for all and . Further note, . Combining previous two equalities we have, . Therefore,
(120) -
2.
If , then for all . Using second condition of Lemma 6.15 (Diagonal Structure), we have for all and . Therefore, . The final inequality follows because and for all .
Further note, . Combining previous two inequalities we have, . The last inequality follows because of the following: If , then the inequality follows because and . Else , in this case we use the fact that is a monotonically increasing for .
Therefore
(121)
Combining equations 120 and 121, for all we have,
Substituting previous inequality in Equation 119 we get,
Further the condition 2 of the theorem follows by combining the above inequality with the sixth condition of Lemma 6.15. ∎
7 Acknowledgments
We thank Jayadev Acharya and Yanjun Han for helpful conversations. We thank the anonymous reviewer for pointing out the alternative proof of the quality of scaled Sinkhorn and Bethe approximations on approximating the permanent of matrices with a bounded number of distinct columns (see Section 3.1 and Appendix A).
References
- [AA11] Scott Aaronson and Alex Arkhipov. The computational complexity of linear optics. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 333–342, New York, NY, USA, 2011. ACM.
- [ADM+10] J. Acharya, H. Das, H. Mohimani, A. Orlitsky, and S. Pan. Exact calculation of pattern probabilities. In 2010 IEEE International Symposium on Information Theory, pages 1498–1502, June 2010.
- [ADOS16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. CoRR, abs/1611.02960, 2016.
- [AOST14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 2014.
- [AOST17] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating renyi entropy of discrete distributions. IEEE Trans. Inf. Theor., 63(1):38–56, January 2017.
- [AR18] Nima Anari and Alireza Rezaei. A tight analysis of bethe approximation for permanent. CoRR, abs/1811.02933, 2018.
- [AS04] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.
- [Bar96] Alexander I Barvinok. Two algorithmic results for the traveling salesman problem. Mathematics of Operations Research, 21(1):65–84, 1996.
- [Bar17] Alexander Barvinok. Combinatorics and Complexity of Partition Functions. Springer Publishing Company, Incorporated, 1st edition, 2017.
- [BF93] John Bunge and Michael Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364–373, 1993.
- [Bre73] L. M. Bregman. Certain properties of nonnegative matrices and their permanents. 1973.
- [BZLV16] Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of kl divergence between large-alphabet distributions. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1118–1122, July 2016.
- [CCG+12] Robert K Colwell, Anne Chao, Nicholas J Gotelli, Shang-Yi Lin, Chang Xuan Mao, Robin L Chazdon, and John T Longino. Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. Journal of plant ecology, 5(1):3–21, 2012.
- [Cha84] A Chao. Nonparametric estimation of the number of classes in a population. scandinavianjournal of statistics11, 265-270. Chao26511Scandinavian Journal of Statistics1984, 1984.
- [CL92] Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 87(417):210–217, 1992.
- [CSS19] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 780–791, New York, NY, USA, 2019. ACM.
- [DS13] Timothy Daley and Andrew D Smith. Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4):325, 2013.
- [ET76] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did shakespeare know? Biometrika, 63(3):435–447, 1976.
- [Für05] Johannes Fürnkranz. Web mining. In Data mining and knowledge discovery handbook, pages 899–920. Springer, 2005.
- [GS14] Leonid Gurvits and Alex Samorodnitsky. Bounds on the permanent and some applications. arXiv e-prints, page arXiv:1408.0976, Aug 2014.
- [GS18] Daniel Grier and Luke Schaeffer. New hardness results for the permanent using linear optics. In Proceedings of the 33rd Computational Complexity Conference, CCC ’18, pages 19:1–19:29, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
- [GTPB07] Zhan Gao, Chi-hong Tseng, Zhiheng Pei, and Martin J Blaser. Molecular analysis of human forearm superficial skin bacterial biota. Proceedings of the National Academy of Sciences, 104(8):2927–2932, 2007.
- [Gur05] Leonid Gurvits. On the complexity of mixed discriminants and related problems. In Proceedings of the 30th International Conference on Mathematical Foundations of Computer Science, MFCS’05, pages 447–458, Berlin, Heidelberg, 2005. Springer-Verlag.
- [Gur11] Leonid Gurvits. Unleashing the power of Schrijver’s permanental inequality with the help of the Bethe Approximation. arXiv e-prints, page arXiv:1106.2844, Jun 2011.
- [HHRB01] Jennifer B Hughes, Jessica J Hellmann, Taylor H Ricketts, and Brendan JM Bohannan. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67(10):4399–4406, 2001.
- [HJM17] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On Estimation of $L_{r}$-Norms in Gaussian White Noise Models. arXiv e-prints, page arXiv:1710.03863, Oct 2017.
- [HJW16] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of KL divergence between discrete distributions. CoRR, abs/1605.09124, 2016.
- [HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance. arXiv preprint arXiv:1802.08405, 2018.
- [HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy estimation over Lipschitz balls. arXiv e-prints, page arXiv:1711.02141, Nov 2017.
- [HO19] Yi Hao and Alon Orlitsky. The Broad Optimality of Profile Maximum Likelihood. arXiv e-prints, page arXiv:1906.03794, Jun 2019.
- [JHW16] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the l1 distance. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 750–754, July 2016.
- [JSV04] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. J. ACM, 51(4):671–697, July 2004.
- [JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835–2885, May 2015.
- [KLR99] Ian Kroes, Paul W Lepp, and David A Relman. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547–14552, 1999.
- [LSW98] Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 644–652, New York, NY, USA, 1998. ACM.
- [OSS+04] A. Orlitsky, S. Sajama, N. P. Santhanam, K. Viswanathan, and Junan Zhang. Algorithms for modeling distributions over large alphabets. In International Symposium on Information Theory, 2004. ISIT 2004. Proceedings., pages 304–304, 2004.
- [OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283–13288, 2016.
- [OSZ03] A. Orlitsky, N. P. Santhanam, and J. Zhang. Always good turing: asymptotically optimal probability estimation. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 179–188, Oct 2003.
- [PBG+01] Bruce J Paster, Susan K Boches, Jamie L Galvin, Rebecca E Ericson, Carol N Lau, Valerie A Levanos, Ashish Sahasrabudhe, and Floyd E Dewhirst. Bacterial diversity in human subgingival plaque. Journal of bacteriology, 183(12):3770–3783, 2001.
- [PGM+01] A. Porta, S. Guzzetti, N. Montano, R. Furlan, M. Pagani, A. Malliani, and S. Cerutti. Entropy, entropy rate, and pattern classification as tools to typify complexity in short heart period variability series. IEEE Transactions on Biomedical Engineering, 48(11):1282–1291, Nov 2001.
- [PJW17] D. S. Pavlichin, J. Jiao, and T. Weissman. Approximate Profile Maximum Likelihood. ArXiv e-prints, December 2017.
- [PW96] Nina T. Plotkin and Abraham J. Wyner. An Entropy Estimator Algorithm and Telecommunications Applications, pages 351–363. Springer Netherlands, Dordrecht, 1996.
- [Rad97] Jaikumar Radhakrishnan. An entropy proof of bregman’s theorem. Journal of Combinatorial Theory, Series A, 77(1):161 – 164, 1997.
- [RCS+09] Harlan S Robins, Paulo V Campregher, Santosh K Srivastava, Abigail Wacher, Cameron J Turtle, Orsalem Kahsai, Stanley R Riddell, Edus H Warren, and Christopher S Carlson. Comprehensive assessment of t-cell receptor -chain diversity in t cells. Blood, 114(19):4099–4107, 2009.
- [RRSS07] S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 559–569, Oct 2007.
- [RVZ17] Aditi Raghunathan, Gregory Valiant, and James Zou. Estimating the unseen from multiple populations. CoRR, abs/1707.03854, 2017.
- [RWdRvSB99] Fred Rieke, Davd Warland, Rob de Ruyter van Steveninck, and William Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, USA, 1999.
- [Sch78] A Schrijver. A short proof of minc’s conjecture. Journal of Combinatorial Theory, Series A, 25(1):80 – 83, 1978.
- [Sch98] Alexander Schrijver. Counting 1-factors in regular bipartite graphs. Journal of Combinatorial Theory, Series B, 72(1):122 – 135, 1998.
- [Spe82] E. Spence. H. minc, permanents (encyclopedia of mathematics and its applications, vol. 6, addison-wesley advanced book programme, 1978), xviii 205 pp., 21.50. Proceedings of the Edinburgh Mathematical Society, 25(1):110–110, 1982.
- [TE87] Ronald Thisted and Bradley Efron. Did shakespeare write a newly-discovered poem? Biometrika, 74(3):445–455, 1987.
- [Val79] L.G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189 – 201, 1979.
- [VBB+12] Martin Vinck, Francesco P. Battaglia, Vladimir B. Balakirsky, A. J. Han Vinck, and Cyriel M. A. Pennartz. Estimation of the entropy based on its polynomial representation. Phys. Rev. E, 85:051139, May 2012.
- [Von11] Pascal O. Vontobel. The bethe permanent of a non-negative matrix. CoRR, abs/1107.4196, 2011.
- [Von12] Pascal O. Vontobel. The bethe approximation of the pattern maximum likelihood distribution. pages 2012–2016, 07 2012.
- [Von13] P. O. Vontobel. The bethe permanent of a nonnegative matrix. IEEE Transactions on Information Theory, 59(3):1866–1901, March 2013.
- [Von14] P. O. Vontobel. The bethe and sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the valiant-valiant estimate. In 2014 Information Theory and Applications Workshop (ITA), pages 1–10, Feb 2014.
- [VV11a] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 403–412, Oct 2011.
- [VV11b] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 685–694, New York, NY, USA, 2011. ACM.
- [WY15] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. ArXiv e-prints, April 2015.
- [WY16a] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702–3720, June 2016.
- [WY16b] Yihong Wu and Pengkun Yang. Sample complexity of the distinct elements problem. arXiv e-prints, page arXiv:1612.03375, Dec 2016.
- [YFW05] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282–2312, July 2005.
- [ZVV+16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G. MacArthur. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature Communications, 7:13293 EP –, 10 2016.
Appendix A Alternative proof for the distinct column case.
Here we provide an alternative and simpler proof for Lemma 4.1 which was pointed to us by an anonymous reviewer. This alternative proof is derived using Corollary 3.4.5 in Barvinok’s book [Bar17] (which is further derived using the Bregman-Minc inequality) and we formally state it below.
Lemma A.1 (Corollary 3.4.5 from [Bar17]).
Suppose that Q is a doubly stochastic matrix that satisfies,
for some positive integers . Then,
Alternative proof for Lemma 4.1.
The lower bound follows from 2.7 and in the remainder we prove the upper bound. Let Q be the maximizer of the scaled Sinkhorn objective, then it is a well know fact that Q satisfies,
where matrices L and R are the left and right non-negative diagonal matrices. Further by the symmetry of the objective, there exists an optimum solution Q that has at most distinct columns and we work with such an optimum solution. As L and R are diagonal matrices, the following two inequalities are trivial,
(122) |
(123) |
Further note that for all doubly stochastic matrices Q we always have,
(124) |
Therefore combining Equations 122, 123 and 124, to prove the upper bound it is enough to show that,
As matrix Q has at most distinct columns, let the multiplicities of these distinct columns be . Note that if a column has multiplicity , the maximal element in this column is at most . Now by Lemma A.1 (Corollary 3.4.5. in [Bar17]), we have
where the last inequality follows because the term is maximized when all ’s are equal and take value . Therefore we conclude the proof. ∎