This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Response Letter for “Convergence Rate of Multiple-Try Metropolis Independent Sampler”

We thank the reviewers for their thoroughness in reading our manuscript and their many constructive suggestions. Besides correcting minor typos and improving the language expression here and there, we have made the following major changes in this revision:

Major Changes:

  • We added a specific case study for multivariate Gaussian and Gaussian mixture examples in Section 4. Combined with previous univariate examples, these numerical illustrations show behaviors of IMH and MTM-IS samplers in high dimensions.

  • We rearranged the subsections in Section 5 and added a new subsection, Section 5.1, to describe an efficient variant called partial MTM-IS, whose more general form has been observed to be particularly useful in molecular simulations (frenkel1996understanding). Proposition 5 provides a preliminary analysis for its convergence rate.

  • We improved the results for the subsection on using multiple different proposals (Section 5.3), and conducted extensive simulation studies (with one reported) to examine the complexity of the situation.

  • We made a significant effort to improve the clarity and readability of the article by rephrasing and rearranging various sentences and definitions, especially for the remarks, examples, and the conclusion section.

The reviewers’ many insightful suggestions motivated us to make these new attempts summarized in the “Major Changes”, and we are deeply grateful for that. Below, we provide itemized responses to each of the reviewers.

Response to reviewer #\#2

  • “Proof of Theorem 1:
    –Is it correct that A(x,B)=P[yB|x]A(x,B)=P[y\in B|x]? If yes, then how do we get the expression line 32 (for which ”x” disappeared)? Maybe adding one more step would help.
    –Shouldn’t dydy and dyidy_{i} be at the end of the equation?”

    Response: The definition A(x,B)=P[yBx]A(x,B)=P[y\in B\mid x] appears correct to us. However, it seems that we made a mistake here by missing out the event {yJ gets accepted}\{y_{J}\text{ gets accepted}\} in the expression. The accept-or-reject procedure involves computing the quotient π(x)/p(x)\pi(x)/p(x), which depends on xx. In addition to handling the integral notation, we also edited the whole proof of Theorem 1 to make it more readable.

  • “Theorem 3, proof of the upper bound. I am not very comfortable with the given proof. Could you maybe give more details (and also avoid using flipping coins?).”

    Response: Thanks for your kind suggestion. A key point of the proof is to realize that decomposition (6) is a mixture distribution, and coin flipping offers a way to describe it. We have now removed the coin-flipping metaphor and added more descriptive words to the construction of the coupling chain to improve the readability.

  • “Equation (11): I am surprised not to have an ess-sup here. Is that because 𝒳=supp(π)\mathcal{X}=supp(\pi)?”

    Response: Sorry for our previous sloppiness. We have now added an ‘ess’ here. Thank you for pointing out this measure-theoretical typo.

  • “Equation (12) involves a ”lim” where I was expecting a ”sup”, which would have implied d(n)<rnd(n)<r^{n} for all nn. Alternatively, if r:=supn(d(n)/C)1/n<1r:=sup_{n}(d(n)/C)^{1/n}<1 for some C<C<\infty would imply d(n)<Crnd(n)<Cr^{n}. Then the denomination ”exact convergence rate” would make more sense to me, but I’d like to hear your opinion on that.”

    Response: We have turned to use ‘sup’ upon your suggestion. Practically it should make no actual difference for our results because Theorem 3 already finds that d(n)=rnd(n)=r^{n}, so here ‘lim’ and ‘sup’ lead to the same quantity.

  • Minor remarks and typos.
    Response: We thank the reviewer for the careful reading and pointing out these errors. We have made the corrections and modifications accordingly. We have also made many other minor changes in order to improve the clarity and readability.

Response to Reviewer #\#3

Implications of the Convergence Results

“My main concern about the paper is that the convergence results provided in the paper is very implicit. In particular, practitioners are often interested in how the convergence rates scale with the complexities of the posterior and the space. For instance, in continuous spaces, one often wonders the dependence of the convergence rates on the dimension of the space, the smoothness of the log posterior, and the convexity or other intrinsic property of the posterior such as the concentration or isoperimetry properties. In discrete spaces, one often considers the dependence on the number of states and similarly the isoperimetric constants of the posterior distribution. The characterization provided in the current paper does not illustrate the scaling of the convergence rates with respect to the complexities of the posterior and the space.

Because of the implicit nature of the convergence rates being provided, some of the claims in the paper are not definitive. For example, the claim that ‘the Multiple-Try Metropolis strategy is most helpful in help jumping among multiple modes of the target distribution’ is not clear from the discussion of the paper. Instead we only know that the independent Multiple-Try Metropolis samplers are not as efficient as their vanilla Metropolis counterparts with the corresponding amount of thinning.”

Response: Thank you very much for your very insightful and also ambitious comments. We agree that those properties you mentioned of an MCMC algorithm are highly valuable and desirable. Unfortunately, for most practically used algorithms in real applications such results are very scarce. We also would like to argue respectfully that, since the IMH and MTM-IS algorithms are not as flexible and general as the Metropolis-Hastings or Gibbs sampling algorithms, our explicit expressions of the convergence rates for IMH and MTM-IS algorithms are equally helpful for practitioners. For example, our rate results suggest explicitly that one needs to choose a fatter-tailed (but not overly fat) sampling distribution than the target one. In contrast, some published rates results on general MCMC often involve unknown constants and unrealistic and unverifiable assumptions on the target distribution, which cannot be used directly by practitioners. Furthermore, the efficiency of a MCMC sampler can also highly depend on parameterizations as recently revealed in the work of zanella2021multilevel, which cannot be reflected well by those dimension-related rate results.

We are fortunate to have been dealing with a much simpler setting in this paper: independence sampler. As shown in liu1996metropolized and here, the convergence rate of the IMH and MTM-IS are precisely related to the importance weight in importance sampling and are largely affected by how the proposal distribution matches the target one, especially in the tail part. For example, if our proposal distribution pp perfectly matches the target distribution, the IMH converges in one step regardless of the dimension! But this is not the case for a standard MCMC algorithm. We also know trivially that a generic importance sampling algorithm (and also IMH and MTM-IS) scales badly (exponentially) with dimension. Thus, there seems to be no unambiguous and non-trivial formulation of the dimension-scaling property for an importance sampling or IMH/MTM-IS algorithm. A key for making these algorithms practically useful for high-dimensional problems is to design the sampling distribution carefully in a problem-specific way, such as the sequential importance sampler for 0-1 tables (chen2005sequential) and sequential Monte Carlo methods for dynamic systems (doucet2001sequential). However, once we start changing both dimension of the target and the sampling distribution in a case-specific way, the question of how convergence rate of an IMH algorithm scales with dimension is no longer valid. Indeed, as shown in much work related to sequential Monte Carlo including our own, there is a huge difference between a nicely designed importance sampler versus a naive one. In our current work since the proposal is not specified exactly, our convergence results do not address the complexities of such sampling problems.

One may argue, which we agree, that for fixed classes of target and proposal distributions, it is useful and possible to get some results on how convergence rates scale with dimensions. Motivated by your comments, we investigate a case (Section 4.2) where the target is a Gaussian mixture, but we are only allowed to use a N(0,σ2Ip)N(0,\sigma^{2}I_{p}) as the sampling distribution. In this case, the convergence rate indeed scales exponentially bad even if we optimize σ\sigma. We also rearranged Section 5 added Subsection 5.1 with a new proposition to show how MTM can help in real molecular simulation applications.

Comparing the Rates

“In the framework proposed within the paper, one can perhaps try to provide inequalities of the form: 1Hk1(w)(11w)k21-H_{k_{1}}(w^{\ast})\leq(1-\frac{1}{w^{\ast}})^{k_{2}} where in different cases, ratios between k1k_{1} and k2k_{2} are different. For example, for a multi-modal distribution, k1/k2k_{1}/k_{2} in the above inequality may be much smaller than that for a strongly log-concave distribution. In discrete space, one can design a distribution that is only supported on a graph with one single vertex connecting two subgraphs. By changing the probability mass of this vertex, one can discuss similarly quantify how “multimodal” the distribution is.”

Response: When k2k\geq 2, function HkH_{k} involves the computation of the integral of a complicated function over the proposal distribution pp, which is often analytically intractable. So we only perform numerical illustrations instead. Readers can find pairs of (k1,k2)(k_{1},k_{2}) by drawing horizontal lines at different levels in our plots from Section 4.

We feel that plainly letting k1k2k_{1}\neq k_{2} may not lead to useful findings since we believe that one should compare convergence performances of two algorithms based on the same (or nearly same) computational budget. However, if one allows the algorithm to take advantage of parallel computing infrastructures, it can add another interesting dimension for the problem and make the suggested comparison meaningful.

Inspired by your comment, we added Section 5.1 to illustrate how MTM may help in a partial MTM framework adopted in the molecular simulation literature. This partial MTM-IS spiritually resembles a case with k1=1k_{1}=1 and k2k1k_{2}\gg k_{1} for a low-cost part of the whole distribution.

“More effort is need for this work to be able to provide richer insights for practitioners. I believe that an inequality of the form: 1Hk1(w)(11w)k21-H_{k_{1}}(w^{\ast})\leq(1-\frac{1}{w^{\ast}})^{k_{2}} can be achieved for a special class of posterior (and proposal distributions). The explicit dependence on dimension etc. may be difficult to achieve within the current framework. In such case, I suggest performing a comprehensive experiment for a class of distributions (Gaussian e.g.) with different dimensions and conditioning. Such results will do a much greater service to the readers of the paper.”

Response: Thanks for your constructive comments and suggestion. We have now included a comprehensive numerical study on multivariate Gaussian and Gaussian mixtures (Section 4) so as to gain more insights on the dimension-dependency of the convergence rates of IMH and MTM-IS, which provides useful guidance to practitioners.