Novel Change of Measure Inequalities with Applications to PAC-Bayesian Bounds and Monte Carlo Estimation
Abstract
We introduce several novel change of measure inequalities for two families of divergences: -divergences and -divergences. We show how the variational representation for -divergences leads to novel change of measure inequalities. We also present a multiplicative change of measure inequality for -divergences and a generalized version of Hammersley-Chapman-Robbins inequality. Finally, we present several applications of our change of measure inequalities, including PAC-Bayesian bounds for various classes of losses and non-asymptotic intervals for Monte Carlo estimates.
1 Introduction
The Probably Approximate Correct (PAC) Bayesian inequality was introduced by [32] and [23]. This framework allows us to produce PAC performance bounds (in the sense of a loss function) for Bayesian-flavored estimators [12], and several extensions have been proposed to date (see e.g., [30, 24, 21, 31, 2]). The core of these theoretical results is summarized by a change of measure inequality. The change of measure inequality is an expectation inequality involving two probability measures where the expectation with respect to one measure is upper-bounded by the divergence between the two measures and the moments with respect to the other measure. The change of measure inequality also plays a major role in information theory. For instance, [17] derived robust uncertainty quantification bounds for statistical estimators of interest with change of measure inequalities. Recent research efforts have been put into more generic perspectives to get PAC-Bayes bounds and to get rid of assumptions such as boundedness of the loss function (e.g., [19, 34, 25, 13, 11]). All PAC-Bayesian bounds contained in these works massively rely on one of the most famous change of measure inequalities, the Donsker-Varadhan representation for the KL-divergence. Several change of measure inequalities had been proposed along with PAC-Bayes bounds lately. [20] proposed a proof scheme of PAC-Bayesian bounds based on the Rényi divergence. [15] proposed an inequality for the divergence and derived a PAC-Bayesian bound for linear classification. [1] proposed a novel change of measure inequality and PAC-Bayesian bounds based on the -divergence. The aforementioned works were proposed for specific purposes. A comprehensive study on change of measure inequalities has not been performed yet. Our work proposes several novel and general change of measure inequalities for two families of divergences: -divergences and -divergences. It is a well-known fact that the -divergence can be variationally characterized as the maximum of an optimization problem rooted in the convexity of the function . This variational representation has been recently used in various applications of information theory, such as -divergence estimation [26] and quantification of the bias in adaptive data analysis [16]. Recently, [29] showed that the variational representation of the -divergence can be tightened when constrained to the space of probability densities.
Our main contributions are as follows:
-
•
By using the variational representation for -divergences, we derive several change of measure inequalities. We perform the analysis for the constrained regime (to the space of probability densities) as well as the unconstrained regime.
- •
-
•
We also generalize prior results for the Hammersley-Chapman-Robbins inequality [18] from the particular divergence, to the family of -divergences.
-
•
We provide new PAC-Bayesian bounds with the -divergence and the -divergence from our novel change of measure inequalities for bounded, sub-Gaussian, sub-exponential and bounded-variance loss functions. Our results are either novel, or have a tighter complexity term than existing results in the literature, and pertain to important machine learning prediction problems, such as regression, classification and structured prediction.
-
•
We provide a new scheme for estimation of non-asymptotic intervals for Monte Carlo estimates. Our results indicate that the empirical mean over a sampling distribution concentrates around an expectation with respect to any arbitrary distribution.
2 Change of Measure Inequalities
Bound Type | Divergence | Uppper-Bound for Every and a Fixed | Reference |
Constrained | KL | [23] | |
Variational | Pearson | Lemma 2 | |
Representation | Total Variation | for | Lemma 4 |
Unconstrained | KL | [23] | |
Variational | Pearson | Lemma 3 | |
Representation | Total Variation | for | |
Lemma 5 | |||
Squared Hellinger | for | Lemma 6 | |
Reverse KL | for | Lemma 7 | |
Neyman | for | Lemma 8 | |
Multiplicative | Pearson | [15] | |
[1] | |||
Generalized | Pearson | [18] | |
HCR | Pseudo | Lemma 12 |
Divergence | Formula with probability measures and defined on a common space | Corresponding Generator |
---|---|---|
KL | ||
Reverse KL | ||
Pearson | ||
Neyman | ||
Total Variation | ||
Squared Hellinger | ||
Pseudo | ||
[1] |
In this section, we formalize the definition of -divergences and present the constrained representation (to the space of probability measures) as well as the unconstrained representation. Then, we provide different change of measure inequalities for several divergences. We also provide multiplicative bounds as well as a generalized Hammersley-Chapman-Robbins bound. Table 1 summarizes our results.
2.1 Change of Measure Inequality from the Variational Representation of -divergences
Let be a convex function. The convex conjugate of is defined by:
(1) |
The definition of yields the following Young-Fenchel inequality
which holds for any . Using the notation of convex conjugates, the -divergence and its variational representation is defined as follows.
Definition 1 (-divergence and its variational representation).
Let be any arbitrary domain. Let and denote the probability measures over the Borel -field on . Additionally, let : be a convex and lower semi-continuous function that satisfies .
For simplicity, we denote in the sequel. Many common divergences, such as the KL-divergence and the Hellinger divergence, are members of the family of -divergences, coinciding with a particular choice of . Table 2 presents the definition of each divergence with the corresponding generator . It is well known that the divergence can be characterized as the following variational representation.
Lemma 1 (Variational representation of divergence, Lemma 1 in [27]).
Let , , and be defined as in Definition 1. Let : be a real-valued function. The -divergence from to is characterized as
[29] shows that this variational representation for -divergences can be tightened.
Theorem 1 (Change of measure inequality from the constrained variational representation for -divergences [29]).
Let , , and be defined as in Definition 1. Let : be a real-valued function. Let denote the space of probability densities with respect to , where the norm is defined as , given a measure over . The general form of the change of measure inequality for f-divergences is given by
where p is constrained to be a probability density function.
The famous Donsker-Varadhan representation for the KL-divergence, which is used in most PAC-Bayesian bounds, can be actually derived from this tighter representation by setting . However, it is not always easy to find a closed-form solution for Theorem 1, as it requires to resort to variational calculus, and in some cases, there is no closed-form solution. In such a case, we can use the following corollary to obtain looser bounds, but only requires to find a convex conjugate.
Corollary 1 (Change of measure inequality from the unconstrained variational representation for -divergences).
Detailed proofs can be found in Appendix A. By choosing a right function and deriving the constrained maximization term with the help of variational calculus, we can create an upper-bound based on the corresponding divergence . Next, we discuss the case of the divergence.
Lemma 2 (Change of measure inequality from the constrained representation of the -divergence).
Let , and be defined as in Theorem 1, we have
The bound in Lemma 2 is slightly tighter than the one without the constraint. The change of measure inequality without the constraint is given as follows.
Lemma 3 (Change of measure inequality from the unconstrained representation of the Pearson -divergence).
Let , and be defined as in Theorem 1, we have
As might be apparent, the bound in Lemma 2 is tighter than the one in Lemma 3 by because . Next, we discuss the case of the total variation divergence.
Lemma 4 (Change of measure inequality from the constrained representation of the total variation divergence).
Let : be a real-valued function. Let and be defined as in Theorem 1, we have
Interestingly, we can obtain the same bound on the total variation divergence even if we use the unconstrained variational representation. Next, we state our result for -divergences.
Lemma 5 (Change of measure inequality from the unconstrained representation of the -divergence).
Let , and be defined as in Theorem 1. For , we have
We can obtain the bounds based on the squared Hellinger divergence , the reverse KL-divergence and the Neyman -divergence in a similar fashion.
Lemma 6 (Change of measure inequality from the unconstrained representation of the squared Hellinger divergence).
Let : be a real-valued function. Let and be defined as in Theorem 1, we have
Similarly, we obtain the following bound for the reverse-KL divergence.
Lemma 7 (Change of measure inequality from the unconstrained representation of the reverse KL-divergence).
Let : be a real-valued function. Let and be defined as in Theorem 1, we have
Finally, we prove our result for the Neyman divergence based on a similar approach.
Lemma 8 (Change of measure inequality from the unconstrained representation of the Neyman -divergence).
Let : be a real-valued function. Let and be defined as in Theorem 1, we have
2.2 Multiplicative Change of Measure Inequality for -divergences
First, we state a known result for the divergence.
Lemma 9 (Multiplicative change of measure inequality for the -divergence [15]).
Let , and be defined as in Theorem 1, we have
First, we note that the divergence is an -divergence for . Next, we generalize the above bound for any -divergence.
Lemma 10 (Multiplicative change of measure inequality for the -divergence).
Let , and be defined as in Theorem 1. For any , we have
2.3 A Generalized Hammersley-Chapman-Robbins (HCR) Inequality
The HCR inequality is a famous information theoretic inequality for the -divergence.
Next, we generalize the above bound for -divergence.
Lemma 12 (The generalization of HCR inequality.).
3 Application to PAC-Bayesian Theory
Loss & Divergence | Generalization upper bound for , for every , and a fixed | Reference | |
Bounded Loss | Proposition 5 | ||
Proposition 5 | |||
Corollary 3 | |||
Corollary 3 | |||
0-1 loss | [15, 20] | ||
Sub-Gaussian | Proposition 6 | ||
Proposition 6 | |||
Theorem 1 & Proposition 6 in [1] | |||
Corollary 4 | |||
Corollary 4 | |||
Theorem 1 & Proposition 6 in [1] | |||
Sub-exponential | Proposition 7 | ||
For | Proposition 7 | ||
Corollary 5 | |||
Corollary 5 | |||
Bounded | Proposition 1 | ||
Variance | Proposition 1 | ||
Proposition 4 in [1] | |||
Corollary 2 | |||
Corollary 2 | |||
Corollary 1 in [1] |
In this section, we will explore the applications of our change of measure inequalities. We consider an arbitrary input space and a output space . The samples are input-output pairs. Each example is drawn i.i.d. according to a fixed, but unknown, distribution on . Let denote a generic loss function. The risk of any predictor is defined as the expected loss induced by samples drawn according to . Given a training set of samples, the empirical risk of any predictor is defined by the empirical average of the loss. That is
In the PAC-Bayesian framework, we consider a hypothesis space of predictors, a prior distribution on , and a posterior distribution on . The prior is specified before exploiting the information contained in , while the posterior is obtained by running non a learning algorithm on . The PAC-Bayesian theory usually studies the stochastic Gibbs predictor . Given a distribution on , predicts an example by drawing a predictor according to , and returning . The risk of is then defined as follows. For any probability distribution on a set of predictors, the Gibbs risk is the expected risk of the Gibbs predictor relative to . Hence,
(2) |
Usual PAC-Bayesian bounds give guarantees on the generalization risk . Typically, these bounds rely on the empirical risk defined as follows.
(3) |
Due to space constraints, we fully present PAC-Bayes generalization bounds for losses with bounded variance. Other results for bounded losses, sub-Gaussian losses and sub-exponential losses are included in Appendix C. Still, we briefly discuss our new results in Section 3.2.
3.1 Loss Function with Bounded Variance
In this section, we present our PAC-Bayesian bounds for the loss functions with bounded variance. Assuming any arbitrary distribution with bounded variance on the loss function (i.e., for any ), we have the following PAC-Bayesian bounds.
Proposition 1 (The PAC-Bayesian bounds for loss function with bounded variance).
Let be a fixed prior distribution over a hypothesis space any . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
(4) |
By setting in Proposition 2, we have the following claim.
Corollary 2 (The PAC-Bayesian bounds with -divergence for bounded variance loss function).
Let be any prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size , with probability at least , simultaneously for all posterior distributions , we have
(5) |
3.2 Discussion
Table 3 presents various PAC-Bayesian bounds based on our change of measure inequalities depending on different assumptions on the loss function , and compares them with existing results in the literature. The importance of the various types of PAC-Bayes bounds is justified by the connection between PAC-Bayes bounds and regularization in a learning problem. PAC-Bayesian theory provides a guarantee that upper-bounds the risk of Gibbs predictors simultaneously for all posterior distributions . PAC-Bayes bounds enjoy this property due to change of measure inequalities. It is a well-known fact that KL-regularized objective functions are obtained from PAC-Bayes risk bounds, in which the minimizer is guaranteed to exist since the risk bounds hold simultaneously for all posterior distributions . The complexity term, partially controlled by the divergence, serves as a regularizer [10, 9, 3]. Additionally, the link between Bayesian inference techniques and PAC-Bayesian risk bounds was shown by [8], that is, the minimization of PAC-Bayesian risk bounds maximizes the Bayesian marginal likelihood. Our results pertain to important machine learning prediction problems, such as regression, classification and structured prediction and indicate which regularizer to use. All PAC-Bayes bounds presented here are either novel, or have a tighter complexity term than existing results in the literature. Since our bounds are based on either -divergence or -divergence, we excluded the comparison with the bounds based on the KL-divergence ([30, 4, 13, 25, 8, 11, 33]). Our results for sub-exponential losses are entirely novel. For the other cases, our bounds are tighter than exisiting bounds in terms of the complexity term. For instance, our bound for bounded losses is tighter than those of [15, 20] since our bound has the complexity term and , while [15, 20] have and . For sub-Gaussian loss functions, our bound has the complexity term and , whereas [1] has and respectively. In addition, our additive bounds, such as Equation (4), have better rates than the existing bounds in [1], since in our bound, and are added, while in [1], and are multiplied.
4 Non-Asymptotic Interval for Monte Carlo Estimates
Divergence |
Non-asymptotic interval for Monte Carlo estimates
|
Reference |
---|---|---|
Pseudo | Proposition 2 | |
Proposition 3 | ||
Proposition 4 |
We now turn to another application of change of measure inequalities. We will introduce a methodology that enables us to find a non-asymptotic interval for Monte Calro (MC) estimate. All results shown in this section are entirely novel and summarized in Table 4. Under some circumstances, our methodology could be a promising alternative to existing MC methods (e.g., Importance Sampling and Rejection Sampling [28]) because their non-asymptotic intervals are hard to analyze. In this section, we consider a problem to estimate an expectation of an Lipschitz function with respect to a complicated distribution , namely by the sample mean over a distribution , where is any distribution we are not able to sample from. For a motivating application, consider being a probabilistic graphical model (see e.g., [14]). Assume that we have a strongly log-concave distribution with a parameter for a sampling distribution (see Appendix D for definition). Under the above conditions, we claim the following proposition.
Proposition 2 (A general expression of non-asymptotic interval for Monte Carlo estimates).
Let be a d-dimensional random vector. Let and be the probability measures induced by . Let be a strongly log-concave distribution with parameter . Let be any L-Lipschitz function with respect to the Euclidean norm. Suppose we draw i.i.d. samples . For , with probability at least , we have
The second term on the right hand side indicates a bias of an empirical mean under the sampling distribution . Next, we present a more informative bound.
Proposition 3 (-based expression of non-asymptotic interval for Monte Carlo estimates).
Let and be defined as in Proposition 2 . Suppose we have i.i.d. samples . Then, with probability at least , we have
where
This result provides an insight into how good a proposal distribution is, meaning that the effect of the deviation between and , namely , is inflated by the empirical variance. This implication might support some results in the literature ([5, 6]). So far we have considered the pseudo -divergence as well as divergence. [7] presented an approximation for an arbitrary distribution by distributions with log-concave density with respect to the KL divergence, which motivates the following result.
Proposition 4 (KL-based expression of non-asymptotic interval for Monte Carlo estimates).
Let and be defined as in Proposition 2 . Suppose we have i.i.d. samples . Then, with probability at least , we have
This result shows that, under the assumption of Proposition 4, the empirical mean over the sampling distribution aymptotically differs from by at most.
References
- [1] Pierre Alquier and Benjamin Guedj. Simpler pac-bayesian bounds for hostile data. Machine Learning, 107(5):887–902, May 2018.
- [2] Amiran Ambroladze, Emilio Parrado-hernández, and John S. Shawe-taylor. Tighter pac-bayes bounds. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 9–16. MIT Press, 2007.
- [3] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 06 2002.
- [4] Olivier Catoni. Pac-bayesian supervised classification: The thermodynamics of statistical learning. 01 2007.
- [5] Julien Cornebise, Éric Moulines, and Jimmy Olsson. Adaptive methods for sequential importance sampling with application to state space models. Statistics and Computing, 18(4):461–480, 2008.
- [6] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational inference via chi upper bound minimization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2732–2741. Curran Associates, Inc., 2017.
- [7] Lutz Duembgen, Richard Samworth, and Dominic Schuhmacher. Approximation by log-concave distributions, with applications to regression. The Annals of Statistics, 39, 02 2010.
- [8] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. Pac-bayesian theory meets bayesian inference. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1884–1892. Curran Associates, Inc., 2016.
- [9] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. A pac-bayes risk bound for general loss functions. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 449–456. MIT Press, 2007.
- [10] Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, and François Laviolette. From pac-bayes bounds to kl regularization. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 603–610. Curran Associates, Inc., 2009.
- [11] Peter D. Grünwald and Nishant A. Mehta. A tight excess risk bound via a unified PAC-Bayesian–Rademacher–Shtarkov–MDL complexity. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 433–465, Chicago, Illinois, 22–24 Mar 2019. PMLR.
- [12] Benjamin Guedj. A primer on pac-bayesian learning. Preprint arXiv:1901.05353, 05 2019.
- [13] Matthew Holland. Pac-bayes under potentially heavy tails. In Advances in Neural Information Processing Systems 32, pages 2715–2724. Curran Associates, Inc., 2019.
- [14] Jean Honorio. Lipschitz parametrization of probabilistic graphical models. pages 347–354. Conference on Uncertainty in Artificial Intelligence, 07 2011.
- [15] Jean Honorio and Tommi Jaakkola. Tight bounds for the expected risk of linear classifiers and pac-bayes finite-sample guarantees. In International Conference on Artificial Intelligence and Statistics, page 384–392, 04 2014.
- [16] J. Jiao, Y. Han, and T. Weissman. Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1475–1479, 2017.
- [17] Markos A. Katsoulakis, Luc Rey-Bellet, and Jie Wang. Scalable information inequalities for uncertainty quantification. Journal of Computational Physics, 336:513 – 545, 2017.
- [18] E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998.
- [19] Guy Lever, Francois Laviolette, and John Shawe-Taylor. Tighter pac-bayes bounds through distribution-dependent priors. Theor. Comput. Sci., 473:4–28, 02 2013.
- [20] François Laviolette Jean-Francis Roy. Luc Bégin, Pascal Germain. PAC-Bayesian Bounds based on the Rényi Divergence. International Conference on Artificial Intelligence and Statistics, 2016.
- [21] David McAllester. Simplified pac-bayesian margin bounds. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Learning Theory and Kernel Machines, pages 203–215, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
- [22] David A. McAllester. Some pac-bayesian theorems. In Machine Learning, pages 230–234. ACM Press, 1998.
- [23] David A. McAllester. Pac-bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT ’99, pages 164–170, New York, NY, USA, 1999. ACM.
- [24] David A. McAllester. Pac-bayesian stochastic model selection. Machine Learning, 51(1):5–21, Apr 2003.
- [25] Zakaria Mhammedi, Peter Grünwald, and Benjamin Guedj. Pac-bayes un-expected bernstein inequality. In Advances in Neural Information Processing Systems 32, pages 12202–12213. Curran Associates, Inc., 2019.
- [26] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor., 56(11):5847–5861, November 2010.
- [27] XuanLong Nguyen, M.J. Wainwright, and Michael Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56:5847 – 5861, 12 2010.
- [28] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.
- [29] Avraham Ruderman, Mark D. Reid, Darío García-García, and James Petterson. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1155–1162, USA, 2012. Omnipress.
- [30] Matthias Seeger. Pac-bayesian generalisation error bounds for gaussian process classification. J. Mach. Learn. Res., 3:233–269, March 2003.
- [31] Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. Pac-bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–7093, Dec 2012.
- [32] John Shawe-Taylor and Robert C. Williamson. A pac analysis of a bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, COLT ’97, pages 2–9, New York, NY, USA, 1997. ACM.
- [33] Rishit Sheth and Roni Khardon. Excess risk bounds for the bayes risk using variational inference in latent gaussian models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5151–5161. Curran Associates, Inc., 2017.
- [34] Ilya O Tolstikhin and Yevgeny Seldin. Pac-bayes-empirical-bernstein inequality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 109–117. Curran Associates, Inc., 2013.
- [35] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
Appendix A Detailed Proofs
A.1 Proof of Corollary 1
Proof.
Removing the supremum and rearranging the terms in Lemma 1, we prove our claim. ∎
A.2 Proof of Lemma 2
Proof.
By Theorem 1, we want to find
(7) |
where is a measurable function on . In order to find the supremum on the right hand side, we consider the following Lagrangian:
where and is constrained to be a probability density over . By talking the functional derivative with respect to and renormalize by dropping the factor multiplying all terms and setting it to zero, we have . Thus, we have . Since is constrained to be , we have . Then, we obtain and the optimum . Plugging it in Equation (7), we have
which simplifies to . ∎
A.3 Proof of Lemma 3
Proof.
Notice that the -divergence is obtained by setting . In order to find the convex conjugate from Equation (1), let . We need to find the supremum of with respect to . By using differentiation with respect to and setting the derivative to zero, we have . Thus, plugging for , we obtain . Plugging and in Corollary 1, we prove our claim. ∎
A.4 Proof of Lemma 4
Proof.
By Theorem 1, we want to find
In order to find the supremum on the right hand side, we consider the following Lagrangian
Then, it is not hard to see that if , is maximized at , otherwise as . Thus, Lemma 4 holds for . If we add on the both sides, then is bounded between 0 and 1, as we claimed in the corollary. ∎
A.5 Proof of Lemma 5
A.6 Proof of Lemma 6
A.7 Proof of Lemma 7
A.8 Proof of Lemma 8
A.9 Proof of Lemma 10
Proof.
The third line is due to the Hölder’s inequality. ∎
A.10 Proof of Lemma 12
Proof.
Consider the covariance of and .
On the other hand,
which proves our claim. ∎
A.11 Proof of Proposition 1
A.12 Proof of Proposition 2
Proof.
Note that we have Lemma 12. Now, it remains to bound and . Since is a strongly log-concave distribution and is an L-Lipschitz function, Theorem 3.16 in [35] implies is a sub-Gaussian random variable with parameter . Therefore, we have
Thus,
On the other hand, letting , then we have
The inequality is due to sub-Gaussianity. Putting everything together with Lemma 12, we prove our claim. ∎
A.13 Proof of Proposition 3
Proof.
By Lemma 11, we have
where is bounded as in Proposition 2.
It remains to bound . Note that is a sub-exponential random variable with parameters when is a sub-Gaussian random variable with a parameter (See Appendix B in [15] for details). Therefore,
Thus,
Putting everything together, we prove our claim. ∎
A.14 Proof of Proposition 4
Proof.
Note that we have the following change of measure inequality for any function from the Donsker-Varadhan representation for the KL-divergence.
(8) |
Suppose satisfies the following condition.
By plugging in Equation (8), we have
On the other hand, suppose satisfies the following condition.
By plugging in Equation (8), we have
By putting all together, we have
Suppose are i.i.d. random variables.We can easily see that we have
(9) |
is bounded as in Proposition 2.
(10) |
As shown in Proposition 2, is a sub-Gaussian random variable with parameter . Therefore, by Definition 2,
(11) |
since both values in the maximum are bounded by . By (9), (10) and (11), we prove our claim. ∎
Appendix B Pseudo -divergence is a member of the family of -divergences.
Proof.
Let and . Let for .
The first inequality is due to the triangle inequality and the second inequality is due to Hölder’s inequality. By the definition of convexity, is convex. Also, . By the definition of -divergence, we prove our claim. ∎
Appendix C PAC-Bayesian bounds for bounded, sub-Gaussian and sub-exponential losses
Here we complement our results in Section 3, by showing our PAC-Bayesian generalization bounds for bounded, sub-Gaussian and sub-exponential losses.
C.1 Bounded Loss Function
First, let us assume here that the loss function is bounded, i.e., for any and , for . Note that, for , we cannot use the total variation, squared Hellinger, Reverse KL and Neyman divergence because is constrained to be in .
Proposition 5 (The PAC-Bayesian bounds for bounded loss function).
Let be any prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
(12) |
(13) |
Proof.
Suppose that we have a convex function , that measures the discrepancy between the observed empirical Gibbs risk and the true Gibbs risk on distribution . Given that, the purpose of the PAC-Bayesian theorem is to upper-bound the discrepancy for any . Let , where the subscript of shows the dependency on the data distribution . Let . By applying Jensen’s inequality on convex function for the first step,
(14) |
where . By Hoeffding’s inequality, for any and any ,
Setting , we have
The symbol denotes that the inequality holds with probability at least . The second line holds due to the Hoeffding’s inequality. For , we have
Also note that
(15) |
By applying Equations (14) and (15) to Lemma 10,
which proves Equation (12). By applying Equations (14) and (15) to Lemma 5 and setting , we prove Equation (13). ∎
Equation (12) has a tighter bound than Proposition 2 in [1]. Also, Equation (13) has a novel expression of PAC-Bayesian theorem with -divergence. The same arguments apply to the bounds in Proposition 6, 7 and 1.
The following corollary is an immediate consequence of this proposition for .
Corollary 3 (The PAC-Bayesian bounds with -divergence for bounded loss function).
Let be any prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size , with probability at least , simultaneously for all posterior distributions , we have
(16) |
(17) |
These are novel PAC-Bayesian bounds for bounded loss functions based on -divergence. Most PAC-Bayesian bounds for bounded loss functions are based on the KL-divergence for the complexity term (e.g., [4, 30, 22]). [15] and [20] contain bounds for bounded loss functions with -divergence. Compared to their bounds, the bound (16) is tighter due to the power of on the complexity term and instead of . Also, PAC-Bayes bound (17) has a unique characteristics; and are independent since is not multiplied by a factor of . Please see Table 3 for details.
C.2 Sub-Gaussian Loss Function
In some contexts, such as regression, considering bounded loss functions is restrictive. Next, we relax the restrictions on the loss function to deal with unbounded losses. We assume that, for any , is sub-Gaussian. First, we mention the definition of sub-Gaussian random variable [35].
Definition 2.
A random variable is said to be sub-Gaussian with the expectation and variance proxy if for any ,
Next, we present our PAC-Bayesian bounds.
Proposition 6 (The PAC-Bayesian bounds for sub-Gaussian loss function).
Let be a fixed prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
Proof.
Suppose the convex function is defined as in Proposition 5. Employing Chernoff’s bound, the tail bound probability for sub-Gaussian random variables [35] is given as follows
(18) |
Setting , and in the tail bound in Equation (18), for any , we have
Setting , we have
where is defined as in Equation (14). By Equation (15), we have
(19) |
On the other hand, we may upper-bound in the following way (as in Proposition 6. in [1]).
Let and . Then, is a sub-Gaussian random variable from our assumption.
The following corollary is an immediate consequnece of Proposition 6 for .
Corollary 4 (The PAC-Bayesian bounds with -divergence for sub-Gaussian loss function).
Let be any prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size , with probability at least , simultaneously for all posterior distributions , we have
(21) |
(22) |
[1] proved PAC-Bayes bound for sub-Gaussian loss function with -divergence. It is noteworthy that may be any convex function and a different choice of leads us to various bounds. For instance, choosing for Lemma 10 results in
which is quite similar to the bounds in Proposition 6 in [1] and looser than Equation (21). We obtained a tighter bound due to our choice of .
C.3 Sub-Exponential Loss Function
We now turn to a more general class where is sub-exponential for any . First, we define sub-exponentiality [35].
Definition 3.
A random variable is said to be sub-exponential with the expectation and parameters and , if for any ,
Next, we provide our PAC-Bayesian bounds.
Proposition 7 (The PAC-Bayesian bounds for sub-exponential loss function).
Let be a fixed prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size and , with probability at least , simultaneously for all posterior distributions , we have
where
Proof.
Note that, for , behaves like sub-Gaussian. However, when the sample size is small, a tighter bound (i.e., ) can be obtained. This shows the advantage of assuming sub-exponentiality over sub-Gaussianity. By setting in Proposition 7, we have the following corollary.
Corollary 5 (The PAC-Bayesian bounds with -divergence for sub-exponential loss function).
Let be any prior distribution over an infinite hypothesis space . For a given posterior distribution over an infinite hypothesis space , let and be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size , with probability at least , simultaneously for all posterior distributions , we have
where
To the best of our knowledge, our bounds for sub-exponential losses are entirely novel.
Appendix D Log-concave distribution
We say that a distribution with a density (with respect to the Lebesgue measure) is a strongly log-concave distribution if the function is strongly concave. Equivalently stated, this condition means that the density can be expressed as , where the function is strongly convex, meaning that there is some such that
for all , and .