This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Novel Change of Measure Inequalities with Applications to PAC-Bayesian Bounds and Monte Carlo Estimation

Yuki Ohnishi Department of Statistics, Purdue University Jean Honorio Department of Computer Science, Purdue University
Abstract

We introduce several novel change of measure inequalities for two families of divergences: ff-divergences and α\alpha-divergences. We show how the variational representation for ff-divergences leads to novel change of measure inequalities. We also present a multiplicative change of measure inequality for α\alpha-divergences and a generalized version of Hammersley-Chapman-Robbins inequality. Finally, we present several applications of our change of measure inequalities, including PAC-Bayesian bounds for various classes of losses and non-asymptotic intervals for Monte Carlo estimates.

1 Introduction

The Probably Approximate Correct (PAC) Bayesian inequality was introduced by [32] and [23]. This framework allows us to produce PAC performance bounds (in the sense of a loss function) for Bayesian-flavored estimators [12], and several extensions have been proposed to date (see e.g., [30, 24, 21, 31, 2]). The core of these theoretical results is summarized by a change of measure inequality. The change of measure inequality is an expectation inequality involving two probability measures where the expectation with respect to one measure is upper-bounded by the divergence between the two measures and the moments with respect to the other measure. The change of measure inequality also plays a major role in information theory. For instance, [17] derived robust uncertainty quantification bounds for statistical estimators of interest with change of measure inequalities. Recent research efforts have been put into more generic perspectives to get PAC-Bayes bounds and to get rid of assumptions such as boundedness of the loss function (e.g., [19, 34, 25, 13, 11]). All PAC-Bayesian bounds contained in these works massively rely on one of the most famous change of measure inequalities, the Donsker-Varadhan representation for the KL-divergence. Several change of measure inequalities had been proposed along with PAC-Bayes bounds lately. [20] proposed a proof scheme of PAC-Bayesian bounds based on the Rényi divergence. [15] proposed an inequality for the χ2\chi^{2} divergence and derived a PAC-Bayesian bound for linear classification. [1] proposed a novel change of measure inequality and PAC-Bayesian bounds based on the α\alpha-divergence. The aforementioned works were proposed for specific purposes. A comprehensive study on change of measure inequalities has not been performed yet. Our work proposes several novel and general change of measure inequalities for two families of divergences: ff-divergences and α\alpha-divergences. It is a well-known fact that the ff-divergence can be variationally characterized as the maximum of an optimization problem rooted in the convexity of the function ff. This variational representation has been recently used in various applications of information theory, such as ff-divergence estimation [26] and quantification of the bias in adaptive data analysis [16]. Recently, [29] showed that the variational representation of the ff-divergence can be tightened when constrained to the space of probability densities.

Our main contributions are as follows:

  • By using the variational representation for ff-divergences, we derive several change of measure inequalities. We perform the analysis for the constrained regime (to the space of probability densities) as well as the unconstrained regime.

  • We present a multiplicative change of measure inequality for the family of α\alpha-divergences. This generalizes the previous results [1, 15] for the α\alpha-divergence, which in turns apply to PAC-Bayes inequalities for types of losses not considered before.

  • We also generalize prior results for the Hammersley-Chapman-Robbins inequality [18] from the particular χ2\chi^{2} divergence, to the family of α\alpha-divergences.

  • We provide new PAC-Bayesian bounds with the α\alpha-divergence and the χ2\chi^{2}-divergence from our novel change of measure inequalities for bounded, sub-Gaussian, sub-exponential and bounded-variance loss functions. Our results are either novel, or have a tighter complexity term than existing results in the literature, and pertain to important machine learning prediction problems, such as regression, classification and structured prediction.

  • We provide a new scheme for estimation of non-asymptotic intervals for Monte Carlo estimates. Our results indicate that the empirical mean over a sampling distribution concentrates around an expectation with respect to any arbitrary distribution.

2 Change of Measure Inequalities

Table 1: Summary of the change of measure inequalities. For simplicity, we denote 𝔼P[]𝔼hP[]\mathbb{E}_{P}[\cdot]\equiv\mathbb{E}_{h\sim P}[\cdot] and ϕϕ(h)\phi\equiv\phi(h).
Bound Type Divergence Uppper-Bound for Every QQ and a Fixed PP Reference
Constrained KL 𝔼Q[ϕ]KL(QP)+log(𝔼P[eϕ])\mathbb{E}_{Q}[\phi]\leq KL(Q\|P)+\log(\mathbb{E}_{P}[e^{\phi}]) [23]
Variational Pearson χ2\chi^{2} 𝔼Q[ϕ]χ2(QP)+𝔼P[ϕ]+14𝕍arP[ϕ]\mathbb{E}_{Q}[\phi]\leq\chi^{2}(Q\|P)+\mathbb{E}_{P}[\phi]+\frac{1}{4}\mathbb{V}{\rm ar}_{P}[\phi] Lemma 2
Representation Total Variation 𝔼Q[ϕ]TV(QP)+𝔼P[ϕ]\mathbb{E}_{Q}[\phi]\leq TV(Q\|P)+\mathbb{E}_{P}[\phi] for ϕ[0,1]\phi\in[0,1] Lemma 4
Unconstrained KL 𝔼Q[ϕ]KL(QP)+(𝔼P[eϕ]1)\mathbb{E}_{Q}[\phi]\leq KL(Q\|P)+(\mathbb{E}_{P}[e^{\phi}]-1) [23]
Variational Pearson χ2\chi^{2} 𝔼Q[ϕ]χ2(QP)+𝔼P[ϕ]+14𝔼P[ϕ2]\mathbb{E}_{Q}[\phi]\leq\chi^{2}(Q\|P)+\mathbb{E}_{P}[\phi]+\frac{1}{4}\mathbb{E}_{P}[\phi^{2}] Lemma 3
Representation Total Variation 𝔼Q[ϕ]TV(QP)+𝔼P[ϕ]\mathbb{E}_{Q}[\phi]\leq TV(Q\|P)+\mathbb{E}_{P}[\phi] for ϕ[0,1]\phi\in[0,1]
α\alpha 𝔼Q[ϕ]Dα(QP)+(α1)αα1α𝔼P[ϕαα1]+1α(α1)\mathbb{E}_{Q}[\phi]\leq D_{\alpha}(Q\|P)+\frac{(\alpha-1)^{\frac{\alpha}{\alpha-1}}}{\alpha}\mathbb{E}_{P}[\phi^{\frac{\alpha}{\alpha-1}}]+\frac{1}{\alpha(\alpha-1)} Lemma 5
Squared Hellinger 𝔼Q[ϕ]H2(QP)+𝔼P[ϕ1ϕ]\mathbb{E}_{Q}[\phi]\leq H^{2}(Q\|P)+\mathbb{E}_{P}[\frac{\phi}{1-\phi}] for ϕ<1\phi\ <1 Lemma 6
Reverse KL 𝔼Q[ϕ]KL¯(QP)+𝔼P[log(11ϕ)]\mathbb{E}_{Q}[\phi]\leq\overline{KL}(Q\|P)+\mathbb{E}_{P}[\log(\frac{1}{1-\phi})] for ϕ<1\phi\ <1 Lemma 7
Neyman χ2\chi^{2} 𝔼Q[ϕ]χ2¯(QP)+22𝔼P[1ϕ]\mathbb{E}_{Q}[\phi]\leq\overline{\chi^{2}}(Q\|P)+2-2\mathbb{E}_{P}[\sqrt{1-\phi}] for ϕ<1\phi\ <1 Lemma 8
Multiplicative Pearson χ2\chi^{2} 𝔼Q[ϕ](χ2(QP)+1)𝔼P[ϕ2]\mathbb{E}_{Q}[\phi]\leq\sqrt{(\chi^{2}(Q\|P)+1)\mathbb{E}_{P}[\phi^{2}]} [15]
α\alpha 𝔼Q[ϕ](α(α1)Dα(QP)+1)1α(𝔼P[|ϕ|αα1])α1α\mathbb{E}_{Q}[\phi]\leq(\alpha(\alpha-1)D_{\alpha}(Q\|P)+1)^{\frac{1}{\alpha}}(\mathbb{E}_{P}[|\phi|^{\frac{\alpha}{\alpha-1}}])^{\frac{\alpha-1}{\alpha}} [1]
Generalized Pearson χ2\chi^{2} χ2(QP)(𝔼Q[ϕ]𝔼P[ϕ])2𝕍arP[ϕ]\chi^{2}(Q\|P)\geq\frac{(\mathbb{E}_{Q}[\phi]-\mathbb{E}_{P}[\phi])^{2}}{\mathbb{V}{\rm ar}_{P}[\phi]} [18]
HCR Pseudo α\alpha |𝔼Q[ϕ]𝔼P[ϕ]|𝒟α~(QP)1α(𝔼P[|ϕμP|αα1])α1α|\mathbb{E}_{Q}[\phi]-\mathbb{E}_{P}[\phi]|\leq\widetilde{\mathcal{D_{\alpha}}}(Q\|P)^{\frac{1}{\alpha}}(\mathbb{E}_{P}[|\phi-\mu_{P}|^{\frac{\alpha}{\alpha-1}}])^{\frac{\alpha-1}{\alpha}} Lemma 12
Table 2: Some common ff-divergences with corresponding generator.
Divergence Formula with probability measures PP and QQ defined on a common space \mathcal{H} Corresponding Generatorf(t)f(t)
KL KL(QP)=logdQdPdQKL(Q\|P)=\int_{\mathcal{H}}\log\frac{dQ}{dP}dQ tlogtt+1t\log t-t+1
Reverse KL KL¯(QP)=logdPdQdP\overline{KL}(Q\|P)=\int_{\mathcal{H}}\log\frac{dP}{dQ}dP logt-\log t
Pearson χ2\chi^{2} χ2(QP)=(dQdP1)2𝑑P\chi^{2}(Q\|P)=\int_{\mathcal{H}}(\frac{dQ}{dP}-1)^{2}dP (t1)2(t-1)^{2}
Neyman χ2\chi^{2} χ2¯(QP)=(dPdQ1)2𝑑Q\overline{\chi^{2}}(Q\|P)=\int_{\mathcal{H}}(\frac{dP}{dQ}-1)^{2}dQ (1t)2t\frac{(1-t)^{2}}{t}
Total Variation TV(QP)=12|dQdP1|𝑑PTV(Q\|P)=\frac{1}{2}\int_{\mathcal{H}}|\frac{dQ}{dP}-1|dP 12|t1|\frac{1}{2}|t-1|
Squared Hellinger H2(QP)=(dQdP1)2𝑑PH^{2}(Q\|P)=\int_{\mathcal{H}}(\sqrt{\frac{dQ}{dP}}-1)^{2}dP (t1)2(\sqrt{t}-1)^{2}
α\alpha Dα(QP)=1α(α1)(|dQdP|α1)𝑑PD_{\alpha}(Q\|P)=\frac{1}{\alpha(\alpha-1)}\int_{\mathcal{H}}(|\frac{dQ}{dP}|^{\alpha}-1)dP tα1α(α1)\frac{t^{\alpha}-1}{\alpha(\alpha-1)}
Pseudo α\alpha 𝒟α~(QP)=|dQdP1|α𝑑P\widetilde{\mathcal{D_{\alpha}}}(Q\|P)=\int_{\mathcal{H}}|\frac{dQ}{dP}-1|^{\alpha}dP |t1|α|t-1|^{\alpha}
ϕp\phi_{p} [1] Dϕp1(QP)=(|dQdP|p1)𝑑PD_{\phi_{p}-1}(Q\|P)=\int_{\mathcal{H}}(|\frac{dQ}{dP}|^{p}-1)dP tp1t^{p}-1

In this section, we formalize the definition of ff-divergences and present the constrained representation (to the space of probability measures) as well as the unconstrained representation. Then, we provide different change of measure inequalities for several divergences. We also provide multiplicative bounds as well as a generalized Hammersley-Chapman-Robbins bound. Table 1 summarizes our results.

2.1 Change of Measure Inequality from the Variational Representation of ff-divergences

Let f:(0,+)f:(0,+\infty)\rightarrow\mathbb{R} be a convex function. The convex conjugate ff^{*} of ff is defined by:

f(y)=supx(xyf(x)).f^{*}(y)=\sup_{x\in\mathbb{R}}(xy-f(x)). (1)

The definition of ff^{*} yields the following Young-Fenchel inequality

f(x)xyf(y)f(x)\geq xy-f^{*}(y)

which holds for any yy. Using the notation of convex conjugates, the ff-divergence and its variational representation is defined as follows.

Definition 1 (ff-divergence and its variational representation).

Let \mathcal{H} be any arbitrary domain. Let PP and QQ denote the probability measures over the Borel σ\sigma-field on \mathcal{H}. Additionally, let ff: [0,)[0,\infty)\rightarrow\mathbb{R} be a convex and lower semi-continuous function that satisfies f(1)=0f(1)=0.

Df(QP):=𝔼P[f(dQdP)]D_{f}(Q\|P):=\mathbb{E}_{P}\bigg{[}f\bigg{(}\frac{dQ}{dP}\bigg{)}\bigg{]}

For simplicity, we denote 𝔼P[]𝔼hP[]\mathbb{E}_{P}[\cdot]\equiv\mathbb{E}_{h\sim P}[\cdot] in the sequel. Many common divergences, such as the KL-divergence and the Hellinger divergence, are members of the family of ff-divergences, coinciding with a particular choice of f(t)f(t). Table 2 presents the definition of each divergence with the corresponding generator f(t)f(t). It is well known that the ff-divergence can be characterized as the following variational representation.

Lemma 1 (Variational representation of ff-divergence, Lemma 1 in [27]).

Let \mathcal{H}, PP, QQ and ff be defined as in Definition 1. Let ϕ\phi: \mathcal{H}\rightarrow\mathbb{R} be a real-valued function. The ff-divergence from PP to QQ is characterized as

Df(QP)supϕ𝔼Q[ϕ]𝔼P[f(ϕ)]D_{f}(Q\|P)\geq\sup_{\phi}\mathbb{E}_{Q}[\phi]-\mathbb{E}_{P}[f^{*}(\phi)]

[29] shows that this variational representation for ff-divergences can be tightened.

Theorem 1 (Change of measure inequality from the constrained variational representation for ff-divergences [29]).

Let \mathcal{H}, PP, QQ and ff be defined as in Definition 1. Let ϕ\phi: \mathcal{H}\rightarrow\mathbb{R} be a real-valued function. Let Δ(μ):={g::g0,g1=1}\Delta(\mu):=\{g:\mathcal{H}\rightarrow\mathbb{R}:g\geq 0,\|g\|_{1}=1\} denote the space of probability densities with respect to μ\mu, where the norm is defined as g1:=|g|𝑑μ\|g\|_{1}:=\int_{\mathcal{H}}|g|d\mu, given a measure μ\mu over \mathcal{H}. The general form of the change of measure inequality for f-divergences is given by

𝔼Q[ϕ]Df(QP)+(𝕀f,PR)(ϕ)(𝕀f,PR)(ϕ)=suppΔ(P)𝔼P[ϕp]𝔼P[f(p)]\begin{split}\mathbb{E}_{Q}[\phi]&\leq D_{f}(Q\|P)+(\mathbb{I}^{R}_{f,P})^{*}(\phi)\\ (\mathbb{I}^{R}_{f,P})^{*}(\phi)&=\sup_{p\in\Delta(P)}{\mathbb{E}_{P}[\phi p]-\mathbb{E}_{P}[f(p)]}\end{split}

where p is constrained to be a probability density function.

The famous Donsker-Varadhan representation for the KL-divergence, which is used in most PAC-Bayesian bounds, can be actually derived from this tighter representation by setting f(t)=tlog(t)f(t)=t\log(t). However, it is not always easy to find a closed-form solution for Theorem 1, as it requires to resort to variational calculus, and in some cases, there is no closed-form solution. In such a case, we can use the following corollary to obtain looser bounds, but only requires to find a convex conjugate.

Corollary 1 (Change of measure inequality from the unconstrained variational representation for ff-divergences).

Let PP, QQ, ff and ϕ\phi be defined as in Theorem 1. By Definition 1, we have

Q on :𝔼Q[ϕ]Df(QP)+𝔼P[f(ϕ)]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq D_{f}(Q\|P)+\mathbb{E}_{P}[f^{*}(\phi)]\end{split}

Detailed proofs can be found in Appendix A. By choosing a right function ff and deriving the constrained maximization term (𝕀f,PR)(ϕ)(\mathbb{I}^{R}_{f,P})^{*}(\phi) with the help of variational calculus, we can create an upper-bound based on the corresponding divergence Df(QP)D_{f}(Q\|P). Next, we discuss the case of the χ2\chi^{2} divergence.

Lemma 2 (Change of measure inequality from the constrained representation of the χ2\chi^{2}-divergence).

Let PP, QQ and ϕ\phi be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ]χ2(QP)+𝔼P[ϕ]+14𝕍arP[ϕ]\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq\chi^{2}(Q\|P)+\mathbb{E}_{P}[\phi]+\frac{1}{4}\mathbb{V}{\rm ar}_{P}[\phi]

The bound in Lemma 2 is slightly tighter than the one without the constraint. The change of measure inequality without the constraint is given as follows.

Lemma 3 (Change of measure inequality from the unconstrained representation of the Pearson χ2\chi^{2}-divergence).

Let PP, QQ and ϕ\phi be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ]χ2(QP)+𝔼P[ϕ]+14𝔼P[ϕ2]\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq\chi^{2}(Q\|P)+\mathbb{E}_{P}[\phi]+\frac{1}{4}\mathbb{E}_{P}[\phi^{2}]

As might be apparent, the bound in Lemma 2 is tighter than the one in Lemma 3 by (𝔼P[ϕ])2(\mathbb{E}_{P}[\phi])^{2} because 𝕍arP[ϕ]𝔼P[ϕ2]\mathbb{V}{\rm ar}_{P}[\phi]\leq\mathbb{E}_{P}[\phi^{2}]. Next, we discuss the case of the total variation divergence.

Lemma 4 (Change of measure inequality from the constrained representation of the total variation divergence).

Let ϕ\phi: [0,1]\mathcal{H}\rightarrow[0,1] be a real-valued function. Let PP and QQ be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ]TV(QP)+𝔼P[ϕ]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq TV(Q\|P)+\mathbb{E}_{P}[\phi]\end{split}

Interestingly, we can obtain the same bound on the total variation divergence even if we use the unconstrained variational representation. Next, we state our result for α\alpha-divergences.

Lemma 5 (Change of measure inequality from the unconstrained representation of the α\alpha-divergence).

Let PP, QQ and ϕ\phi be defined as in Theorem 1. For α>1\alpha>1, we have

Q on :𝔼Q[ϕ]Dα(QP)+(α1)αα1α𝔼P[ϕαα1]+1α(α1)\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq D_{\alpha}(Q\|P)+\frac{(\alpha-1)^{\frac{\alpha}{\alpha-1}}}{\alpha}\mathbb{E}_{P}[\phi^{\frac{\alpha}{\alpha-1}}]+\frac{1}{\alpha(\alpha-1)}\end{split}

We can obtain the bounds based on the squared Hellinger divergence H2(QP)H^{2}(Q\|P), the reverse KL-divergence KL¯(QP)\overline{KL}(Q\|P) and the Neyman χ2\chi^{2}-divergence χ2¯(QP)\overline{\chi^{2}}(Q\|P) in a similar fashion.

Lemma 6 (Change of measure inequality from the unconstrained representation of the squared Hellinger divergence).

Let ϕ\phi: (,1)\mathcal{H}\rightarrow(-\infty,1) be a real-valued function. Let PP and QQ be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ]H2(QP)+𝔼P[ϕ1ϕ]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq H^{2}(Q\|P)+\mathbb{E}_{P}\bigg{[}\frac{\phi}{1-\phi}\bigg{]}\end{split}

Similarly, we obtain the following bound for the reverse-KL divergence.

Lemma 7 (Change of measure inequality from the unconstrained representation of the reverse KL-divergence).

Let ϕ\phi: (,1)\mathcal{H}\rightarrow(-\infty,1) be a real-valued function. Let PP and QQ be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ]KL¯(QP)+𝔼P[log(11ϕ)]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq\overline{KL}(Q\|P)+\mathbb{E}_{P}\bigg{[}\log\bigg{(}\frac{1}{1-\phi}\bigg{)}\bigg{]}\end{split}

Finally, we prove our result for the Neyman χ2\chi^{2} divergence based on a similar approach.

Lemma 8 (Change of measure inequality from the unconstrained representation of the Neyman χ2\chi^{2}-divergence).

Let ϕ\phi: (,1)\mathcal{H}\rightarrow(-\infty,1) be a real-valued function. Let PP and QQ be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ]χ2¯(QP)+22𝔼P[1ϕ]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq\overline{\chi^{2}}(Q\|P)+2-2\mathbb{E}_{P}[\sqrt{1-\phi}]\end{split}

2.2 Multiplicative Change of Measure Inequality for α\alpha-divergences

First, we state a known result for the χ2\chi^{2} divergence.

Lemma 9 (Multiplicative change of measure inequality for the χ2\chi^{2}-divergence [15]).

Let PP, QQ and ϕ\phi be defined as in Theorem 1, we have

Q on :𝔼Q[ϕ](χ2(QP)+1)𝔼P[ϕ2]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq\sqrt{(\chi^{2}(Q\|P)+1)\mathbb{E}_{P}[\phi^{2}]}\end{split}

First, we note that the χ2\chi^{2} divergence is an α\alpha-divergence for α=2\alpha=2. Next, we generalize the above bound for any α\alpha-divergence.

Lemma 10 (Multiplicative change of measure inequality for the α\alpha-divergence).

Let PP, QQ and ϕ\phi be defined as in Theorem 1. For any α>1\alpha>1, we have

Q on :𝔼Q[ϕ](α(α1)Dα(QP)+1)1α(𝔼P[|ϕ|αα1])α1α\begin{split}\forall Q\text{ on }\mathcal{H}:\ \mathbb{E}_{Q}[\phi]\leq\big{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\big{)}^{\frac{1}{\alpha}}\big{(}\mathbb{E}_{P}[|\phi|^{\frac{\alpha}{\alpha-1}}]\big{)}^{\frac{\alpha-1}{\alpha}}\end{split}

Our bound is stated in the form of α\alpha-divergence. By choosing α=2\alpha=2, we have the same bound as Lemma 9 where χ2(QP)=2Dα(QP)\chi^{2}(Q\|P)=2D_{\alpha}(Q\|P). We will later apply the above α\alpha-divergence change of measure to obtain PAC-Bayes inequalities for types of losses not considered before [1, 15].

2.3 A Generalized Hammersley-Chapman-Robbins (HCR) Inequality

The HCR inequality is a famous information theoretic inequality for the χ2\chi^{2}-divergence.

Lemma 11 (HCR inequality [18]).

Let PP, QQ and ϕ\phi be defined as in Theorem 1, we have

Q on :χ2(QP)(EQ[ϕ]𝔼P[ϕ])2𝕍arP[ϕ]\begin{split}\forall Q\text{ on }\mathcal{H}:\ \chi^{2}(Q\|P)\geq\frac{(E_{Q}[\phi]-\mathbb{E}_{P}[\phi])^{2}}{\mathbb{V}{\rm ar}_{P}[\phi]}\end{split}

Next, we generalize the above bound for α\alpha-divergence.

Lemma 12 (The generalization of HCR inequality.).

Let PP, QQ and ϕ\phi be defined as in Theorem 1. For any α>1\alpha>1, we have

Q on :|𝔼Q[ϕ]𝔼P[ϕ]|𝒟α~(QP)1α(𝔼P[|ϕμP|αα1])α1α\begin{split}\forall Q\text{ on }\mathcal{H}:\ \big{|}\mathbb{E}_{Q}[\phi]-\mathbb{E}_{P}[\phi]\big{|}\leq\widetilde{\mathcal{D_{\alpha}}}(Q\|P)^{\frac{1}{\alpha}}\big{(}\mathbb{E}_{P}[|\phi-\mu_{P}|^{\frac{\alpha}{\alpha-1}}]\big{)}^{\frac{\alpha-1}{\alpha}}\end{split}

where 𝒟α~(QP)=|dQdP1|α𝑑P\widetilde{\mathcal{D_{\alpha}}}(Q\|P)=\int_{\mathcal{H}}|\frac{dQ}{dP}-1|^{\alpha}dP and μP=𝔼P[ϕ]\mu_{P}=\mathbb{E}_{P}[\phi].

We call 𝒟α~(QP)\widetilde{\mathcal{D_{\alpha}}}(Q\|P) a pseudo α\alpha-divergence, which is a member of the family of ff-divergences with f(t)=|t1|αf(t)=|t-1|^{\alpha} for α>1\alpha>1. See Appendix B for the formal proof. Straightforwardly, we can obtain Lemma 11 by choosing α=2\alpha=2.

3 Application to PAC-Bayesian Theory

Table 3: Summary of PAC-Bayesian bounds with α\alpha-divergence and χ2\chi^{2}-divergence.
Loss & Divergence Generalization upper bound for RD(GQ)RS(GQ)R_{D}(G_{Q})-R_{S}(G_{Q}), for every QQ, and a fixed PP Reference
Bounded Loss α\alpha O(R[1mlog(1δ)]12Dα(QP)12α)O\big{(}R[\frac{1}{m}\log(\frac{1}{\delta})]^{\frac{1}{2}}D_{\alpha}(Q\|P)^{\frac{1}{2\alpha}}\big{)} Proposition 5
α\alpha O((1m)12[Dα(QP)+(R2log(1δ))αα1]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[D_{\alpha}(Q\|P)+(R^{2}\log(\frac{1}{\delta}))^{\frac{\alpha}{\alpha-1}}]^{\frac{1}{2}}\big{)} Proposition 5
χ2\chi^{2} O(R[1mlog(1δ)]12χ2(QP)14)O\big{(}R[\frac{1}{m}\log(\frac{1}{\delta})]^{\frac{1}{2}}\chi^{2}(Q\|P)^{\frac{1}{4}}\big{)} Corollary 3
χ2\chi^{2} O((1m)12[χ2(QP)+(R2log(1δ))2]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[\chi^{2}(Q\|P)+(R^{2}\log(\frac{1}{\delta}))^{2}]^{\frac{1}{2}}\big{)} Corollary 3
0-1 loss χ2\chi^{2} O((1mδ)12χ2(QP)12)O\big{(}(\frac{1}{m\delta})^{\frac{1}{2}}\chi^{2}(Q\|P)^{\frac{1}{2}}\big{)} [15, 20]
Sub-Gaussian α\alpha O(σ[1mlog(1δ)]12Dα(QP)12α)O\big{(}\sigma[\frac{1}{m}\log(\frac{1}{\delta})]^{\frac{1}{2}}D_{\alpha}(Q\|P)^{\frac{1}{2\alpha}}\big{)} Proposition 6
α\alpha O((1m)12[Dα(QP)+(σ2log(1δ))αα1]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[D_{\alpha}(Q\|P)+(\sigma^{2}\log(\frac{1}{\delta}))^{\frac{\alpha}{\alpha-1}}]^{\frac{1}{2}}\big{)} Proposition 6
α\alpha O(σ(1m)12(1δ)α1αDα(QP)1α)O\big{(}\sigma(\frac{1}{m})^{\frac{1}{2}}(\frac{1}{\delta})^{\frac{\alpha-1}{\alpha}}D_{\alpha}(Q\|P)^{\frac{1}{\alpha}}\big{)} Theorem 1 & Proposition 6 in [1]
χ2\chi^{2} O(σ[1mlog(1δ)]12χ2(QP)14)O\big{(}\sigma[\frac{1}{m}\log(\frac{1}{\delta})]^{\frac{1}{2}}\chi^{2}(Q\|P)^{\frac{1}{4}}\big{)} Corollary 4
χ2\chi^{2} O((1m)12[χ2(QP)+(σ2log(1δ))2]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[\chi^{2}(Q\|P)+(\sigma^{2}\log(\frac{1}{\delta}))^{2}]^{\frac{1}{2}}\big{)} Corollary 4
χ2\chi^{2} O(σ(1mδ)12χ2(QP)12)O\big{(}\sigma(\frac{1}{m\delta})^{\frac{1}{2}}\chi^{2}(Q\|P)^{\frac{1}{2}}\big{)} Theorem 1 & Proposition 6 in [1]
Sub-exponential α\alpha O(βmlog(1δ)Dα(QP)12α)O\big{(}\frac{\beta}{m}\log(\frac{1}{\delta})D_{\alpha}(Q\|P)^{\frac{1}{2\alpha}}\big{)} Proposition 7
For α\alpha O((1m)12[Dα(QP)+mαα1(βlog(1δ))2αα1]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[D_{\alpha}(Q\|P)+m^{\frac{-\alpha}{\alpha-1}}(\beta\log(\frac{1}{\delta}))^{\frac{2\alpha}{\alpha-1}}]^{\frac{1}{2}}\big{)} Proposition 7
m<2β2log(2δ)σ2m<\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}} χ2\chi^{2} O(βmlog(1δ)χ2(QP)14)O\big{(}\frac{\beta}{m}\log(\frac{1}{\delta})\chi^{2}(Q\|P)^{\frac{1}{4}}\big{)} Corollary 5
χ2\chi^{2} O((1m)12[(χ2(QP)+1m2(βlog(1δ))4)]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[(\chi^{2}(Q\|P)+\frac{1}{m^{2}}(\beta\log(\frac{1}{\delta}))^{4})]^{\frac{1}{2}}\big{)} Corollary 5
Bounded α\alpha O(σ(1mδ)12Dα(QP)12α)O\big{(}\sigma(\frac{1}{m\delta})^{\frac{1}{2}}D_{\alpha}(Q\|P)^{\frac{1}{2\alpha}}\big{)} Proposition 1
Variance α\alpha O((1m)12[Dα(QP)+(σ2δ)αα1]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[D_{\alpha}(Q\|P)+(\frac{\sigma^{2}}{\delta})^{\frac{\alpha}{\alpha-1}}]^{\frac{1}{2}}\big{)} Proposition 1
α\alpha O(σ(1m)12(Dα(QP)δα1)1α)O\big{(}\sigma(\frac{1}{m})^{\frac{1}{2}}(\frac{D_{\alpha}(Q\|P)}{\delta^{\alpha-1}})^{\frac{1}{\alpha}}\big{)} Proposition 4 in [1]
χ2\chi^{2} O(σ(1mδ)12χ2(QP)14)O\big{(}\sigma(\frac{1}{m\delta})^{\frac{1}{2}}\chi^{2}(Q\|P)^{\frac{1}{4}}\big{)} Corollary 2
χ2\chi^{2} O((1m)12[χ2(QP)+(σ2δ)2]12)O\big{(}(\frac{1}{m})^{\frac{1}{2}}[\chi^{2}(Q\|P)+(\frac{\sigma^{2}}{\delta})^{2}]^{\frac{1}{2}}\big{)} Corollary 2
χ2\chi^{2} O(σ(1mδ)12χ2(QP)12)O\big{(}\sigma(\frac{1}{m\delta})^{\frac{1}{2}}\chi^{2}(Q\|P)^{\frac{1}{2}}\big{)} Corollary 1 in [1]

In this section, we will explore the applications of our change of measure inequalities. We consider an arbitrary input space 𝒳\mathcal{X} and a output space 𝒴\mathcal{Y} . The samples (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y} are input-output pairs. Each example (x,y)\mathnormal{(x,y)} is drawn i.i.d. according to a fixed, but unknown, distribution DD on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. Let :𝒴×𝒴\ell:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R} denote a generic loss function. The risk RD(h)R_{D}(h) of any predictor h:𝒳𝒴\mathnormal{h}:\mathcal{X}\rightarrow\mathcal{Y} is defined as the expected loss induced by samples drawn according to DD. Given a training set SS of mm samples, the empirical risk RS(h)R_{S}(h) of any predictor hh is defined by the empirical average of the loss. That is

RD(h)=𝔼(x,y)D(h(x),y),RS(h)=1|S|(x,y)S(h(x),y)R_{D}(h)=\mathop{\mathbb{E}}_{(x,y)\sim D}\ell(h(x),y),\quad R_{S}(h)=\frac{1}{|S|}\sum_{(x,y)\in S}\ell(h(x),y)

In the PAC-Bayesian framework, we consider a hypothesis space \mathcal{H} of predictors, a prior distribution PP on \mathcal{H}, and a posterior distribution QQ on \mathcal{H}. The prior is specified before exploiting the information contained in SS, while the posterior is obtained by running non a learning algorithm on SS. The PAC-Bayesian theory usually studies the stochastic Gibbs predictor GQG_{Q}. Given a distribution QQ on \mathcal{H}, GQG_{Q} predicts an example xx by drawing a predictor hh according to QQ, and returning h(x)h(x). The risk of GQG_{Q} is then defined as follows. For any probability distribution QQ on a set of predictors, the Gibbs risk RD(GQ)R_{D}(G_{Q}) is the expected risk of the Gibbs predictor GQG_{Q} relative to DD. Hence,

RD(GQ)=𝔼(x,y)D𝔼hQ(h(x),y)R_{D}(G_{Q})=\mathop{\mathbb{E}}_{(x,y)\sim D}\mathop{\mathbb{E}}_{h\sim Q}\ell(h(x),y) (2)

Usual PAC-Bayesian bounds give guarantees on the generalization risk RD(GQ)R_{D}(G_{Q}). Typically, these bounds rely on the empirical risk RS(GQ)R_{S}(G_{Q}) defined as follows.

RS(GQ)=1|S|(x,y)S𝔼hQ(h(x),y)R_{S}(G_{Q})=\frac{1}{|S|}\sum_{(x,y)\in S}\mathop{\mathbb{E}}_{h\sim Q}\ell(h(x),y) (3)

Due to space constraints, we fully present PAC-Bayes generalization bounds for losses with bounded variance. Other results for bounded losses, sub-Gaussian losses and sub-exponential losses are included in Appendix C. Still, we briefly discuss our new results in Section 3.2.

3.1 Loss Function with Bounded Variance

In this section, we present our PAC-Bayesian bounds for the loss functions with bounded variance. Assuming any arbitrary distribution with bounded variance on the loss function \ell (i.e., 𝕍ar(x,y)D[(h(x),y)]σ2\mathop{\mathbb{V}{\rm ar}}_{(x,y)\sim D}[\ell(h(x),y)]\leq\sigma^{2} for any hh\in\mathcal{H}), we have the following PAC-Bayesian bounds.

Proposition 1 (The PAC-Bayesian bounds for loss function with bounded variance).

Let PP be a fixed prior distribution over a hypothesis space any hh\in\mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size mm and α>1\alpha>1, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+σ2mδ(α(α1)Dα(QP)+1)1α\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{\sigma^{2}}{m\delta}\big{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\big{)}^{\frac{1}{\alpha}}}\end{split}
RD(GQ)RS(GQ)+1m(Dα(QP)+1α(α1))+1mα(σ2(α1)δ)αα1\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{m}\bigg{(}D_{\alpha}(Q\|P)+\frac{1}{\alpha(\alpha-1)}\bigg{)}+\frac{1}{m\alpha}\bigg{(}\frac{\sigma^{2}(\alpha-1)}{\delta}\bigg{)}^{\frac{\alpha}{\alpha-1}}}\end{split} (4)

By setting α=2\alpha=2 in Proposition 2, we have the following claim.

Corollary 2 (The PAC-Bayesian bounds with χ2\chi^{2}-divergence for bounded variance loss function).

Let PP be any prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size m>0m>0, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+σ2mδχ2(QP)+1\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\frac{\sigma^{2}}{m\delta}\sqrt{\chi^{2}(Q\|P)+1}}\end{split} (5)
RD(GQ)RS(GQ)+12m(χ2(QP)+1+(σ2δ)2)\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{2m}\bigg{(}\chi^{2}(Q\|P)+1+\bigg{(}\frac{\sigma^{2}}{\delta}\bigg{)}^{2}\bigg{)}}\end{split}

Now, note that choosing Δ(q,p)=|qp|\Delta(q,p)=|q-p| for Equation (5) results in

RD(GQ)RS(GQ)+σ2mδ(χ2(QP)+1)\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\frac{\sigma^{2}}{m\delta}(\chi^{2}(Q\|P)+1)}\end{split} (6)

which was shown in Corollary 1 of [1]. We can easily see that (5) is tighter than (6) due to the choice of Δ(q,p)=(qp)2\Delta(q,p)=(q-p)^{2}. Please see Table 3 for details.

3.2 Discussion

Table 3 presents various PAC-Bayesian bounds based on our change of measure inequalities depending on different assumptions on the loss function \ell, and compares them with existing results in the literature. The importance of the various types of PAC-Bayes bounds is justified by the connection between PAC-Bayes bounds and regularization in a learning problem. PAC-Bayesian theory provides a guarantee that upper-bounds the risk of Gibbs predictors simultaneously for all posterior distributions QQ. PAC-Bayes bounds enjoy this property due to change of measure inequalities. It is a well-known fact that KL-regularized objective functions are obtained from PAC-Bayes risk bounds, in which the minimizer Q^\hat{Q} is guaranteed to exist since the risk bounds hold simultaneously for all posterior distributions QQ. The complexity term, partially controlled by the divergence, serves as a regularizer [10, 9, 3]. Additionally, the link between Bayesian inference techniques and PAC-Bayesian risk bounds was shown by [8], that is, the minimization of PAC-Bayesian risk bounds maximizes the Bayesian marginal likelihood. Our results pertain to important machine learning prediction problems, such as regression, classification and structured prediction and indicate which regularizer to use. All PAC-Bayes bounds presented here are either novel, or have a tighter complexity term than existing results in the literature. Since our bounds are based on either χ2\chi^{2}-divergence or α\alpha-divergence, we excluded the comparison with the bounds based on the KL-divergence ([30, 4, 13, 25, 8, 11, 33]). Our results for sub-exponential losses are entirely novel. For the other cases, our bounds are tighter than exisiting bounds in terms of the complexity term. For instance, our bound for bounded losses is tighter than those of [15, 20] since our bound has the complexity term χ2(QP)1/4\chi^{2}(Q\|P)^{1/4} and log(1/δ)\log(1/\delta), while [15, 20] have χ2(QP)1/2\chi^{2}(Q\|P)^{1/2} and 1/δ1/\delta. For sub-Gaussian loss functions, our bound has the complexity term Dα(QP)1/2αD_{\alpha}(Q\|P)^{1/2\alpha} and log(1/δ)\log(1/\delta), whereas [1] has Dα(QP)1/αD_{\alpha}(Q\|P)^{1/\alpha} and 1/δ1/\delta respectively. In addition, our additive bounds, such as Equation (4), have better rates than the existing bounds in [1], since in our bound, Dα(QP)D_{\alpha}(Q\|P) and 1/δ1/\delta are added, while in [1], Dα(QP)D_{\alpha}(Q\|P) and 1/δ1/\delta are multiplied.

4 Non-Asymptotic Interval for Monte Carlo Estimates

Table 4: Summary of non-asymptotic interval for Monte Carlo estimates
Divergence Non-asymptotic interval for Monte Carlo estimates
|𝔼Q[ϕ(X)]1ni=1nϕ(Xi)|4L2log(2δ)nγ+𝒦\big{|}\mathbb{E}_{Q}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\big{|}\leq\frac{4L^{2}\log(\frac{2}{\delta})}{n\gamma}+\mathcal{K}
Reference
Pseudo α\alpha 𝒦=22α1αL𝒟α~(QP)1αγΓ(3α22(α1))α1α\mathcal{K}=\frac{2^{\frac{2\alpha-1}{\alpha}}L\widetilde{\mathcal{D_{\alpha}}}(Q\|P)^{\frac{1}{\alpha}}}{\sqrt{\gamma}}\Gamma(\frac{3\alpha-2}{2(\alpha-1)})^{\frac{\alpha-1}{\alpha}} Proposition 2
χ2\chi^{2} 𝒦={χ2(QP){1ni=1nϕ2(xi)+16L2γ1nlog2δ} if log(2δ)nχ2(QP){1ni=1nϕ2(xi)+16L2nγlog2δ} if n<log(2δ)\mathcal{K}=\begin{cases}\sqrt{\chi^{2}(Q\|P)\{\frac{1}{n}\sum_{i=1}^{n}\phi^{2}(x_{i})+\frac{16L^{2}}{\gamma}\sqrt{\frac{1}{n}\log\frac{2}{\delta}}\}}$ if $\log(\frac{2}{\delta})\leq n\\ \sqrt{\chi^{2}(Q\|P)\{\frac{1}{n}\sum_{i=1}^{n}\phi^{2}(x_{i})+\frac{16L^{2}}{n\gamma}\log\frac{2}{\delta}\}}$ if $n<\log(\frac{2}{\delta})\end{cases} Proposition 3
KLKL 𝒦=KL(QP)+L2nγ\mathcal{K}=KL(Q\|P)+\frac{L^{2}}{n\gamma} Proposition 4

We now turn to another application of change of measure inequalities. We will introduce a methodology that enables us to find a non-asymptotic interval for Monte Calro (MC) estimate. All results shown in this section are entirely novel and summarized in Table 4. Under some circumstances, our methodology could be a promising alternative to existing MC methods (e.g., Importance Sampling and Rejection Sampling [28]) because their non-asymptotic intervals are hard to analyze. In this section, we consider a problem to estimate an expectation of an Lipschitz function ϕ:d\phi:\mathbb{R}^{d}\rightarrow\mathbb{R} with respect to a complicated distribution QQ, namely 𝔼Q[ϕ(X)]\mathbb{E}_{Q}[\phi(X)] by the sample mean 1ni=1nϕ(Xi)\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i}) over a distribution PP, where QQ is any distribution we are not able to sample from. For a motivating application, consider QQ being a probabilistic graphical model (see e.g., [14]). Assume that we have a strongly log-concave distribution PP with a parameter γ\gamma for a sampling distribution (see Appendix D for definition). Under the above conditions, we claim the following proposition.

Proposition 2 (A general expression of non-asymptotic interval for Monte Carlo estimates).

Let XX be a d-dimensional random vector. Let PP and QQ be the probability measures induced by XX. Let PP be a strongly log-concave distribution with parameter γ>0\gamma>0. Let ϕ:d\phi:\mathbb{R}^{d}\rightarrow\mathbb{R} be any L-Lipschitz function with respect to the Euclidean norm. Suppose we draw i.i.d. samples X1,X2,,XnPX_{1},X_{2},...,X_{n}\sim P. For α>1\alpha>1, with probability at least 1δ1-\delta, we have

|𝔼Q[ϕ(X)]1ni=1nϕ(Xi)|4L2log(2δ)nγ+22α1αL𝒟α~(QP)1αγΓ(3α22(α1))α1α\bigg{|}\mathbb{E}_{Q}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\bigg{|}\leq\frac{4L^{2}\log(\frac{2}{\delta})}{n\gamma}+\frac{2^{\frac{2\alpha-1}{\alpha}}L\widetilde{\mathcal{D_{\alpha}}}(Q\|P)^{\frac{1}{\alpha}}}{\sqrt{\gamma}}\Gamma\bigg{(}\frac{3\alpha-2}{2(\alpha-1)}\bigg{)}^{\frac{\alpha-1}{\alpha}}

The second term on the right hand side indicates a bias of an empirical mean 1ni=1nϕ(Xi)\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i}) under the sampling distribution PP. Next, we present a more informative bound.

Proposition 3 (χ2\chi^{2}-based expression of non-asymptotic interval for Monte Carlo estimates).

Let X,P,Q,L,γX,P,Q,L,\gamma and ϕ\phi be defined as in Proposition 2 . Suppose we have i.i.d. samples X1,X2,,XnPX_{1},X_{2},...,X_{n}\sim P. Then, with probability at least (1δ)2(1-\delta)^{2}, we have

|𝔼Q[ϕ(X)]1ni=1nϕ(Xi)|4L2log(2δ)nγ+𝒦\bigg{|}\mathbb{E}_{Q}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\bigg{|}\leq\frac{4L^{2}\log(\frac{2}{\delta})}{n\gamma}+\mathcal{K}

where

𝒦={χ2(QP){1ni=1nϕ2(Xi)+16L2γ1nlog2δ}if log(2δ)nχ2(QP){1ni=1nϕ2(Xi)+16L2nγlog2δ}if n<log(2δ)\begin{split}\mathcal{K}=\begin{cases}\sqrt{\chi^{2}(Q\|P)\{\frac{1}{n}\sum_{i=1}^{n}\phi^{2}(X_{i})+\frac{16L^{2}}{\gamma}\sqrt{\frac{1}{n}\log\frac{2}{\delta}}\}}\quad\text{if }\log(\frac{2}{\delta})\leq n\\ \sqrt{\chi^{2}(Q\|P)\{\frac{1}{n}\sum_{i=1}^{n}\phi^{2}(X_{i})+\frac{16L^{2}}{n\gamma}\log\frac{2}{\delta}\}}\quad\text{if }n<\log(\frac{2}{\delta})\end{cases}\end{split}

This result provides an insight into how good a proposal distribution PP is, meaning that the effect of the deviation between PP and QQ, namely χ2(QP)\chi^{2}(Q\|P), is inflated by the empirical variance. This implication might support some results in the literature ([5, 6]). So far we have considered the pseudo α\alpha-divergence as well as χ2\chi^{2} divergence. [7] presented an approximation for an arbitrary distribution QQ by distributions with log-concave density with respect to the KL divergence, which motivates the following result.

Proposition 4 (KL-based expression of non-asymptotic interval for Monte Carlo estimates).

Let X,P,Q,L,γX,P,Q,L,\gamma and ϕ\phi be defined as in Proposition 2 . Suppose we have i.i.d. samples X1,X2,,XnPX_{1},X_{2},...,X_{n}\sim P. Then, with probability at least 1δ1-\delta, we have

|𝔼Q[ϕ(X)]1ni=1nϕ(Xi)|4L2log(2δ)nγ+KL(QP)+L2nγ\bigg{|}\mathbb{E}_{Q}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\bigg{|}\leq\frac{4L^{2}\log(\frac{2}{\delta})}{n\gamma}+KL(Q\|P)+\frac{L^{2}}{n\gamma}

This result shows that, under the assumption of Proposition 4, the empirical mean 1ni=1nϕ(Xi)\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i}) over the sampling distribution PP aymptotically differs from 𝔼Q[ϕ(X)]\mathbb{E}_{Q}[\phi(X)] by KL(QP)KL(Q\|P) at most.

References

  • [1] Pierre Alquier and Benjamin Guedj. Simpler pac-bayesian bounds for hostile data. Machine Learning, 107(5):887–902, May 2018.
  • [2] Amiran Ambroladze, Emilio Parrado-hernández, and John S. Shawe-taylor. Tighter pac-bayes bounds. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 9–16. MIT Press, 2007.
  • [3] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 06 2002.
  • [4] Olivier Catoni. Pac-bayesian supervised classification: The thermodynamics of statistical learning. 01 2007.
  • [5] Julien Cornebise, Éric Moulines, and Jimmy Olsson. Adaptive methods for sequential importance sampling with application to state space models. Statistics and Computing, 18(4):461–480, 2008.
  • [6] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational inference via chi upper bound minimization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2732–2741. Curran Associates, Inc., 2017.
  • [7] Lutz Duembgen, Richard Samworth, and Dominic Schuhmacher. Approximation by log-concave distributions, with applications to regression. The Annals of Statistics, 39, 02 2010.
  • [8] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. Pac-bayesian theory meets bayesian inference. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1884–1892. Curran Associates, Inc., 2016.
  • [9] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. A pac-bayes risk bound for general loss functions. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 449–456. MIT Press, 2007.
  • [10] Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, and François Laviolette. From pac-bayes bounds to kl regularization. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 603–610. Curran Associates, Inc., 2009.
  • [11] Peter D. Grünwald and Nishant A. Mehta. A tight excess risk bound via a unified PAC-Bayesian–Rademacher–Shtarkov–MDL complexity. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 433–465, Chicago, Illinois, 22–24 Mar 2019. PMLR.
  • [12] Benjamin Guedj. A primer on pac-bayesian learning. Preprint arXiv:1901.05353, 05 2019.
  • [13] Matthew Holland. Pac-bayes under potentially heavy tails. In Advances in Neural Information Processing Systems 32, pages 2715–2724. Curran Associates, Inc., 2019.
  • [14] Jean Honorio. Lipschitz parametrization of probabilistic graphical models. pages 347–354. Conference on Uncertainty in Artificial Intelligence, 07 2011.
  • [15] Jean Honorio and Tommi Jaakkola. Tight bounds for the expected risk of linear classifiers and pac-bayes finite-sample guarantees. In International Conference on Artificial Intelligence and Statistics, page 384–392, 04 2014.
  • [16] J. Jiao, Y. Han, and T. Weissman. Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1475–1479, 2017.
  • [17] Markos A. Katsoulakis, Luc Rey-Bellet, and Jie Wang. Scalable information inequalities for uncertainty quantification. Journal of Computational Physics, 336:513 – 545, 2017.
  • [18] E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998.
  • [19] Guy Lever, Francois Laviolette, and John Shawe-Taylor. Tighter pac-bayes bounds through distribution-dependent priors. Theor. Comput. Sci., 473:4–28, 02 2013.
  • [20] François Laviolette Jean-Francis Roy. Luc Bégin, Pascal Germain. PAC-Bayesian Bounds based on the Rényi Divergence. International Conference on Artificial Intelligence and Statistics, 2016.
  • [21] David McAllester. Simplified pac-bayesian margin bounds. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Learning Theory and Kernel Machines, pages 203–215, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
  • [22] David A. McAllester. Some pac-bayesian theorems. In Machine Learning, pages 230–234. ACM Press, 1998.
  • [23] David A. McAllester. Pac-bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT ’99, pages 164–170, New York, NY, USA, 1999. ACM.
  • [24] David A. McAllester. Pac-bayesian stochastic model selection. Machine Learning, 51(1):5–21, Apr 2003.
  • [25] Zakaria Mhammedi, Peter Grünwald, and Benjamin Guedj. Pac-bayes un-expected bernstein inequality. In Advances in Neural Information Processing Systems 32, pages 12202–12213. Curran Associates, Inc., 2019.
  • [26] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor., 56(11):5847–5861, November 2010.
  • [27] XuanLong Nguyen, M.J. Wainwright, and Michael Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56:5847 – 5861, 12 2010.
  • [28] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.
  • [29] Avraham Ruderman, Mark D. Reid, Darío García-García, and James Petterson. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1155–1162, USA, 2012. Omnipress.
  • [30] Matthias Seeger. Pac-bayesian generalisation error bounds for gaussian process classification. J. Mach. Learn. Res., 3:233–269, March 2003.
  • [31] Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. Pac-bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–7093, Dec 2012.
  • [32] John Shawe-Taylor and Robert C. Williamson. A pac analysis of a bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, COLT ’97, pages 2–9, New York, NY, USA, 1997. ACM.
  • [33] Rishit Sheth and Roni Khardon. Excess risk bounds for the bayes risk using variational inference in latent gaussian models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5151–5161. Curran Associates, Inc., 2017.
  • [34] Ilya O Tolstikhin and Yevgeny Seldin. Pac-bayes-empirical-bernstein inequality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 109–117. Curran Associates, Inc., 2013.
  • [35] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

Appendix A Detailed Proofs

A.1 Proof of Corollary 1

Proof.

Removing the supremum and rearranging the terms in Lemma 1, we prove our claim. ∎

A.2 Proof of Lemma 2

Proof.

By Theorem 1, we want to find

(𝕀f,PR)(ϕ)=suppΔ(P)ϕp(p1)2dP=suppΔ(P)𝔼P[ϕp(p1)2]\begin{split}(\mathbb{I}^{R}_{f,P})^{*}(\phi)&=\sup_{p\in\Delta(P)}{\int_{\mathcal{H}}\phi p-(p-1)^{2}dP}\\ &=\sup_{p\in\Delta(P)}{\mathbb{E}_{P}[\phi p-(p-1)^{2}]}\end{split} (7)

where ϕ\phi is a measurable function on \mathcal{H}. In order to find the supremum on the right hand side, we consider the following Lagrangian:

L(p,λ)=𝔼P[ϕp(p1)2]+λ(𝔼P[p]1)=ϕp(p1)2dP+λ(p𝑑P1)\begin{split}L(p,\lambda)&=\mathbb{E}_{P}[\phi p-(p-1)^{2}]+\lambda(\mathbb{E}_{P}[p]-1)\\ &=\int_{\mathcal{H}}\phi p-(p-1)^{2}dP+\lambda\bigg{(}\int_{\mathcal{H}}pdP-1\bigg{)}\end{split}

where λ\lambda\in\mathbb{R} and pp is constrained to be a probability density over \mathcal{H}. By talking the functional derivative with respect to pp and renormalize by dropping the dPdP factor multiplying all terms and setting it to zero, we have LpdP=ϕ2(p1)+λ=0\frac{\partial L}{\partial pdP}=\phi-2(p-1)+\lambda=0. Thus, we have p=ϕ+λ2+1p=\frac{\phi+\lambda}{2}+1. Since pp is constrained to be p𝑑P=1\int_{\mathcal{H}}pdP=1, we have (ϕ+λ2+1)𝑑P=1\int_{\mathcal{H}}(\frac{\phi+\lambda}{2}+1)dP=1. Then, we obtain λ=𝔼P[ϕ]\lambda=-\mathbb{E}_{P}[\phi] and the optimum p=ϕ𝔼P[ϕ]2+1p=\frac{\phi-\mathbb{E}_{P}[\phi]}{2}+1. Plugging it in Equation (7), we have

(𝕀f,PR)(ϕ)=EP[ϕ(ϕ𝔼P[ϕ]2+1)(ϕ𝔼P[ϕ]2)2]\begin{split}(\mathbb{I}^{R}_{f,P})^{*}(\phi)&={E_{P}\bigg{[}\phi\bigg{(}\frac{\phi-\mathbb{E}_{P}[\phi]}{2}+1\bigg{)}-\bigg{(}\frac{\phi-\mathbb{E}_{P}[\phi]}{2}\bigg{)}^{2}\bigg{]}}\end{split}

which simplifies to 𝔼P[ϕ]+14𝕍arP[ϕ]\mathbb{E}_{P}[\phi]+\frac{1}{4}\mathbb{V}{\rm ar}_{P}[\phi]. ∎

A.3 Proof of Lemma 3

Proof.

Notice that the χ2\chi^{2}-divergence is obtained by setting f(x)=(x1)2f(x)=(x-1)^{2}. In order to find the convex conjugate f(y)f^{*}(y) from Equation (1), let g(x,y)=xy(x1)2g(x,y)=xy-(x-1)^{2}. We need to find the supremum of g(x,y)g(x,y) with respect to xx. By using differentiation with respect to xx and setting the derivative to zero, we have gx=y2(x1)=0\frac{\partial g}{\partial x}=y-2(x-1)=0. Thus, plugging x=y2+1x=\frac{y}{2}+1 for g(x,y)=xy(x1)2g(x,y)=xy-(x-1)^{2}, we obtain f(y)=y+y24f^{*}(y)=y+\frac{y^{2}}{4}. Plugging f(x)f(x) and f(y)f^{*}(y) in Corollary 1, we prove our claim. ∎

A.4 Proof of Lemma 4

Proof.

By Theorem 1, we want to find

(𝕀f,PR)(ϕ)=suppΔ(P)𝔼P[ϕp12|p1|]\begin{split}(\mathbb{I}^{R}_{f,P})^{*}(\phi)&=\sup_{p\in\Delta(P)}{\mathbb{E}_{P}[\phi p-\frac{1}{2}|p-1|]}\end{split}

In order to find the supremum on the right hand side, we consider the following Lagrangian

L(p,λ)=𝔼P[ϕp12|p1|)]+λ(𝔼P[p]1)={𝔼P[ϕp12p+12]+λ(𝔼P[p]1)if 1p𝔼P[ϕp+12p12]+λ(𝔼P[p]1)if 0<p<1\begin{split}L(p,\lambda)&=\mathbb{E}_{P}[\phi p-\frac{1}{2}|p-1|)]+\lambda(\mathbb{E}_{P}[p]-1)\\ &=\begin{cases}\mathbb{E}_{P}[\phi p-\frac{1}{2}p+\frac{1}{2}]+\lambda(\mathbb{E}_{P}[p]-1)&\text{if }1\leq p\\ \mathbb{E}_{P}[\phi p+\frac{1}{2}p-\frac{1}{2}]+\lambda(\mathbb{E}_{P}[p]-1)&\text{if }0<p<1\\ \end{cases}\end{split}

Then, it is not hard to see that if |ϕ|12|\phi|\leq\frac{1}{2}, L(p,λ)L(p,\lambda) is maximized at p=1p=1, otherwise LL\rightarrow\infty as p1p\rightarrow 1. Thus, Lemma 4 holds for |ϕ|12|\phi|\leq\frac{1}{2}. If we add 12\frac{1}{2} on the both sides, then ϕ\phi is bounded between 0 and 1, as we claimed in the corollary. ∎

A.5 Proof of Lemma 5

Proof.

The corresponding convex function f(x)f(x) for the α\alpha-divergence is defined as f(x)=xα1α(α1)f(x)=\frac{x^{\alpha}-1}{\alpha(\alpha-1)}. Applying the same procedure as in Lemma 3, we get the convex conjugate f(y)=(α1)αα1αyαα1+1α(α1)f^{*}(y)=\frac{(\alpha-1)^{\frac{\alpha}{\alpha-1}}}{\alpha}y^{\frac{\alpha}{\alpha-1}}+\frac{1}{\alpha(\alpha-1)} for α>1\alpha>1. Plugging f(x)f(x) and f(y)f^{*}(y) in Corollary 1, we prove our claim. ∎

A.6 Proof of Lemma 6

Proof.

The corresponding convex function f(x)f(x) for the squared Hellinger divergence is defined as f(x)=(x1)2f(x)=(\sqrt{x}-1)^{2}. Let g(x,y)=xy(x1)2g(x,y)=xy-(\sqrt{x}-1)^{2}. We have gx=y1+1x=0\frac{\partial g}{\partial x}=y-1+\frac{1}{\sqrt{x}}=0. Since we consider x>0x>0, y<1y<1. Applying the same procedure as in Lemma 3, we get the convex conjugate f(y)=y1yf^{*}(y)=\frac{y}{1-y}. Plugging f(x)f(x) and f(y)f^{*}(y) in Corollary 1, we prove our claim. ∎

A.7 Proof of Lemma 7

Proof.

The corresponding convex function f(x)f(x) for the reverse KL-divergence is defined as f(x)=log(x)f(x)=-\log(x). Applying the same procedure as in Lemma 6, we get the convex conjugate f(y)=log(1y)1f^{*}(y)=\log(-\frac{1}{y})-1. Plugging f(x)f(x) and f(y)f^{*}(y) in Corollary 1, we have

𝔼Q[ψ]KL¯(QP)+𝔼P[log(1ψ)]1\mathbb{E}_{Q}[\psi]\leq\overline{KL}(Q\|P)+\mathbb{E}_{P}\bigg{[}\log\bigg{(}-\frac{1}{\psi}\bigg{)}\bigg{]}-1

where ψ<0\psi<0. Letting ψ=ϕ1\psi=\phi-1, we prove our claim. ∎

A.8 Proof of Lemma 8

Proof.

The corresponding convex function f(x)f(x) for the Neyman χ2\chi^{2}-divergence is defined as f(x)=(x1)2xf(x)=-\frac{(x-1)^{2}}{x}. Applying the same procedure as in Lemma 6, we get the convex conjugate f(y)=221yf^{*}(y)=2-2\sqrt{1-y}. Plugging f(x)f(x) and f(y)f^{*}(y) in Corollary 1, we prove our claim. ∎

A.9 Proof of Lemma 10

Proof.
𝔼Q[ϕ]=ϕ𝑑Q=ϕdQdP𝑑P(|dQdP|α𝑑P)1α(|ϕ|αα1𝑑P)α1α=(α(α1)Dα(QP)+1)1α(𝔼P[|ϕ|αα1])α1α\begin{split}\mathbb{E}_{Q}[\phi]&=\int_{\mathcal{H}}\phi dQ\\ &=\int_{\mathcal{H}}\phi\frac{dQ}{dP}dP\\ &\leq\bigg{(}\int_{\mathcal{H}}\bigg{|}\frac{dQ}{dP}\bigg{|}^{\alpha}dP\bigg{)}^{\frac{1}{\alpha}}\bigg{(}\int_{\mathcal{H}}|\phi|^{\frac{\alpha}{\alpha-1}}dP\bigg{)}^{\frac{\alpha-1}{\alpha}}\\ &=\bigg{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\bigg{)}^{\frac{1}{\alpha}}\bigg{(}\mathbb{E}_{P}[|\phi|^{\frac{\alpha}{\alpha-1}}]\bigg{)}^{\frac{\alpha-1}{\alpha}}\end{split}

The third line is due to the Hölder’s inequality. ∎

A.10 Proof of Lemma 12

Proof.

Consider the covariance of ϕ\phi and dQdP\frac{dQ}{dP}.

|CovP(ϕ,dQdP)|=|ϕdQdP𝑑Pϕ𝑑PdQdP𝑑P|=|𝔼Q[ϕ]𝔼P[ϕ]|\begin{split}\bigg{|}Cov_{P}\bigg{(}\phi,\frac{dQ}{dP}\bigg{)}\bigg{|}&=\bigg{|}\int_{\mathcal{H}}\phi\frac{dQ}{dP}dP-\int_{\mathcal{H}}\phi dP\int_{\mathcal{H}}\frac{dQ}{dP}dP\bigg{|}\\ &=\bigg{|}\mathbb{E}_{Q}[\phi]-\mathbb{E}_{P}[\phi]\bigg{|}\end{split}

On the other hand,

|CovP(ϕ,dQdP)|=|𝔼P[(ϕμP)(dQdP𝔼P[dQdP])]|𝔼P[|(ϕμP)(dQdP𝔼P[dQdP])|]=|(ϕμP)(dQdP1)|𝑑P(|dQdP1|α𝑑P)1α(|ϕμP|αα1𝑑P)α1α𝒟α~(QP)1α(𝔼P[|ϕμP|αα1])α1α\begin{split}&\bigg{|}Cov_{P}\bigg{(}\phi,\frac{dQ}{dP}\bigg{)}\bigg{|}\\ &=\bigg{|}\mathbb{E}_{P}\bigg{[}(\phi-\mu_{P})\bigg{(}\frac{dQ}{dP}-\mathbb{E}_{P}\bigg{[}\frac{dQ}{dP}\bigg{]}\bigg{)}\bigg{]}\bigg{|}\\ &\leq\mathbb{E}_{P}\bigg{[}\bigg{|}(\phi-\mu_{P})\bigg{(}\frac{dQ}{dP}-\mathbb{E}_{P}\bigg{[}\frac{dQ}{dP}\bigg{]}\bigg{)}\bigg{|}\bigg{]}\\ &=\int_{\mathcal{H}}\bigg{|}(\phi-\mu_{P})\bigg{(}\frac{dQ}{dP}-1\bigg{)}\bigg{|}dP\\ &\leq\bigg{(}\int_{\mathcal{H}}\bigg{|}\frac{dQ}{dP}-1\bigg{|}^{\alpha}dP\bigg{)}^{\frac{1}{\alpha}}\bigg{(}\int_{\mathcal{H}}|\phi-\mu_{P}|^{\frac{\alpha}{\alpha-1}}dP\bigg{)}^{\frac{\alpha-1}{\alpha}}\\ &\leq\widetilde{\mathcal{D_{\alpha}}}(Q\|P)^{\frac{1}{\alpha}}\bigg{(}\mathbb{E}_{P}[|\phi-\mu_{P}|^{\frac{\alpha}{\alpha-1}}]\bigg{)}^{\frac{\alpha-1}{\alpha}}\end{split}

which proves our claim. ∎

A.11 Proof of Proposition 1

Proof.

Suppose the convex function Δ:×\Delta:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} is defined as in Proposition 5. By Chebyshev’s inequality, for any ϵ>0\epsilon>0 and any hh\in\mathcal{H},

Pr(x,y)D(ϕD(h)ϵ)=Pr(x,y)D((RS(h)RD(h))2ϵt)=Pr(x,y)D(|1mi=1m(h(xi),yi)𝔼(x,y)D(h(x),y)|ϵt)tσ2mϵ\begin{split}&\mathop{Pr}_{(x,y)\sim D}(\phi_{D}(h)\geq\epsilon)\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}(R_{S}(h)-R_{D}(h))^{2}\geq\frac{\epsilon}{t}\bigg{)}\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}\bigg{|}\frac{1}{m}\sum_{i=1}^{m}\ell(h(x_{i}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim D}\ell(h(x),y)\bigg{|}\geq\sqrt{\frac{\epsilon}{t}}\bigg{)}\\ &\leq\frac{t\sigma^{2}}{m\epsilon}\end{split}

Setting δ=tσ2mϵ\delta=\frac{t\sigma^{2}}{m\epsilon}, we have

ϕD(h)1δtσ2mδ\begin{split}&\phi_{D}(h)\mathop{\leq}_{1-\delta}\frac{t\sigma^{2}}{m\delta}\end{split}

Now, we have the upper bound for ϕD(h)\phi_{D}(h) so we can apply the same procedure as in Proposition 5. ∎

A.12 Proof of Proposition 2

Proof.

Note that we have Lemma 12. Now, it remains to bound 𝔼P[ϕ(X)]\mathbb{E}_{P}[\phi(X)] and 𝔼P[|ϕ(X)𝔼P[ϕ(X)]|αα1]\mathbb{E}_{P}[|\phi(X)-\mathbb{E}_{P}[\phi(X)]|^{\frac{\alpha}{\alpha-1}}]. Since PP is a strongly log-concave distribution and ϕ\phi is an L-Lipschitz function, Theorem 3.16 in [35] implies ϕ(X)𝔼P[ϕ(X)]\phi(X)-\mathbb{E}_{P}[\phi(X)] is a sub-Gaussian random variable with parameter σ2=2L2γ\sigma^{2}=\frac{2L^{2}}{\gamma}. Therefore, we have

Pr(|1ni=1nϕ(Xi)𝔼P[ϕ(X)]|ϵ)2enγϵ24L2\begin{split}Pr\bigg{(}\bigg{|}\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})-\mathbb{E}_{P}[\phi(X)]\bigg{|}\geq\epsilon\bigg{)}\leq 2e^{\frac{n\gamma\epsilon^{2}}{4L^{2}}}\end{split}

Thus,

|𝔼P[ϕ(X)]1ni=1nϕ(Xi)|1δ4L2nγlog(2δ)\bigg{|}\mathbb{E}_{P}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\bigg{|}\mathop{\leq}_{1-\delta}\frac{4L^{2}}{n\gamma}\log(\frac{2}{\delta})

On the other hand, letting U=ϕ(X)𝔼P[ϕ(X)]U=\phi(X)-\mathbb{E}_{P}[\phi(X)] , then we have

𝔼P[|U|αα1]=0Pr{|U|αα1>u}𝑑u=0Pr{|U|>uα1α}𝑑u=αα10u1α1Pr{|U|>u}𝑑u=αα10u1α1Pr{|ϕ(X)𝔼P[ϕ(X)]|>u}𝑑uαα10u1α12eγu24L2𝑑u=αα10uα2(α1)1eγu4L2𝑑u=αα1Γ(α2(α1))(4L2γ)α2(α1)=2Γ(3α22(α1))(4L2γ)α2(α1)\begin{split}\mathbb{E}_{P}[|U|^{\frac{\alpha}{\alpha-1}}]&=\int_{0}^{\infty}Pr\{|U|^{\frac{\alpha}{\alpha-1}}>u\}du\\ &=\int_{0}^{\infty}Pr\{|U|>u^{\frac{\alpha-1}{\alpha}}\}du\\ &=\frac{\alpha}{\alpha-1}\int_{0}^{\infty}u^{\frac{1}{\alpha-1}}Pr\{|U|>u\}du\\ &=\frac{\alpha}{\alpha-1}\int_{0}^{\infty}u^{\frac{1}{\alpha-1}}Pr\{|\phi(X)-\mathbb{E}_{P}[\phi(X)]|>u\}du\\ &\leq\frac{\alpha}{\alpha-1}\int_{0}^{\infty}u^{\frac{1}{\alpha-1}}2e^{-\frac{\gamma u^{2}}{4L^{2}}}du\\ &=\frac{\alpha}{\alpha-1}\int_{0}^{\infty}u^{\frac{\alpha}{2(\alpha-1)}-1}e^{-\frac{\gamma u}{4L^{2}}}du\\ &=\frac{\alpha}{\alpha-1}\Gamma\bigg{(}\frac{\alpha}{2(\alpha-1)}\bigg{)}\bigg{(}\frac{4L^{2}}{\gamma}\bigg{)}^{\frac{\alpha}{2(\alpha-1)}}\\ &=2\Gamma\bigg{(}\frac{3\alpha-2}{2(\alpha-1)}\bigg{)}\bigg{(}\frac{4L^{2}}{\gamma}\bigg{)}^{\frac{\alpha}{2(\alpha-1)}}\\ \end{split}

The inequality is due to sub-Gaussianity. Putting everything together with Lemma 12, we prove our claim. ∎

A.13 Proof of Proposition 3

Proof.

By Lemma 11, we have

|EQ[ϕ(X)]𝔼P[ϕ(X)]|χ2(QP)𝔼P[ϕ2(X)]\begin{split}|E_{Q}[\phi(X)]-\mathbb{E}_{P}[\phi(X)]|\leq\sqrt{\chi^{2}(Q\|P)\mathbb{E}_{P}[\phi^{2}(X)]}\end{split}

where 𝔼P[ϕ(X)]\mathbb{E}_{P}[\phi(X)] is bounded as in Proposition 2.

|𝔼P[ϕ(X)]1ni=1nϕ(Xi)|1δ4L2nγlog(2δ)\bigg{|}\mathbb{E}_{P}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\bigg{|}\mathop{\leq}_{1-\delta}\frac{4L^{2}}{n\gamma}\log(\frac{2}{\delta})

It remains to bound EP[ϕ2(X)]E_{P}[\phi^{2}(X)]. Note that U2U^{2} is a sub-exponential random variable with parameters (42σ2,4σ2)(4\sqrt{2}\sigma^{2},4\sigma^{2}) when UU is a sub-Gaussian random variable with a parameter σ\sigma (See Appendix B in [15] for details). Therefore,

Pr(|1ni=1nϕ2(Xi)𝔼P[ϕ2(X)]|ϵ){2enγ2ϵ2256L4if log2δn2enγ2ϵ2256L4if log2δ>n\begin{split}Pr\bigg{(}\bigg{|}\frac{1}{n}\sum^{n}_{i=1}\phi^{2}(X_{i})-\mathbb{E}_{P}[\phi^{2}(X)]\bigg{|}\geq\epsilon\bigg{)}\leq\begin{cases}2e^{-\frac{n\gamma^{2}\epsilon^{2}}{256L^{4}}}\quad\text{if }\log\frac{2}{\delta}\leq n\\ 2e^{-\frac{n\gamma^{2}\epsilon^{2}}{256L^{4}}}\quad\text{if }\log\frac{2}{\delta}>n\end{cases}\end{split}

Thus,

|𝔼P[ϕ2(X)]1ni=1nϕ2(Xi)|1δ{16L2n1nlog(2δ)if log2δn16L2nγlog(2δ)if log2δ>n\begin{split}\bigg{|}\mathbb{E}_{P}[\phi^{2}(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi^{2}(X_{i})\bigg{|}\mathop{\leq}_{1-\delta}\begin{cases}\frac{16L^{2}}{n}\sqrt{\frac{1}{n}\log(\frac{2}{\delta})}\quad\text{if }\log\frac{2}{\delta}\leq n\\ \frac{16L^{2}}{n\gamma}\log(\frac{2}{\delta})\quad\text{if }\log\frac{2}{\delta}>n\end{cases}\end{split}

Putting everything together, we prove our claim. ∎

A.14 Proof of Proposition 4

Proof.

Note that we have the following change of measure inequality for any function ψ:\psi:\mathbb{R}\rightarrow\mathbb{R} from the Donsker-Varadhan representation for the KL-divergence.

𝔼Q[ψ(X)]KL(QP)+log(𝔼P[eψ(X)])\mathbb{E}_{Q}[\psi(X)]\leq KL(Q\|P)+\log(\mathbb{E}_{P}[e^{\psi(X)}]) (8)

Suppose ψ1:\psi_{1}:\mathbb{R}\rightarrow\mathbb{R} satisfies the following condition.

ψ1(X)=ϕ(X)𝔼P[ϕ(X)]\psi_{1}(X)=\phi(X)-\mathbb{E}_{P}[\phi(X)]

By plugging in Equation (8), we have

𝔼Q[ϕ(X)]𝔼P[ϕ(X)]KL(QP)+log(𝔼P[eϕ(X)𝔼P[ϕ(X)]])\begin{split}\mathbb{E}_{Q}[\phi(X)]-\mathbb{E}_{P}[\phi(X)]&\leq KL(Q\|P)+\log(\mathbb{E}_{P}[e^{\phi(X)-\mathbb{E}_{P}[\phi(X)]}])\end{split}

On the other hand, suppose ψ2:\psi_{2}:\mathbb{R}\rightarrow\mathbb{R} satisfies the following condition.

ψ2(X)=𝔼P[ϕ(X)]ϕ(X)\psi_{2}(X)=\mathbb{E}_{P}[\phi(X)]-\phi(X)

By plugging in Equation (8), we have

𝔼P[ϕ(X)]𝔼Q[ϕ(X)]KL(QP)+log(𝔼P[e𝔼P[ϕ(X)]ϕ(X)])\begin{split}\mathbb{E}_{P}[\phi(X)]-\mathbb{E}_{Q}[\phi(X)]&\leq KL(Q\|P)+\log(\mathbb{E}_{P}[e^{\mathbb{E}_{P}[\phi(X)]-\phi(X)}])\end{split}

By putting all together, we have

|𝔼Q[ϕ(X)]𝔼P[ϕ(X)]|KL(QP)+log(𝔼P[eϕ(X)𝔼P[ϕ(X)]])log(𝔼P[e𝔼P[ϕ(X)]ϕ(X)])\begin{split}&|\mathbb{E}_{Q}[\phi(X)]-\mathbb{E}_{P}[\phi(X)]|\leq KL(Q\|P)+\log(\mathbb{E}_{P}[e^{\phi(X)-\mathbb{E}_{P}[\phi(X)]}])\vee\log(\mathbb{E}_{P}[e^{\mathbb{E}_{P}[\phi(X)]-\phi(X)}])\end{split}

Suppose X1,X2,,XnX_{1},X_{2},\dots,X_{n} are i.i.d. random variables.We can easily see that we have

|𝔼Q[ϕ(X)]𝔼P[ϕ(X)]|KL(QP)+log(𝔼P[e1ni=1nϕ(Xi)𝔼P[ϕ(X)]])log(𝔼P[e𝔼P[ϕ(X)]1ni=1nϕ(Xi)])\begin{split}|\mathbb{E}_{Q}[\phi(X)]-\mathbb{E}_{P}[\phi(X)]|&\leq KL(Q\|P)\\ &+\log(\mathbb{E}_{P}[e^{\frac{1}{n}\sum_{i=1}^{n}\phi(X_{i})-\mathbb{E}_{P}[\phi(X)]}])\vee\log(\mathbb{E}_{P}[e^{\mathbb{E}_{P}[\phi(X)]-\frac{1}{n}\sum_{i=1}^{n}\phi(X_{i})}])\end{split} (9)

𝔼P[ϕ(X)]\mathbb{E}_{P}[\phi(X)] is bounded as in Proposition 2.

|𝔼P[ϕ(X)]1ni=1nϕ(Xi)|1δ4L2nγlog(2δ)\bigg{|}\mathbb{E}_{P}[\phi(X)]-\frac{1}{n}\sum^{n}_{i=1}\phi(X_{i})\bigg{|}\mathop{\leq}_{1-\delta}\frac{4L^{2}}{n\gamma}\log(\frac{2}{\delta}) (10)

As shown in Proposition 2, ϕ(X)𝔼P[ϕ(X)]\phi(X)-\mathbb{E}_{P}[\phi(X)] is a sub-Gaussian random variable with parameter σ2=2L2γ\sigma^{2}=\frac{2L^{2}}{\gamma}. Therefore, by Definition 2,

log(𝔼P[e1ni=1nϕ(Xi)𝔼P[ϕ(X)]])log(𝔼P[e𝔼P[ϕ(X)]1ni=1nϕ(Xi)])L2nγ\log(\mathbb{E}_{P}[e^{\frac{1}{n}\sum_{i=1}^{n}\phi(X_{i})-\mathbb{E}_{P}[\phi(X)]}])\vee\log(\mathbb{E}_{P}[e^{\mathbb{E}_{P}[\phi(X)]-\frac{1}{n}\sum_{i=1}^{n}\phi(X_{i})}])\leq\frac{L^{2}}{n\gamma} (11)

since both values in the maximum are bounded by L2nγ\frac{L^{2}}{n\gamma}. By (9), (10) and (11), we prove our claim. ∎

Appendix B Pseudo α\alpha-divergence is a member of the family of ff-divergences.

Proof.

Let f(t)=|t1|αf(t)=|t-1|^{\alpha} and λ[0,1]\lambda\in[0,1]. Let 1α+1β=1\frac{1}{\alpha}+\frac{1}{\beta}=1 for α1\alpha\geq 1.

f(λx+(1λ)y)=|λx+(1λ)y1|α=|λ(x1)+(1λ)(y1)|α{λ|x1|+(1λ)|y1|}α={λ1α|x1|λ1β+(1λ)1α|y1|(1λ)1β}α(λ|x1|α+(1λ)|y1|α)(λ+(1λ))αβ=λ|x1|α+(1λ)|y1|α=λf(x)+(1λ)f(y)\begin{split}&f(\lambda x+(1-\lambda)y)\\ &=|\lambda x+(1-\lambda)y-1|^{\alpha}\\ &=|\lambda(x-1)+(1-\lambda)(y-1)|^{\alpha}\\ &\leq\{\lambda|x-1|+(1-\lambda)|y-1|\}^{\alpha}\\ &=\{\lambda^{\frac{1}{\alpha}}|x-1|\lambda^{\frac{1}{\beta}}+(1-\lambda)^{\frac{1}{\alpha}}|y-1|(1-\lambda)^{\frac{1}{\beta}}\}^{\alpha}\\ &\leq(\lambda|x-1|^{\alpha}+(1-\lambda)|y-1|^{\alpha})(\lambda+(1-\lambda))^{\frac{\alpha}{\beta}}\\ &=\lambda|x-1|^{\alpha}+(1-\lambda)|y-1|^{\alpha}\\ &=\lambda f(x)+(1-\lambda)f(y)\end{split}

The first inequality is due to the triangle inequality and the second inequality is due to Hölder’s inequality. By the definition of convexity, f(t)f(t) is convex. Also, f(1)=0f(1)=0. By the definition of ff-divergence, we prove our claim. ∎

Appendix C PAC-Bayesian bounds for bounded, sub-Gaussian and sub-exponential losses

Here we complement our results in Section 3, by showing our PAC-Bayesian generalization bounds for bounded, sub-Gaussian and sub-exponential losses.

C.1 Bounded Loss Function

First, let us assume here that the loss function is bounded, i.e., for any hh\in\mathcal{H} and (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y}, (h(x),y)[0,R]\ell(h(x),y)\in[0,R] for R>0R>0. Note that, for R>1R>1, we cannot use the total variation, squared Hellinger, Reverse KL and Neyman χ2\chi^{2} divergence because ϕ\phi is constrained to be in [0,1][0,1].

Proposition 5 (The PAC-Bayesian bounds for bounded loss function).

Let PP be any prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size m>0m>0 and α>1\alpha>1, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+R22mlog(2δ)(α(α1)Dα(QP)+1)1α\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{R^{2}}{2m}\log(\frac{2}{\delta})\big{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\big{)}^{\frac{1}{\alpha}}}\end{split} (12)
RD(GQ)RS(GQ)+1m(Dα(QP)+1α(α1))+1mα(R2(α1)2log(2δ))αα1\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{m}\bigg{(}D_{\alpha}(Q\|P)+\frac{1}{\alpha(\alpha-1)}\bigg{)}+\frac{1}{m\alpha}\bigg{(}\frac{R^{2}(\alpha-1)}{2}\log(\frac{2}{\delta})\bigg{)}^{\frac{\alpha}{\alpha-1}}}\end{split} (13)
Proof.

Suppose that we have a convex function Δ:×\Delta:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}, that measures the discrepancy between the observed empirical Gibbs risk RS(GQ)R_{S}(G_{Q}) and the true Gibbs risk RD(GQ)R_{D}(G_{Q}) on distribution QQ. Given that, the purpose of the PAC-Bayesian theorem is to upper-bound the discrepancy tΔ(RD(GQ),RS(GQ))t\Delta(R_{D}(G_{Q}),R_{S}(G_{Q})) for any t>0t>0. Let ϕD(h):=tΔ(RD(h),RS(h))\phi_{D}(h):=t\Delta(R_{D}(h),R_{S}(h)), where the subscript of ϕD\phi_{D} shows the dependency on the data distribution DD. Let Δ(q,p)=(qp)2\Delta(q,p)=(q-p)^{2}. By applying Jensen’s inequality on convex function Δ\Delta for the first step,

tΔ(RD(GQ),RS(GQ))=tΔ(𝔼hQRD(h),𝔼hQRS(h))𝔼hQtΔ(RD(h),RS(h))=𝔼hQϕD(h)\begin{split}t\Delta(R_{D}(G_{Q}),R_{S}(G_{Q}))&=t\Delta(\mathop{\mathbb{E}}_{h\sim Q}R_{D}(h),\mathop{\mathbb{E}}_{h\sim Q}R_{S}(h))\\ &\leq\mathop{\mathbb{E}}_{h\sim Q}t\Delta(R_{D}(h),R_{S}(h))\\ &=\mathop{\mathbb{E}}_{h\sim Q}\phi_{D}(h)\end{split} (14)

where ϕD(h)=tΔ(RD(h),RS(h))=t(RS(h)RD(h))2\phi_{D}(h)=t\Delta(R_{D}(h),R_{S}(h))=t(R_{S}(h)-R_{D}(h))^{2}. By Hoeffding’s inequality, for any ϵ>0\epsilon>0 and any hh\in\mathcal{H},

Pr(x,y)D(ϕD(h)ϵ)=Pr(x,y)D((RS(h)RD(h))2ϵt)=Pr(x,y)D(|1mi=1m(h(xi),yi)𝔼(x,y)D(h(x),y)|ϵt)2e2mϵR2t\begin{split}&\mathop{Pr}_{(x,y)\sim D}(\phi_{D}(h)\geq\epsilon)\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}\big{(}R_{S}(h)-R_{D}(h)\big{)}^{2}\geq\frac{\epsilon}{t}\bigg{)}\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}\bigg{|}\frac{1}{m}\sum_{i=1}^{m}\ell(h(x_{i}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim D}\ell(h(x),y)\bigg{|}\geq\sqrt{\frac{\epsilon}{t}}\bigg{)}\\ &\leq 2e^{-\frac{2m\epsilon}{R^{2}t}}\end{split}

Setting δ=2e2mϵR2t\delta=2e^{-\frac{2m\epsilon}{R^{2}t}}, we have

ϕD(h)1δtR22mlog(2δ)\begin{split}\phi_{D}(h)\mathop{\leq}_{1-\delta}\frac{tR^{2}}{2m}\log(\frac{2}{\delta})\end{split}

The symbol 1δ\displaystyle{\mathop{\leq}_{1-\delta}} denotes that the inequality holds with probability at least 1δ1-\delta. The second line holds due to the Hoeffding’s inequality. For α>1\alpha>1, we have

(ϕD(h))αα11δ(tR22mlog(2δ))αα1\begin{split}(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}\mathop{\leq}_{1-\delta}\bigg{(}\frac{tR^{2}}{2m}\log(\frac{2}{\delta})\bigg{)}^{\frac{\alpha}{\alpha-1}}\end{split}

Also note that

𝔼hP[(ϕD(h))αα1]sup(x,y)D{𝔼hP[(ϕD(h))αα1]}𝔼hP[sup(x,y)D(ϕD(h))αα1]1δ(tR22mlog(2δ))αα1\begin{split}\mathop{\mathbb{E}}_{h\sim P}[(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}]&\leq\sup_{(x,y)\sim D}\bigg{\{}\mathop{\mathbb{E}}_{h\sim P}[(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}]\bigg{\}}\\ &\leq\mathop{\mathbb{E}}_{h\sim P}\bigg{[}\sup_{(x,y)\sim D}(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}\bigg{]}\\ &\mathop{\leq}_{1-\delta}\bigg{(}\frac{tR^{2}}{2m}\log(\frac{2}{\delta})\bigg{)}^{\frac{\alpha}{\alpha-1}}\end{split} (15)

By applying Equations (14) and (15) to Lemma 10,

t(RD(GQ)RS(GQ))21δtR22mlog(2δ)(α(α1)Dα(QP)+1)1α\begin{split}t(R_{D}(G_{Q})-R_{S}(G_{Q}))^{2}\mathop{\leq}_{1-\delta}\frac{tR^{2}}{2m}\log(\frac{2}{\delta})\bigg{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\bigg{)}^{\frac{1}{\alpha}}\end{split}

which proves Equation (12). By applying Equations (14) and (15) to Lemma 5 and setting t=mt=m, we prove Equation (13). ∎

Equation (12) has a tighter bound than Proposition 2 in [1]. Also, Equation (13) has a novel expression of PAC-Bayesian theorem with χ2\chi^{2}-divergence. The same arguments apply to the bounds in Proposition 6, 7 and 1.

The following corollary is an immediate consequence of this proposition for α=2\alpha=2.

Corollary 3 (The PAC-Bayesian bounds with χ2\chi^{2}-divergence for bounded loss function).

Let PP be any prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size m>0m>0, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+R22mlog(2δ)χ2(QP)+1\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\frac{R^{2}}{2m}\log(\frac{2}{\delta})\sqrt{\chi^{2}(Q\|P)+1}}\end{split} (16)
RD(GQ)RS(GQ)+12m(χ2(QP)+1+(R22log(2δ))2)\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{2m}\bigg{(}\chi^{2}(Q\|P)+1+\bigg{(}\frac{R^{2}}{2}\log(\frac{2}{\delta})\bigg{)}^{2}\bigg{)}}\end{split} (17)

These are novel PAC-Bayesian bounds for bounded loss functions based on χ2\chi^{2}-divergence. Most PAC-Bayesian bounds for bounded loss functions are based on the KL-divergence for the complexity term (e.g., [4, 30, 22]). [15] and [20] contain bounds for bounded loss functions with χ2\chi^{2}-divergence. Compared to their bounds, the bound (16) is tighter due to the power of 14\frac{1}{4} on the complexity term and log(1δ)\log(\frac{1}{\delta}) instead of 1δ\frac{1}{\delta}. Also, PAC-Bayes bound (17) has a unique characteristics; χ2(QP)\chi^{2}(Q\|P) and 1δ\frac{1}{\delta} are independent since χ2(QP)\chi^{2}(Q\|P) is not multiplied by a factor of 1δ\frac{1}{\delta}. Please see Table 3 for details.

C.2 Sub-Gaussian Loss Function

In some contexts, such as regression, considering bounded loss functions is restrictive. Next, we relax the restrictions on the loss function to deal with unbounded losses. We assume that, for any hh\in\mathcal{H}, (h(x),y)\ell(h(x),y) is sub-Gaussian. First, we mention the definition of sub-Gaussian random variable [35].

Definition 2.

A random variable ZZ is said to be sub-Gaussian with the expectation 𝔼[Z]=μ\mathbb{E}[Z]=\mu and variance proxy σ2\sigma^{2} if for any λ\lambda\in\mathbb{R},

𝔼[eλ(Zμ)]eλ2σ22\begin{split}\mathbb{E}[e^{\lambda(Z-\mu)}]\leq e^{\frac{\lambda^{2}\sigma^{2}}{2}}\end{split}

Next, we present our PAC-Bayesian bounds.

Proposition 6 (The PAC-Bayesian bounds for sub-Gaussian loss function).

Let PP be a fixed prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size mm and α>1\alpha>1, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+2σ2mlog(2δ)(α(α1)Dα(QP)+1)1α\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{2\sigma^{2}}{m}\log(\frac{2}{\delta})\big{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\big{)}^{\frac{1}{\alpha}}}\end{split}
RD(GQ)RS(GQ)+1m(Dα(QP)+1α(α1))+1mα(2σ2(α1)log(2δ))αα1\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{m}\bigg{(}D_{\alpha}(Q\|P)+\frac{1}{\alpha(\alpha-1)}\bigg{)}+\frac{1}{m\alpha}\bigg{(}2\sigma^{2}(\alpha-1)\log(\frac{2}{\delta})\bigg{)}^{\frac{\alpha}{\alpha-1}}}\end{split}
Proof.

Suppose the convex function Δ:×\Delta:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} is defined as in Proposition 5. Employing Chernoff’s bound, the tail bound probability for sub-Gaussian random variables [35] is given as follows

Pr(|Z¯μ|ϵ)2emϵ22σ2\begin{split}Pr(|\bar{Z}-\mu|\geq\epsilon)\leq 2e^{-\frac{m\epsilon^{2}}{2\sigma^{2}}}\end{split} (18)

Setting Z¯=RS(h)\overline{Z}=R_{S}(h), μ=RD(h)\mu=R_{D}(h) and δ=2emϵ22σ2\delta=2e^{-\frac{m\epsilon^{2}}{2\sigma^{2}}} in the tail bound in Equation (18), for any hh\in\mathcal{H}, we have

Pr(x,y)D(ϕD(h)ϵ)=Pr(x,y)D((RS(h)RD(h))2ϵt)=Pr(x,y)D(|1mi=1m(h(xi),yi)𝔼(x,y)D(h(x),y)|ϵt)2emϵ2σ2t\begin{split}&\mathop{Pr}_{(x,y)\sim D}(\phi_{D}(h)\geq\epsilon)\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}(R_{S}(h)-R_{D}(h))^{2}\geq\frac{\epsilon}{t}\bigg{)}\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}\bigg{|}\frac{1}{m}\sum_{i=1}^{m}\ell(h(x_{i}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim D}\ell(h(x),y)\bigg{|}\geq\sqrt{\frac{\epsilon}{t}}\bigg{)}\\ &\leq 2e^{-\frac{m\epsilon}{2\sigma^{2}t}}\end{split}

Setting δ=2emϵ2σ2t\delta=2e^{-\frac{m\epsilon}{2\sigma^{2}t}}, we have

ϕD(h)1δ2tσ2mlog(2δ)\phi_{D}(h)\mathop{\leq}_{1-\delta}\frac{2t\sigma^{2}}{m}\log(\frac{2}{\delta})

where ϕD(h)\phi_{D}(h) is defined as in Equation (14). By Equation (15), we have

𝔼hP[(ϕD(h))αα1]1δ(2tσ2mlog(2δ))αα1\mathop{\mathbb{E}}_{h\sim P}[(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}]\mathop{\leq}_{1-\delta}\bigg{(}\frac{2t\sigma^{2}}{m}\log(\frac{2}{\delta})\bigg{)}^{\frac{\alpha}{\alpha-1}} (19)

On the other hand, we may upper-bound 𝔼hP[(ϕD(h))αα1]\mathop{\mathbb{E}}_{h\sim P}[(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}] in the following way (as in Proposition 6. in [1]).

𝔼hP[(ϕD(h))αα1]1δtαα1δ𝔼(x,y)D𝔼hP[(RS(h)RD(h))2αα1]=tαα1δ𝔼hP𝔼(x,y)D[(RS(h)RD(h))2αα1]\begin{split}\mathop{\mathbb{E}}_{h\sim P}[(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}]&\mathop{\leq}_{1-\delta}\frac{t^{\frac{\alpha}{\alpha-1}}}{\delta}\mathop{\mathbb{E}}_{(x,y)\sim D}\mathop{\mathbb{E}}_{h\sim P}[(R_{S}(h)-R_{D}(h))^{\frac{2\alpha}{\alpha-1}}]\\ &=\frac{t^{\frac{\alpha}{\alpha-1}}}{\delta}\mathop{\mathbb{E}}_{h\sim P}\mathop{\mathbb{E}}_{(x,y)\sim D}[(R_{S}(h)-R_{D}(h))^{\frac{2\alpha}{\alpha-1}}]\\ \end{split}

Let U=RS(h)RD(h)U=R_{S}(h)-R_{D}(h) and q=αα1q=\frac{\alpha}{\alpha-1}. Then, UU is a sub-Gaussian random variable from our assumption.

𝔼(x,y)D[(RS(h)RD(h))2αα1]=𝔼(x,y)D[U2q]=0Pr{|U|2q>u}𝑑u=2q0u2q1Pr{|U|>u}𝑑u4q0u2q1emu22σ2𝑑u\begin{split}&\mathop{\mathbb{E}}_{(x,y)\sim D}[(R_{S}(h)-R_{D}(h))^{\frac{2\alpha}{\alpha-1}}]=\mathop{\mathbb{E}}_{(x,y)\sim D}[U^{2q}]\\ &=\int_{0}^{\infty}Pr\{|U|^{2q}>u\}du=2q\int_{0}^{\infty}u^{2q-1}Pr\{|U|>u\}du\\ &\leq 4q\int_{0}^{\infty}u^{2q-1}e^{-\frac{mu^{2}}{2\sigma^{2}}}du\end{split}

By setting u=tu=\sqrt{t}, the previous inequality becomes

𝔼(x,y)D[(RS(h)RD(h))2αα1]2q0tq1emt2σ2𝑑t=2qΓ(q)(m2σ2)q=2αα1Γ(αα1)(2σ2m)αα1\begin{split}&\mathop{\mathbb{E}}_{(x,y)\sim D}[(R_{S}(h)-R_{D}(h))^{\frac{2\alpha}{\alpha-1}}]\leq 2q\int_{0}^{\infty}t^{q-1}e^{-\frac{mt}{2\sigma^{2}}}dt\\ &=2q\Gamma(q)\bigg{(}\frac{m}{2\sigma^{2}}\bigg{)}^{-q}=2\frac{\alpha}{\alpha-1}\Gamma\bigg{(}\frac{\alpha}{\alpha-1}\bigg{)}\bigg{(}\frac{2\sigma^{2}}{m}\bigg{)}^{\frac{\alpha}{\alpha-1}}\\ \end{split}

Therefore, we have

𝔼hP[(ϕD(h))αα1]1δ2αtαα1δ(α1)Γ(αα1)(2σ2m)αα1\begin{split}\mathop{\mathbb{E}}_{h\sim P}[(\phi_{D}(h))^{\frac{\alpha}{\alpha-1}}]\mathop{\leq}_{1-\delta}\frac{2\alpha t^{\frac{\alpha}{\alpha-1}}}{\delta(\alpha-1)}\Gamma\bigg{(}\frac{\alpha}{\alpha-1}\bigg{)}\bigg{(}\frac{2\sigma^{2}}{m}\bigg{)}^{\frac{\alpha}{\alpha-1}}\end{split} (20)

Although one has choices to use either Equation (19) or (20), we can easily see that Equation (19) is always tighter than (20). Putting everything together with Lemma 5, we prove our claim. ∎

The following corollary is an immediate consequnece of Proposition 6 for α=2\alpha=2.

Corollary 4 (The PAC-Bayesian bounds with χ2\chi^{2}-divergence for sub-Gaussian loss function).

Let PP be any prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size m>0m>0, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+2σ2mlog(2δ)χ2(QP)+1\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\frac{2\sigma^{2}}{m}\log(\frac{2}{\delta})\sqrt{\chi^{2}(Q\|P)+1}}\end{split} (21)
RD(GQ)RS(GQ)+12m(χ2(QP)+1+(2σ2log(2δ))2)\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{2m}\bigg{(}\chi^{2}(Q\|P)+1+\bigg{(}2\sigma^{2}\log(\frac{2}{\delta})\bigg{)}^{2}\bigg{)}}\end{split} (22)

[1] proved PAC-Bayes bound for sub-Gaussian loss function with χ2\chi^{2}-divergence. It is noteworthy that Δ\Delta may be any convex function and a different choice of Δ\Delta leads us to various bounds. For instance, choosing Δ(q,p)=|qp|\Delta(q,p)=|q-p| for Lemma 10 results in

RD(GQ)RS(GQ)+2σ2mlog(2δ)(χ2(QP)+1)\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\frac{2\sigma^{2}}{m}\log(\frac{2}{\delta})(\chi^{2}(Q\|P)+1)}\end{split}

which is quite similar to the bounds in Proposition 6 in [1] and looser than Equation (21). We obtained a tighter bound due to our choice of Δ(q,p)=(qp)2\Delta(q,p)=(q-p)^{2}.

C.3 Sub-Exponential Loss Function

We now turn to a more general class where (h(x),y)\ell(h(x),y) is sub-exponential for any hh\in\mathcal{H}. First, we define sub-exponentiality [35].

Definition 3.

A random variable ZZ is said to be sub-exponential with the expectation 𝔼[Z]=μ\mathbb{E}[Z]=\mu and parameters σ2\sigma^{2} and β>0\beta>0, if for any λ\lambda\in\mathbb{R},

𝔼[eλ(Zμ)]eλ2σ22,:|λ|<1β\begin{split}\mathbb{E}[e^{\lambda(Z-\mu)}]\leq e^{\frac{\lambda^{2}\sigma^{2}}{2}},\forall:|\lambda|<\frac{1}{\beta}\end{split}

Next, we provide our PAC-Bayesian bounds.

Proposition 7 (The PAC-Bayesian bounds for sub-exponential loss function).

Let PP be a fixed prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size mm and α>1\alpha>1, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+𝒦δ1(α(α1)Dα(QP)+1)1α\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\mathcal{K}^{1}_{\delta}\big{(}\alpha(\alpha-1)D_{\alpha}(Q\|P)+1\big{)}^{\frac{1}{\alpha}}}\end{split}
RD(GQ)RS(GQ)+1t(Dα(QP)+1α(α1))+𝒦α,δ2\begin{split}R_{D}(G_{Q})\leq R_{S}(G_{Q})+\sqrt{\frac{1}{t}\bigg{(}D_{\alpha}(Q\|P)+\frac{1}{\alpha(\alpha-1)}\bigg{)}+\mathcal{K}^{2}_{\alpha,\delta}}\end{split}

where

𝒦δ1={2σ2mlog(2δ),2β2log(2δ)σ2m(2βmlog2δ)2,0<m<2β2log(2δ)σ2,𝒦α,δ2=m1α1(α1)αα1α(𝒦δ1)αα1\begin{split}\mathcal{K}^{1}_{\delta}=\begin{cases}\frac{2\sigma^{2}}{m}\log(\frac{2}{\delta}),\quad\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}}\leq m\\ (\frac{2\beta}{m}\log\frac{2}{\delta})^{2},\quad 0<m<\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}},\end{cases}\mathcal{K}^{2}_{\alpha,\delta}=\frac{m^{\frac{1}{\alpha-1}}(\alpha-1)^{\frac{\alpha}{\alpha-1}}}{\alpha}(\mathcal{K}^{1}_{\delta})^{\frac{\alpha}{\alpha-1}}\end{split}
Proof.

Suppose the convex function Δ:×\Delta:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} is defined as in Proposition 5. For any random variables satisfying Definition 3, we have the following concentration inequality [35].

Pr(|Z¯μ|ϵ){2emϵ22σ2,0<ϵσ2β2emϵ2β,σ2β<ϵ\begin{split}Pr(|\overline{Z}-\mu|\geq\epsilon)\leq\begin{cases}2e^{-\frac{m\epsilon^{2}}{2\sigma^{2}}},\quad 0<\epsilon\leq\frac{\sigma^{2}}{\beta}\\ 2e^{-\frac{m\epsilon}{2\beta}},\quad\frac{\sigma^{2}}{\beta}<\epsilon\end{cases}\end{split} (23)

Following the proof of Proposition 6, for 0<ϵtσ2β0<\sqrt{\frac{\epsilon}{t}}\leq\frac{\sigma^{2}}{\beta}, we have

ϕD(h)1δ2tσ2mlog(2δ)\phi_{D}(h)\mathop{\leq}_{1-\delta}\frac{2t\sigma^{2}}{m}\log(\frac{2}{\delta})

For σ2β<ϵt\frac{\sigma^{2}}{\beta}<\sqrt{\frac{\epsilon}{t}}, we have

Pr(x,y)D(ϕD(h)ϵ)=Pr(x,y)D((RS(h)RD(h))2ϵt)=Pr(x,y)D(|1mi=1m(h(xi),yi)𝔼(x,y)D(h(x),y)|ϵt)2em2βϵt\begin{split}&\mathop{Pr}_{(x,y)\sim D}(\phi_{D}(h)\geq\epsilon)\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}(R_{S}(h)-R_{D}(h))^{2}\geq\frac{\epsilon}{t}\bigg{)}\\ &=\mathop{Pr}_{(x,y)\sim D}\bigg{(}\bigg{|}\frac{1}{m}\sum_{i=1}^{m}\ell(h(x_{i}),y_{i})-\mathop{\mathbb{E}}_{(x,y)\sim D}\ell(h(x),y)\bigg{|}\geq\sqrt{\frac{\epsilon}{t}}\bigg{)}\\ &\leq 2e^{-\frac{m}{2\beta}\sqrt{\frac{\epsilon}{t}}}\end{split}

Setting δ=2em2βϵt\delta=2e^{-\frac{m}{2\beta}\sqrt{\frac{\epsilon}{t}}}, we have, for σ2β<ϵt\frac{\sigma^{2}}{\beta}<\sqrt{\frac{\epsilon}{t}},

ϕD(h)1δt(2βmlog2δ)2\phi_{D}(h)\mathop{\leq}_{1-\delta}t\bigg{(}\frac{2\beta}{m}\log\frac{2}{\delta}\bigg{)}^{2}

Thus,

ϕD(h)1δ{2tσ2mlog(2δ),2β2log(2δ)σ2mt(2βmlog2δ)2,0<m<2β2log(2δ)σ2\begin{split}\phi_{D}(h)\mathop{\leq}_{1-\delta}\begin{cases}\frac{2t\sigma^{2}}{m}\log(\frac{2}{\delta}),\quad\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}}\leq m\\ t(\frac{2\beta}{m}\log\frac{2}{\delta})^{2},\quad 0<m<\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}}\end{cases}\end{split}

where ϕD(h)\phi_{D}(h) is defined as in Equation (14). Now, we have the upper bound for ϕD(h)\phi_{D}(h) so we can apply the same procedure as in Proposition 5 and 7. ∎

Note that, for 2β2log(2δ)σ2m\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}}\leq m, ϕD(h)\phi_{D}(h) behaves like sub-Gaussian. However, when the sample size is small, a tighter bound (i.e., t(2βmlog2δ)2t(\frac{2\beta}{m}\log\frac{2}{\delta})^{2}) can be obtained. This shows the advantage of assuming sub-exponentiality over sub-Gaussianity. By setting α=2\alpha=2 in Proposition 7, we have the following corollary.

Corollary 5 (The PAC-Bayesian bounds with χ2\chi^{2}-divergence for sub-exponential loss function).

Let PP be any prior distribution over an infinite hypothesis space \mathcal{H}. For a given posterior distribution QQ over an infinite hypothesis space \mathcal{H}, let RD(GQ)R_{D}(G_{Q}) and RS(GQ)R_{S}(G_{Q}) be the Gibbs risk and the empirical Gibbs risk as in Equation (2) and (3) respectively. For the sample size m>0m>0, with probability at least 1δ1-\delta, simultaneously for all posterior distributions QQ, we have

RD(GQ)RS(GQ)+𝒦δ1χ2(QP)+1\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\mathcal{K}^{1}_{\delta}\sqrt{\chi^{2}(Q\|P)+1}}\end{split}
RD(GQ)RS(GQ)+12m(χ2(QP)+1+(m𝒦δ1)2)\begin{split}R_{D}(G_{Q})&\leq R_{S}(G_{Q})+\sqrt{\frac{1}{2m}\big{(}\chi^{2}(Q\|P)+1+(m\mathcal{K}^{1}_{\delta})^{2}\big{)}}\end{split}

where

𝒦δ1={2σ2mlog(2δ),2β2log(2δ)σ2m(2βmlog2δ)2,0<m<2β2log(2δ)σ2\begin{split}&\mathcal{K}^{1}_{\delta}=\begin{cases}\frac{2\sigma^{2}}{m}\log(\frac{2}{\delta}),\quad\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}}\leq m\\ (\frac{2\beta}{m}\log\frac{2}{\delta})^{2},\quad 0<m<\frac{2\beta^{2}\log(\frac{2}{\delta})}{\sigma^{2}}\end{cases}\end{split}

To the best of our knowledge, our bounds for sub-exponential losses are entirely novel.

Appendix D Log-concave distribution

We say that a distribution PP with a density pp (with respect to the Lebesgue measure) is a strongly log-concave distribution if the function logp\log p is strongly concave. Equivalently stated, this condition means that the density can be expressed as p(x)=exp(ψ(x))p(x)=\exp(-\psi(x)), where the function ψ:d\psi:\mathbb{R}^{d}\rightarrow\mathbb{R} is strongly convex, meaning that there is some γ>0\gamma>0 such that

λψ(x)+(1λ)ψ(y)ψ(λx+(1λ)y)γ2λ(1λ)xy22\lambda\psi(x)+(1-\lambda)\psi(y)-\psi(\lambda x+(1-\lambda)y)\geq\frac{\gamma}{2}\lambda(1-\lambda)\|x-y\|_{2}^{2}

for all λ[0,1]\lambda\in[0,1], and x,ydx,y\in\mathbb{R}^{d}.