Cosme Louartlabel=e1][email protected][
School of Data Science, The Chinese University of Hong Kong (Shenzhen), Shenzhen, Chinapresep= , ]e1
Abstract
Following the concentration of the measure theory formalism, we consider the transformation of a random variable having a general concentration function . If the transformation is -Lipschitz with deterministic, the concentration function of is immediately deduced to be equal to . If the variations of are bounded by a random variable having a concentration function (around ) , this paper sets that has a concentration function analogous to the so-called parallel product of and . With this result at hand (i) we express the concentration of random vectors with independent heavy-tailed entries, (ii) given a transformation with bounded differential, we express the so-called “multi-level” concentration of as a function of , and the operator norms of the successive differentials up to the (iii) we obtain a heavy-tailed version of the Hanson-Wright inequality.
60-08, 60B20,
62J07,
Concentration of Measure,
Hanson-Wright inequality,
Heavy-tailed probabilities,
Parallel sum,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs
Notations
We denote (resp. ) the set of positive (resp. negative) or null real numbers, and . Given , we note , the closure of , , the interior of and , the boundary of ().
We extend the relation “” to a relation between sets and introduce a new order “” defined followingly for any sets :
and we denote (resp. ) when (resp. ). The above definitions can easily extend to cases were or are scalars that are then identified with the singletons or .
We consider below a set-valued mapping (also called “operator”) . The graph of is the set . The inverse of is the set-valued mapping satisfying:
Placing ourselves in the framework of monotone operator, we will say that is increasing if:
and is said to be a decreasing operator if is increasing. Increasing and decreasing operators are both called monotone operators.
The domain of is denoted and the range .
(Scalar-valued) functions are identified with operators whose image are all singletons. Given a set , we naturally define:
Then, given two operators , the composition of and is the operator defined for any as .
The sum (resp. the product) between two operators is simply noted (resp. ) their domain is exactly .
We denote the function defined for all as . Given , we denote the increment operator whose domain is and defined as:
Given two euclidean vector spaces and and an integer , we denote , the set of times differentiable mappings from to and , the set of -linear mappings from to . Given a linear mapping , we further denote the operator norm of defined as:
Given , , and , we denote , the differential of at point .
Introduction
A fundamental result of the concentration of measure theory sets that for any Gaussian random vector , for any -Lipschitz for the euclidean norm on mapping :
the mapping is here called a concentration function of .
Given a transformation , -Lipschitz for the euclidean norm on and we have for any , an independent copy of :
and it is easy to deduce, with a slight modification of , that the concentration function of the random vector is . The aim of our paper is to express the concentration of when is not constant but a certain random variable for , measurable. The following result is given for the general concentration function of and .
Theorem 0.1.
Let us consider two metric spaces , , a random variable , a measurable mapping such that there exist two strictly decreasing mappings such that for any -Lipschitz mapping , :
and
for any independent copy of , and a transformation such that, for all :
then for any -Lipschitz mapping :
(1)
This theorem allows some extension of the famous Hanson-Wright concentration inequality to a wider range of concentration functions (Theorems 6.1 and 6.7).
A more general version of this theorem is given in Theorem 3.5 for which are possibly constant in some parts of their domain, and for a random variable which writes like a product of random variables, each of which has its own concentration function. Theorem 3.10 gives a similar result for the random variable that satisfies weaker so-called “convex concentration” hypotheses and takes value in .
The mapping rewrites with the introduction of and .
The expression can be recognized as the so-called parallel sum, originally introduced in electrical engineering to model parallel resistor networks, which was then generalized to matrices in [5] and to nonlinear operators in convex analysis in [4] (see [7, Chapter 24] for a presentation in the context of set-valued functions).
The parallel sum is traditionally denoted , but since we will also be presenting a parallel product, we found it more convenient to denote it as (and the parallel product as ). The operation on graphs can easily be represented as shown in Figure 1.
Figure 1: Classical sum (Left) and parallel sum (Right) of two operators. One can wonder from this second graph whether the parallel sum should not rather be called the perpendicular sum
The control of non Lipschitz functionals has been studied by different authors. To be brief, one can mention the famous works of Vu in [25] on binary variables, then the introduction of specific operator norms on tensors to get concentration of polynomial of Gaussian variables by Latała in [16] and later generalized for more general variables and functionals in [3], [11] and [12].
Their studies let appear the notion of “multilevel concentration inequalities” where the concentration rate usually takes the form:
for a finite set and . This multi-level scheme appears naturally when one computes the parallel product of decreasing mappings
for a decreasing function (possibly heavy-tailed) replacing . We present in Theorem 5.5 a first setting where this multi-level concentration can appear.
However, a more natural setting allows us to continue the long-running study of the concentration of chaos and other functionals with bounded -differential, . Two approaches have been developed for these objects.
The older one is based on Hermite polynomials and was introduced by Christer Borell [9]. This work is not available online but was cited in [17] that studies chaos of maximal order of Gaussian variables. Then concentration inequalities closer to the one we will set are proved for Gaussian random variables in [6] and for random vectors with independent log-concave entries in [19], both with the appearance of quantiles instead of medians.
The more recent approach was initiated by Latała in [16] and relies on moment bounds to express a two-sided concentration inequality for Gaussian chaos. The tensor-product matrix norms involved in this type of concentration can be numerous, but the expectation is taken inside the norms, which could be a good improvement. This approach was followed in [3] for general functionals with bounded derivative of higher order and for random vectors satisfying log-Sobolev inequalities, then [12] extended the results to so-called “-sub exponential random variables”.
The two approaches are known to be “relatively equivalent” in the case of (see the discussion initiated at the end of Section 3 in [2]), the question remains open for larger orders. Our present work is in line with the first approach, and in this respect, our main contribution is to extend previous results to functionals with bounded -derivative and to any random vector of a metric space admitting a concentration function that could be defined as:
(following the idea of Ledoux presented at the beginning of his book [18]).
Theorem 0.2.
Let us consider a random vectors such that for any -Lipschitz:
for a certain median of , and a certain decreasing mapping .
Then, for any -differentiable mapping and any , -Lipschitz one can bound:
(2)
where, is a median of , for all , is a median of and .
Finally, let us mention here that an elementary result (Lemma 2.5) provides a sufficient condition to replace medians or independent copies with expectations in concentration inequalities similar to (2) when is integrable. In “natural settings”, it morally consists in assuming the integrability of . This Lemma allows to generalize the Hanson-wright inequality to random vectors having a concentration function satisfying (second moment bounded), see Theorem 6.7.
In order to be able to treat rigorously the case of non-invertible mappings and , we decided to work with maximal operators of , i.e. set-valued mappings that always admit an inverse. To our knowledge, this connection between probabilities and set-valued mappings was only explored by Rockafellar (see for instance [22]) to study financial tools like superquantiles.
Section I provides basic tools for dealing with operators and the expression of the sum (resp. the product) of two concentration inequalities with the parallel sum (resp. with the parallel product). Further presentation of operators is provided in the appendix with the definition of the minimum of operators and the connection with the parallel sum to provide a more intuitive understanding of this operation. It is an easy exercise to check that these results on operators become trivial in the case of singleton-valued invertible mappings.
Section II briefly explains how to deal with different concentration pivots such as median, independent copies or expectation. Section III introduces important results of concentration in high dimension and expresses the concentration of a transformation with concentrated variations. In Section V we present a general example of heavy-tailed concentration in high dimension. In Section V we introduce relevant notations and parallel operation tools to justify the appearance of multilevel concentration, we further prove Theorem 0.2. Finally, in Section VI, we apply our results to obtain a heavy-tailed version of the Hanson-Wright theorem.
1 Parallel operations and concentration in
Definition 1.1.
Given two operators , we denote the parallel sum and the parallel product of and followingly:
To set first elementary properties on parallel sum and product, let us provide without proof basic lemmas on the inverse f operators.
Given an operator , and is monotone iif. is monotone and has the same monotonicity.
One can easily circumvise the domain and range of the parallel sum and parallel product from their definition.
Proposition 1.3.
Given two increasing (resp. decreasing) operators , is also increasing (resp. decreasing) and:
•
and ,
•
.
To ensure the monotonicity of one has to impose the ranges of to be subsets of (or ) as it will be done later.
As a simple consequence of Lemma 1.19, note that the parallel sum and product are both associative, commutative and the parallel product distributes on the parallel sum as classical sums and products of function would do.
Lemma 1.4.
Given three operators , one has the identities:
•
and ,
•
and ,
•
.
Unlike the classical sum or product of functions, the composition distributes towards the sum and the product on the left.
Proposition 1.5.
Given three operators :
It is a simple inference of the following lemma.
Lemma 1.6.
Given two operators : .
Be careful, increasing operators (resp. decreasing operators) satisfy , but of course . In particular, if , note that , one sees that is decreasing but is not monotone. The composition with invertible functions though preserves monotonicity.
It will be more convenient below to work with so-called “maximally monotone operators” principally because of Proposition 1.10 and Theorems 1.18 and 1.12 given below. Although it is not completely trivial, maximality for operators on is quite elementary, principally because the class of convex sets is equal to the class of connected sets (it is the class of intervals of ).
Definition 1.7.
A monotone operator is maximally monotone if there exists no monotone operator such that is strictly included in .
In other words, is maximally increasing iif the following equivalence is satisfied:
(3)
One has to replace “” with “” to get a characterization of maximally decreasing operators.
Given a maximally monotone operator , , is a closed interval, besides, and are both intervals of and more generally, for any interval , is also an interval.
Definition 1.11.
An operator is locally bounded on iif. there exists such that is bounded.
A maximally monotone operator is locally bounded on a point if and only if .
Since we are here working in , it is possible to improve the hypotheses sufficient for the sum (or parallel sum) of two operators to be maximally monotone that were given in Example 1.9, Item 5.
Proposition 1.13.
Given two maximally monotone operators with the same monotonicity:
•
maximally monotone,
•
maximally monotone,
Note that the operator such that is rigorously monotone but not maximally monotone (its graph is contained in the graph of any monotone operator).
Proof.
We know from Proposition 1.10 that are two intervals of , if contains more that one element, then it means that it contains an open set and therefore , one can then employ Example 1.9, Item 5 to deduce that is maximally monotone.
If , for some , then Theorem 1.12 allows us to conclude if, say, are maximally increasing, and that , and therefore, , is maximally increasing (and decreasing).
The result on parallel sum is deduced from Example 1.9, Item 1.
∎
One could be tempted to transfer the maximality of the parallel sum to the parallel product thanks to the following identity, true for any such that222The identity is only true for operators such that :
(4)
However the maximality properties raise some difficulties because there are some examples where is maximally monotone but is not (for instance for any for ). Besides the condition is arguably too restrictive for our application where one would more naturally merely assume .
Still it is possible to show the maximality of the product and the parallel product from scratch if one restricts the ranges to stay in , mainly to preserve monotonicity.
Proposition 1.14.
Given two maximally monotone operators with the same monotonicity:
•
maximally monotone,
•
maximally monotone,
Proof.
Let us do the proof in the case increasing. We will use the characterization with given by Theorem 1.8. Proposition 1.10 ensures that and are both connected sets of . Then it is not hard to see that is also connected and non empty (it is simply a consequence of the continuity of the product in ). By continuity of the projection on the first variable, one can deduce that and therefore are both intervals of .
Now, let us note from Theorem 1.12 that (otherwise ) and therefore . And if one denotes ( by hypothesis), either and then , either , and again . We have exactly shown, by convexity of , that , one can conclude with Theorem 1.8 that is maximal.
The previous arguments can be applied to the mappings and when and are maximally decreasing.
∎
All those result on maximally monotone operators being set, we reached the central objective of this section: to relate the cumulative distribution function of the sum (resp. of the product) of random variables with the parallel sum (resp. the parallel product). Although rather trivial (at least when working with invertible functions and not operators), it is at the basis of the main theorems of next sections. Since we are trying to bound probabilities, we decide to slightly restrict the class of operator considered in this kind of results. That will simplify the settings and the proofs.
The class of maximally decreasing operators such that is called the class of probabilistic operators and is noted .
The set is called the class of positive probabilistic operators and is noted .
Lemma 1.16.
Given two operators :
Proof.
The first implication is a simple consequence of Propositions 1.3 and Example 1.9, Item 5.
For the second implication, let us simply note from Lemma 1.3 that:
The same holds for .
∎
Proposition 1.17.
Given and random variables satisfying for all , : , we have the concentration333Recall that the parallel sum, and the parallel product are both associative operations thanks to Lemma 1.4, there is therefore no need for parenthesis.:
If one assumes in addition that , and almost surely,
then, we have the concentration:
where
When and , note that . In particular, the example depicted on Figure 2 shows that the inequality can be reached for some random variables and and some values of .
Figure 2: If the law of is uniformly distributed on the gray triangles, then : . One can also unbalance the weights between and to get probabilities different from .
The proof of this proposition requires two additional results that will mainly allow us to deal with the multiplicity of image of operators. Maximally monotone operators are actually almost always singleton value as seen in next theorem.
that is maximally decreasing thanks to Proposition 1.13 (by definition of probabilistic operators, ).
Lemma 1.19 allows us to set:
Considering , we know from Proposition 1.10 that is an interval as well as any .
Let us assume in a first time that , for a given . Then for all , we note such that (we know from Proposition 1.10 that is a closed interval, and it is not empty since ). Of course ,
one can then bound:
Now, , , therefore the last set inequality becomes as expected .
When contains more than one element, we still know from Theorem 1.12 that there exists , (otherwise, would be unbounded from below, which would contradict the fact that ). One can then employ Theorem 1.18 to introduce a set , dense in and such that , is a singleton. We know from the first part of the proof that , , the lower semi continuity of (or right continuity, since the mapping is decreasing) then allows us to conclude:
The proof is analogous for the parallel product thanks to the almost sure implication:
∎
In the setting of Proposition 1.17, note that if are all invertible mappings (singleton-valued operators) defined on , one can deduce that :
This inference is fully justified for general operators in the appendix.
2 Pivot of a concentration
Proposition 1.17 could be called a “tail concentration result”, one might be more naturally interested in the concentration around a deterministic value as depicted in next results.
Proposition 2.1.
Given two real-valued random variables and two deterministic variables , such that and , with , we have the concentrations:
•
,
•
Proof.
It is a mere application of Proposition 1.17. The first result is a simple consequence of the triangular inequality and the second result is deduced from the bound:
∎
Remark 2.2.
Note that in the setting of Proposition 2.1 above, if is a decreasing mapping and , then Proposition A.17 provides:
Therefore, one obtains:
which shares some similarity with Hanson-Wright result presented in Theorems6.1, 6.7.
The concentration around an independent copy, although less intuitive, presents some advantages like the one provided in next lemma.
The concentration rate of a -Lipschitz transformations of a random variable is proportional to the Lipschitz parameter since:
Lemma 2.3.
Given a random variable , an independent copy, , , a parameter and a -Lipschitz mapping , one has the implication:
Concentration around an independent copy is somehow equivalent to concentration around a median as stated in next lemma.
Lemma 2.4.
Given a random variables , an independent copy, , a median, and :
Proof.
The second implication can be taken from [18], Corollary 1.5. The first implication is just a consequence of the fact that:
∎
One can then wonder if the same is true for the expectation of replacing the median (at least when the expectation is defined). Given , we define its integral with the introduction of the set:
where the mappings are interpreted as singleton-valued operators to give sense to the inequality .
If , otherwise .
Lemma 2.5.
Given a random variable , a deterministic scalar , and such that , one can bound:
The concentration of the measure theory is only relevant in high dimension where as Talagrand noted in [24]: “A random variable that depends (in a “smooth” way) on the influence of many independent variable (but not too much on any of them) is essentially constant”. We give in the two next theorems two fundamental results of the theory, we refer the reader to [18] to a wider list of examples. Recent powerful characterization of product measure in most general settings can be found in [14, 13].
The concentration of a random variable of a metric space is expressed through the concentration of any for belonging to a certain class of regularity.
The random variables are classically called “observations” of . Depending on the class of regularity of the set of observations that should satisfy the concentration inequality, one obtains different class of concentration, typically, from the stronger to the weaker notion: the Lipschitz (see Theorem 3.1), the convex (see Theorem 3.2) and the linear (see Theorems 6.1, 6.7) concentration.
Without particular specification, is endowed with the euclidean norm.
Given a Gaussian vector with covariance identity, for any -Lipschitz mapping and any independent copy of , :
One can easily deduce as it was done for Lemma 2.3 that any random variable , where is a -Lipschitz transformation from to some metric space satisfy the similar result with a decay
A second result was proven by Talagrand and sets only the concentration of convex observation. It is a weaker result but very useful since it allows to study discrete distributions (that can not be obtained through Lipschitz transformation of the Gaussian vectors mentioned in Theorem 3.1).
Given a random vector with independent entries and a -Lipschitz mapping :
The upper concentration could equivalently had been restricted to -Lipschitz and concave .
Remark 3.3.
This theorem can be generalized to any random vector , for two deterministic and such that (the convexity of when is convex can not be ensured by a general transformation ). One could sum up this remark saying that the class of convexly concentrated random vectors is stable through bounded affine transformation.
For some specific transformations that preserve some convexity properties it is sometimes possible to show the linear concentration of (for instance when is build with some entry-wise product or matrix product as in Theorems 6.1 and 6.7, one can refer to [20, Theorem 1] for more general results on polynomials of random matrices).
To end our short presentation, it is interesting to note that, in finite dimension, the concentration of a linear observation happens around .
Lemma 3.4.
Given a finite-dimensional vector space444One could provide a definition of the expectation easily in any reflexive space or even any vector space of functions taking value in a reflexive space. However, for the definition, we require to be continuous on (the dual set of ). Without further information on (like a bound) the lemma can only be true on a finite dimensional space where all linear forms are continuous. If instead of the assumption , 1-Lipschitz, one adopts the assumption , , then is in a sense centered, and it is possible to deduce the result in a general reflexive space. , a random vector , an integrable such that for all -Lipschitz mapping , and any median of , , ,
then is well defined and one has the concentration for any linear form , such that :
This Lemma is a direct consequence of Lemma 2.5, therefore we directly pursue to the main result of this section: the extension of the concentration provided in Theorems 3.1, 3.2 to some non Lipschitz observation.
Theorem 3.5.
Let us consider a metric space , a random variable , measurable mappings such that there exist probabilistic operators , such that for all -Lipschitz mapping and any independent copy of , :
and
Given a supplementary metric space , and , if we assume that :
(5)
then for any , -Lipschitz:
In the Hanson-Wright setting, , , for some deterministic matrix and (then ).
Theorem 3.5 is a simple consequence of the following result that we took from [15].
Lemma 3.6.
In a metric space , given a subset and a mapping , -Lipschitz, the continuation defined as:
(6)
is -Lipschitz and satisfies .
Proof.
First note that since is -Lipschitz on , given , for all :
and in addition , therefore, .
Now, given , the triangle inequality provides:
For simplicity we do the proof for invertible mappings , we left the proof for general probabilistic operators as an exercise for the reader, the different arguments were already provided in the proof of Proposition 1.17.
Let us introduce the notation , and consider . We know from Lemma 1.19 that:
(7)
Introducing the set , one gets:
(8)
We know from (5) that is -Lipschitz on .
Let us then denote , the -Lipschitz continuation of defined in Lemma 3.6, one can bound thanks to Lemma 1.19 and (7):
One can besides bound from the hypotheses on the concentration of :
Combining the two last inequalities with (8) one finally obtains the result of the theorem.
∎
To adapt Theorem 3.5 to the case of the convex concentration, one needs to find a -Lipschitz and convex mapping that could play the role of the mapping defined in (6). When the original mapping is differentiable, a good suggestion was provided in [1]; we will adapt their definition to merely convex settings in Lemma 3.9 thanks to the notion of subgradient that we recall below.
Definition 3.7(Subgradient).
Given an euclidean vector space , a convex mapping and a point , the subgradient of on is the set:
The subgradient is well suited to the study of convex Lipschitz mappings thanks to the following property.
Lemma 3.8.
Given an euclidean vector space , an open set and convex and -Lipschitz,
for all , and .
Proof.
The non emptiness is provided for instance in [8], Proposition 5.4.1. Now, note that given and :
∎
This lemma allows us to define rigorously our Lipschitz convex continuation.
Lemma 3.9.
Given an euclidean vector space , a non-empty open set and a -Lipschitz, convex mapping , the mapping:
(9)
is convex, -Lipschitz and satisfies:
If , then only takes finite values.
Proof.
First, the convexity is obvious as convexity is stable through supremum. Second, the triangular inequality satisfied by suprema allows us to set that :
Finally, for all and for all :
by definition of the subgradient. In other words which implies .
Let us assume that there exists such that then there exists two sequences of vectors and such that , , and:
That means in particular that , since and .
∎
Theorem 3.10.
Let us consider a euclidean vector space , a random vector , continuous mappings such that there exist probabilistic operators , satisfying for any -Lipschitz and convex mapping and any independent copy of , :
and
Given a convex mapping , if we assume that for all :
then:
Proof.
The proof is of course quite similar to the one of Theorem 3.5 except that in this case, one must check that the sets , for a given are open in order to employ Lemma 3.6 instead of Lemma 3.9. To show that is open, first note that:
thanks to Proposition 1.10 that sets that and are both closed. One can then conclude with the hypothesis of continuity of . The rest of the proof is strictly the same.
∎
4 Heavy tailed random vector concentration
The aim of this subsection is not to set optimal heavy-tailed result concentration subsection but merely to present a straightforward and efficient way to show the Lipschitz concentration of random vectors having independent heavy tailed entries. The simple idea is to employ the Gaussian concentration given in Theorem 3.1 as a pivot to reach more general concentration decay.
To simplify the notations, let us introduce the operator defined, for all , as:
(10)
Proposition 4.1.
Let us consider a random vector such that there exists a set of independent Gaussian random variables , and mappings , differentiable by parts satisfying:
Given an increasing invertible mapping such that and555In other words, we require to be sub-additive after , it is true in particular for . for all , , , for any , -Lipschitz and independent copy of , , we have the concentration:
(11)
In particular, given , introducing the quantity possibly infinite but independent with , one can bound:
(12)
Remark 4.2.
1.
One might be tempted to replace the mapping with where . It is generally not possible because one rather has : , which implies, thanks to Lemma A.6, and therefore . However, for heavy tailed distribution, is exponentially increasing which implies that the tail behavior of and are very similar.
2.
In the setting of Proposition 4.1, Lemma 2.4 allows us to replace the random variable by any of its medians simply adding a factor in inequalities (11) and (12).
Corollary 4.3.
In the setting of Proposition 4.1, if , is well defined and one can further bound:
Proof.
Introducing the notation , we know from Proposition 4.1 and Lemma 2.4 that . Besides, let us bound:
Now, if , Jensen’s inequality provides:
and given two random variables , . If Jensen’s inequality implies here ,
and the triangular inequality provides which allows us to conclude in all cases that:
∎
Before proving Proposition 4.1, let us apply it to the case of random variables with tail , . The next corollary is completely artificial but gives a simple illustration of Proposition 4.1.
Example 4.4.
Consider the setting of Proposition 4.1 with , , and:
for a given .
We first consider an entry of , , for , to understand which examples are concerned by this setting.
Having in mind the elementary identities , and , one can perform the change of variable in order to get:
With this last identity, we see that the moment of order of is finite if and only if , this setting can therefore concern heavy-tailed distributions.
We now check the hypotheses of Proposition 4.1 on the mapping .
Given , . Then, for any :
The (like the identity mapping ) is therefore sub additive on and consequently the same holds for .
Besides, and one can employ Proposition 4.1 to set the concentration, for any , -Lipschitz:
Given a parameter , let us compute:
We see that if , then and for any and any -Lipschitz mapping :
With this last example, we saw that there exists a random vector having independent entries with minimal order of infinite moment equal to and such that any Lipschitz observations has finite moment for any .
More developments are here required to see how much one can extend this class of heavy-tailed random vectors. In particular some links could be made with the Fuk-Nagaev inequality [10, 21] that only concerns linear observations . For some aspects, Fuk-Nagaev is stronger because it sets that when the entries have a moment of order finite then the first moments are of powers of (it would be in our case since the sum is a -Lipschitz mapping for the euclidean norm on ). However, to our knowledge, the best Fuk-Nagaev result is only given for ([1]) and, more importantly, it only concerns linear mappings which is an important limitation.
The proof of Proposition 4.1 relies on the following elementary lemma that justifies the sub-additive condition required on .
Lemma 4.5.
Given an increasing mapping such that for all , , , one can bound:
Proof.
Starting with the implication:
we consider and try to bound from below:
(13)
Now starting from the hypothesis on applied to and one gets:
Applying on both sides ( like is increasing) one obtains:
One can then inject this last inequality in (13) to finally obtain:
to obtain the identity (where we naturally defined ).
We want to apply Theorem 3.5 to the random vector and the mapping . Given an independent copy of , let us bound (recall that is increasing):
Let us bound for any :
A probability being smaller than ,
one can employ Lemma 4.5 to get (with the notation :
For the second statement,
note that if we introduce , one can bound:
with the changes of variable and (note that thus ).
∎
5 Multilevel concentration
Let us study the particular settings of Theorems 3.5 and 3.10 when all write , for satisfying for any , -Lipschitz:
(14)
given a random vector .
We say that are “right compositions” of . It is in particular the case for the Hanson-Wright inequality that we start to depict below.
Given a deterministic matrix , the application of Theorem 3.5 requires to express the concentration of for a functional that allows us to bound the variations:
It is not hard to see that, as a -Lipschitz functional of , satisfies , where is a median of .
The lemma below allows us to see that the concentration function of is actually a right composition of :
(15)
Lemma 5.1.
Given a random variable , a probabilistic operator , and two parameters , :
Let us introduce some useful notations before proving this lemma. Given an increasing operator such that , we denote , the operator having domain equal to and satisfying:
In particular, given , note that:
(16)
Besides, the fact that666Naturally and and Theorem A.16 imply:
Note that , thus777One could be tempted to take advantage of Proposition 1.17 and the concentration ( being a constant), but that would introduce an unnecessary coefficient in the decay. Besides, Proposition 1.17 can actually not be applied since is not maximally monotone., for all :
∎
Theorem 3.5 applied to the concentration (14) and (15) then leads us to compute thanks to Proposition 1.5 the parallel product:
thanks to Lemma 1.4.
Here, with the new notations we introduced and Lemma 5.2, let us bound:
•
,
•
, ,
One can finally conclude, since is decreasing and the parallel sum (like the parallel product) preserves the order:
(17)
We see a new concentration being again some variation of a right composition of . This last concentration function can be involved in new concentration inequalities, that is why we will now describe the mechanism systematically. Let us introduce the “short notations” for all :
With these notations at hand, one sees that the expressions of the concentration function appearing in (17) express in a general way for finite sets and . It is then possible to identify some simple calculation rules that are provided in next Lemma. The proof is a simple consequence of the distributive property of the parallel product provided by Lemma 1.4.
Lemma 5.3.
Given finite subsets , and families of parameters one has the identity:
and , we recognize here the inverse of the convex conjugate of . This remark lead to some interesting, yet more laborious, proofs of Theorem 5.5 below.
Theorem 5.5.
Let us consider a metric space , a random variable , measurable mappings such that there exist , finite indexes sets containing , , and families of positive parameters such that for all , -Lipschitz and for any independent copy of , :
(18)
and:
Given a supplementary metric space , and a measurable mapping , if we assume that for any :
(19)
then for any , -Lipschitz:
The result is also true in a convex concentration setting ( euclidean vector space, and (18) is true for any -Lipschitz and convex).
Although practical realization of this setting does not seem very frequent, it partly explains the frequent appearance of multilevel concentration (in particular in [11] whose setting is quite far from the literature around our Theorem 0.2 that we briefly presented in the introduction).
Let us fist give some remarks on this theorem before providing its proof.
•
If , then, by convention, denoting one has:
thus we see that the contribution of will be nonexistent in the computation of the parallel sum.
•
If there exists such that , then it means that is a constant equal to , and it is indeed treated as such in the final formula.
One can then conclude with Proposition 1.5 combined with Lemma 5.3.
∎
The last result of multilevel concentration relies on the Taylor approximation of -differentiable mappings and relies on the notion of modulus of continuity. To stay coherent with our framework, we introduce this definition for operators.
Definition 5.6.
A modulus of continuity is a maximally increasing operator satisfying and .
Given two metric spaces , , a mapping is said to be -continuous iif.:
One then has the following characterization of the concentration of measure phenomenon with modulus of continuity already provided in [18], it is a particular case of Lemma 5.9 provided just below.
Given a random vector if, for any , -Lipschitz and for any median of , , then for any , -continuous, and any median of , one has:
(the converse is obvious).
It does not seem easy – if possible – to find analogous to Lemma 3.6 and 3.9 to continue a -continuous mapping when is not concave888It is a well known fact that modulus of continuity on convex bodies can be assumed to be concave or sub additive but the question is then to show that our restriction space is convex which is generally not the case. as it will be the case in the proof of next Theorem.
Hopefully, this difficulty can be easily overcome since the condition that will be met to rely on the Taylor approximation is a localized notion of -continuity that we define below.
Definition 5.8.
Given two metric space and , a modulus of continuity , and , we say that a mapping is -continuous from if for all , for all :
One can then inspire from the beginning of Section 1.3 in [18] to get the following lemma.
Lemma 5.9.
Let us consider a metric space , a random variable , and a decreasing mapping such that:
(20)
for , a median of , then
for any subsets , any modulus of continuity
, and any measurable mapping , -continuous from :
(21)
for any , a median of .
In [18], most of the results are set in the measure theory framework, the next proof is mainly an adaptation of [18]’s inferences with probabilistic notations.
Proof.
Introducing the set , note that :
since is -continuous from and:
We then rely on the mapping to remove the condition :
(22)
Now, first note that is -Lipschitz on . Given , if or , the Lipschitz character of the distance (it satisfies the triangular inequality) allows us to deduce that . We then consider the case and and assume, without loss of generality that , one can then bound:
Given , there exists and such that and then:
thanks to the triangular inequality. Letting tend to zero, one can deduce the -Lipschitz character of .
Second, note that admits as a median:
One can then deduce from the hypothesis of the lemma that:
In the case where is a (singleton-valued) invertible function, one can directly conclude:
for the general set-valued case, one can employ similar arguments to the ones given in the proof of Proposition 1.17.
∎
We are now almost ready to set our main result, Theorem 0.2, on the concentration of finitely differentiable mappings.
To sharpen the concentration bound, a solution is to work with a sequence of polynomials satisfying:
(23)
where the parameters were defined999Actually, was not defined, but it could be any positive number. To stay in line with the definition of , one could choose as a median of . in the setting of Theorem 0.2, but Lemma 5.10 below is independent of this choice.
Note that for all , has degree .
Lemma 5.10.
Given the sequence of polynomials defined in (23) (for a given sequence , if one introduces the coefficients satisfying:
(24)
then:
Proof.
Let us first find a relation between the coefficients from the expression of . Of course, one has , thus . Now injecting (24) in (23), one obtains (with the changes of variable and ):
One then gets the following recurrence relation between the coefficients:
The recursion formula reveals that the lower index is independent of the upper index, we then denote , , (for any ).
Let us then show recursively that . Of course , then given , if we assume that this inequality is true for , one can bound:
And the Stirling formula allows us to conclude:
∎
We will prove below a stronger result than Theorem 0.2 which is merely deduced from Lemma 5.10 and Proposition A.17.
Theorem 5.11.
Let us consider a random vectors such that for any -Lipschitz:
for a certain median of , and a certain decreasing mapping .
Then, for any -differentiable mapping and any , -Lipschitz, one can bound:
where, is a median of , for all , is a median of , and are the parameters introduced in Lemma 5.10.
Proof.
We will show recursively that, for all , for all , -Lipschitz:
(25)
and is a median of .
Note from Lemma 5.1 that the iteration hypothesis implies in particular:
(26)
Given :
which means that is a -Lipschitz transformation of and therefore, for any , -Lipschitz:
and indeed .
Let us now assume that (25) is true from down to a certain . One can bound thanks to the Taylor expansion around :
One retrieve the iteration hypothesis (25) combining (28) with (5), the result is obtained from the identity:
∎
To obtain a version of Theorem 0.2 in a convex concentration setting, one would first require to set an analogous result to Lemma 5.9 in the case of a convex concentration hypothesis (which is not straightforward, it would just be true for convex sets then would be the difference of two convex mappings which would impact the final constants), one would also have to assume that all the mappings are convex which seems quite restrictive. To limit the content of the present article, we leave these developments to interested readers.
6 Consequences for Hanson-Wright concentration inequality
We formulate below the Hanson-Wright inequality as a linear concentration result on random matrices with the widest hypotheses possible on (a result with the expectation is provided in Theorem 6.7). This result is completely equivalent to concentration of quadratic forms for (see Remark 6.2), but having already presented the stronger notions of Lipschitz and convex concentration in previous sections, we found interesting to provide some examples of the linear concentrated vectors class.
Theorem 6.1.
Given a decreasing mappings and a random matrix , if one assumes that for all -Lipschitz mapping and for all median of , :
then for all deterministic , one has the concentration:
for a given median of , and any , median of .
In a convex concentration setting the same result is obtained with some numerical constants replacing the “” and “” in last result.
Remark 6.2(Vectorization of a matricial identity).
Given , let us introduce the notation satisfying and an operation having value in defined for any and as:
Through a mere regrouping of indexes in sums, one obtains:
(30)
One sees with this vectorized identity that the study of boils down to a mere study of where is a random vector and a deterministic matrix.
Let us assume, without loss of generality that in (30) is a symmetric matrix (the anti-symmetric contribution cancels since ).
Theorem 3.5 or Theorem 5.11 (the strong version of Theorem 0.2) can both be applied here, but in order to get the best concentration constants possible, we rather check the conditions of the latter one.
Let us introduce , then one has for any :
Therefore:
Applying Theorem 5.11, one can deduce the expected result (note that and ).
In the case of a convex concentration of , one can still obtain a similar result by expressing as the difference of two positive symmetric matrices to be able to consider convex mappings and combine Theorem 3.10 and Lemma 2.4 to conclude.
∎
The rest of the section aims at rewriting Theorem 6.1 in the cases where admits an expectation which is linked to some integrability properties of (see Lemma 2.5). The first lemma helps us bounding that will be close to the median “” in Theorem 6.1.
Lemma 6.3.
Given a random matrix and two deterministic matrices and :
Then one can bound thanks to Cauchy-Schwarz inequality and Jensen inequality:
since .
∎
Let us now express the conditions for which can be bounded.
Lemma 6.4.
Given a random vectors and , let us denote:
If we assume that for all linear form , s.t. :
then one can bound:
This lemma partly relies on a trivial result which is a direct consequence of Cauchy-Schwarz inequality (it is quite similar to , for a random variable ).
Lemma 6.5.
Given :
where .
Remark 6.6.
The term might look quite bothering. Actually it is not the case since probabilities are bounded by , and one can generally replace with where . looking at Lemma 6.4 and Theorem 6.7 that we will later set, one sees it is sufficient to show that:
•
, which is obvious,
•
, which come from the fact that introducing (non empty by definition of ), one knows from Lemma 6.5 that , and therefore .
Following Remark 6.6 we know that it is sufficient to show our result for . In particular, since then , Lemma 6.5 provides the inequality .
Given such that , one can then merely bound thanks to Fubini Theorem:
One can conclude that .
∎
We have now all the elements to prove:
Theorem 6.7.
Given a decreasing mappings , and a random matrix , denoting again , we assume that and that for any -Lipschitz and convex mapping and any , a median of :
then for any deterministic , one has the concentration:
The hypothesis might seem a bit elaborate, but one can always adapt to satisfy this condition depending on the setting. Of course this Theorem is only relevant when .
In the case where and for a given , one obtains the classical Hanson-Wright inequality (see [1]) with some numerical constants independent with and :
Proof.
Following Remark 6.6, we can assume without loss of generality that .
We already know from Theorem 6.1 and Lemma 3.4101010To be a direct application of Lemma 3.4, one should actually start with the Lipschitz concentration of , but Theorem 6.1 just provides the concentration of , ; that is however not an issue since in Lemma 3.4, the only relevant assumption is the concentrations of the observations , . that :
(31)
where we recall that is a median of .
Besides:
A simple application of Fubini and Lemma 6.5 provides:
(32)
Now, starting from the linear concentration inequality (consequence to Lemma 3.4):
after computing , we can deduce from Lemmas 6.3 and 6.4 that:
and inject this bound in (31) to obtain the result of the theorem.
∎
Appendix A Minimum of operators
The aim of this section is to bound the parallel sum and the parallel product with a simpler expression involving the minimum of operators. Before defining the minimum, we naturally introduce a relation on the set of operators of same monotonicity.
This relation relies on inequalities between images of operators, it also requires some conditions on the domains. To express this last condition, given a subset , let us introduce the notations:
Definition A.1(Operator order relation).
Given two increasing operators , we say that is lower than and we denote if and only if111111When and are maximally monotone, the domains are convex (see Proposition 1.10), and the first conditions rewrite and :
If for all , instead of one has then we say is bigger than and denote .
The definitions for decreasing operators are the same but one needs to interchange the symbols “” and “”.
Note that the hypothesis on the domains implies inequalities on the domains (they are equivalence when the domains are non empty intervals of , for instance for maximally monotone operators).
Given two sets :
Definition A.1 simplifies when dealing with maximally monotone operators:
Proposition A.2.
Given two maximally monotone operators with the same monotonicity one has the equivalence:
Proof.
Let us assume that are moth maximally increasing, that and consider . Considering , if , one has , if , then , . If , then the hypothesis , that implies in particular , allows us to consider some such that . We have just shown:
the maximality of then exactly implies . In other words, .
∎
The condition on the domain could actually be replaced (in case of maximally monotone operators) by a similar condition on the range as partly justified by next lemma.
Note that the result is here the same for increasing or decreasing operators (mainly because ).
Lemma A.3.
Given two monotone operators :
•
If is maximally monotone: ,
•
If is maximally monotone: .
Proof.
Let us do the proof for increasing operators. Considering , we know that there exists such that . If then immediately we can conclude that and therefore that . If , then Proposition 1.10 implies that either either . Only the first case is possible since, by hypothesis, . One can then conclude with Theorem 1.12 that , which implies again .
The same kind of arguments works for decreasing operators (this time ). The second item is shown in a symmetric way.
∎
As it will be our goal in Theorem A.16 below, note that equality between maximal operator can be deduced from a two side inequality.
Proposition A.4.
Given two maximally monotone operators with the same monotonicity one has the equivalence:
It is a simple consequence of the following lemma and the fact that the values of maximally monotone operators are all intervals of (see Proposition 1.10).
Lemma A.5.
Given two closed intervals , if we assume that , , and , then .
Proof.
Let us introduce , such that and , then:
and we can conclude that and .
∎
Let us end with a last property on the relation between operators that allows us to combine it with the inverse operation.
Lemma A.6.
Given two monotone operators
Proof.
Let us only prove the result for . Given (if or , one always has ), for all we know from the hypothesis that there exists such that . We know from Lemma 1.2 that is decreasing, then , and since , we have exactly shown that .
∎
Definition A.7.
Given increasing operators , we introduce the minimum of and through the following definition of its graph:
The same way, we define the maximum of operators followingly:
to define the minimum and the maximum of decreasing operators, one needs to interchange the expressions “” and “”.
Note that, as expected, with these definitions, one has for any monotone operators with same monotonicty:
Proposition A.8.
Given maximally monotone operators :
If are increasing:
If are decreasing:
one need to interchange the signs “” with “” to obtain the domain of the maximum.
The first identity can be rewritten:
(33)
Proof.
Let us do the proof for the minimum and for increasing mappings. The inclusion is obvious, now, considering , let us show that . Given , if , we know from Proposition 1.10 that is convex and therefore either satisfies or . If the latter case is true, then Theorem 1.12 allows us to set that ( is increasing and not locally bounded on the left boundary of its domain), thus for any , there exists some such that , which is impossible by definition of the minimum of operators. Therefore and as a consequence .
Let us now consider . For all :
•
either , then Proposition 1.10 implies since and , ,
•
either , one can consider and automatically has for any , .
Let us then consider . By construction and , , that exactly means that .
∎
Lemma A.9.
Given maximally monotone operators (with same monotonicity), and :
The proof of this lemma relies on the following elementary result that we provide without proof.
Let us do the proof for the minimum and increasing mappings. We know from Proposition A.8 and more precisely from identity (33) that Besides, Lemma A.10 allows us to set:
since for all , . As a consequence, .
Let us then consider and , if there exists such that , then, by definition of the minimum, for all , which implies (in other words ). The same proof clearly works for decreasing mappings, considering this time all , and it works the same way for the maximum.
To show that , one can not use Proposition A.2, since we still have not proven that is maximally monotone.
Given , and , if there exists such that , then, considering any , if , by definition of the minimum, we know that and for all , , that directly implies by maximality of that , and therefore that .
∎
To show maximality properties of the maximum and minimum, we will employ the simple characterization on (resp. on ) given by Theorem 1.8 for increasing (resp. decreasing) operators. Let us first study the range of the minimum and the maximum.
Proposition A.11.
Given maximally monotone operators :
and one can replace the sign “” with “” to obtain the range of the maximum.
Proof.
By definition of the minimum, one obviously has .
Besides we know from the first result of Lemma A.3 combined with Lemma A.9 that for all , , that finally implies .
The other inclusion is more laborious since it is closely related to the maximality of that will be proven in Proposition A.14. Let us assume that .
The set:
is non empty since and closed thanks to Proposition 1.10. Let us now consider two cases:
•
If , then one can consider such that there exists a sequence satisfying:
The operator being maximally increasing, it is not hard to see that it is constant on equal to . For all , there exists such that , otherwise, that would mean that . Therefore, noting , we have, for all , . We have exactly shown that .
•
If , let us then consider .
Given , if , then Proposition 1.10 would imply either either . Only the first case is possible, and that would imply in particular that for all , . If , then , otherwise, there would exist such that (since and is convex) which contradicts the definition of . Then for all , and as before, .
∎
We want to show that those two operations are stable on the set of maximally monotone operators.
The monotonicity is a simple consequence of the definition.
Proposition A.12.
Given increasing (resp. decreasing) monotone operators , and are both increasing (resp. decreasing) monotone operators.
Lemma A.13.
Given increasing (resp.decreasing) monotone operators :
Proposition A.14.
Given increasing (resp.decreasing) maximally monotone operators , and are both increasing (resp.decreasing) maximally monotone operators.
Proof.
Theorem 1.8 allows us to set that for all , and Proposition A.11 combined with Lemma A.13 provides:
which allows to conclude the proof thanks to Theorem 1.8 (we already know from Proposition A.12 that is increasing monotone).
∎
One needs a last result to prove the main result of the Section.
Proposition A.15.
Given maximally monotone operators with the same monotonicity:
Proof.
We just prove the results for maximally increasing operators and for the minimum. The converse implication is obvious thanks to Lemma A.9 (). Let us then assume that . Looking at the domain, that implies:
That directly implies thanks to Lemma A.10 and Proposition A.8 that:
Now considering any and , we know that for all .
Then for any and , (inequalities still hold when the sets are empty), and therefore if , then , if , that means that and therefore .
∎
Theorem A.16.
Given maximally monotone operators , if are decreasing:
if are increasing:
Proof.
Let us only show the first identity (for decreasing operators). We know from Lemma A.9 that for all , , therefore Lemma A.6 allows us to conclude that and we can then deduce from Proposition A.15 that:
The same inequality is true for replacing which then provides ,
which implies (again with Lemma A.6):
One can then conclude with Proposition A.4 (all the mappings are maximally monotone thanks to Example 1.9 and Proposition A.14).
∎
It is now immediate to prove:
Proposition A.17.
Given probabilistic operators :
if we assume in addition that :
Proof.
It is a simple consequence of Theorem A.16 ( are all maximally decreasing) combined with Lemma A.6 and the fact that:
and that:
when .
∎
References
[1]
Radosław Adamczak.
A note on the hanson-wright inequality for random vectors with
dependencies.
Electronic Communications in Probability, 20(72):1–13, 2015.
[2]
Radosław Adamczak, Witold Bednorz, and Paweł Wolff.
Moment estimates implied by modified log-sobolev inequalities.
ESAIM: Probability and Statistics, 21:467–494, 2017.
[3]
Radosław Adamczak and Paweł Wolff.
Concentration inequalities for non-lipschitz functions with bounded
derivatives of higher order.
Probability Theory and Related Fields, 162:531–586, 2015.
[4]
W Anderson, T Morley, and G Trapp.
Fenchel duality of nonlinear networks.
IEEE Transactions on Circuits and Systems, 25(9):762–765,
1978.
[5]
William N Anderson Jr and Richard James Duffin.
Series and parallel addition of matrices.
Journal of Mathematical Analysis and Applications,
26(3):576–594, 1969.
[6]
Miguel A Arcones and Evarist Giné.
On decoupling, series expansions, and tail behavior of chaos
processes.
Journal of Theoretical Probability, 6(1):101–122, 1993.
[7]
HH Bauschke and PL Combettes.
Convex Analysis and Monotone Operator Theory in Hilbert Spaces.
CMS books in mathematics, Springer, 2011.
[9]
Christer Borell.
On the taylor series of a wiener polynomial.
In Seminar Notes on multiple stochastic integration, polynomial
chaos and their integration. Case Western Reserve University, Cleveland,
1984.
[10]
Dao Ha Fuk.
Certain probabilistic inequalities for martingales.
Sibirskii Matematicheskii Zhurnal, 14(1):185–193, 1973.
[11]
Friedrich Götze, Holger Sambale, and Arthur Sinulis.
Concentration inequalities for bounded functionals via
log-sobolev-type inequalities.
Journal of Theoretical Probability, 34:1623–1652, 2021.
[12]
Friedrich Götze, Holger Sambale, and Arthur Sinulis.
Concentration inequalities for polynomials in
-sub-exponential random variables.
2021.
[13]
Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, Yan Shu, and Prasad Tetali.
Characterization of a class of weak transport-entropy inequalities on
the line.
2018.
[14]
Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali.
Kantorovich duality for general transport costs and applications.
Journal of Functional Analysis, 273(11):3327–3405, 2017.
[15]
Alice Guionnet.
Large random matrices, volume 36.
Springer Science & Business Media, 2009.
[16]
Rafał Latała.
Estimates of moments and tails of gaussian chaoses.
2006.
[17]
Michel Ledoux.
A note on large deviations for wiener chaos.
Séminaire de probabilités de Strasbourg, 22:249–259,
1988.
[18]
Michel Ledoux.
The concentration of measure phenomenon. ed. by peter landweber et
al. vol. 89.
Mathematical Surveys and Monographs. Providence, Rhode Island:
American Mathematical Society, page 181, 2005.
[19]
R Lochowski.
Moment and tail estimates for multidimensional chaoses generated by
symmetric random variables with logarithmically concave tails.
Banach Center Publications, 72:161, 2006.
[20]
Elizabeth S Meckes and Mark W Meckes.
Concentration and convergence rates for spectral measures of random
matrices.
Probability Theory and Related Fields, 156:145–164, 2013.
[21]
Sergey V Nagaev.
Large deviations of sums of independent random variables.
The Annals of Probability, pages 745–789, 1979.
[22]
R Tyrrell Rockafellar and Johannes O Royset.
Random variables, monotone relations, and convex analysis.
Mathematical Programming, 148:297–331, 2014.
[23]
Michel Talagrand.
Concentration of measure and isoperimetric inequalities in product
spaces.
Publications mathématiques de l’IHÉS, 104:905–909, 1995.
[24]
Michel Talagrand.
A new look at independence.
The Annals of probability, pages 1–34, 1996.
[25]
Van H Vu.
Concentration of non-lipschitz functions and applications.
Random Structures & Algorithms, 20(3):262–316, 2002.