We prove the theorem in Section 2.3, providing a bit
of commentary here by discussing subset regression estimators, as in the
paper [4, Sec. 4], then discussing more general smoothers
and estimates [2]. As we shall see, the assumption that the
matrices have bounded operator norm is little loss of generality [2, Sec. 2.5].
2.1 Subset regression and projection estimators
We first revisit the setting that Tibshirani and Rosset [4, Sec. 4] consider,
when each matrix is an orthogonal projector. As a motiviating
example, in linear regression, we may take , where denotes the submatrix of the design whose columns indexes. In this case, as , we
have and .
First consider the case that there is satisfying , that is, the true model belongs to the set . Then
because minimizes , we have
|
|
|
Whenever , we have the simplified bound
|
|
|
An alternative way to look at the result is by replacing
with . In this case, combining [4, Thm. 1]
with Theorem 1, we obtain
|
|
|
|
for a (numerical) constant .
In particular, if , then we obtain that
there exists a numerical constant such that for all ,
|
|
|
|
(5) |
Additional perspective comes by considering some of the bounds
Tibshirani and Rosset [4, Sec. 4] develop. While we cannot recover
identical results because of the correction
term (4) in the excess degrees of freedom, we can
recover analogues. As one example, consider their Theorem 2. It assumes that
in the sequence model (1), and
(implicitly letting and depend on ) that
. Then
inequality (5) immediately gives the oracle
inequality , and Theorem 1 moreover
shows that
|
|
|
In brief, Tibshirani and Rosset’s major conclusions on subset
regression estimators—and projection estimators more generally—hold, but
with a few tweaks.
2.2 Examples of more general smoothers
In more generality, the matrices defining
the estimators may be
arbitrary. In common cases, however, they are reasonably
well-behaved. Take as motivation nonparametric regression, where , and we let for and , so that measures the
in-sample risk of an estimator of . Here,
standard estimators include kernel regression and
local averaging [2, 6], both of which we touch on.
First consider kernel ridge regression (KRR). For a reproducing kernel , the (PSD) Gram matrix has entries . Then for , we define , which is symmetric and positive semidefinite, and
satisfies . Assume the collection is finite and define the effective dimension , yielding that . Then
SURE-tuned KRR, with regularization and
, satisfies
|
|
|
where as usual, .
A second example arises from -nearest-neighbor (-nn) estimators. We
take to indicate the number of nearest neighbors to
average, and for let denote the indices of the
nearest neighbors to in , so
are the indices for which is a neighbor of . Then the
matrix satisfies , and we claim that .
Indeed,
|
|
|
and as is elementwise nonnegative,
the Gershgorin circle theorem guarantees
|
|
|
|
as for each .
Additionally, we have
.
The normalized risk of the -nn estimator is then ,
and certainly whenever .
Under a few restrictions, we can therefore obtain an oracle-type inequality:
assume the points are regular enough that
for .
Then the SURE-tuned -nearest-neighbor
estimator
satisfies the bound
|
|
|
on its excess degrees of freedom.
Via inequality (2), this implies the oracle
inequality
|
|
|
2.3 Proof of Theorem 1
Our proof strategy is familiar from high-dimensional
statistics [cf. 6, Chs. 7 & 9]: we develop a basic
inequality relating the risk of to that of , then
apply a peeling argument [5] to bound the probability
that relative error bounds deviate far from their expectations.
Throughout, we let denote a universal (numerical) constant
whose value may change from line to line.
To prove the theorem, we first recall a
definition and state two auxiliary lemmas.
Definition 2.1.
A mean-zero random variable is -sub-exponential
if for
. If , then is -sub-Gaussian.
Lemma 2.1.
Let , , be -sub-Gaussian and . Then
|
|
|
Let , , be -sub-exponential
and . Then
|
|
|
The second statement of the lemma generalizes the first without specifying
constants. We also use that
quadratic forms of Gaussian random vectors are sub-exponential.
Lemma 2.2.
Let and .
Then is -sub-exponential.
Additionally, is -sub-exponential.
We defer the proofs of Lemmas 2.1 and
2.2 to Sections 2.4 and
2.5, respectively.
Recall the notation
, and for
define the centered variables
|
|
|
(6) |
A quick calculation shows that for every ,
|
|
|
|
As minimizes the SURE criterion,
we therefore have the basic inequality
|
|
|
(7) |
We now provide a peeling argument using
inequality (7). For each define the shell
|
|
|
The key result is the following lemma.
Lemma 2.3.
There exists a numerical constant such that
|
|
|
Proof
Note that
if , we have
|
|
|
and so it must be the case that at least one of
|
|
|
(8) |
occurs. We can thus bound the probability that
by bounding the probabilities of each of the events (8);
to do this, we that and concentrate for .
As promised, we now show that and are sub-exponential and sub-Gaussian, respectively (recall
Definition 2.1). Observe that
|
|
|
|
|
|
|
|
(9) |
by assumption that
and that . In particular,
inequality (9)
shows that for each we have (and we always have ), so that
|
|
|
(10) |
For each we have
|
|
|
while
|
|
|
by assumption that for all .
Lemma 2.2 and the
bounds (10) on thus
give that is -sub-exponential, so that for , a Chernoff bound
implies
|
|
|
|
valid for .
Taking
and yields that
for a numerical constant ,
|
|
|
We can provide a similar bound on for . It
is immediate that . Using
inequality (9) and that ,
for each we have
|
|
|
|
|
|
|
|
Similarly, and
.
Using that ,
for each we have
for some
. This yields the bound
|
|
|
|
where are
numerical constants. Returning to the events (8),
we have shown
|
|
|
|
|
|
|
|
as desired.
∎
We leverage the probability bound in Lemma 2.3
to give our final guarantees. We expand
Define the (centered) linear and quadratic terms
|
|
|
so that
.
Expanding this equality, we have
|
|
|
As in the proof of Lemma 2.3, the
bounds (10) that and Lemma 2.2 guarantee that
is -sub-exponential. Thus we have
|
|
|
|
|
|
|
|
where inequality is Cauchy-Schwarz and inequality follows
by combining Lemma 2.1 (take ) and
Lemma 2.3.
We similarly have that is -sub-Gaussian,
yielding
|
|
|
Temporarily introduce the shorthand
.
Substituting these bounds into the expansion
above and naively bounding yields
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where we made the substitution .
We break each of
the integrals into the
regions
and . Thus
|
|
|
|
where we have used
for and
that whenever . For
the second integral, we have
|
|
|
|
where we have used that for any . Replacing ,
Theorem 1
follows.
2.4 Proof of Lemma 2.1
For the first statement, without loss of generality by scaling, we may assume
. Then for any , we may write
|
|
|
|
|
|
|
|
|
|
|
|
where we have made the substitution , or
. Using [1, Eq. (1.5)],
which states that for ,
we obtain that
|
|
|
|
whenever . Take to achieve the result.
The second bound is more subtle. First, we note
that is -sub-exponential. We
therefore prove the bound for -sub-exponential random
variables, rescaling at the end.
Following a similar argument to that above,
note that , and so
|
|
|
|
|
|
|
|
Making the substitution , or
in the first integral, and
, or
in the second, we obtain the bounds
|
|
|
|
The first integral term always has upper bound
,
while the second has bound [1, Eq. (1.5)]
|
|
|
In the former case—when is small enough that
—we note that
|
|
|
Combining the preceding bounds therefore yields
|
|
|
(11) |
We leverage the moment bound (11) to
give the bound on the maxima. Let be arbitrary. Then
|
|
|
|
If , take to obtain that
. Otherwise,
take , and note that
.
To obtain the result with appropriate scaling, use the mapping
to see that if the are
-sub-exponential, then
|
|
|
and multiply through by .
2.5 Proof of Lemma 2.2
Note that ; we prove the result
leveraging .
As is symmetric, we can write for a diagonal matrix and
orthogonal , and as we can further simplify (with no
loss of generality) by assuming is diagonal with . Then . As , we have
|
|
|
We use the Taylor approximation that if , then
, so
|
|
|
whenever .
If , an identical calculation holds when
.
This yields the first result of the lemma.
For the second, note that
, while
|
|
|
where we have used that is an inner product
on matrices and the Cauchy-Schwarz inequality.