Monotone single index models with general non-symmetric design
1 Introduction
Single index models (SIMs) provide a semi-parametric extension of linear models, where a scalar response variable is related to a predictor vector via
(1.1) |
for some unknown parameter vector and unknown link function . In this article, we consider a shape-constrained single index model (1.1) where belongs to a class of monotonic (but not necessarily smooth) functions and the parameter is high-dimensional and sparse. In this paper we develop a scalable algorithm and provide theoretical guarantees for the high-dimensional single index model. Prior theoretical guarantees for the high-dimensional single index model rely on the distribution of being known and symmetric (see e.g. [Plan2013-ew, Yang2017-cl]). In a number of settings, non-symmetric mixture distributions are induced and prior theoretical guarantees do not apply. In this paper, we provide a scalable algorithm with theoretical guarantees for the high-dimensional single index model under more general design assumptions that include non-symmetric distributions.
We consider a general case where mild assumptions are made on the design matrix , the error and their interaction. In particular, can be deterministic or random, if is random, distribution of can be asymmetric, and the distribution of may depend on . In addition we assume that the unknown vector is sparse and high-dimensional.As is not assumed to be known a priori, the model (1.1) provides a more flexible model specification than a parametric model, while still circumventing the curse-of-dimensionality by avoiding a -dimensional non-parametric function estimation. Two estimation problems exist in single index model framework: an estimation of an unknown vector and 1-dimensional regression function . Both are addressed in this paper.
To begin, we provide three motivating examples for the single index model in (1.1). We see each example reduces to the estimation problem of or , or both under the model (1.1).
1.1 Examples
Mis-specified generalized linear models (GLMs)
The generalized linear model is a parametric extension of the generalized linear model, which includes many popular models such as logistic and poisson regression. GLMs relate the conditional expectation of to via a link function , such that , where a link function is assumed to be known. The parameter of interest is usually estimated via an iterative procedure using the knowledge of ; therefore, when is mis-specified, an output is inevitably biased for , even in an asymptotic sense. Using the single index model framework, we produce an estimate of without particular link function specification.
Classification with corrupted labels
Another interesting example involves classification with corrupted labels. In particular, instead of observing a sample , we observe the pair where is a systematically corrupted version of . In particular is related to via for some unknown parameter vector and a monotonic function , which is potentially unknown.
We assume the corrupted sample is independent of given , which corresponds to a missing-at-random assumption. The corruption structure is completely specified by a confounding matrix such that , whose off-diagonal terms are smaller than diagonals. Under these assumptions, is given by
(1.2) |
where we let . Note is monotonic since and we assume that . In many practical situations the confounding matrix is unknown, which limits the parametric approach even with the assumption of a specific , such as a sigmoid function .
Classification with presence-only data and unknown prevalence
In various applications such as ecology, text-mining, and biochemistry, negative samples are often hard or even impossible to be obtained. In Positive-Unlabeled (PU) learning, we use positive and unlabeled samples, where positive samples are taken from the sub-population where responses are known to be positive, and unlabeled samples are random samples from the population.
More concretely, let be a random sample from the population. As in Example 2, we assume is related to via for some unknown parameter vector and a monotonic function . Instead of , we observe the pair which is defined as follows. If a sample is labeled (), an associated response is positive () and the sample is taken from positive sub-population, i.e. ). On the contrary, if a sample is unlabeled (), we know that the sample is randomly drawn from the population, i.e. , but an associated outcome is unknown.
Regarding as a corrupted version of , the PU problem can be viewed as a special case of Example 2, where only positive samples are corrupted. In fact, the confounding matrix Q is given in the following form :
(1.3) |
where is the ratio of the number of positive to unlabeled samples. Therefore, following the same lines in (1.1) and noting that , we have for defined in Example 2 with defined in (1.3).
We note that the conditional mean function in (1.1) is known up to a contamination proportion . If the contamination proportion is known from alternative source, we may use a parametric approach to estimate , as in . However, if the contamination proportion is unknown, reliable estimation of such proportion is very challenging and heavily relies on model specification ([Hastie2013-wq]). The single index model framework translates the problem of unknown to unknown and consequently provides a framework for addressing the PU-learning problem with unknown .
Another interesting aspect of this PU problem is that the distribution of cannot be symmetric regardless of distribution of original design . This is because has a mixture distribution of two distributions and , and by the monotonicity assumption of , the distribution of cannot be symmetric and precludes existing methods relying on the symmetric design assumption.
1.2 Prior work
The single index model has been an active area in statistics for a number of decades, since its introduction in Econometrics and Statistics [Ichimura1993-gx, Horowitz1996-jv, Klein1993-is]. There is an extensive literature about estimation of the index vector , which can be broadly classified into two categories : methods which directly estimate the index vector while avoiding a “nuisance’ ’infinite-dimensional function, and methods which involve estimation of both the infinite-dimensional function and the index vector.
Methods of the first type usually require strong design assumption such as Gaussian or an elliptically symmetric distribution. [Duan1991-nk] proposed an estimator of using sliced inverse regression in the context of sufficient dimension reduction. consistency and asymptotically normality of the estimator are established in the low-dimensional regime, under the elliptically symmetrical design assumption. This result is extended in [Lin2018-rk] in high-dimensional regime under similar design assumptions. [Plan2013-ew] studied sparse recovery of the index vector when is Gaussian, and established a non-asymptotic mean-squared error bound, where is a sparsity level, i.e. . [Ai2014-lo] studied the error bound of the same estimator when the design is not Gaussian and showed that the estimator is biased and the bias depends on the size of . The proposed estimator in [Plan2013-ew] is generalized in several directions; [Yang2017-cl] propose an estimator based on the score function of the covariate, which relies on prior knowledge of covariate distribution. [Chen2017-py] proposed estimators based on U-statistics which also include [Plan2013-ew] as a special case with standard Gaussian design assumption.
Methods of the second type usually do not require strong design assumption, and our method also falls into this category. One popular approach is via M-estimation or solving estimation equations. Under smoothness assumptions of , methods for a link function estimation were proposed via Kernel estimation([Ichimura1993-gx, Hardle1993-oj, Klein1993-is, Delecroix2003-fv, Cui2011-go]), local polynomials ([Carroll1997-iz, Chiou1998-zw]) , or splines ([Yu2002-hq, Kuchibhotla2016-ef]). An estimator for the index vector is then obtained by minimizing certain criterion function such as squared loss ([Ichimura1993-gx, Hardle1993-oj, Yu2002-hq]) or solving an estimating equation ([Chiou1998-zw, Cui2011-go]). consistency and asymptotic normality of those proposed estimators were established for those estimators in the low-dimensional regime. Another approach is the average derivative method ([Powell1989-je, Hristache2001-om]), which takes advantage of the fact . An estimator is then obtained through a non-parametric estimation of , also under smoothness assumption of . There are also studies for single index model in high-dimensional regime, using PAC-Bayesian approach ([Alquier2013-ij]) or incorporating penalties ([Wang2012-mb, Radchenko2015-wa] )
There have been an increasing number of studies where the link function is restricted to have a particular shape, such as a monotone shape, but is not necessarily a smooth function. [Kalai2009-ew] and [Kakade2011-yk] investigated single index problem in low-dimensional regime under monotonic and Lipschitz continuous link assumption, and proposed iterative perceptron-type algorithms with prediction error guarantees. In particular, an isotonic regression is used for the estimation of then update based on estimated . [Ganti2017-dq] extended [Kalai2009-ew, Kakade2011-yk] to the high-dimensional setting via incorporating a projection step onto a sparse subset of the parameter space. They empirically demonstrated good predictive performance, but no theoretical guarantee were provided. [Balabdaoui2016-sd] study the least square estimators under monotonicity constraint and establish consistency of the least-square estimator. Also in [Balabdaoui2017-at], a consistent estimator for the index vector , based on solving a score function, is proposed.
Our work focuses on estimation of the index vector in high-dimensional regime where an unknown function is assumed to be Lipschitz-continuous and monotonic. We provide an efficient algorithm via iterative hard-thresholding and a non-asymptotic and mean function error bound.
1.3 Our contributions
Our major contributions can be summarized as follows:
-
•
Develop a scalable projection-based iterative approach. For each iteration , we first perform isotonic regression of on to using the PAVA algorithm, taking an orthogonal gradient step to update , and then enforcing sparsity using a thresholding operator. The algorithm is stopped once the mean-squared error drops below a threshold .
-
•
The main contribution is to provide theoretical guarantees for our algorithm, both for the parameter vector and the mean function that scales as . Note that the minimax rate for low-dimensional isotonic regression is and our algorithm achieves this rate up to .
-
•
Since our goal is largely to estimate the parameter vector , we also show that under additional assumptions including a -min condition, we show that a one-step correction achieves the parametric low-dimensional rate. The one-step correction is motivated by the earlier work of [Balabdaoui2017-at] which proves rates in the low-dimensional case.
-
•
Finally we support our theoretical results with a simulation study which supports our theoretical rates and also shows that our algorithm compares favorable to existing approaches.
2 Problem setup and algorithm
Let be the feature vectors and the response values, which we assume satisfy the model
(2.1) |
where is -Lipschitz & monotone non-decreasing; is an -sparse unit vector, with denoting the noise term. We write to denote the matrix with rows , and and to denote the vectors with entries and , respectively. For any function and any , we write to mean that is applied element-wise to the vector , i.e. .