This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Monotone single index models with general non-symmetric design

Ran Dai1, Hyebin Song2, Rina Foygel Barber1, and Garvesh Raskutti2
1University of Chicago
2University of Wisconsin-Madison

1 Introduction

Single index models (SIMs) provide a semi-parametric extension of linear models, where a scalar response variable YY\in\mathbb{R} is related to a predictor vector XpX\in\mathbb{R}^{p} via

𝔼[Y|X]=g(XTu)\mathbb{E}[Y|X]=g_{\star}(X^{T}u_{\star}) (1.1)

for some unknown parameter vector uu_{\star} and unknown link function gg_{\star}. In this article, we consider a shape-constrained single index model (1.1) where gg_{\star} belongs to a class of monotonic (but not necessarily smooth) functions and the parameter uu_{\star} is high-dimensional and sparse. In this paper we develop a scalable algorithm and provide theoretical guarantees for the high-dimensional single index model. Prior theoretical guarantees for the high-dimensional single index model rely on the distribution of XX being known and symmetric (see e.g. [Plan2013-ew, Yang2017-cl]). In a number of settings, non-symmetric mixture distributions are induced and prior theoretical guarantees do not apply. In this paper, we provide a scalable algorithm with theoretical guarantees for the high-dimensional single index model under more general design assumptions that include non-symmetric distributions.

We consider a general case where mild assumptions are made on the design matrix XX, the error ϵ:=YE[Y|X]\epsilon:=Y-E[Y|X] and their interaction. In particular, XX can be deterministic or random, if XX is random, distribution of XX can be asymmetric, and the distribution of ϵ\epsilon may depend on XX. In addition we assume that the unknown vector uu_{\star} is sparse and high-dimensional.As gg_{\star} is not assumed to be known a priori, the model (1.1) provides a more flexible model specification than a parametric model, while still circumventing the curse-of-dimensionality by avoiding a pp-dimensional non-parametric function estimation. Two estimation problems exist in single index model framework: an estimation of an unknown vector uu_{\star} and 1-dimensional regression function gg_{\star}. Both are addressed in this paper.

To begin, we provide three motivating examples for the single index model in (1.1). We see each example reduces to the estimation problem of uu_{\star} or gg_{\star}, or both under the model (1.1).

1.1 Examples

Mis-specified generalized linear models (GLMs)
The generalized linear model is a parametric extension of the generalized linear model, which includes many popular models such as logistic and poisson regression. GLMs relate the conditional expectation of YY to XX via a link function ψ\psi, such that E[Y|X]=ψ1(XTu)E[Y|X]=\psi^{-1}(X^{T}u_{\star}), where a link function ψ\psi is assumed to be known. The parameter of interest uu_{\star} is usually estimated via an iterative procedure using the knowledge of ψ\psi; therefore, when ψ\psi is mis-specified, an output is inevitably biased for uu_{\star}, even in an asymptotic sense. Using the single index model framework, we produce an estimate of uu_{\star} without particular link function specification.

Classification with corrupted labels
Another interesting example involves classification with corrupted labels. In particular, instead of observing a sample (X,Y~)p×(X,\tilde{Y})\in\mathbb{R}^{p}\times\mathbb{R}, we observe the pair (X,Y)p×(X,Y)\in\mathbb{R}^{p}\times\mathbb{R} where YY is a systematically corrupted version of Y~\tilde{Y}. In particular Y~\tilde{Y} is related to XX via 𝔼[Y~|X]=q(XTu)\mathbb{E}[\tilde{Y}|X]=q(X^{T}u_{*}) for some unknown parameter vector upu_{*}\in\mathbb{R}^{p} and a monotonic function qq, which is potentially unknown.

We assume the corrupted sample YY is independent of XX given Y~\tilde{Y}, which corresponds to a missing-at-random assumption. The corruption structure is completely specified by a 2×22\times 2 confounding matrix QQ such that Qk+1,l+1=(Y=k|Y~=l)Q_{k+1,l+1}=\mathbb{P}(Y=k|\tilde{Y}=l), whose off-diagonal terms are smaller than diagonals. Under these assumptions, 𝔼[Y|X]=(Y=1|X)\mathbb{E}[Y|X]=\mathbb{P}(Y=1|X) is given by

(Y=1|X)\displaystyle\mathbb{P}(Y=1|X) =q(XTu)(Y=1|Y~=1)+(1(XTu))(Y=1|Y~=0)\displaystyle=q(X^{T}u_{\star})\mathbb{P}(Y=1|\tilde{Y}=1)+(1-(X^{T}u_{\star}))\mathbb{P}(Y=1|\tilde{Y}=0)
=q(XTu)Q22+(1q(XTu))Q21.\displaystyle=q(X^{T}u_{\star})Q_{22}+(1-q(X^{T}u_{\star}))Q_{21}.
=g(XTu)\displaystyle=g_{\star}(X^{T}u_{*}) (1.2)

where we let g(s):=q(s)Q22+(1q(s))Q21g_{\star}(s):=q(s)Q_{22}+(1-q(s))Q_{21}. Note gg_{\star} is monotonic since g(s)=q(s)(Q22Q21)+Q21g_{\star}(s)=q(s)(Q_{22}-Q_{21})+Q_{21} and we assume that Q22>Q21Q_{22}>Q_{21}. In many practical situations the confounding matrix QQ is unknown, which limits the parametric approach even with the assumption of a specific qq, such as a sigmoid function q(u)=1/(1+exp(u)q(u)=1/(1+\exp(-u).

Classification with presence-only data and unknown prevalence π\pi
In various applications such as ecology, text-mining, and biochemistry, negative samples are often hard or even impossible to be obtained. In Positive-Unlabeled (PU) learning, we use positive and unlabeled samples, where positive samples are taken from the sub-population where responses are known to be positive, and unlabeled samples are random samples from the population.

More concretely, let (X~,Y~)p×(\tilde{X},\tilde{Y})\in\mathbb{R}^{p}\times\mathbb{R} be a random sample from the population. As in Example 2, we assume Y~\tilde{Y} is related to X~\tilde{X} via 𝔼[Y~|X~]=g(X~Tu)\mathbb{E}[\tilde{Y}|\tilde{X}]=g(\tilde{X}^{T}u_{*}) for some unknown parameter vector upu_{*}\in\mathbb{R}^{p} and a monotonic function gg. Instead of (X~,Y~)(\tilde{X},\tilde{Y}), we observe the pair (X,Y)(X,Y) which is defined as follows. If a sample is labeled (Y=1Y=1), an associated response is positive (Y~=1\tilde{Y}=1) and the sample is taken from positive sub-population, i.e. X=𝑑(X~|Y~=1X\overset{d}{=}(\tilde{X}|\tilde{Y}=1). On the contrary, if a sample is unlabeled (Y=0Y=0), we know that the sample is randomly drawn from the population, i.e. X=𝑑X~X\overset{d}{=}\tilde{X}, but an associated outcome is unknown.

Regarding YY as a corrupted version of Y~\tilde{Y}, the PU problem can be viewed as a special case of Example 2, where only positive samples (Y~=1)(\tilde{Y}=1) are corrupted. In fact, the confounding matrix Q is given in the following form :

Q11=(Y=0|Y~=0)=1,Q22=(Y=1|Y~=1)=γγ+(Y~=1),Q_{11}=\mathbb{P}(Y=0|\tilde{Y}=0)=1,\quad Q_{22}=\mathbb{P}(Y=1|\tilde{Y}=1)=\dfrac{\gamma}{\gamma+\mathbb{P}(\tilde{Y}=1)}, (1.3)

where γ\gamma is the ratio of the number of positive to unlabeled samples. Therefore, following the same lines in (1.1) and noting that Y~|X=𝑑Y~|X~\tilde{Y}|X\overset{d}{=}\tilde{Y}|\tilde{X}, we have 𝔼[Y|X]=g(XTu)\mathbb{E}[Y|X]=g_{\star}(X^{T}u_{\star}) for gg_{\star} defined in Example 2 with QQ defined in (1.3).

We note that the conditional mean function gg_{\star} in (1.1) is known up to a contamination proportion π=(Y~=1)\pi=\mathbb{P}(\tilde{Y}=1). If the contamination proportion is known from alternative source, we may use a parametric approach to estimate uu_{\star}, as in . However, if the contamination proportion is unknown, reliable estimation of such proportion is very challenging and heavily relies on model specification ([Hastie2013-wq]). The single index model framework translates the problem of unknown π\pi to unknown gg_{\star} and consequently provides a framework for addressing the PU-learning problem with unknown π\pi.

Another interesting aspect of this PU problem is that the distribution of XX cannot be symmetric regardless of distribution of original design X~\tilde{X}. This is because XX has a mixture distribution of two distributions X~|Y~=1\tilde{X}|\tilde{Y}=1 and X~\tilde{X}, and by the monotonicity assumption of gg, the distribution of X~|Y~=1\tilde{X}|\tilde{Y}=1 cannot be symmetric and precludes existing methods relying on the symmetric design assumption.

1.2 Prior work

The single index model has been an active area in statistics for a number of decades, since its introduction in Econometrics and Statistics [Ichimura1993-gx, Horowitz1996-jv, Klein1993-is]. There is an extensive literature about estimation of the index vector uu_{\star}, which can be broadly classified into two categories : methods which directly estimate the index vector while avoiding a “nuisance’ ’infinite-dimensional function, and methods which involve estimation of both the infinite-dimensional function and the index vector.

Methods of the first type usually require strong design assumption such as Gaussian or an elliptically symmetric distribution. [Duan1991-nk] proposed an estimator of uu_{\star} using sliced inverse regression in the context of sufficient dimension reduction. n\sqrt{n} consistency and asymptotically normality of the estimator are established in the low-dimensional regime, under the elliptically symmetrical design assumption. This result is extended in [Lin2018-rk] in high-dimensional regime under similar design assumptions. [Plan2013-ew] studied sparse recovery of the index vector uu_{\star} when XX is Gaussian, and established a non-asymptotic 𝒪((slog(2p/s)/n)1/2)\mathcal{O}((s\log(2p/s)/n)^{1/2}) mean-squared error bound, where ss is a sparsity level, i.e. s:=u0s:=\|u_{\star}\|_{0}. [Ai2014-lo] studied the error bound of the same estimator when the design is not Gaussian and showed that the estimator is biased and the bias depends on the size of u\|u_{\star}\|_{\infty}. The proposed estimator in [Plan2013-ew] is generalized in several directions; [Yang2017-cl] propose an estimator based on the score function of the covariate, which relies on prior knowledge of covariate distribution. [Chen2017-py] proposed estimators based on U-statistics which also include [Plan2013-ew] as a special case with standard Gaussian design assumption.

Methods of the second type usually do not require strong design assumption, and our method also falls into this category. One popular approach is via M-estimation or solving estimation equations. Under smoothness assumptions of gg_{\star}, methods for a link function estimation were proposed via Kernel estimation([Ichimura1993-gx, Hardle1993-oj, Klein1993-is, Delecroix2003-fv, Cui2011-go]), local polynomials ([Carroll1997-iz, Chiou1998-zw]) , or splines ([Yu2002-hq, Kuchibhotla2016-ef]). An estimator for the index vector is then obtained by minimizing certain criterion function such as squared loss ([Ichimura1993-gx, Hardle1993-oj, Yu2002-hq]) or solving an estimating equation ([Chiou1998-zw, Cui2011-go]). n\sqrt{n} consistency and asymptotic normality of those proposed estimators were established for those estimators in the low-dimensional regime. Another approach is the average derivative method ([Powell1989-je, Hristache2001-om]), which takes advantage of the fact ddxE[Y|X]=ug(XTu)\frac{d}{dx}E[Y|X]=u_{\star}g^{\prime}(X^{T}u_{\star}). An estimator is then obtained through a non-parametric estimation of ddxE[Y|X]\frac{d}{dx}E[Y|X], also under smoothness assumption of gg_{\star}. There are also studies for single index model in high-dimensional regime, using PAC-Bayesian approach ([Alquier2013-ij]) or incorporating penalties ([Wang2012-mb, Radchenko2015-wa] )

There have been an increasing number of studies where the link function is restricted to have a particular shape, such as a monotone shape, but is not necessarily a smooth function. [Kalai2009-ew] and [Kakade2011-yk] investigated single index problem in low-dimensional regime under monotonic and Lipschitz continuous link assumption, and proposed iterative perceptron-type algorithms with prediction error guarantees. In particular, an isotonic regression is used for the estimation of gg_{\star} then update uu_{\star} based on estimated gg_{\star}. [Ganti2017-dq] extended [Kalai2009-ew, Kakade2011-yk] to the high-dimensional setting via incorporating a projection step onto a sparse subset of the parameter space. They empirically demonstrated good predictive performance, but no theoretical guarantee were provided. [Balabdaoui2016-sd] study the least square estimators under monotonicity constraint and establish n1/3n^{1/3} consistency of the least-square estimator. Also in [Balabdaoui2017-at], a n\sqrt{n} consistent estimator for the index vector uu_{\star}, based on solving a score function, is proposed.

Our work focuses on estimation of the index vector in high-dimensional regime where an unknown function is assumed to be Lipschitz-continuous and monotonic. We provide an efficient algorithm via iterative hard-thresholding and a non-asymptotic 2\ell_{2} and mean function error bound.

1.3 Our contributions

Our major contributions can be summarized as follows:

  • Develop a scalable projection-based iterative approach. For each iteration u=utu=u^{t}, we first perform isotonic regression of YY on to XutXu^{t} using the PAVA algorithm, taking an orthogonal gradient step to update uu, and then enforcing sparsity using a thresholding operator. The algorithm is stopped once the mean-squared error drops below a threshold τ\tau.

  • The main contribution is to provide theoretical guarantees for our algorithm, both for the parameter vector uu_{\star} and the mean function g(XTu)g_{\star}(X^{T}u_{\star}) that scales as (slogpn)1/3\biggr{(}\frac{s\log p}{n}\biggr{)}^{1/3}. Note that the minimax rate for low-dimensional isotonic regression is n1/3n^{-1/3} and our algorithm achieves this rate up to logp\log p.

  • Since our goal is largely to estimate the parameter vector uu_{\star}, we also show that under additional assumptions including a β\beta-min condition, we show that a one-step correction achieves the parametric low-dimensional 1n\frac{1}{\sqrt{n}} rate. The one-step correction is motivated by the earlier work of [Balabdaoui2017-at] which proves 1n\frac{1}{\sqrt{n}} rates in the low-dimensional case.

  • Finally we support our theoretical results with a simulation study which supports our theoretical rates and also shows that our algorithm compares favorable to existing approaches.

The remainder of the paper is organized as follows: Section LABEL:SecProb provides a problem setup and we develop our algorithm; in Section 3 we provide theoretical guarantees for our algorithm which are supported by experimental results in Section 4. Proofs are provided in Section 5.

2 Problem setup and algorithm

Let X1,,XnpX_{1},\dots,X_{n}\in\mathbb{R}^{p} be the feature vectors and YiY_{i}\in\mathbb{R} the response values, which we assume satisfy the model

Yi=g(Xiu)+Zi,Y_{i}=g_{\star}(X_{i}^{\top}u_{\star})+Z_{i}, (2.1)

where gg_{\star} is LL-Lipschitz & monotone non-decreasing; uu_{\star} is an ss_{\star}-sparse unit vector, with ZiZ_{i}\in\mathbb{R} denoting the noise term. We write 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p} to denote the matrix with rows XiX_{i}, and 𝐘n\mathbf{Y}\in\mathbb{R}^{n} and 𝐙n\mathbf{Z}\in\mathbb{R}^{n} to denote the vectors with entries YiY_{i} and ZiZ_{i}, respectively. For any function g:g:\mathbb{R}\rightarrow\mathbb{R} and any vpv\in\mathbb{R}^{p}, we write g(𝐗v)g(\mathbf{X}v) to mean that gg is applied element-wise to the vector 𝐗v\mathbf{X}v, i.e. g(𝐗v)=(g(X1v),,g(Xnv))g(\mathbf{X}v)=\big{(}g(X_{1}^{\top}v),\dots,g(X_{n}^{\top}v)\big{)}.

2.1 Algorithm

3 Theoretical guarantees

4 Numerical experiments

5 Proofs

Appendix A Proofs