Monotone single index models with general non-symmetric design

Ran Dai¹, Hyebin Song², Rina Foygel Barber¹, and Garvesh Raskutti²
¹University of Chicago
²University of Wisconsin-Madison

1 Introduction

Single index models (SIMs) provide a semi-parametric extension of linear models, where a scalar response variable $Y\in\mathbb{R}$ is related to a predictor vector $X\in\mathbb{R}^{p}$ via

\mathbb{E}[Y|X]=g_{\star}(X^{T}u_{\star})

(1.1)

for some unknown parameter vector $u_{\star}$ and unknown link function $g_{\star}$ . In this article, we consider a shape-constrained single index model (1.1) where $g_{\star}$ belongs to a class of monotonic (but not necessarily smooth) functions and the parameter $u_{\star}$ is high-dimensional and sparse. In this paper we develop a scalable algorithm and provide theoretical guarantees for the high-dimensional single index model. Prior theoretical guarantees for the high-dimensional single index model rely on the distribution of $X$ being known and symmetric (see e.g. [Plan2013-ew, Yang2017-cl]). In a number of settings, non-symmetric mixture distributions are induced and prior theoretical guarantees do not apply. In this paper, we provide a scalable algorithm with theoretical guarantees for the high-dimensional single index model under more general design assumptions that include non-symmetric distributions.

We consider a general case where mild assumptions are made on the design matrix $X$ , the error $\epsilon:=Y-E[Y|X]$ and their interaction. In particular, $X$ can be deterministic or random, if $X$ is random, distribution of $X$ can be asymmetric, and the distribution of $\epsilon$ may depend on $X$ . In addition we assume that the unknown vector $u_{\star}$ is sparse and high-dimensional.As $g_{\star}$ is not assumed to be known a priori, the model (1.1) provides a more flexible model specification than a parametric model, while still circumventing the curse-of-dimensionality by avoiding a $p$ -dimensional non-parametric function estimation. Two estimation problems exist in single index model framework: an estimation of an unknown vector $u_{\star}$ and 1-dimensional regression function $g_{\star}$ . Both are addressed in this paper.

To begin, we provide three motivating examples for the single index model in (1.1). We see each example reduces to the estimation problem of $u_{\star}$ or $g_{\star}$ , or both under the model (1.1).

1.1 Examples

Mis-specified generalized linear models (GLMs)
The generalized linear model is a parametric extension of the generalized linear model, which includes many popular models such as logistic and poisson regression. GLMs relate the conditional expectation of $Y$ to $X$ via a link function $\psi$ , such that $E[Y|X]=\psi^{-1}(X^{T}u_{\star})$ , where a link function $\psi$ is assumed to be known. The parameter of interest $u_{\star}$ is usually estimated via an iterative procedure using the knowledge of $\psi$ ; therefore, when $\psi$ is mis-specified, an output is inevitably biased for $u_{\star}$ , even in an asymptotic sense. Using the single index model framework, we produce an estimate of $u_{\star}$ without particular link function specification.

Classification with corrupted labels
Another interesting example involves classification with corrupted labels. In particular, instead of observing a sample $(X,\tilde{Y})\in\mathbb{R}^{p}\times\mathbb{R}$ , we observe the pair $(X,Y)\in\mathbb{R}^{p}\times\mathbb{R}$ where $Y$ is a systematically corrupted version of $\tilde{Y}$ . In particular $\tilde{Y}$ is related to $X$ via $\mathbb{E}[\tilde{Y}|X]=q(X^{T}u_{*})$ for some unknown parameter vector $u_{*}\in\mathbb{R}^{p}$ and a monotonic function $q$ , which is potentially unknown.

We assume the corrupted sample $Y$ is independent of $X$ given $\tilde{Y}$ , which corresponds to a missing-at-random assumption. The corruption structure is completely specified by a $2\times 2$ confounding matrix $Q$ such that $Q_{k+1,l+1}=\mathbb{P}(Y=k|\tilde{Y}=l)$ , whose off-diagonal terms are smaller than diagonals. Under these assumptions, $\mathbb{E}[Y|X]=\mathbb{P}(Y=1|X)$ is given by

$\displaystyle\mathbb{P}(Y=1\|X)$	$\displaystyle=q(X^{T}u_{\star})\mathbb{P}(Y=1\|\tilde{Y}=1)+(1-(X^{T}u_{\star}))\mathbb{P}(Y=1\|\tilde{Y}=0)$
	$\displaystyle=q(X^{T}u_{\star})Q_{22}+(1-q(X^{T}u_{\star}))Q_{21}.$
	$\displaystyle=g_{\star}(X^{T}u_{*})$	(1.2)

where we let $g_{\star}(s):=q(s)Q_{22}+(1-q(s))Q_{21}$ . Note $g_{\star}$ is monotonic since $g_{\star}(s)=q(s)(Q_{22}-Q_{21})+Q_{21}$ and we assume that $Q_{22}>Q_{21}$ . In many practical situations the confounding matrix $Q$ is unknown, which limits the parametric approach even with the assumption of a specific $q$ , such as a sigmoid function $q(u)=1/(1+\exp(-u)$ .

Classification with presence-only data and unknown prevalence $\pi$
In various applications such as ecology, text-mining, and biochemistry, negative samples are often hard or even impossible to be obtained. In Positive-Unlabeled (PU) learning, we use positive and unlabeled samples, where positive samples are taken from the sub-population where responses are known to be positive, and unlabeled samples are random samples from the population.

More concretely, let $(\tilde{X},\tilde{Y})\in\mathbb{R}^{p}\times\mathbb{R}$ be a random sample from the population. As in Example 2, we assume $\tilde{Y}$ is related to $\tilde{X}$ via $\mathbb{E}[\tilde{Y}|\tilde{X}]=g(\tilde{X}^{T}u_{*})$ for some unknown parameter vector $u_{*}\in\mathbb{R}^{p}$ and a monotonic function $g$ . Instead of $(\tilde{X},\tilde{Y})$ , we observe the pair $(X,Y)$ which is defined as follows. If a sample is labeled ( $Y=1$ ), an associated response is positive ( $\tilde{Y}=1$ ) and the sample is taken from positive sub-population, i.e. $X\overset{d}{=}(\tilde{X}|\tilde{Y}=1$ ). On the contrary, if a sample is unlabeled ( $Y=0$ ), we know that the sample is randomly drawn from the population, i.e. $X\overset{d}{=}\tilde{X}$ , but an associated outcome is unknown.

Regarding $Y$ as a corrupted version of $\tilde{Y}$ , the PU problem can be viewed as a special case of Example 2, where only positive samples $(\tilde{Y}=1)$ are corrupted. In fact, the confounding matrix Q is given in the following form :

Q_{11}=\mathbb{P}(Y=0|\tilde{Y}=0)=1,\quad Q_{22}=\mathbb{P}(Y=1|\tilde{Y}=1)=\dfrac{\gamma}{\gamma+\mathbb{P}(\tilde{Y}=1)},

(1.3)

where $\gamma$ is the ratio of the number of positive to unlabeled samples. Therefore, following the same lines in (1.1) and noting that $\tilde{Y}|X\overset{d}{=}\tilde{Y}|\tilde{X}$ , we have $\mathbb{E}[Y|X]=g_{\star}(X^{T}u_{\star})$ for $g_{\star}$ defined in Example 2 with $Q$ defined in (1.3).

We note that the conditional mean function $g_{\star}$ in (1.1) is known up to a contamination proportion $\pi=\mathbb{P}(\tilde{Y}=1)$ . If the contamination proportion is known from alternative source, we may use a parametric approach to estimate $u_{\star}$ , as in . However, if the contamination proportion is unknown, reliable estimation of such proportion is very challenging and heavily relies on model specification ([Hastie2013-wq]). The single index model framework translates the problem of unknown $\pi$ to unknown $g_{\star}$ and consequently provides a framework for addressing the PU-learning problem with unknown $\pi$ .

Another interesting aspect of this PU problem is that the distribution of $X$ cannot be symmetric regardless of distribution of original design $\tilde{X}$ . This is because $X$ has a mixture distribution of two distributions $\tilde{X}|\tilde{Y}=1$ and $\tilde{X}$ , and by the monotonicity assumption of $g$ , the distribution of $\tilde{X}|\tilde{Y}=1$ cannot be symmetric and precludes existing methods relying on the symmetric design assumption.

1.2 Prior work

The single index model has been an active area in statistics for a number of decades, since its introduction in Econometrics and Statistics [Ichimura1993-gx, Horowitz1996-jv, Klein1993-is]. There is an extensive literature about estimation of the index vector $u_{\star}$ , which can be broadly classified into two categories : methods which directly estimate the index vector while avoiding a “nuisance’ ’infinite-dimensional function, and methods which involve estimation of both the infinite-dimensional function and the index vector.

Methods of the first type usually require strong design assumption such as Gaussian or an elliptically symmetric distribution. [Duan1991-nk] proposed an estimator of $u_{\star}$ using sliced inverse regression in the context of sufficient dimension reduction. $\sqrt{n}$ consistency and asymptotically normality of the estimator are established in the low-dimensional regime, under the elliptically symmetrical design assumption. This result is extended in [Lin2018-rk] in high-dimensional regime under similar design assumptions. [Plan2013-ew] studied sparse recovery of the index vector $u_{\star}$ when $X$ is Gaussian, and established a non-asymptotic $\mathcal{O}((s\log(2p/s)/n)^{1/2})$ mean-squared error bound, where $s$ is a sparsity level, i.e. $s:=\|u_{\star}\|_{0}$ . [Ai2014-lo] studied the error bound of the same estimator when the design is not Gaussian and showed that the estimator is biased and the bias depends on the size of $\|u_{\star}\|_{\infty}$ . The proposed estimator in [Plan2013-ew] is generalized in several directions; [Yang2017-cl] propose an estimator based on the score function of the covariate, which relies on prior knowledge of covariate distribution. [Chen2017-py] proposed estimators based on U-statistics which also include [Plan2013-ew] as a special case with standard Gaussian design assumption.

Methods of the second type usually do not require strong design assumption, and our method also falls into this category. One popular approach is via M-estimation or solving estimation equations. Under smoothness assumptions of $g_{\star}$ , methods for a link function estimation were proposed via Kernel estimation([Ichimura1993-gx, Hardle1993-oj, Klein1993-is, Delecroix2003-fv, Cui2011-go]), local polynomials ([Carroll1997-iz, Chiou1998-zw]) , or splines ([Yu2002-hq, Kuchibhotla2016-ef]). An estimator for the index vector is then obtained by minimizing certain criterion function such as squared loss ([Ichimura1993-gx, Hardle1993-oj, Yu2002-hq]) or solving an estimating equation ([Chiou1998-zw, Cui2011-go]). $\sqrt{n}$ consistency and asymptotic normality of those proposed estimators were established for those estimators in the low-dimensional regime. Another approach is the average derivative method ([Powell1989-je, Hristache2001-om]), which takes advantage of the fact $\frac{d}{dx}E[Y|X]=u_{\star}g^{\prime}(X^{T}u_{\star})$ . An estimator is then obtained through a non-parametric estimation of $\frac{d}{dx}E[Y|X]$ , also under smoothness assumption of $g_{\star}$ . There are also studies for single index model in high-dimensional regime, using PAC-Bayesian approach ([Alquier2013-ij]) or incorporating penalties ([Wang2012-mb, Radchenko2015-wa] )

There have been an increasing number of studies where the link function is restricted to have a particular shape, such as a monotone shape, but is not necessarily a smooth function. [Kalai2009-ew] and [Kakade2011-yk] investigated single index problem in low-dimensional regime under monotonic and Lipschitz continuous link assumption, and proposed iterative perceptron-type algorithms with prediction error guarantees. In particular, an isotonic regression is used for the estimation of $g_{\star}$ then update $u_{\star}$ based on estimated $g_{\star}$ . [Ganti2017-dq] extended [Kalai2009-ew, Kakade2011-yk] to the high-dimensional setting via incorporating a projection step onto a sparse subset of the parameter space. They empirically demonstrated good predictive performance, but no theoretical guarantee were provided. [Balabdaoui2016-sd] study the least square estimators under monotonicity constraint and establish $n^{1/3}$ consistency of the least-square estimator. Also in [Balabdaoui2017-at], a $\sqrt{n}$ consistent estimator for the index vector $u_{\star}$ , based on solving a score function, is proposed.

Our work focuses on estimation of the index vector in high-dimensional regime where an unknown function is assumed to be Lipschitz-continuous and monotonic. We provide an efficient algorithm via iterative hard-thresholding and a non-asymptotic $\ell_{2}$ and mean function error bound.

1.3 Our contributions

Our major contributions can be summarized as follows:

•

Develop a scalable projection-based iterative approach. For each iteration $u=u^{t}$ , we first perform isotonic regression of $Y$ on to $Xu^{t}$ using the PAVA algorithm, taking an orthogonal gradient step to update $u$ , and then enforcing sparsity using a thresholding operator. The algorithm is stopped once the mean-squared error drops below a threshold $\tau$ .
•

The main contribution is to provide theoretical guarantees for our algorithm, both for the parameter vector $u_{\star}$ and the mean function $g_{\star}(X^{T}u_{\star})$ that scales as $\biggr{(}\frac{s\log p}{n}\biggr{)}^{1/3}$ . Note that the minimax rate for low-dimensional isotonic regression is $n^{-1/3}$ and our algorithm achieves this rate up to $\log p$ .
•

Since our goal is largely to estimate the parameter vector $u_{\star}$ , we also show that under additional assumptions including a $\beta$ -min condition, we show that a one-step correction achieves the parametric low-dimensional $\frac{1}{\sqrt{n}}$ rate. The one-step correction is motivated by the earlier work of [Balabdaoui2017-at] which proves $\frac{1}{\sqrt{n}}$ rates in the low-dimensional case.
•

Finally we support our theoretical results with a simulation study which supports our theoretical rates and also shows that our algorithm compares favorable to existing approaches.

The remainder of the paper is organized as follows: Section LABEL:SecProb provides a problem setup and we develop our algorithm; in Section 3 we provide theoretical guarantees for our algorithm which are supported by experimental results in Section 4. Proofs are provided in Section 5.

2 Problem setup and algorithm

Let $X_{1},\dots,X_{n}\in\mathbb{R}^{p}$ be the feature vectors and $Y_{i}\in\mathbb{R}$ the response values, which we assume satisfy the model

Y_{i}=g_{\star}(X_{i}^{\top}u_{\star})+Z_{i},

(2.1)

where $g_{\star}$ is $L$ -Lipschitz & monotone non-decreasing; $u_{\star}$ is an $s_{\star}$ -sparse unit vector, with $Z_{i}\in\mathbb{R}$ denoting the noise term. We write $\mathbf{X}\in\mathbb{R}^{n\times p}$ to denote the matrix with rows $X_{i}$ , and $\mathbf{Y}\in\mathbb{R}^{n}$ and $\mathbf{Z}\in\mathbb{R}^{n}$ to denote the vectors with entries $Y_{i}$ and $Z_{i}$ , respectively. For any function $g:\mathbb{R}\rightarrow\mathbb{R}$ and any $v\in\mathbb{R}^{p}$ , we write $g(\mathbf{X}v)$ to mean that $g$ is applied element-wise to the vector $\mathbf{X}v$ , i.e. $g(\mathbf{X}v)=\big{(}g(X_{1}^{\top}v),\dots,g(X_{n}^{\top}v)\big{)}$ .

Monotone single index models with general non-symmetric design

1 Introduction

1.1 Examples

1.2 Prior work

1.3 Our contributions

2 Problem setup and algorithm

2.1 Algorithm

3 Theoretical guarantees

4 Numerical experiments

5 Proofs

Appendix A Proofs