This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Jian Liang 22institutetext: Department of Automation, Tsinghua University
State Key Lab of Intelligent Technologies and Systems
Tsinghua National Laboratory for Information Science and Technology(TNList)
Beijing. 100084. P.R.China
22email: [email protected]
33institutetext: Kun Chen 44institutetext: Department of Statistics
University of Connecticut
Storrs, CT 06269, USA
44email: [email protected]
55institutetext: Ming Lin 66institutetext: Department of Computational Medicine and Bioinformatics
University of Michigan
Ann Arbor, MI 48109, USA
66email: [email protected]
77institutetext: Changshui Zhang 88institutetext: Department of Automation, Tsinghua University
State Key Lab of Intelligent Technologies and Systems
Tsinghua National Laboratory for Information Science and Technology(TNList)
Beijing. 100084. P.R.China
88email: [email protected]
99institutetext: Fei Wang 1010institutetext: Department of Healthcare Policy and Research
Cornell University
New York City. NY 10065. USA
1010email: [email protected]

Robust Finite Mixture Regression for Heterogeneous Targets

Jian Liang    Kun Chen    Ming Lin    Changshui Zhang    Fei Wang
(Received: date / Accepted: date)
Abstract

Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that 1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, 2) achieves shared feature selection among tasks and cluster components, and 3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.

Keywords:
Finite Mixture RegressionMixed-type ResponseIncomplete TargetsAnomaly DetectionTask Clustering
journal: Data Mining and Knowledge Discovery

1 Introduction

Regression modeling, which refers to building models to learn conditional relationship between output targets and input features on some training samples, is a fundamental problem in statistics and machine learning. Some classical regression modeling approaches include least square regression, logistic regression and Poisson regression; see, e.g., Bishop (2006)Kubat (2015)Fahrmeir et al (2013) and the references therein.

The aforementioned classic approaches usually train a single model of a single target over the entire data set. However, real-world problems can be much more complicated. In particular, the needs of utilizing high-dimensional features, population heterogeneity, and multiple interrelated targets are among the most prominent complications. To handle high-dimensional data, the celebrated regularized estimation approaches have undergone exciting developments in recent years; see, e.g., Fan and Lv (2010) and Huang et al (2012). In the presence of population heterogeneity, the samples may form several distinct clusters corresponding to mixed relationships between the targets and the features. A popular modeling strategy in such a scenario is the Finite Mixture Regression (FMR) (McLachlan and Peel, 2004), which is capable of adaptively learning multiple models, each of which is responsible for one subset/cluster of the data. FMR models have been widely used in market segmentation studies, patients’ disease progression subtyping, motif gene-expression research, etc.; see, e.g., Städler et al (2010)Khalili (2011)Khalili and Chen (2007)Do˘ru and Arslan (2017), and the references therein. The problem of joint learning for multiple targets is usually referred to as Multi-Task Learning (MTL) in machine learning or multivariate learning in statistics; see, e.g., Argyriou et al (2007a)Argyriou et al (2007b)Chen et al (2011), and Gong et al (2012b). We stress that the main definition of MTL considers tasks that do not necessarily share the same set of samples (and features), and that this paper focuses on a special case of MTL, where the multivariate outcomes are collected from the same set of samples and share the same set of features, whose reason will be explained later. There have also been multi-task FMR models, e.g., Wedel and DeSarbo (1995)Xian Wang et al (2004)Kyung Lim et al (2016) and Bai et al (2016), which mainly built on certain multivariate probability distribution such as Gaussian distribution or multivariate tt distribution.

Thus far, a comprehensive study on multi-task mixture-regression modeling with high-dimensional data is still lacking. To tackle this problem for handling real-world applications, there remain several challenges and practical concerns.

  • Task Heterogeneity. Current MTL approaches usually assume that the targets are of the same type. However, it is common that the multiple targets are of different types, such as continuous, binary, count, etc., which we refer to as task heterogeneity. For example, in anesthesia decision making (Tan et al, 2010), the anesthesia drugs will have impacts on multiple indicators of an anesthesia patient, such as anesthesia depth, blood pressures, heart rates, etc. The anesthesiologist needs to consider all those different aspects as well as their intrinsic dependence before making the decision.

  • Task Integration. As in the anesthesiology example, the multiple tasks are typically inter-related to each other, and the potential benefit from a MTL approach needs to be realized through properly exploring and taking advantage of these relationships. In existing high-dimensional MTL approaches, the tasks are usually integrated by assuming certain shared conditional mean structures between the targets and the features. The problem is more difficult in the presence of both task and population heterogeneities.

  • Task Robustness. Similar to the idea in the robust MTL approaches (Passos et al, 2012; Gong et al, 2012a; Chen et al, 2011), it is not always the case that jointly considering all tasks by assuming certain shared structures among them would be helpful. Certain tasks, referred to as anomaly tasks, may not follow the assumed shared structure and thus can ruin the overall model performance. More generally, tasks may even cluster into groups with different shared structures.

In this paper, we propose a novel method named HEterogeneous-target Robust MIxTure regression (Hermit), to address the above challenges in a unified framework. Here we explain that we mainly consider the setting, where the multivariate outcomes are collected from the same set of samples and share the same set of features because our main objective is to learn potentially shared sample clusters and feature sets among tasks. Rigorous theoretical analysis and performance guarantees are provided. It is worthwhile to highlight the key aspects of our approach as follows.

  • Our method handles mixed type of targets simultaneously. Each target follows an exponential dispersion family model (Jorgensen, 1987), so that multiple different types of targets, e.g., continuous, binary, and counts, can be handled jointly. The tasks are naturally integrated through sharing the same clustering structure arising from population heterogeneity. Our theory allows Hermit  to cover sub-exponential distributions, including the commonly-encountered Bernoulli, Poisson and Gaussian as special cases.

  • Our method imposes structural constraints in each mixture component of Hermit, to deal with the curse of dimensionality and at the same time further take advantage of the interrelationship of the tasks. In particular, the group 1\ell_{1} penalization is adopted to perform shared feature selection among tasks within each mixture component.

  • Our method adopts three strategies for robustness. First, we adopt a mean-shift regularization technique (She and Chen, 2017) to detect the outlier samples automatically and adjust for its outlying effects in model estimation. The second strategy measures discrepancy of different conditional distributions to detect anomaly tasks. The third strategy measures similarity between each pair of tasks to discover a clustered structure among tasks. Moreover, our model can work with incomplete data and impute entry-wise missing values in the multiple targets.

The aforementioned key elements, e.g., multi-task learning, sample clustering, shared feature selection, and anomaly detection, are integrated in a unified mixture model setup, so that they can benefit from and reinforce each other. A generalized Expectation-Maximization (GEM) (Neal and Hinton, 1998) algorithm is developed to conduct model estimation efficiently. For theoretical analysis, we generalize the results of Städler et al (2010) to establish non-asymptotic oracle performance bounds for Hermit  under a high-dimensional learning framework. This is not trivial due to the non-convexity (due to the population heterogeneity) and the target heterogeneity of the problem.

The rest of this paper is organized as follows. Section 2 provides a brief review of the background and the related works to our method. Section 3 presents the details of the proposed Hermit  model and the computational algorithm. Section 4 discusses several extensions of our method. Section 5 shows the theoretical analysis. The empirical evaluations are presented in Section 6, followed by the discussions and conclusions in Section 7.

2 Background & Related Work

Let Y𝒴Y\in\mathcal{Y}\subset\mathbb{R} be the output target and 𝐱𝒳d\mathbf{x}\in\mathcal{X}\subset\mathbb{R}^{d} the input feature vector. GLMs (Nelder and Baker, 1972) postulate that the conditional probability density function of YY given 𝐱\mathbf{x} is

f(y𝐱,θ)=f(yφ,ϕ)=exp{yφb(φ)a(ϕ)+c(y,ϕ)},f(y\mid{\mathbf{x},\theta})=f(y\mid\varphi,\phi)=\exp\biggl{\{}\frac{y\varphi-b(\varphi)}{a(\phi)}+c(y,\phi)\biggr{\}},

where φ=𝐱T𝜷\varphi=\mathbf{x}^{\rm T}\mbox{\boldmath$\beta$} with 𝜷\beta being the regression coefficient vector, ϕ\phi is a dispersion parameter, and a(),b(),c()a(\cdot),b(\cdot),c(\cdot) are known functions whose forms are determined by the specific distribution. Here we use θ\theta to denote the collection of all the unknown parameters, i.e., θ=(𝜷,ϕ)\theta=(\mbox{\boldmath$\beta$},\phi). Least square regression, logistic regression and Poisson regression are all special cases of GLMs. In the presence of population heterogeneity, a standard finite mixture model of GLM postulates that the conditional probability density function of YY given 𝐱\mathbf{x} is

f(y𝐱,θ)=r=1kπrf(yφr,ϕr),f(y\mid\mathbf{x},\theta)=\sum_{r=1}^{k}\pi_{r}f(y\mid\varphi_{r},\phi_{r}),

where φr=𝐱T𝜷r\varphi_{r}=\mathbf{x}^{\rm T}\mbox{\boldmath$\beta$}_{r} with 𝜷r\mbox{\boldmath$\beta$}_{r} being the regression coefficient vector for the rrth mixture component, and πr>0(r=1,,k)\pi_{r}>0\ (r=1,\dots,k) with r=1kπr=1\sum_{r=1}^{k}\pi_{r}=1. So FMR model assumes that there are kk sub-populations, each of which admits a different conditional relationship between YY and 𝐱\mathbf{x}.

McLachlan and Peel (2004) introduced finite mixture of GLM models. Bartolucci and Scaccia (2005) considered a special case that 𝜷1,,𝜷k\mbox{\boldmath$\beta$}_{1},\dots,\mbox{\boldmath$\beta$}_{k} are different only in their first entries. Khalili and Chen (2007) discussed using sparse penalties such as Lasso and SCAD to perform feature selection for FMR models and showed asymptotic properties of the penalized estimators. Städler et al (2010) reparameterized the finite mixture of Gaussian regression model and used 1\ell_{1} penalization to achieve bounded log-likelihood and consistent feature selection. For multiple targets, Wedel and DeSarbo (1995) proposed finite mixture of GLM models with multivariate targets. These methods only consider the univariate-outcome case. Weruaga and Vía (2015) proposed multivariate Gaussian mixture regression and used 1\ell_{1} penalty for sparseness of parameters. Besides mixture of GLMs, there have been many works on mixture of other continuous distributions such as tt and Laplace distributions, mainly motivated by the needs of robust estimation for handling heavy tailed or skewed continuous distribution; see, e.g., Xian Wang et al (2004); Do˘ru and Arslan (2017); Alfò et al (2016); Doğru and Arslan (2016); Kyung Lim et al (2016); Bai et al (2016). However, these methods assume that the targets are of the same type, and only consider interrelationship among tasks with continuous outcomes. Additionally, they all assumed that their FMR model is shared by all the tasks.

In MTL, Kumar and Daumé III (2012)Passos et al (2012)Gong et al (2012a)Chen et al (2011)Jacob et al (2009)Chen et al (2010) and He and Lawrence (2011) proposed to tackle the problem that different groups of tasks share different information, providing methods to handle anomaly tasks, clustered structure or graph-based structure among tasks. Yang et al (2009) proposed a multi-task framework to jointly learn tasks with output types of Gaussian and multinomial. Zhang et al (2012) proposed a multi-modal multi-task model to predict clinical variables for regression and categorical variable for classification jointly. Li et al (2014) proposed a heterogeneous multi-task learning framework to learn a pose-joint regressor and a sliding window body-part detector in a deep network architecture simultaneously. Nevertheless, these MTL methods cannot handle the heterogeneity of conditional relationship between features and targets.

By contrast, the proposed FMR framework Hermit  is effective for handling sample heterogeneity with mixed type of tasks whose interrelationship are harnessed by structural constraints. Non-asymptotic theoretical guarantees are provided. It also handles anomaly tasks or clustered structure among tasks, for the case that not all the tasks share the same FMR structure.

3 HEterogeneous-target Robust MIxTure regression

In this section, we first present the formulation of the main  Hermit model, followed by penalized likelihood estimators with sparse constraint and structural constraint, respectively. We then introduce the associated optimization procedures, and describe how to perform sample clustering and make imputation of the missing/unobserved outcomes on incomplete multi-target outcomes based on the main model. Hyper-parameter tuning is discussed at last. Various extensions of the main methodology, including strategies to handle anomaly tasks or clustered tasks, will be introduced in Section 4.

3.1 Model Formulation and Estimation Criterion

Let 𝐘n×m\mathbf{Y}\in\mathbb{R}^{n\times m} be the output/target data matrix and 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} the input/feature data matrix, consisting of nn independent samples (𝐲i,𝐱i)(\mathbf{y}_{i},\mathbf{x}_{i}), i=1,,ni=1,\ldots,n. As such, there are mm different targets with a common set of dd features. We allow 𝐘\mathbf{Y} to contain missing values at random; define Ωi={j{1,,m}:yij is observed.}\Omega_{i}=\{j\in\{1,\ldots,m\}:y_{ij}\mbox{ is observed.}\} be the collection of indices of observed outcome in the iith sample 𝐲i\mathbf{y}_{i} (Ωi\Omega_{i}\neq\emptyset), for i=1,,ni=1,\ldots,n.

To model multiple types of targets, such as continuous, binary, count, etc., we allow yijy_{ij} to potentially follow different distributions in the exponential-dispersion family, for each j=1,,mj=1,\ldots,m. Specifically, we assume that given 𝐱i\mathbf{x}_{i}, the joint probability density function of 𝐲~i={yij;jΩi}\tilde{\mathbf{y}}_{i}=\{y_{ij};j\in\Omega_{i}\} is

f(𝐲~i𝐱i,θ)=f(yij,jΩi𝐱i,θ)=r=1kπrjΩif(yijφijr,ϕjr),f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta)=f(y_{ij},j\in\Omega_{i}\mid\mathbf{x}_{i},\theta)=\sum_{r=1}^{k}\pi_{r}\prod_{j\in\Omega_{i}}f(y_{ij}\mid\varphi_{ijr},\phi_{jr}), (1)

where

f(yijφijr,ϕjr)=exp{yijφijrbj(φijr)aj(ϕjr)+cj(yij,ϕjr)},f(y_{ij}\mid\varphi_{ijr},\phi_{jr})=\exp\biggl{\{}\frac{y_{ij}\varphi_{ijr}-b_{j}(\varphi_{ijr})}{a_{j}(\phi_{jr})}+c_{j}(y_{ij},\phi_{jr})\biggr{\}},

φijr\varphi_{ijr} is the natural parameter for the iith sample of the jjth target in the rrth mixture component, ϕjr\phi_{jr} is the dispersion parameter of the jjth target in the rrth mixture component, and the functions aj,bj,cj(j=1,,m)a_{j},b_{j},c_{j}\ (j=1,\ldots,m) are determined by the specific distribution of the jjth target. Here, the key assumption is that the mm tasks all correspond to the same cluster structure (e.g., the mm tasks all have kk clusters) determined by the underlying population heterogeneity; given the shared cluster label (e.g., rr), the tasks within each mixture component then become independent of each other (depicted by the product of their probability density functions). As such, by allowing cluster label sharing, the model provides an effective way to genuinely integrate the learning of the multiple tasks.

Following the setup of GLMs, we assume a linear structure in the natural parameters, i.e.,

φijr=𝐱i𝜷jr,\varphi_{ijr}=\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}, (2)

where 𝜷jr\mbox{\boldmath$\beta$}_{jr} is the regression coefficient vector of the jjth response in the rrth mixture component. Since 𝐱i\mathbf{x}_{i} is possibly of high dimensionality, the 𝜷jr\mbox{\boldmath$\beta$}_{jr}s are potentially sparse vectors. For example, when the 𝜷jr\mbox{\boldmath$\beta$}_{jr}s for j=1,,mj=1,\ldots,m share the same sparsity pattern, the tasks share the same set of relevant features within each mixture component. For r=1,,kr=1,\ldots,k, write 𝜷rd×m=[𝜷1r,𝜷2r,,𝜷mr]\mbox{\boldmath$\beta$}_{r}\in\mathbb{R}^{d\times m}=[\mbox{\boldmath$\beta$}_{1r},\mbox{\boldmath$\beta$}_{2r},\ldots,\mbox{\boldmath$\beta$}_{mr}] and ϕr=[ϕ1r,,ϕmr]T\phi_{r}=[\phi_{1r},\ldots,\phi_{mr}]^{T}. Also write 𝜷(d×m)×k=[𝜷1,,𝜷k]\mbox{\boldmath$\beta$}\in\mathbb{R}^{(d\times m)\times k}=[\mbox{\boldmath$\beta$}_{1},\ldots,\mbox{\boldmath$\beta$}_{k}]. Let θ={𝜷,ϕ1,,ϕk,π1,,πk}\theta=\{\mbox{\boldmath$\beta$},\phi_{1},\ldots,\phi_{k},\pi_{1},\ldots,\pi_{k}\} collecting all the unknown parameters, with the parameter space given by Θ=(d×m)×k×>0m×k×Π,\Theta=\mathbb{R}^{(d\times m)\times k}\times\mathbb{R}^{m\times k}_{>0}\times\Pi, where Π={π;πr>0 for r=1,,k and r=1kπr=1}\Pi=\{\pi;\pi_{r}>0\mbox{ for }r=1,\ldots,k\mbox{ and }\sum_{r=1}^{k}\pi_{r}=1\}.

The data log-likelihood of the proposed model is

(θ𝐘,𝐗)=i=1nlog(r=1kπrjΩiexp{yijφijrbj(φijr)aj(ϕjr)+cj(yij,ϕjr)}).\begin{split}\ell(\theta\mid\mathbf{Y},\mathbf{X})&=\sum_{i=1}^{n}\log\left(\sum_{r=1}^{k}\pi_{r}\prod_{j\in\Omega_{i}}\exp\biggl{\{}\frac{y_{ij}\varphi_{ijr}-b_{j}(\varphi_{ijr})}{a_{j}(\phi_{jr})}+c_{j}(y_{ij},\phi_{jr})\biggr{\}}\right).\end{split} (3)

The missing values in 𝐘\mathbf{Y} simply do not contribute to the likelihood, which follows the same spirit as in matrix completion (Candès and Recht, 2009). The proposed model indeed possesses a genuine multivariate flavor, as the different outcomes share the same underlying latent cluster pattern of the heterogeneous population. We then propose to estimate θ\theta by the following penalized likelihood criterion:

θ^=argminθΘ(θ𝐘,𝐗)/n+(𝜷;λ),\hat{\theta}={\arg\min}_{\theta\in\Theta}\ -\ell(\theta\mid\mathbf{Y},\mathbf{X})/n+\mathcal{R}(\mbox{\boldmath$\beta$};\lambda), (4)

where (𝜷;λ)\mathcal{R}(\mbox{\boldmath$\beta$};\lambda) is some certain penalty term on the regression coefficients with λ\lambda being a tuning parameter.

We thus name our proposed method the HEterogeneous-target Robust MIxTure regression (Hermit). The (𝜷;λ)\mathcal{R}(\mbox{\boldmath$\beta$};\lambda) can be flexibly chosen based on specific needs of feature selection. The first sparse penalties adopted by our model is the 1\ell_{1} norm (lasso-type) penalty,

(𝜷;λ,π)=λr=1kπrγ𝜷r1,\mathcal{R}(\mbox{\boldmath$\beta$};\lambda,\pi)=\lambda\sum_{r=1}^{k}\pi_{r}^{\gamma}\|\mbox{\boldmath$\beta$}_{r}\|_{1}, (5)

where λ\lambda is the tuning parameter, 1\|\cdot\|_{1} is the entry-wise 1\ell_{1} norm, and πrγ\pi_{r}^{\gamma}s (r=1,,k)(r=1,\ldots,k) are the penalty weights with γ{0,1/2,1}\gamma\in\{0,1/2,1\} being a pre-specified constant. Here the penalty also depends on the unknown mixture proportions π\pi; when the cluster sizes are expected to be imbalanced, using this weighted penalization with some γ>0\gamma>0 is preferred (Städler et al, 2010). This entry-wise regularization approach allows the tasks to have independent set of relevant features. Alternatively, in order to enhance the integrative learning and potentially boost the performance of clustering, it could be beneficial to encourage the internal similarity within each sub-population. Then certain group-wise regularization of the features could be considered, which are widely adopted in multi-task learning. In particular, we consider a component-specific group sparsity pattern to achieve shared feature selection among different tasks, in which the group 1\ell_{1} norm penalty is used (Gong et al, 2012a; Jalali et al, 2010),

(𝜷;λ,π)=λr=1kπrγ𝜷r1,2,\displaystyle\mathcal{R}(\mbox{\boldmath$\beta$};\lambda,\pi)=\lambda\sum_{r=1}^{k}\pi_{r}^{\gamma}\|\mbox{\boldmath$\beta$}_{r}\|_{1,2}, (6)

where 1,2\|\cdot\|_{1,2} denotes the sum of the row 2\ell_{2} norms of the enclosed matrix, and the weights are constructed as in (5). The shared feature set in each sub-population can be used to characterize the sub-population and render the whole model more interpretable.

3.2 Optimization

We propose a generalized EM (GEM) algorithm (Dempster et al, 1977) to solve the minimization problem in (4). For each i=1,,ni=1,\ldots,n, define (δi,1,,δi,k)(\delta_{i,1},\ldots,\delta_{i,k}) be a set of latent indicator variables, where δi,r=1\delta_{i,r}=1 if the iith sample (𝐲i,𝐱i)(\mathbf{y}_{i},\mathbf{x}_{i}) belongs to the rrth component of the mixture model (1) and δi,r=0\delta_{i,r}=0 otherwise. So r=1kδi,r=1\sum_{r=1}^{k}\delta_{i,r}=1, i\forall i. These indicators are not observed since the cluster labels of the samples are unknown. Let δ\delta denote the collection of all the indicator variables. By treating δ\delta as missing, the EM algorithm proceeds by iteratively optimizing the conditional expectation of the complete log-likelihood criterion.

The complete log-likelihood is given by

c(θ𝐘,𝐗,δ)=\displaystyle\ell_{c}(\theta\mid\mathbf{Y},\mathbf{X},\delta)= r=1k{i=1njΩiδi,r(yijφijrbj(φijr)aj(ϕjr)+cj(yij,ϕjr))\displaystyle\sum_{r=1}^{k}\biggl{\{}\sum_{i=1}^{n}\sum_{j\in\Omega_{i}}\delta_{i,r}\left(\frac{y_{ij}\varphi_{ijr}-b_{j}(\varphi_{ijr})}{a_{j}(\phi_{jr})}+c_{j}(y_{ij},\phi_{jr})\right)
+i=1nδi,rlog(πr)},\displaystyle+\sum_{i=1}^{n}\delta_{i,r}\log(\pi_{r})\biggr{\}},

where φijr=𝐱i𝜷jr\varphi_{ijr}=\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}, for i=1,,n,i=1,\ldots,n, j=1,,mj=1,\ldots,m, and r=1,,kr=1,\ldots,k. The conditional expectation of the penalized complete negative log-likelihood is then given by

Qpen(θθ)=𝔼[c(θ𝐘,𝐗,δ)|𝐘,𝐗,θ]/n+(𝜷;λ,π),\displaystyle Q_{pen}(\theta\mid\theta^{\prime})=-\mathbb{E}[\ell_{c}(\theta\mid\mathbf{Y},\mathbf{X},\delta)|\mathbf{Y},\mathbf{X},\theta^{\prime}]/n+\mathcal{R}(\mbox{\boldmath$\beta$};\lambda,\pi),

where (𝜷;λ,π)\mathcal{R}(\mbox{\boldmath$\beta$};\lambda,\pi) can be any of the penalties in (5) or (6). It is easy to show that deriving Qpen(θθ)Q_{pen}(\theta\mid\theta^{\prime}) boils down to the computation of 𝔼[δi,r𝐘,𝐗,θ]\mathbb{E}[\delta_{i,r}\mid\mathbf{Y},\mathbf{X},\theta^{\prime}], which admits an explicit form.

The EM algorithm proceeds as follows. Let θ(0)\theta^{(0)} be some given initial values. We repeat the following steps for t=0,1,2,t=0,1,2,\ldots, until convergence of the parameters or the pre-specified maximum number of iteration ToutT_{out} is reached.

E-Step: Compute ρ^i,r(t+1)=𝔼[δi,r𝐘,𝐗,θ(t)]\hat{\rho}_{i,r}^{(t+1)}=\mathbb{E}[\delta_{i,r}\mid\mathbf{Y},\mathbf{X},\theta^{(t)}]. For φijr(t)=𝐱i𝜷jr(t)\varphi_{ijr}^{(t)}=\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}^{(t)},

ρ^i,r(t+1)\displaystyle\hat{\rho}_{i,r}^{(t+1)} =πr(t)jΩiexp{(yijφijr(t)bj(φijr(t)))/aj(ϕjr(t))+cj(yij,ϕjr(t))}r=1kπr(t)jΩiexp{(yijgijr(t)bj(gijr(t)))/aj(ϕjr(t))+cj(yij,ϕjr(t))}.\displaystyle=\frac{\pi_{r}^{(t)}\prod_{j\in\Omega_{i}}\exp\{(y_{ij}\varphi_{ijr}^{(t)}-b_{j}(\varphi_{ijr}^{(t)}))/a_{j}(\phi_{jr}^{(t)})+c_{j}(y_{ij},\phi_{jr}^{(t)})\}}{\sum_{r^{\prime}=1}^{k}\pi_{r^{\prime}}^{(t)}\prod_{j\in\Omega_{i}}\exp\{(y_{ij}g_{ijr^{\prime}}^{(t)}-b_{j}(g_{ijr^{\prime}}^{(t)}))/a_{j}(\phi_{jr^{\prime}}^{(t)})+c_{j}(y_{ij},\phi_{jr^{\prime}}^{(t)})\}}. (7)

M-Step: Minimize Qpen(θθ(t))Q_{pen}(\theta\mid\theta^{(t)}).

a) Update π=(π1,,πk)\pi=(\pi_{1},\ldots,\pi_{k}) by solving

π(t+1)=argminπ1nr=1ki=1nρ^i,r(t+1)log(πr)+(𝜷(t);λ,π)\displaystyle\pi^{(t+1)}=\arg\min_{\pi}\ -\frac{1}{n}\sum_{r=1}^{k}\sum_{i=1}^{n}\hat{\rho}^{(t+1)}_{i,r}\log(\pi_{r})+\mathcal{R}(\mbox{\boldmath$\beta$}^{(t)};\lambda,\pi)
s.t.r=1kπr=1,πr>0,r.\displaystyle s.t.\ \sum_{r=1}^{k}\pi_{r}=1,\pi_{r}>0,\forall r.

b) Update 𝜷,𝚽\mbox{\boldmath$\beta$},\mbox{\boldmath$\Phi$}.

(𝜷(t+1),𝚽(t+1))=argmin𝜷,𝚽\displaystyle(\mbox{\boldmath$\beta$}^{(t+1)},\mbox{\boldmath$\Phi$}^{(t+1)})=\arg\min_{\mbox{\boldmath$\beta$},\mbox{\boldmath$\Phi$}}\ 1nr=1ki=1nρ^i,r(t+1)jΩi(yij𝐱i𝜷jrbj(𝐱i𝜷jr)aj(ϕjr)\displaystyle-\frac{1}{n}\sum_{r=1}^{k}\sum_{i=1}^{n}\hat{\rho}^{(t+1)}_{i,r}\sum_{j\in\Omega_{i}}\biggl{(}\frac{y_{ij}\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}-b_{j}(\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr})}{a_{j}(\phi_{jr})}
+cj(yij,ϕjr))+(𝜷;λ,π(t+1)).\displaystyle+c_{j}(y_{ij},\phi_{jr})\biggr{)}+\mathcal{R}(\mbox{\boldmath$\beta$};\lambda,\pi^{(t+1)}).

For the problem in a), Städler et al (2010) proposed a procedure to lower the objective function by a feasible point, and we find that simply setting πr(t+1)=i=1nρ^i,r/n\pi_{r}^{(t+1)}=\sum_{i=1}^{n}\hat{\rho}_{i,r}/n is good enough. For the problem in b), we use an accelerated proximal gradient (APG) method introduced in Nesterov et al (2007) with the maximum number of iteration of TinT_{in}. The update steps by proximal operators correspond to the chosen penalty form. For the entry-wise 1\ell_{1} norm penalty in (5),

𝜷^r(t+1)=sign(𝜷~r(t))max{0,|𝜷~r(t)|τλ(πr(t+1))γ},\displaystyle\widehat{\mbox{\boldmath$\beta$}}_{r}^{(t+1)}=\mbox{sign}(\widetilde{\mbox{\boldmath$\beta$}}_{r}^{(t)})\circ\max\{0,|\widetilde{\mbox{\boldmath$\beta$}}_{r}^{(t)}|-\tau\lambda(\pi_{r}^{(t+1)})^{\gamma}\}, (8)

where \circ denotes entry-wise product, 𝜷~r(t)=𝜷r(t)+τ𝜷r(t)\widetilde{\mbox{\boldmath$\beta$}}_{r}^{(t)}=\mbox{\boldmath$\beta$}_{r}^{(t)}+\tau\triangle\mbox{\boldmath$\beta$}_{r}^{(t)}, τ\tau denotes the step size, and 𝜷r(t)\triangle\mbox{\boldmath$\beta$}_{r}^{(t)} denotes the update direction of 𝜷r(t)\mbox{\boldmath$\beta$}_{r}^{(t)} determined by APG. For the group 1\ell_{1} norm penalty in (6),

𝜷^r,j(t+1)=𝜷~r,j(t)max{0,1τλ(πr(t+1))γ/𝜷~r,j(t)2},\widehat{\mbox{\boldmath$\beta$}}_{r,j}^{(t+1)}=\widetilde{\mbox{\boldmath$\beta$}}_{r,j}^{(t)}\circ\max\{0,1-\tau\lambda(\pi_{r}^{(t+1)})^{\gamma}/\|\widetilde{\mbox{\boldmath$\beta$}}_{r,j}^{(t)}\|_{2}\},

where 𝜷r,j\mbox{\boldmath$\beta$}_{r,j} denotes the jjth column of 𝜷r\mbox{\boldmath$\beta$}_{r}. We adopt the active set algorithm in Städler et al (2010) to speed up the computation.

The time complexity of our algorithm using the speed up technique is O(ToutkmnsTin)O(T_{out}kmnsT_{in}) with ss being the number of non-zero parameters. The algorithm performs well in practice, and we have not observed any convergence issues in our extensive numerical studies.

3.3 Clustering of Samples & Imputation of Missing Targets

From the model estimation, we can get estimates of both the conditional probabilities p(δi,r=1yij,jΩi,𝐱i,θ)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\theta) and the conditional means 𝔼[Yij𝐱i,θ,δi,r=1]\mathbb{E}[Y_{ij}\mid\mathbf{x}_{i},\theta,\delta_{i,r}=1], where 𝔼[Yij𝐱i,θ,δi,r=1]=μj(φijr)=bj(𝐱i𝜷jr)\mathbb{E}[Y_{ij}\mid\mathbf{x}_{i},\theta,\delta_{i,r}=1]=\mu_{j}(\varphi_{ijr})=b^{\prime}_{j}(\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}). Specifically, the conditional probabilities can be estimated by p(δi,r=1yij,jΩi,𝐱i,θ^)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}) which corresponds to (7), taking t=Toutt=T_{out}. The conditional expectations can be estimated as 𝔼[Yij𝐱i,θ^,δi,r=1]=bj(𝐱i𝜷^jr)\mathbb{E}[Y_{ij}\mid\mathbf{x}_{i},\hat{\theta},\delta_{i,r}=1]=b^{\prime}_{j}(\mathbf{x}_{i}\hat{\mbox{\boldmath$\beta$}}_{jr}).

For clustering the samples, we adopt the Bayes rule, i.e., for i=1,,ni=1,\ldots,n,

r^i=argmaxrp(δi,r=1yij,jΩi,𝐱i,θ^).\hat{r}_{i}=\mathop{\operatorname*{argmax}}_{r}\ p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}). (9)

Following the idea of Jacobs et al (1991), we propose to make imputation for the missing outcomes by

y^ij=r=1kp(δi,r=1yij,jΩi,𝐱i,θ^)𝔼[Yj𝐱i,θ^,δi,r=1],forjΩi.\hat{y}_{ij}=\sum_{r=1}^{k}p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta})\mathbb{E}[Y_{j}\mid\mathbf{x}_{i},\hat{\theta},\delta_{i,r}=1],\ \mbox{for}\ j\notin\Omega_{i}. (10)

3.4 Tuning Hyper-Parameters

Unless otherwise specified, all the hyper-parameters, including regularization coefficients λ\lambdas and the number of clusters kk, are tuned to maximize the data log-likelihood in (3) on the held-out validation data set. In other words, we fit models on training data with different specific hyper-parameter settings, and then the optimal model is chosen as the one that gives the highest log-likelihood in (3) of the held-out validation data set. This approach is fairly standard and has been widely used in existing works (Städler et al, 2010). Moreover, cross validation and various information criteria (Bhat and Kumar, 2010; Aho et al, 2014) can also be applied to determine hyper-parameters.

4 Extensions

We provide several extensions of the proposed Hermit  approach described in Section 3, including robust estimation against outlier samples, handling anomaly tasks or clustered structure among tasks, and modeling mixture probabilities for feature-based prediction.

4.1 Robust Estimation

To perform robust estimation for parameters in the presence of outlier samples, we propose to adopt the mean shift penalization approach (She and Owen, 2011). Specifically, we extend the natural parameter model to the following additive form,

φijr=𝐱i𝜷jr+ζijr,\varphi_{ijr}=\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}+\zeta_{ijr}, (11)

where ζijr\zeta_{ijr} is a case-specific mean shift parameter to capture the potential deviation from the linear model. Apparently, when ζijr\zeta_{ijr} is allowed to vary without any constraint, it can make the model fit as perfect as possible for every yijry_{ijr}. The merit of this approach is realized by assuming certain sparsity structure of the ζijr\zeta_{ijr}s, so that only a few of them have nonzero values corresponding to anomalies. Write 𝜻rn×m=[𝜻1r,𝜻2r,,𝜻mr]\mbox{\boldmath$\zeta$}_{r}\in\mathbb{R}^{n\times m}=[\mbox{\boldmath$\zeta$}_{1r},\mbox{\boldmath$\zeta$}_{2r},\ldots,\mbox{\boldmath$\zeta$}_{mr}] for r=1,,kr=1,\ldots,k, and 𝜻(n×m)×k=[𝜻1,,𝜻r]\mbox{\boldmath$\zeta$}\in\mathbb{R}^{(n\times m)\times k}=[\mbox{\boldmath$\zeta$}_{1},\ldots,\mbox{\boldmath$\zeta$}_{r}]. We can then conduct joint model estimation and outlier detection by extending (4) to

(θ^,𝜻^)=argminθΘ,𝜻(θ𝐘,𝐗)/n+(𝜷;λ1)+(𝜻;λ2),(\hat{\theta},\hat{\mbox{\boldmath$\zeta$}})={\arg\min}_{\theta\in\Theta,\mbox{\boldmath$\zeta$}}\ -\ell(\theta\mid\mathbf{Y},\mathbf{X})/n+\mathcal{R}(\mbox{\boldmath$\beta$};\lambda_{1})+\mathcal{R}(\mbox{\boldmath$\zeta$};\lambda_{2}), (12)

where, for example, the penalty on 𝜻\zeta can be chosen as the group 1\ell_{1} penalty,

(𝜻;λ2)=λ2i=1njrζijr2,\displaystyle\mathcal{R}(\mbox{\boldmath$\zeta$};\lambda_{2})=\lambda_{2}\sum_{i=1}^{n}\sqrt{\sum_{jr}\zeta_{ijr}^{2}}, (13)

so that entries of 𝜻\zeta are nonzero for only a few data samples.

The proposed GEM algorithm can be readily extended to handle the inclusion of 𝜻\zeta, for which we omit the details.

4.2 Handling Anomaly Tasks

Besides outlier samples, certain tasks, referred to as anomaly tasks, may not follow the assumed shared structure and thus can ruin the overall model performance. To handle anomaly tasks, though it is also intuitive to adopt the approach above, our numerical study suggests that its performance is sensitive to the tuning parameters. Here, we adopt the idea of Koller and Sahami (1996) , by utilizing the estimated conditional probabilities to measure how well a task is concordant with the estimated mixture structure. Consider the hhth task. The main idea is to measure the discrepancy between p(δir=1yij,jΩi,𝐱i,θ)p(\delta_{ir}=1\mid y_{ij},j\in\Omega_{i},\mathbf{x}_{i},\theta), the conditional probability based on data from all observed targets, and p(δir=1yih,𝐱i,θ)p(\delta_{ir}=1\mid y_{ih},\mathbf{x}_{i},\theta), the conditional probability based on only the hhth task. If hhth task is an anomaly task, it is expected that the two conditional probabilities would differ more from each other (Koller and Sahami, 1996; Law et al, 2002).

For r=1,,kr=1,\ldots,k, i=1,,ni=1,\ldots,n, let

PΩ,ir\displaystyle P_{\Omega,ir} =p(δir=1yij,jΩi,𝐱i,θ^)=π^rjΩif(yijφ^ijr,ϕ^jr)r=1kπ^rjΩif(yijφ^ijr,ϕ^jr),\displaystyle=p(\delta_{ir}=1\mid y_{ij},j\in\Omega_{i},\mathbf{x}_{i},\hat{\theta})=\frac{\hat{\pi}_{r}\prod_{j\in\Omega_{i}}f(y_{ij}\mid\hat{\varphi}_{ijr},\hat{\phi}_{jr})}{\sum_{r^{\prime}=1}^{k}\hat{\pi}_{r^{\prime}}\prod_{j\in\Omega_{i}}f(y_{ij}\mid\hat{\varphi}_{ijr^{\prime}},\hat{\phi}_{jr^{\prime}})}, (14)
Ph,ir\displaystyle P_{h,ir} =p(δir=1yih,𝐱i,θ^)=π^rf(yihφ^ihr,ϕ^hr)r=1kπ^rf(yihφ^ihr,ϕ^hr).\displaystyle=p(\delta_{ir}=1\mid y_{ih},\mathbf{x}_{i},\hat{\theta})=\frac{\hat{\pi}_{r}f(y_{ih}\mid\hat{\varphi}_{ihr},\hat{\phi}_{hr})}{\sum_{r^{\prime}=1}^{k}\hat{\pi}_{r^{\prime}}f(y_{ih}\mid\hat{\varphi}_{ihr^{\prime}},\hat{\phi}_{hr^{\prime}})}.

Define 𝐏Ω=[PΩ,ir]n×k\mathbf{P}_{\Omega}=[P_{\Omega,ir}]_{n\times k} and 𝐏h=[Ph,ir]n×k\mathbf{P}_{h}=[P_{h,ir}]_{n\times k}. Then we define the concordant score of the hhth task as

score(h)=(DKL(𝐏Ω𝐏h)+DKL(𝐏h𝐏Ω))/(2n),h=1,,m,\begin{split}score(h)=-(D_{KL}(\mathbf{P}_{\Omega}\parallel\mathbf{P}_{h})+D_{KL}(\mathbf{P}_{h}\parallel\mathbf{P}_{\Omega}))/{(2n)},\ h=1,\ldots,m,\end{split} (15)

where DKLD_{KL} is the widely used Kullback-Leibler divergence (Cover and Thomas, 2012).

The tasks can then be ranked based on their concordant scores. As such, the detection of anomaly tasks boils down to a one-dimensional outlier detection problem. After anomaly tasks are detected, their FMR models can be built.

4.3 Handling Clustered Structure among tasks

In practice, tasks may be clustered into groups such that each task group has its own model structure. Here we assume that each cluster of tasks shares a FMR structure defined in (1), and propose to construct a similarity matrix to discover the potential cluster pattern among tasks.

We consider a two-stage strategy. First, each task learns a FMR model on the training data independently with the same pre-fixed kk. Then we get 𝐏h=[Ph,ir]n×k\mathbf{P}_{h}=[P_{h,ir}]_{n\times k} for all h=1,,mh=1,\ldots,m, where

Ph,ir=p(δh,ir=1yih,𝐱i,θ^h)=π^hrf(yihφ^ihr,ϕ^hr)r=1kπ^hrf(yihφ^ihr,ϕ^hr),P_{h,ir}=p(\delta_{h,ir}=1\mid y_{ih},\mathbf{x}_{i},\hat{\theta}_{h})=\frac{\hat{\pi}_{hr}f(y_{ih}\mid\hat{\varphi}_{ihr},\hat{\phi}_{hr})}{\sum_{r^{\prime}=1}^{k}\hat{\pi}_{hr^{\prime}}f(y_{ih}\mid\hat{\varphi}_{ihr^{\prime}},\hat{\phi}_{hr^{\prime}})},

and δh,ir(i=1,,n,r=1,,k)\delta_{h,ir}(i=1,\dots,n,r=1,\dots,k) and π^hr(r=1,,k)\hat{\pi}_{hr}(r=1,\dots,k) are the latent variables and the estimated prior probabilities of the hhth task, respectively.

Second, we adopt Normalized Mutual Information (NMI) (Strehl and Ghosh, 2002a) to measure the similarity between each pair of tasks. We choose NMI instead of Kullback-Leibler divergence because NMI can handle the case that the orders of clusters of two kk-cluster structures are different. Specifically, given two methods to estimate latent variables, which are denoted by method uu and method vv, let 𝐏u=[Pu,ir]n×k\mathbf{P}_{u}=[P_{u,ir}]_{n\times k} and 𝐏v=[Pv,ir]n×k\mathbf{P}_{v}=[P_{v,ir}]_{n\times k} denote the estimated probability of latent variables of method uu and method vv, respectively, where Pu,ir=p(δu,ir=1),Pv,ir=p(δv,ir=1)P_{u,ir}=p(\delta_{u,ir}=1),P_{v,ir}=p(\delta_{v,ir}=1) for i=1,,n,r=1,,ki=1,\ldots,n,r=1,\ldots,k. NMI is defined as

NMI(𝐏u,𝐏v)=I(𝐏u,𝐏v)I(𝐏u,𝐏u)I(𝐏v,𝐏v),u=1,,m,v=1,,m,NMI(\mathbf{P}_{u},\mathbf{P}_{v})=\frac{I(\mathbf{P}_{u},\mathbf{P}_{v})}{\sqrt{I(\mathbf{P}_{u},\mathbf{P}_{u})I(\mathbf{P}_{v},\mathbf{P}_{v})}},\ u=1,\ldots,m,v=1,\ldots,m, (16)

where I(𝐏u,𝐏v)I(\mathbf{P}_{u},\mathbf{P}_{v}) denotes the mutual information between 𝐏u,𝐏v\mathbf{P}_{u},\mathbf{P}_{v} such that

I(𝐏u,𝐏v)=a=1kb=1kp(δu,a=1,δv,b=1)log(p(δu,a=1,δv,b=1)p(δu,a=1)p(δv,b=1)).I(\mathbf{P}_{u},\mathbf{P}_{v})=\sum_{a=1}^{k}\sum_{b=1}^{k}p(\delta_{u,a}=1,\delta_{v,b}=1)\log\left(\frac{p(\delta_{u,a}=1,\delta_{v,b}=1)}{p(\delta_{u,a}=1)p(\delta_{v,b}=1)}\right).

Following Strehl and Ghosh (2002a), we approximate p(δu,a=1,δv,b=1)p(\delta_{u,a}=1,\delta_{v,b}=1), p(δu,a=1)p(\delta_{u,a}=1) and p(δv,b=1)p(\delta_{v,b}=1) by

p(δu,a=1)1ni=1np(δu,ia=1)=1ni=1nPu,ia,p(δv,b=1)1ni=1np(δv,ib=1)=1ni=1nPv,ib,p(δu,a=1,δv,b=1)1ni=1np(δu,ia=1,δv,ib=1)1ni=1np(δu,ia=1)p(δv,ib=1)=1ni=1nPu,iaPv,ib.\begin{split}p(\delta_{u,a}=1)&\approx\frac{1}{n}\sum_{i=1}^{n}p(\delta_{u,ia}=1)=\frac{1}{n}\sum_{i=1}^{n}P_{u,ia},\\ p(\delta_{v,b}=1)&\approx\frac{1}{n}\sum_{i=1}^{n}p(\delta_{v,ib}=1)=\frac{1}{n}\sum_{i=1}^{n}P_{v,ib},\\ p(\delta_{u,a}=1,\delta_{v,b}=1)&\approx\frac{1}{n}\sum_{i=1}^{n}p(\delta_{u,ia}=1,\delta_{v,ib}=1)\\ &\approx\frac{1}{n}\sum_{i=1}^{n}p(\delta_{u,ia}=1)p(\delta_{v,ib}=1)=\frac{1}{n}\sum_{i=1}^{n}P_{u,ia}P_{v,ib}.\end{split}

As such, given the estimated models θ^u,θ^v\hat{\theta}_{u},\hat{\theta}_{v} for the uuth and the vvth task, respectively, we treat p(δu,ir=1yiu,𝐱i,θ^u)p(\delta_{u,ir}=1\mid y_{iu},\mathbf{x}_{i},\hat{\theta}_{u}) and p(δv,ir=1yiv,𝐱i,θ^v)p(\delta_{v,ir}=1\mid y_{iv},\mathbf{x}_{i},\hat{\theta}_{v}) as p(δu,ir=1)p(\delta_{u,ir}=1) and p(δv,ir=1)p(\delta_{v,ir}=1), respectively, for i=1,,n,r=1,,ki=1,\ldots,n,r=1,\ldots,k. Then NMI between each pair of tasks are computed by (16). We note that for simplicity we set the pre-fixed kk to be the same, but in general kk can be different for different tasks by the definition of Mutual Information.

Given the similarity between each pair of tasks, any similarity-based clustering method can be applied to cluster mm tasks into groups. Empirically, the performance of task clustering is not sensitive to the pre-fixed kk. As such, we set the pre-fixed kk to be 2020 in this paper. We then apply the proposed Hermit  approach separately for each task group.

4.4 Modeling Mixture Probabilities

In real-applications, one may require to use 𝐱i\mathbf{x}_{i} only to infer the latent variables δi,r\delta_{i,r} and then to predict 𝐲i\mathbf{y}_{i}, for i=1,,n,r=1,,ki=1,\ldots,n,r=1,\ldots,k. Here we further extend our method following the idea of Mixture-Of-Experts (MOE) (Yuksel et al, 2012) model; the only modification is that πr\pi_{r} in (1) is assumed to be function of 𝐱i\mathbf{x}_{i}, for i=1,,ni=1,\ldots,n.

To be specific, let 𝜶=[𝜶1,,𝜶k]d×k\mbox{\boldmath$\alpha$}=[\mbox{\boldmath$\alpha$}_{1},\ldots,\mbox{\boldmath$\alpha$}_{k}]\in\mathbb{R}^{d\times k} collect regression coefficient vectors for a multinomial linear model. We assume that given 𝐱i\mathbf{x}_{i}, the joint probability density function of 𝐲~i={yij;jΩi}\tilde{\mathbf{y}}_{i}=\{y_{ij};j\in\Omega_{i}\} in (1) is replaced by

f(𝐲~i𝐱i,θ,𝜶)=f(yij,jΩi𝐱i,θ,𝜶)=r=1kp(δi,r=1𝐱i,αr)jΩif(yijφijr,ϕjr),\begin{split}f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta,\mbox{\boldmath$\alpha$})&=f(y_{ij},j\in\Omega_{i}\mid\mathbf{x}_{i},\theta,\mbox{\boldmath$\alpha$})\\ &=\sum_{r=1}^{k}p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{r})\prod_{j\in\Omega_{i}}f(y_{ij}\mid\varphi_{ijr},\phi_{jr}),\end{split}

where

p(δi,r=1𝐱i,αr)=exp(𝐱i𝜶r)r=1kexp(𝐱i𝜶r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{r})=\frac{\exp(\mathbf{x}_{i}\mbox{\boldmath$\alpha$}_{r})}{\sum_{r^{\prime}=1}^{k}\exp(\mathbf{x}_{i}\mbox{\boldmath$\alpha$}_{r^{\prime}})} (17)

is referred to as the gating probability. All the other terms are defined the same as in (1).

Let θ2={𝜷,ϕ1,,ϕk}\theta_{2}=\{\mbox{\boldmath$\beta$},\phi_{1},\ldots,\phi_{k}\}, with the parameter space Θ2=(d×m)×k×>0m×k.\Theta_{2}=\mathbb{R}^{(d\times m)\times k}\times\mathbb{R}^{m\times k}_{>0}. The data log-likelihood of the MOE model is

(θ2,𝜶𝐘,𝐗)=i=1nlog(r=1kp(δi,r=1𝐱i,αr)jΩiexp{yijφijrbj(φijr)aj(ϕjr)+cj(yij,ϕjr)}).\begin{split}\ell(\theta_{2},\mbox{\boldmath$\alpha$}\mid\mathbf{Y},\mathbf{X})=\sum_{i=1}^{n}\log\biggl{(}&\sum_{r=1}^{k}p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{r})\\ &\prod_{j\in\Omega_{i}}\exp\biggl{\{}\frac{y_{ij}\varphi_{ijr}-b_{j}(\varphi_{ijr})}{a_{j}(\phi_{jr})}+c_{j}(y_{ij},\phi_{jr})\biggr{\}}\biggr{)}.\end{split} (18)

The model estimation is conducted by extending (4) to

(θ^2,𝜶^)=argminθ2Θ2,𝜶(θ2,𝜶𝐘,𝐗)+(𝜷;λ1)/n+(𝜶;λ2),(\hat{\theta}_{2},\hat{\mbox{\boldmath$\alpha$}})={\arg\min}_{\theta_{2}\in\Theta_{2},\mbox{\boldmath$\alpha$}}\ -\ell(\theta_{2},\mbox{\boldmath$\alpha$}\mid\mathbf{Y},\mathbf{X})+\mathcal{R}(\mbox{\boldmath$\beta$};\lambda_{1})/n+\mathcal{R}(\mbox{\boldmath$\alpha$};\lambda_{2}), (19)

where, for example, the penalty on 𝜶\alpha can be chosen as the lasso type penalty,

(𝜶;λ2)=λ2𝜶1.\displaystyle\mathcal{R}(\mbox{\boldmath$\alpha$};\lambda_{2})=\lambda_{2}\|\mbox{\boldmath$\alpha$}\|_{1}. (20)

The minimization problem in (19) is also solved by GEM, for which the optimization procedure is similar to that in Section 3.2. The differences occur at the E-step:

ρ^i,r(t+1)=p(δi,r=1𝐱i,αr(t))jΩif(yijφijr(t),ϕjr(t))r=1kp(δi,r=1𝐱i,αr(t))jΩif(yijφijr(t),ϕjr(t)),\hat{\rho}_{i,r}^{(t+1)}=\frac{p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{r}^{(t)})\prod_{j\in\Omega_{i}}f(y_{ij}\mid\varphi_{ijr}^{(t)},\phi_{jr}^{(t)})}{\sum_{r^{\prime}=1}^{k}p(\delta_{i,r^{\prime}}=1\mid\mathbf{x}_{i},\alpha_{r^{\prime}}^{(t)})\prod_{j\in\Omega_{i}}f(y_{ij}\mid\varphi_{ijr^{\prime}}^{(t)},\phi_{jr^{\prime}}^{(t)})}, (21)

where f(yijφijr(t),ϕjr(t))=exp{(yijφijr(t)bj(φijr(t)))/aj(ϕjr(t))+cj(yij,ϕjr(t))}f(y_{ij}\mid\varphi_{ijr}^{(t)},\phi_{jr}^{(t)})=\exp\{(y_{ij}\varphi_{ijr}^{(t)}-b_{j}(\varphi_{ijr}^{(t)}))/a_{j}(\phi_{jr}^{(t)})+c_{j}(y_{ij},\phi_{jr}^{(t)})\}, at the optimization for 𝜶\alpha:

𝜶(t+1)=argmin𝜶1nr=1ki=1nρ^i,r(t+1)logp(δi,r=1𝐱i,αr)+(𝜶(t);λ2),\mbox{\boldmath$\alpha$}^{(t+1)}=\arg\min_{\mbox{\boldmath$\alpha$}}\ -\frac{1}{n}\sum_{r=1}^{k}\sum_{i=1}^{n}\hat{\rho}^{(t+1)}_{i,r}\log p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{r})+\mathcal{R}(\mbox{\boldmath$\alpha$}^{(t)};\lambda_{2}), (22)

and at the computation of π\pi: πr(t+1)=1ni=1np(δi,r=1𝐱i,αr(t+1))\pi_{r}^{(t+1)}=\frac{1}{n}\sum_{i=1}^{n}p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{r}^{(t+1)}) for r=1,,kr=1,\ldots,k.

5 Theoretical Analysis

We study the estimation and variable selection performance of Hermit  under the high-dimensional framework with dnd\gg n. Both mm and kk, on the other hand, are considered as fixed. This is because usually, the number of interested targets and the number of desired clusters are not large in many real problems. Here we only present the setup and the main results on non-asymptotic oracle inequalities to bound the excess risk and false selection, leaving detailed derivations in the Appendix. Our results generalize Städler et al (2010) to cover mixture regression models with 1) multivariate, heterogeneous (mixed-type) and incomplete response and 2) shared feature grouping sparse structure. This is not trivial due to the non-convexity and the triple heterogeneity of the problem. It turns out that additional condition on the tail behaviors of the conditional density f(𝐲𝐱,θ)f(\mathbf{y}\mid\mathbf{x},\theta) is required. Fortunately, the required conditions are still satisfied by a broad range of distributions.

5.1 Notations and Conditions on the Conditional Density

We firstly introduce some notations. Denote the regression parameters that are subject to regularization by β=vec(𝜷1,,𝜷k),ϕ=vec(Φ1,,Φk)\beta=\mbox{vec}(\mbox{\boldmath$\beta$}_{1},\ldots,\mbox{\boldmath$\beta$}_{k}),\phi=\mbox{vec}(\Phi_{1},\ldots,\Phi_{k}), where vec()\mbox{vec}(\cdot) is the vectorization operator. The other parameters in the mixture model are denoted by η=vec(log(ϕ),log(π))\eta=\mbox{vec}(\log(\phi),\log(\pi)), where log()\log(\cdot) is entry-wisely applied. Denote the true parameter by θ0=(𝜷0,Φ0,1,,Φ0,k,π0,1,π0,k1)\theta_{0}=(\mbox{\boldmath$\beta$}_{0},\Phi_{0,1},\ldots,\Phi_{0,k},\pi_{0,1},\pi_{0,k-1}) to be estimated under the FMR model defined in (1) and (2). In the sequel, we always use subscripts “0” to represent parameters or structures under the true model. To study sparsity recovery, denote the set of indices of non-zero entries of the true parameter by SS. We use \lesssim to indicate that the inequality holds up to some multiplicative numerical constants. To focus on the main idea, we consider the case of γ=0\gamma=0 in the following analysis.

We define average excess risk for fixed design points 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n} based on Kullback-Leibler divergence as

ε¯(θθ0)\displaystyle\bar{\varepsilon}(\theta\mid\theta_{0}) =1ni=1nε(θ𝐱i,θ0),ε(θ𝐱i,θ0)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\varepsilon(\theta\mid\mathbf{x}_{i},\theta_{0}),\ \varepsilon(\theta\mid\mathbf{x}_{i},\theta_{0})
=log(f(𝐲~i𝐱i,θ)f(𝐲~i𝐱i,θ0))f(𝐲~i𝐱i,θ0)𝑑𝐲~i,\displaystyle=-\int\log\left(\frac{f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta)}{f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta_{0})}\right)f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta_{0})d\tilde{\mathbf{y}}_{i},

where f(𝐲~i𝐱i,θ)f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta) is defined in (1).

To impose the conditions on f(𝐲~i𝐱i,θ)f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta), denote ψi=vec(φi,η)\psi_{i}=\mbox{vec}(\varphi_{i},\eta), where φi=vec({φijr;jΩi,r=1,,k})\varphi_{i}=\mbox{vec}(\{\varphi_{ijr};j\in\Omega_{i},r=1,\ldots,k\}), and denote ψ=vec(ψ1,,ψn)\psi=\mbox{vec}(\psi_{1},\ldots,\psi_{n}). As such, we may write f(𝐲~i𝐱i,θ)=f(𝐲~iψi)f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta)=f(\tilde{\mathbf{y}}_{i}\mid\psi_{i}), (θ𝐲~i,𝐱i)=logf(𝐲~i𝐱i,θ)=(ψi𝐲~i)\ell(\theta\mid\tilde{\mathbf{y}}_{i},\mathbf{x}_{i})=\log f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta)=\ell(\psi_{i}\mid\tilde{\mathbf{y}}_{i}), and ε¯(ψψ0)=1ni=1nε(ψiψ0,i)=ε¯(θθ0)\bar{\varepsilon}(\psi\mid\psi_{0})=\frac{1}{n}\sum_{i=1}^{n}\varepsilon(\psi_{i}\mid\psi_{0,i})=\bar{\varepsilon}(\theta\mid\theta_{0}).

Without loss of much generality, the model parameters are assumed to be in a bounded parameter space for a constant KK:

Θ~{θ;maxi=1,,nφi(𝐱i,𝜷)K,maxj=1,,m|logaj(ϕ)|K,maxj=1,,mlog|bj(ϕ)|K,logϕK,Klogπ10,,Klogπk10,r=1k1πr<1,πk=1r=1k1πr}.\begin{split}\tilde{\Theta}\subset\{\theta;&\max_{i=1,\ldots,n}\|\varphi_{i}(\mathbf{x}_{i},\mbox{\boldmath$\beta$})\|_{\infty}\leq K,\max_{j=1,\ldots,m}|\log a_{j}(\phi)|\leq K,\\ &\max_{j=1,\ldots,m}\log|b^{\prime}_{j}(\phi)|\leq K,\|\log\phi\|_{\infty}\leq K,\\ &-K\leq\log\pi_{1}\leq 0,\ldots,-K\leq\log\pi_{k-1}\leq 0,\\ &\sum_{r=1}^{k-1}\pi_{r}<1,\pi_{k}=1-\sum_{r=1}^{k-1}\pi_{r}\}.\end{split} (23)

We present the following conditions on f(𝐲~iψi)f(\tilde{\mathbf{y}}_{i}\mid\psi_{i}).

Condition 1

For some function G1()G_{1}(\cdot)\in\mathbb{R}, for i=1,,ni=1,\ldots,n,

supθΘ~(ψi𝐲~i)ψiG1(𝐲~i).\displaystyle\sup_{\theta\in\tilde{\Theta}}\|\frac{\partial\ell(\psi_{i}\mid\tilde{\mathbf{y}}_{i})}{\partial\psi_{i}}\|_{\infty}\leq G_{1}(\tilde{\mathbf{y}}_{i}).
Condition 2

For a constant c10c_{1}\geq 0, and some constants c2,c3,c4,c50c_{2},c_{3},c_{4},c_{5}\geq 0 depending KK, and for M>c4M>c_{4}, we assume for i=1,,ni=1,\ldots,n,

𝔼[|G1(𝐲~i)|1{|G1(𝐲~i)|>M}][c3(Mc2)c+c5]exp{(Mc2)1/c1},\displaystyle\mathbb{E}[|G_{1}({\tilde{\mathbf{y}}_{i}})|1\{|G_{1}(\tilde{\mathbf{y}}_{i})|>M\}]\leq\biggl{[}c_{3}\biggl{(}\frac{M}{c_{2}}\biggr{)}^{c^{\prime}}+c_{5}\biggr{]}\exp\biggl{\{}-\biggl{(}\frac{M}{c_{2}}\biggr{)}^{1/c_{1}}\biggr{\}},
𝔼[|G1(𝐲~i)|21{|G1(𝐲~i)|>M}][c3(Mc2)c+c5]2exp{2(Mc2)1/c1},\displaystyle\mathbb{E}[|G_{1}(\tilde{\mathbf{y}}_{i})|^{2}1\{|G_{1}(\tilde{\mathbf{y}}_{i})|>M\}]\leq\biggl{[}c_{3}\biggl{(}\frac{M}{c_{2}}\biggr{)}^{c^{\prime}}+c_{5}\biggr{]}^{2}\exp\biggl{\{}-2\biggl{(}\frac{M}{c_{2}}\biggr{)}^{1/c_{1}}\biggr{\}},

where 𝐲~i={yij;jΩi}\tilde{\mathbf{y}}_{i}=\{y_{ij};j\in\Omega_{i}\}, c=2+3/c1c^{\prime}=2+3/c_{1} and 1{}1\{\cdot\} denotes the indicator function.

Condition 3

It holds that,

mini=1,,nΛmin(I(ψ0,i))>1/c0>0,\displaystyle\min_{i=1,\ldots,n}\Lambda_{\min}(I(\psi_{0,i}))>{1}/{c_{0}}>0,

where c0c_{0} is a constant, Λmin2(A)\Lambda_{\min}^{2}(A) is the smallest eigenvalue of a symmetric, positive semi-definite matrix AA and for i=1,,ni=1,\ldots,n, I(ψ0,i)I(\psi_{0,i}) is the Fisher information matrix such that

I(ψ0,i)=2(ψ0,i𝐲~i)ψ0,iψ0,iTf(𝐲~iψ0,i)𝑑𝐲~i.\displaystyle I(\psi_{0,i})=-\int\frac{\partial^{2}\ell(\psi_{0,i}\mid\tilde{\mathbf{y}}_{i})}{\partial\psi_{0,i}\partial\psi_{0,i}^{T}}f(\tilde{\mathbf{y}}_{i}\mid\psi_{0,i})d\tilde{\mathbf{y}}_{i}.

The first condition follows from Städler et al (2010), which aims to bound (ψi𝐲~i)/ψi\partial\ell(\psi_{i}\mid\tilde{\mathbf{y}}_{i})/\partial\psi_{i} with known 𝐲~i\tilde{\mathbf{y}}_{i}, for i=1,,ni=1,\ldots,n. The second condition is about the tail behaviors of f(𝐲~i𝐱i,θ)f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta). The third condition depicts the local convexity of \ell at the point θ0\theta_{0}. Condition 1 and 2 can cover a broad range of distributions for ff, including but not limited to mixture of sub-exponential distributions, such as our proposed Hermit  model with known dispersion parameters, c.f., Lemma 1.

Lemma 1

Condition 1 and 2 hold for the heterogeneous mixture distribution f(𝐲~i𝐱i,θ)f(\tilde{\mathbf{y}}_{i}\mid\mathbf{x}_{i},\theta) defined in (1) with known dispersion parameters.

The following two quantities will be used.

λ0=mkMnlognlog(dn)n,Mn=c2(logn)c1,\lambda_{0}=\sqrt{mk}M_{n}\log n\sqrt{\frac{\log(d\vee n)}{n}},\ M_{n}=c_{2}(\log n)^{c_{1}}, (24)

where c1,c2c_{1},c_{2} are the same constants as in Condition 2. More specifically, we choose c1=1/2,0,1c_{1}=1/2,0,1 for Gaussian, Bernoulli and Poisson task, respectively.

5.2 Results for Lasso-Type Estimator

Consider first the penalized estimator defined in (4) with the 1\ell_{1} penalty in (5). Following Bickel et al (2009) and Städler et al (2010), we impose the following restricted eigenvalue condition on the design.

Condition 4

(Restricted eigenvalue condition). For all wdmkw\in\mathbb{R}^{dmk} satisfying wSc16wS1\|w_{S^{c}}\|_{1}\leq 6\|w_{S}\|_{1}, it holds that for some constant κ1\kappa\geq 1,

wS22κ2φQn2=κ2ni=1njΩir=1k(𝐱i𝜷jr)2.\displaystyle\|w_{S}\|_{2}^{2}\leq\kappa^{2}\|\varphi\|_{Q_{n}}^{2}=\frac{\kappa^{2}}{n}\sum_{i=1}^{n}\sum_{j\in\Omega_{i}}\sum_{r=1}^{k}(\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr})^{2}.
Theorem 1

Consider the Hermit  model in (1) with θ0Θ~\theta_{0}\in\tilde{\Theta} , and consider the penalized estimator (4) with the 1\ell_{1} penalty in (5). Assume Conditions 1-4 hold. Suppose mkn/Mn\sqrt{mk}\lesssim n/M_{n}, and take λ>2Tλ0\lambda>2T\lambda_{0} for some constant T>1T>1. For some constant c>0c>0 and large enough nn, with probability

1cexp(log2nlog(dn)c)1n,1-c\exp\left(-\frac{\log^{2}n\log(d\vee n)}{c}\right)-\frac{1}{n}, (25)

we have

ε¯(θ^θ0)+2(λTλ0)β^Sc14(λ+Tλ0)2κ2c02s\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})+2(\lambda-T\lambda_{0})\|\hat{\beta}_{S^{c}}\|_{1}\leq 4(\lambda+T\lambda_{0})^{2}\kappa^{2}c_{0}^{2}s (26)

where ss is the number of non-zero parameters of w0w_{0}.

Theorem 1 suggests that the average excess risk has a convergence rate of the order O(sλ02)=O((logn)2+2c1log(dn)smk/n)O(s\lambda_{0}^{2})=O((\log n)^{2+2c_{1}}\log(d\vee n)smk/n), by taking λ=2Tλ0\lambda=2T\lambda_{0} and using λ0\lambda_{0} and MnM_{n} as defined in (24). Also, the degree of false selection measured by β^Sc1\|\hat{\beta}_{S^{c}}\|_{1} converge to zero at rate O(sλ0)=O(s(logn)2+2c1log(dn)mk/n)O(s\lambda_{0})=O(s\sqrt{(\log n)^{2+2c_{1}}\log(d\vee n)mk/n)}.

Similar to Städler et al (2010), under weaker conditions without the restricted eigenvalue assumption on the design, we still achieve the consistency for the average excess risk.

Theorem 2

Consider the Hermit  model in (1) with θ0Θ~\theta_{0}\in\tilde{\Theta}, and consider the penalized estimator (4) with the 1\ell_{1} penalty in (5). Assume Conditions 1-3 hold. Suppose

β01\displaystyle\|\beta_{0}\|_{1} =o(n/((logn)2+2c1log(dn)mk)),\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n)mk)}),
mk\displaystyle\sqrt{mk} =o(n/((logn)2+2c1log(dn)))\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n))})

as nn\rightarrow\infty, and take λ=C(logn)2+2c1log(dn)mk/n\lambda=C\sqrt{(\log n)^{2+2c_{1}}\log(d\vee n)mk/n} for some constant C>0C>0 sufficiently large. For some constant c>0c>0 and large enough nn, with the following probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have ε¯(θ^θ0)=oP(1).\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})=o_{P}(1).

5.3 Results for Group-Lasso Type Estimator

Consider the following general form of the group 1\ell_{1} penalty,

(𝜷)=λp=1P𝜷𝒢pF,\begin{split}\mathcal{R}(\mbox{\boldmath$\beta$})=\lambda\sum_{p=1}^{P}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}\|_{F},\end{split} (27)

where 𝒢1,,𝒢P\mathcal{G}_{1},\ldots,\mathcal{G}_{P} are index collections such that 𝒢p𝒢p=\mathcal{G}_{p}\bigcap\mathcal{G}_{p^{\prime}}=\emptyset for ppp\neq p^{\prime} and p=1P𝒢p=l=1dj=1mr=1k(l,j,r)\bigcup_{p=1}^{P}\mathcal{G}_{p}=\bigcup_{l=1}^{d}\bigcup_{j=1}^{m}\bigcup_{r=1}^{k}(l,j,r) equals the universal set of indices of 𝜷(d×m)×k\mbox{\boldmath$\beta$}\in\mathbb{R}^{(d\times m)\times k}, i.e., 𝜷𝒢p\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}} is the ppth group of 𝜷\beta. F\|\cdot\|_{F} denotes the Frobenius norm and here for p=1,,Pp=1,\ldots,P, 𝜷𝒢pF=(l,j,r)𝒢pwljr2\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}\|_{F}=\sqrt{\sum_{(l,j,r)\in\mathcal{G}_{p}}w_{ljr}^{2}}. This penalty form generalizes the row-wise group sparsity in (6).

Denote ={p:𝜷0,𝒢p=𝟎}\mathcal{I}=\{p:\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}=\mathbf{0}\} and c={p:𝜷0,𝒢p𝟎},\mathcal{I}^{c}=\{p:\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\neq\mathbf{0}\}, where 𝜷0,𝒢p\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}} is the ppth group of 𝜷0\mbox{\boldmath$\beta$}_{0}. Now denote by ss the size of \mathcal{I}, with some abuse of notation. We impose the following group-version restricted eigenvalue condition.

Condition 5

For all 𝛃(d×m)×k\mbox{\boldmath$\beta$}\in\mathbb{R}^{(d\times m)\times k} satisfying

pc𝜷𝒢pF6p𝜷𝒢pF,\displaystyle\sum_{p\in\mathcal{I}^{c}}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}\|_{F}\leq 6\sum_{p\in\mathcal{I}}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}\|_{F},

it holds that for some constant κ1\kappa\geq 1,

p𝜷𝒢pF2κ2φQn2.\displaystyle\sum_{p\in\mathcal{I}}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}\|_{F}^{2}\leq\kappa^{2}\|\varphi\|_{Q_{n}}^{2}.
Theorem 3

Consider the Hermit  model in (1) with θ0Θ~\theta_{0}\in\tilde{\Theta}, and consider the penalized estimator (4) with the group 1\ell_{1} penalty in (27).

(a) Assume conditions 1-3 and 5 hold. Suppose mkn/Mn\sqrt{mk}\lesssim n/M_{n}, and take λ>2Tλ0\lambda>2T\lambda_{0} for some constant T>1T>1. For some constant c>0c>0 and large enough nn, with the following probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have

ε¯(θ^θ0)+2(λTλ0)pc𝜷^𝒢pF4(λ+Tλ0)2κ2c02s.\displaystyle\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})+2(\lambda-T\lambda_{0})\sum_{p\in\mathcal{I}^{c}}\|\widehat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 4(\lambda+T\lambda_{0})^{2}\kappa^{2}c_{0}^{2}s.

(b) Assume conditions 1-3 hold (without condition 5), and assume

p=1P𝜷0,𝒢pF\displaystyle\sum_{p=1}^{P}\|\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F} =o(n/((logn)2+2c1log(dn)mk)),\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n)mk)}),
mk\displaystyle\sqrt{mk} =o(n/((logn)2+2c1log(dn)))\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n))})

as nn\rightarrow\infty. Let λ=C(logn)2+2c1log(dn)mk/n\lambda=C\sqrt{(\log n)^{2+2c_{1}}\log(d\vee n)mk/n} for some C>0C>0 sufficiently large. Then for some constant c>0c>0 and large enough nn, with the following probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have ε¯(θ^θ0)=oP(1).\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})=o_{P}(1).

So the average excess risk has a convergence rate of O(sλ02)O(s\lambda_{0}^{2}), and the degree of false group selection, as measured by pc𝜷^𝒢pF\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}, converges to zero at rate O(sλ0)O(s\lambda_{0}). The estimator in (4) using other group 1\ell_{1} penalties such as (6) are special cases, so the results of Theorem 3 still apply.

Remark. Our results can be extended to the mean-shifted natural parameter model as in (11), with a modified restricted eigenvalue condition. See the Appendix for some details.

6 Experiments

In this section, we present empirical studies on both synthetic and real-world data sets.

6.1 Methods for Comparison

We evaluate the following versions of the proposed Hermit  approach.

(1) Single task learning (Single): It is a special case of the Hermit  estimator (4) with (5), where each task is learned separately.

(2) Separately learning (Sep): It is a special case of the Hermit  estimator (4) with (5), where each type (Gaussian, Bernoulli or Poisson) of tasks is learned separately.

(3) Mixed learning with entry-wise sparsity (Mix): It is the proposed Hermit  estimator (4) with (5) where all the tasks are jointly learned. To compare with Sep, we allow different tuning parameters for different types of outcomes.

(4) Mixed learning with group sparsity (Mix GS): It is the proposed Hermit  estimator (4) with (6).

(5) Mixed learning Mixture-Of-Experts model with entry-wise sparsity (Mix MOE): It is the proposed Hermit  estimator (19) with (5) and (20).

(6) Mixed learning Mixture-Of-Experts model with group sparsity (Mix MOE GS): It is the proposed Hermit  estimator (19) with (6) and (20).

Besides the above FMR methods, we also evaluate several non-FMR multi-task methods below for comparison, some of which handle certain kinds of heterogeneities, such as anomaly tasks, clustered tasks and heterogeneous responses. Since they are non-FMR, they learn a single regression coefficient matrix 𝜷d×m\mbox{\boldmath$\beta$}\in\mathbb{R}^{d\times m}.

  • LASSO: 1\ell_{1}-norm multi-task regression with λ𝜷1\lambda\|\mbox{\boldmath$\beta$}\|_{1} as penalty. Each type of tasks are learned independently. It is a special case of Sep when pre-fixed k^=1\hat{k}=1.

  • Sep L2: ridge multi-task regression with λ𝜷F2\lambda\|\mbox{\boldmath$\beta$}\|_{F}^{2} as penalty. Each type of tasks are learned independently.

  • Group LASSO: 1,2\ell_{1,2}-norm multi-task regression with λ𝜷1,2\lambda\|\mbox{\boldmath$\beta$}\|_{1,2} as penalty (Yang et al, 2009), which handles heterogeneous responses, and is a special case of Mix GS when pre-fixed k^=1\hat{k}=1.

  • TraceReg: trace-norm multi-task regression (Ji and Ye, 2009).

  • Dirty: dirty model multi-task regression with λ1𝐒1+λ2𝐋1,(𝜷=𝐋+𝐒)\lambda_{1}\|\mathbf{S}\|_{1}+\lambda_{2}\|\mathbf{L}\|_{1,\infty}(\mbox{\boldmath$\beta$}=\mathbf{L}+\mathbf{S}) as penalty (Jalali et al, 2010), handling entry-wise heterogeneity in 𝜷\beta comparing with Group LASSO.

  • MSMTFL: multi-stage multi-task feature learning (Gong et al, 2012b) whose penalty is λ1l=1dmin(𝜷l1,λ2)\lambda_{1}\sum_{l=1}^{d}\min(\|\mbox{\boldmath$\beta$}^{l}\|_{1},\lambda_{2}), where 𝜷l\mbox{\boldmath$\beta$}^{l} denotes the llth row of 𝜷\beta. It also handles entry-wise heterogeneity in 𝜷\beta comparing with Group LASSO.

  • SparseTrace: multi-task regression, learning sparse and low-rank patterns with λ1𝐒1+λ2𝐋(𝜷=𝐋+𝐒)\lambda_{1}\|\mathbf{S}\|_{1}+\lambda_{2}\|\mathbf{L}\|_{*}(\mbox{\boldmath$\beta$}=\mathbf{L}+\mathbf{S}) as penalty (Chen et al, 2012a), handling entry-wise heterogeneity in 𝜷\beta comparing with TraceReg, where \|\cdot\|_{*} denotes the nuclear norm of the enclosed matrix.

  • rMTFL: robust multi-task feature learning with λ1𝐒2,1+λ2𝐋1,2(𝜷=𝐋+𝐒)\lambda_{1}\|\mathbf{S}\|_{2,1}+\lambda_{2}\|\mathbf{L}\|_{1,2}(\mbox{\boldmath$\beta$}=\mathbf{L}+\mathbf{S}) as penalty (Gong et al, 2012a), handling anomaly tasks comparing with Group LASSO.

  • RMTL: robust multi-task regression with λ1𝐒2,1+λ2𝐋(𝜷=𝐋+𝐒)\lambda_{1}\|\mathbf{S}\|_{2,1}+\lambda_{2}\|\mathbf{L}\|_{*}(\mbox{\boldmath$\beta$}=\mathbf{L}+\mathbf{S}) as penalty (Chen et al, 2011), handling anomaly tasks comparing with TraceReg.

  • CMTL: clustered multi-task learning (Zhou et al, 2011), handling clustered tasks.

  • GO-MTL: multi-task regression, handling overlapping clustered tasks (Kumar and Daumé III, 2012).

6.2 Experimental Setting

In our experiments, for the E-step of GEM, we follow Städler et al (2010) to initialize ρ\rho. For the M-step, we initialize the entries of 𝜷\beta from 𝒩(0,1010)\mathcal{N}(0,10^{-10}). We fix σ=1\sigma=1 for Gaussian tasks, and set γ=1\gamma=1. In the APG algorithm, step size is initialized by the Barzilai-Borwein rule (Barzilai and Borwein, 1988) and updated by the TFOCS-style backtracking (Becker et al, 2011).

We terminate the APG algorithm with maximum iteration step Tin=200T_{in}=200 or when the relative 2\ell_{2}-norm distance of two consecutive parameters is less than 10610^{-6}. We terminate the GEM with maximum iteration step Tout=50T_{out}=50, or when the relative change of two consecutive (θ𝐘,𝐗)/n-\ell(\theta\mid\mathbf{Y},\mathbf{X})/n is less than 10610^{-6} or when the relative \ell_{\infty}-norm distance of two consecutive parameters is less than 10310^{-3}.

In the experiments on both simulated and real-world data sets, we partition the entire data set into three parts: a training set for model fitting, a validation set for tuning hyper-parameters and a testing set for testing the generalization performance of the selected models. The only exception is Section 6.4.1, where we do not generate testing data sets because the models are evaluated by comparing the estimation results to the ground truth.

In hyper-parameter tuning, the regularization parameters, i.e., λ\lambdas, are tuned from [1e6,1e3][1e-6,1e3], and the number of clusters are tuned from {1,,10}\{1,\ldots,10\}. Hyper-parameters of the baseline methods are tuned according to the descriptions in their respective references.

All the experiments are replicated 100 times under each model setting.

6.3 Evaluation Metrics

The prediction of latent variable is evaluated by Normalized Mutual Information (NMI) (Strehl and Ghosh, 2002b; Fern and Brodley, 2003; Strehl and Ghosh, 2002a). In detail, we compute NMI scores by (16), treating estimated conditional probabilities [𝐏Ω,ir]n×k[\mathbf{P}_{\Omega,ir}]_{n\times k} defined in (14) and the ground truth latent variables [δi,r]n×k[\delta_{i,r}]_{n\times k} as [P1,ir]n×k[P_{1,ir}]_{n\times k} and [P2,ir]n×k[P_{2,ir}]_{n\times k}, respectively.

For feature selection, firstly, the estimated components are reordered to make the best match with the true components. Then feature selection is evaluated by Area Under the ROC Curve (AUC) which is measured by the Wilcoxon-Mann-Whitney statistic provided by Hanley and McNeil (1982). Concretely, absolute values of vectorized estimated regression parameters, i.e., β^abs=|vec(𝜷^)|\hat{\beta}_{\mbox{abs}}=|\mbox{vec}(\hat{\mbox{\boldmath$\beta$}})|, and binarized vectorized ground truth regression parameters, i.e., β0,sign=sign(|vec(𝜷0)|)\beta_{0,\mbox{sign}}=\mbox{sign}(|\mbox{vec}(\mbox{\boldmath$\beta$}_{0})|), are used as inputs to AUC, where 𝜷0\mbox{\boldmath$\beta$}_{0} denotes the ground truth regression parameter and vec()\mbox{vec}(\cdot) is the vectorization operator.

In order to show the existence of mixed relationships between features and targets, imputation performance for incomplete targets is used to compare FMR methods with non-FMR MTL methods. Concretely, the goal is to predict one-half randomly chosen targets. The other half targets are allowed to be used. FMR methods use the other half targets to compute conditional probabilities p(δi,r=1yij,jΩi,𝐱i,θ^)(i=1,,n,r=1,,k)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta})(i=1,\ldots,n,r=1,\dots,k) and make prediction as stated in Section 3.3. Non-FMR MTL methods perform feature-based prediction.

Feature-based prediction performances are also compared between non-FMR MTL methods and our MOE methods, where only features are allowed to use to predict testing targets. For this case, the goal is to predict all the targets.

For target prediction, Gaussian outcomes are evaluated by nMSE (Chen et al, 2011; Gong et al, 2012a) which is defined as the mean of each task’s mean squared error (MSE) divided by the variance of its target vector. Bernoulli outcomes are evaluated by average AUC (aAUC), which is defined as the mean AUC of each task. For Poisson tasks, we firstly compute the logarithms of outcomes, then use nMSE for evaluation.

Since our objective functions in (4) and (19) are non-convex, estimated parameters may correspond to local minimums of the objective functions. Therefore, we try different initializations and report the results ranking the best 20% on the validation data set out of the 100 replications to avoid the results that may be stuck at local minimums, suggesting that one can always select any result within the best 20%.

6.4 Simulation

6.4.1 Latent Variable Prediction and Feature Selection

We consider both low dimensional case and high dimensional case for latent variable prediction and feature selection. For the low-dimensional case, we set the number of samples n=100n=100, feature dimension d=15d=15, number of non-zero features (sparsity) s=3s=3, and the number of tasks (responses) m=15m=15. The data set includes 33 Gaussian tasks, 1010 Bernoulli tasks, and 22 Poisson tasks. The number of latent components k=2k=2. For r=1,,kr=1,\dots,k, in the rrth component, the first row (biases) and the (s(r1)+2)(s(r-1)+2)th to the (sr+1)(sr+1)th row (a block of ss rows) of the true 𝜷rd×m\mbox{\boldmath$\beta$}_{r}\in\mathbb{R}^{d\times m} are non-zero (to let different components have different sets of features). Non-zero parameters in 𝜷\beta are in the range of [3,1][1,3][-3,-1]\cup[1,3] except that those of Poisson tasks are in the range of [0.3,0.1][0.1,0.3][-0.3,-0.1]\cup[0.1,0.3]. The biases are all set to 11 except that those of Poisson tasks are set to 33. For Gaussian tasks, all σ\sigmas are set to 11. The entries of 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n} are drawn from 𝒩(0,1)\mathcal{N}(0,1) with the first dimension being 11. π=(0.5,0.5)\pi=(0.5,0.5). Validation data is independently generated likewise and has nn samples. For the high-dimensional case, we set n=180n=180, d=320d=320 and m=20m=20. The data set includes 88 Gaussian tasks, 1010 Bernoulli tasks, and 22 Poisson tasks. Other settings are the same as in the low-dimensional case. We set the pre-fixed k^\hat{k} to be equal to the true k=2k=2. For targets of training data, we have tried different missing rates, which are in the range of {0,0.05,0.1,0.15,0.2}\{0,0.05,0.1,0.15,0.2\}. We compare the performances of θ^\hat{\theta}s estimated by Single, Sep, Mix, Mix GS, respectively, with that of θ0\theta_{0} (denoted by “True”).

The results are shown in Fig 1. The horizontal axis is the missing rates. Intuitively, larger missing rates may result in worse performances due to fewer data samples. Single provides poor results and is not sensitive to missing rate, because (1) data samples are deficient for single-task learning and (2) the influence of missing rate may be not significant when the number of samples is at this level. Sep outperforms Single and is affected significantly by missing rate, because (1) Sep uses the prior knowledge in data that multiple tasks share the same FMR structure and (2) Sep constructs separate FMR models such that tasks for each model are deficient, hence the advantage from joint learning multiple tasks can be easily affected when some targets are missing. Our Hermit  method Mix outperforms Sep and is robust against growing missing rate, because (1) Mix uses the prior knowledge in data that all the tasks share the same FMR structure and (2) Mix takes advantage of all the tasks, therefore, the number of tasks is then enough even some targets are missing. Our Hermit  method Mix GS outperforms Mix, even rivals the true model, and is also robust against growing missing rate, because (1) comparing with Mix, Mix GS further uses the prior knowledge in data that all the tasks share the same feature space in each cluster, and (2) Mix GS takes advantage of all the tasks as well.

Refer to caption
(a) Latent variable prediction accuracy
Refer to caption
(b) Feature selection accuracy
Refer to caption
(c) Latent variable prediction accuracy
Refer to caption
(d) Feature selection accuracy
Figure 1: Latent variable prediction and feature selection performance. (a) and (b) are results on low-dimensional data; (c) and (d) are results on high-dimensional data.

6.4.2 Performances When the Pre-Fixed k^\hat{k} Is Different with the True kk

We consider testing the performance of target imputation when the pre-fixed k^\hat{k} is different with the true kk. Four data sets are generated with the true k=1,2,3,4k=1,2,3,4, respectively. We set n=1000n=1000, d=32d=32, and m=15m=15. There are 33 Gaussian tasks, 1010 Bernoulli tasks, and 22 Poisson tasks. For each k=1,2,3,4k=1,2,3,4, the sparsity ss is set to d/(2k)\lfloor d/(2k)\rfloor such that the total numbers of relevant features for different data sets are the same. The values of the non-zero regression parameters for Gaussian and Bernoulli tasks in 𝜷\beta are in the range of [6,2][2,6][-6,-2]\cup[2,6]. We set π1=π2==πk\pi_{1}=\pi_{2}=\ldots=\pi_{k}. Validation and testing data are independently generated likewise and both have nn samples. We randomly set 20% of targets to be missing for all the training, validation and testing data. Other settings are the same as in Section 6.4.1. One intuitive thought is that when the pre-fixed k^\hat{k} equals the true kk, the imputation performance will be maximized. So we set the pre-fixed k^{1,2,3,4,5,6,7,8}\hat{k}\in\{1,2,3,4,5,6,7,8\}. We test Mix model in this experiment. Results by Mix GS model are similar.

In Fig 2, the imputation performances are truly maximized when the pre-fixed k^\hat{k} equals the true kk. When pre-fixed k^\hat{k} is larger than the true kk, the imputation performances are similar. When the true k>1k>1 and when the pre-fixed k^\hat{k} is less than the true kk, the imputation performances grow with the pre-fixed k^\hat{k}. One may expect that when the pre-fixed k^\hat{k} is larger than the true kk, the performances will deteriorate, since imputation would be based on fewer data samples. We think it is because (1) the simulated data are simple, and (2) the information sharing among tasks renders the robustness of our Hermit  method against decreasing sample size, which is consistent with the results in Section 6.4.1 when facing increasing missing rate (larger missing rate also indicates fewer data samples).

Refer to caption
(a) nMSE of Gaussian targets
Refer to caption
(b) aAUC of Bernoulli targets
Refer to caption
(c) nMSE of log of Poisson targets
Figure 2: Imputation performance when the pre-fixed k^\hat{k} is different with the true kk.

6.4.3 Comparison with Non-FMR Methods

We compare the imputation performance of our Hermit  methods Mix and Mix GS with all the non-FMR methods. We choose the data set used in Section 6.4.2 with the true k=3k=3. The Poisson targets are removed since many other methods are not able to handle them. The tuned k^=3\hat{k}=3.

In Table 1, our Hermit  methods Mix and Mix GS not only outperform their special cases, i.e., LASSO and Group LASSO, respectively, but also outperform other multi-task learning methods, including those handling certain kinds of heterogeneities.

nMSE aAUC
LASSO 0.6892 0.7384
Mix 0.1181 0.9525
Group LASSO 0.6850 0.7482
Mix GS 0.1212 0.9559
Sep L2 0.6912 0.7355
GO-MTL 0.8055 0.7259
CMTL 0.6916 0.7344
MSMTFL 0.6890 0.7381
TraceReg 0.6913 0.7362
SparseTrace 0.6904 0.7374
RMTL 0.6913 0.7362
Dirty 0.6850 0.7482
rMTFL 0.6850 0.7482
Table 1: Comparison with non-FMR methods

6.4.4 Detection of Anomaly Tasks

We set n=2000n=2000. The number of tasks (responses) m=30m=30. The information about the true kks and numbers of different types of tasks is in Table 2. Other settings are the same as in Section 6.4.2. In Table 2, it can be seen that the true kk of the majority of tasks (the first 20 tasks) is 4. The first 20 tasks are referred to as concordant tasks, while the other 10 tasks are referred to as anomaly tasks.

Group True k #Gaussian #Bernoulli #Poisson
1 4 5 10 5
2 1 1 1 1
3 6 1 0 0
4 2 1 1 0
5 3 0 1 1
6 5 1 1 0
Table 2: True kks and numbers of different types of tasks.

We compute the concordant scores using (15) for the tasks. In Fig 3, the concordant scores separate concordant tasks and anomaly tasks quite well. Scores of Poisson tasks are similar to scores of Bernoulli tasks, because they all provide less accurate information than Gaussian tasks do.

Refer to caption
(a) Mix
Refer to caption
(b) Mix GS
Figure 3: Concordant scores of tasks, which are associated with Table 2. (a) estimated by Mix; (b) estimated by Mix GS. The first 20 tasks are concordant tasks, the last 10 tasks are anomaly tasks. The first 5 tasks are Gaussian tasks, the subsequent 10 tasks are Bernoulli tasks and then the subsequent 5 tasks are Poisson tasks.

6.4.5 Handling Clustered Relationship among tasks

We construct 4 groups of tasks. The total number of tasks (responses) m=60m=60. The information about the true kks and numbers of different types of tasks is in Table 3. Other settings are the same as in Section 6.4.4. We first apply Single for each task, setting k^=20\hat{k}=20. Then we apply the strategy in Section 4.3 to construct a similarity matrix by NMI defined in (16). Kernel PCA (Schölkopf et al, 1998; Van Der Maaten et al, 2009) is then applied using the similarity matrix as the kernel matrix. The similarity matrix and the result of Kernel PCA are shown in Fig 4.

Group True k #Gaussian #Bernoulli #Poisson
1 1 3 10 2
2 2 3 10 2
3 3 3 10 2
4 4 3 10 2
Table 3: True kks and numbers of different types of tasks. Tasks are clustered into 4 groups.
Refer to caption
(a) Similarity Matrix
Refer to caption
(b) Dimension Reduction by Kernel PCA
Figure 4: (a) Similarity matrix among tasks described in Table 3; (b) Relationship among tasks shown by Kernel PCA.

In Fig 4 (a), Group 2,3 and 4 can be recognized as three groups. In Group 1, each task shows no similarity with other tasks, because with the true k=1k=1, the data samples can be randomly partitioned into k^=20\hat{k}=20 sub-populations, which results in low NMI scores. In Fig 4 (b), basically, 4 groups of tasks are clustered into 4 different regions.

6.4.6 Handling Outlier Samples

We choose the data set used in Section 6.4.2 with the true k=2k=2, then randomly shuffle the data pairs (𝐲i,𝐱i)(\mathbf{y}_{i},\mathbf{x}_{i}), for i=1,,ni=1,\ldots,n, and contaminate the true targets by the following procedure. For outlier ratio poutlier=0%,1%,2%,5%,8%,10%p_{\mbox{outlier}}=0\%,1\%,2\%,5\%,8\%,10\%, (1) for Gaussian targets, set all the targets of poutlierp_{\mbox{outlier}} of data samples to be 100100; (2) for Bernoulli targets, set all the targets of poutlierp_{\mbox{outlier}} of data samples to be 11. Such contamination is only performed on training and validation data, leaving testing data clean.

Then we evaluate two groups of methods. For the group of non-robust methods, we choose our Hermit  methods Mix and Mix GS. For the group of robust methods, we firstly run the robust version of the non-robust methods by adding 𝜻\zeta in the natural parameter models as (11) and adding (13) as the additional penalty, then we clean the data by removing poutlierp_{\mbox{outlier}} of data samples associated with the largest value of jrζijr2\sqrt{\sum_{jr}\zeta_{ijr}^{2}} (i{1,,n}i\in\{1,\ldots,n\}). Finally, we run their non-robust version of methods on the “cleaned” data, respectively. We follow Gong et al (2012a) to adopt such two-stage strategy.

The imputation performances are reported in Table 4, from where it can be seen that, 1) when poutlier=0%p_{\mbox{outlier}}=0\%, robust methods are over-parameterized and may underperform non-robust methods; 2) when poutlier>0%p_{\mbox{outlier}}>0\%, robust methods significantly outperform non-robust methods.

0% 1% 2% 5% 8% 10%
nMSE for Gaussian Mix non-robust 0.0625 0.6754 0.6894 1.0122 1.3250 1.4953
robust 0.0620 0.0627 0.0626 0.0737 0.0632 0.0635
Mix GS non-robust 0.0658 0.6434 0.6741 0.7505 1.0736 1.2939
robust 0.0599 0.0611 0.0673 0.0694 0.0602 0.0607
aAUC Mix non-robust 0.9571 0.7954 0.7961 0.7982 0.7986 0.7981
robust 0.9570 0.9571 0.9574 0.9519 0.9568 0.9567
Mix GS non-robust 0.9509 0.7979 0.7984 0.7982 0.7979 0.7952
robust 0.9581 0.9577 0.9519 0.9482 0.9578 0.9574
nMSE for Poisson Mix non-robust 0.2089 0.7368 0.6905 0.6528 0.6642 0.6736
robust 0.2086 0.2105 0.2099 0.2345 0.2222 0.2230
Mix GS non-robust 0.2136 0.7416 0.7795 0.6587 0.6688 0.8665
robust 0.2087 0.2109 0.2169 0.2202 0.2212 0.2236
Table 4: Comparison between methods handling outlier samples on synthetic data.

6.4.7 Feature-Based Prediction by MOE

We set the true k=3k=3. The true 𝜶d×k\mbox{\boldmath$\alpha$}\in\mathbb{R}^{d\times k}, whose first four rows are non-zero. The non-zero entries of 𝜶\alpha are drawn from 𝒩(0,1)\mathcal{N}(0,1). Number of data samples n=1000n=1000. For all i=1,,n,r=1,,ki=1,\ldots,n,r=1,\ldots,k, the iith data sample coming from the rrth sub-population obeys a multinomial distribution with the probability defined in (17). Other settings are the same as in Section 6.4.3.

We compare our Hermit  methods Mix MOE and Mix MOE GS. The prediction performances are shown in Table 5, which are consistent with the results in Section 6.4.3.

We further show in Table 6 the concordance between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and p(δi,r=1𝐱i,α0,r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{0,r}), where 𝜶0\mbox{\boldmath$\alpha$}_{0} denotes the true 𝜶\alpha, for all i=1,,n,r=1,,ki=1,\ldots,n,r=1,\ldots,k, for both training and testing data. In (22), 𝜶\alpha is optimized by partially minimizing the discrepancy between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and ρ^i,r(t+1)\hat{\rho}^{(t+1)}_{i,r} for t=0,,T1t=0,\ldots,T-1. As such we also show the concordance between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and ρ^i,r(T)=p(δi,r=1yij,jΩi,𝐱i,θ^2)\hat{\rho}^{(T)}_{i,r}=p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}_{2}). The concordances are measured by NMI defined in (16). We use NMI instead of KL-divergence, because NMI is normalized to the range of [0,1][0,1].

Both p(δi,r=1𝐱i,α0,r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{0,r}) and p(δi,r=1yij,jΩi,𝐱i,θ^2)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}_{2}) are approximated accurately on the training data. The approximation accuracies are lower on the testing data because the deficiency of data samples comparing with the dimension.

nMSE aAUC
LASSO 0.6390 0.7834
Mix MOE 0.0656 0.9466
Group LASSO 0.6348 0.7878
Mix MOE GS 0.0579 0.9502
Sep L2 0.6481 0.7794
GO-MTL 0.6946 0.7778
CMTL 0.6496 0.7796
MSMTFL 0.6397 0.7831
TraceReg 0.6509 0.7790
SparseTrace 0.6473 0.7805
RMTL 0.6511 0.7797
Dirty 0.6348 0.7878
rMTFL 0.6483 0.7787
Table 5: Prediction performances based on only features.
Training Testing
C(𝜶^𝜶0)C(\hat{\mbox{\boldmath$\alpha$}}\parallel\mbox{\boldmath$\alpha$}_{0}) C(𝜶^θ^2)C(\hat{\mbox{\boldmath$\alpha$}}\parallel\hat{\theta}_{2}) C(𝜶^𝜶0)C(\hat{\mbox{\boldmath$\alpha$}}\parallel\mbox{\boldmath$\alpha$}_{0}) C(𝜶^θ^2)C(\hat{\mbox{\boldmath$\alpha$}}\parallel\hat{\theta}_{2})
Mix MOE 0.9863 0.9918 0.8440 0.8457
Mix MOE GS 0.9962 0.9933 0.8455 0.8516
Table 6: Approximation performances based on only features. C(𝜶^𝜶0)C(\hat{\mbox{\boldmath$\alpha$}}\parallel\mbox{\boldmath$\alpha$}_{0}) denotes the concordance between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and p(δi,r=1𝐱i,α0,r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\alpha_{0,r}), where 𝜶0\mbox{\boldmath$\alpha$}_{0} denotes the true 𝜶\alpha. C(𝜶^θ^2)C(\hat{\mbox{\boldmath$\alpha$}}\parallel\hat{\theta}_{2}) denotes the concordance between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and p(δi,r=1yij,jΩi,𝐱i,θ^2)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}_{2}). The concordances are measured by NMI defined in (16).

6.4.8 Scalability

We discuss the scalability of our method for increasing number of features and tasks. The running time is evaluated. We choose the data set used in Section 6.4.2 with the true k=4k=4. The sparsity ss is fixed to be 4. We set the number of features d{32,64,128,256,512}d\in\{32,64,128,256,512\} and the number of tasks m{15,30,60,120,240}m\in\{15,30,60,120,240\}. The ratios between the numbers of Gaussian, Bernoulli and Poisson tasks are the same as in Section 6.4.2. We randomly generate 100 data sets for each pair of (d,m)(d,m). We report the results of the method Mix GS only, as the results of Mix are similar. In each case, k^\hat{k} is tuned and is equal to the true k=4k=4. The estimated parameters in different cases may have different numbers of relevant features. As such, in order to provide a fair comparison, we report the running time per feature, i.e., running time divided by the number of non-zero features of the estimated parameters in each case.

In Fig 5, both dimension dd and the number of tasks mm have no significant influence on running time per feature, especially when dd and mm is large, which is consistent with our time-complexity analysis in Section 3.2.

Refer to caption
(a) Training Time
Refer to caption
(b) Testing Time
Figure 5: Running time per feature when dimension and the number of tasks grow. (a) reports the running time per feature for training the model; (b) reports the running time per feature for testing.

6.5 Application

On the real-world data sets, we first demonstrate the existence of the heterogeneity of conditional relationship, then report the superiority of our Hermit  method over other methods considered in Section 6.1. We further interpret the advantage of our method by presenting the selected features. Effectiveness of anomaly-task detection and task clustering strategy is also validated.

6.5.1 Data Description

Both real-world data sets introduced in the following are longitudinal surveys for elder patients, which includes a set of questions. Some of the question answers are treated as input features and some of the questions related to indices of geriatric assessments are treated as targets.

LSOA II Data. This data is from the Second Longitudinal Study of Aging (LSOA II) 111https://www.cdc.gov/nchs/lsoa/lsoa2.htm.. LSOA II is a collaborative study by the National Center for Health Statistics (NCHS) and the National Institute of Aging conducted from 1994-2000. A national representative sample of 94479447 subjects 7070 years of age and over were selected and interviewed. Three separated interviews were conducted during the periods of 1994-1996, 1997-1998, and 1999-2000, respectively. The interviews are referred to as WAVE 1, WAVE 2, and WAVE 3 interviews, respectively. We use data WAVE 2 and WAVE 3, which includes a total of 42994299 sample subjects and 44 targets, and 188188 features are extracted from WAVE 2 interview.

Among the targets, specifically, three self-rated health measures, including overall health status, memory status and depression status, can be regarded as continuous outcomes; there are 41 binary outcomes, which fall into several categories: fundamental daily activity, extended daily activity, social involvement, medical condition, on cognitive ability, and sensation condition. The features include records of demographics, family structure, daily personal care, medical history, social activity, health opinion, behavior, nutrition, health insurance and income and assets, the majority of which are binary measurements. Both targets and features have missing values due to non-response and questionnaire filtering. The average missing value rates in targets and features are 13.7% and 20.2%, respectively. For the missing values in features, we adopt the following procedure for pre-processing. For continuous features, the missing values are imputed with sample mean. For binary features, a better approach is to treat missing as a third category as it may also carry important information; as such, two dummy variables are created from each binary feature with missing values (the third one is not necessary.) This results in totally d=293d=293 features. We randomly select 30% of the samples for training, 30% for validation and the rest for testing.

easySHARE Data. This data is a simplified data set from the Survey of Heath, Aging, and Retirement in Europe (SHARE)222http://www.share-project.org/data-access-documentation.html.. SHARE includes multidisciplinary and cross-national panel databases on health, socio-economic status, and social and family networks of more than 85,000 individuals from 2020 European countries aged 5050 or over. Four waves of interviews were conducted during 2004 - 2011, and are referred to as WAVE 1 to WAVE 4 interviews. We use WAVE 1 and WAVE 2, which includes 20,449 sample persons and 15 targets (among which 11 are binary, and 4 are continuous), and totally 75 features are constructed from WAVE 1 interview.

The targets are from four interview modules: social support, mental health, functional limitation indices and cognitive function indices. The features cover a wide range of assessments, including demographics, household composition, social support and network, physical health, mental health, behavior risk, healthcare, occupation and income. Detailed description features are not listed in this paper. Both targets and features have missing values due to non-response and questionnaire filtering. The average missing value rates in targets and features are 6.9% and 5.1%, respectively. The same pre-processing procedure as that for LSOA II Data has been adopted and results in totally d=118d=118 features. We randomly select 10% of the samples for training, 10% for validation and the rest for testing.

6.5.2 Comparison with FMR Method

In this experiment, we compare our proposed Hermit  methods Mix and Mix GS which handle mixed type of outcomes with Sep which learns different types of tasks separately. Single is abandoned because it learns each task independently and is not able to use targets of other tasks to help increasing imputation performance.

Results are reported in Fig 6, where 1) for both the real data sets, basically, the best pre-fixed k^>1\hat{k}>1, except for Bernoulli tasks of easySHARE data, suggesting that the heterogeneity of conditional relationship exists in LSOA II data and the Gaussian tasks of easySHARE data; 2) FMR models benefit Gaussian targets more than Bernoulli targets; 3) Mix and Mix GS outperform Sep in Gaussian tasks. However, their performances are comparable with Sep in Bernoulli tasks, which may be because that the number of Bernoulli tasks are much more than that of Gaussian tasks such that the benefit from Gaussian tasks is limited.

Refer to caption
(a) nMSE of Gaussian targets
Refer to caption
(b) aAUC of Bernoulli targets
Refer to caption
(c) nMSE of Gaussian targets
Refer to caption
(d) aAUC of Bernoulli targets
Figure 6: Comparison with Sep on real data sets. (a) and (b) are results on LSOA II data, (c) and (d) are results on easySHARE data.

6.5.3 Comparison with Non-FMR Methods

In this experiment we test imputation performance, comparing our Hermit  methods Mix and Mix GS with all the non-FMR methods.

Results are reported in Table 7, where 1) our Hermit  methods Mix and Mix GS not only outperform their special cases, LASSO and Group LASSO, respectively, but also outperform other methods, including those handling certain kinds of heterogeneities, except for aAUC on easySHARE, the reason of which has been discussed in Section 6.5.2; 2) Mix GS increases nMSE by 9.76% and 14.37% on LSOA II data and easySHARE data, respectively, comparing with its non-FMR version Group LASSO. The similar improvements by Mix are witnessed as well.

LSOA II easySHARE
nMSE aAUC nMSE aAUC
LASSO 0.7051 0.7474 0.7869 0.7386
Mix 0.6408 0.7525 0.6601 0.7419
Group LASSO 0.6975 0.7413 0.7897 0.7413
Mix GS 0.6294 0.7481 0.6548 0.7402
Sep L2 0.7176 0.7392 0.7796 0.7464
GO-MTL 0.8516 0.6972 0.8231 0.7288
CMTL 0.8186 0.7089 0.7958 0.7364
MSMTFL 0.7028 0.7473 0.7803 0.7411
TraceReg 0.7150 0.7408 0.7809 0.7496
SparseTrace 0.6972 0.7475 0.7791 0.7475
RMTL 0.7145 0.7418 0.7808 0.7496
Dirty 0.7032 0.7480 0.7781 0.7486
rMTFL 0.6953 0.7418 0.7781 0.7486
Table 7: Comparison with non-FMR methods on real data sets

6.5.4 Feature Selection

We consider demonstrating the advantage of our Hermit  method on feature selection. We compare our Hermit  method Mix GS with its non-FMR version Group LASSO. Both methods select shared features across tasks. We collect the unique features that only selected for each sub-population.

For LSOA II data set, the tuned k^=2\hat{k}=2. Mix GS selects 47/29447/294 features (summing up selected features of both sub-populations), while Group LASSO selects 48/29448/294 features. Descriptions of unique features of both sub-populations are listed in Table 8. Sub-population 1 seems considering worse condition of patients.

Sub-population 1 (π1=70.09%\pi_{1}=70.09\%) Sub-population 2 (π2=29.91%\pi_{2}=29.91\%)
able or prevented to leave house times seen doctor in past 3 months
have problems with balance easier or harder to walk 1/4 mile
total number of living children widowed
easier/harder than before: in/out of bed follow regular physical routine
#(ADL activities) SP is unable to perform present social activities
easier or harder to walk 10 steps ever had a stress test
do you take aspirin do you take vitamins
often troubled with pain necessary to use steps or stairs
visit homebound friend for others had flu shot
ever had a hysterectomy ever had cataract surgery
physical activity more/less/same
Table 8: Descriptions of unique features of each sub-population of LSOA II data. Features are sorted in descending order by the 𝜷rl2\|\mbox{\boldmath$\beta$}_{r}^{l}\|_{2} where 𝜷rl\mbox{\boldmath$\beta$}_{r}^{l} is the llth row of 𝜷r\mbox{\boldmath$\beta$}_{r} (l=1,,dl=1,\ldots,d). The bold denote the features that are not selected by Group LASSO. “#(\cdot)” denotes the number of the enclosed events. “ADL” denotes Activity of Daily Livings. “SP” denotes Standardized Patients.

For easySHARE data set, the tuned k^=5\hat{k}=5. Mix GS selects 58/11858/118 features, while Group LASSO selects 57/11857/118 features. Descriptions of unique features of two sub-populations are listed in Table 9. Sub-population 1 seems considering more about personality and experience, while sub-population 2 seems considering more about politics and education.

For both real data sets, our Hermit  method Mix GS recalls more useful features than Group LASSO does.

Sub-population 1 (π1=21.31%\pi_{1}=21.31\%) Sub-population 2 (π2=26.36%\pi_{2}=26.36\%)
fatigue taken part in a political organization
guilt attended an educational or training course
enjoyment taken part in religious organization
suicidality none of social activities
tearfullness cared for a sick or disabled adult
interest done voluntary or charity work
current job situation:sick education: lower secondary
education: first tertiary
education: post secondary
education: upper secondary
education: primary
education: second tertiary
Table 9: Descriptions of unique features of two sub-populations of easySHARE data. Features are sorted in descending order by the 𝜷rl2\|\mbox{\boldmath$\beta$}_{r}^{l}\|_{2} where 𝜷rl\mbox{\boldmath$\beta$}_{r}^{l} is the llth row of 𝜷r\mbox{\boldmath$\beta$}_{r} (l=1,,dl=1,\ldots,d). The bold denote the features that are not selected by Group LASSO.

6.5.5 Detection of Anomaly Tasks

We firstly use (15) to compute concordant scores of tasks, which are reported in Fig 7. Clear separations are witnessed on both data sets. We select one-third of tasks with highest scores as concordant tasks and another third with lowest scores as anomaly tasks. The descriptions of the concordant and anomaly tasks are listed in Table 10 and Table 11, respectively.

Refer to caption
(a) LSOA II data set
Refer to caption
(b) easySHARE data set
Figure 7: Concordant scores of tasks, which were estimated by Mix GS. The tasks are reordered according to the scores.
Concordant tasks (top 7) Anomaly tasks (top 8)
have difficulty dressing go to movies, sports, events, etc.
have difficulty doing light housewrk now have asthma
have difficulty using toilet now have arthritis
have difficulty managing medication now have hypertension
have difficulty bathing or showering injured from fall(s)
have difficulty managing money memory of year
have difficulty preparing meals have deafness
get together with relatives
Table 10: Descriptions of tasks of LSOA II data
Concordant tasks (top 5) Anomaly tasks (top 5)
activities of daily living index numeracy score
instrumental activities of daily living indices gone to sport social or other kind of club
mobility index recall of words first trial
appetite give help to others outside the household
orientation to date provided help to family friends or neighbors
Table 11: Descriptions of tasks of easySHARE data

The concordant tasks detected by our methods seem truly correlated with each other intuitively. And the information of detected anomaly tasks is diverse and seems different with that of concordant tasks.

For each data set, we apply our Hermit  method Mix (and Mix GS) to build two models for non-anomaly tasks (the first two-third tasks) and anomaly tasks, respectively. For LSOA II data set, the tuned k^=4\hat{k}=4 and 11 for non-anomaly tasks and anomaly tasks, respectively. For easySHARE data set, the tuned k^=6\hat{k}=6 and 22 for non-anomaly tasks and anomaly tasks, respectively.

Averaged imputation performances are shown in Table 12. By providing separate models to handle anomaly tasks, the performances improve significantly, where Mix GS outperforms Mix, maybe because the non-anomaly tasks share some relevant features.

LSOA II easySHARE
nMSE aAUC nMSE aAUC
Mix - All tasks 0.6408 0.7525 0.6601 0.7419
Mix - Handle anomalies 0.5979 0.7602 0.6569 0.7370
Mix GS - All tasks 0.6294 0.7481 0.6548 0.7402
Mix GS - Handle anomalies 0.5923 0.7649 0.6462 0.7447
Table 12: Comparison for imputation performances. “All tasks” denotes building one FMR model for all the tasks. “Handle anomalies” denotes building two models for non-anomaly tasks and anomaly tasks, respectively.

6.5.6 Handling Clustered Relationship among tasks

We adopt the same strategy as that in Section 6.4.5 to construct a similarity matrix and perform dimension reduction for each of the real-world data sets.

For LSOA II data set, the similarity matrix and results of 2D reduction are shown in Fig 8. In Fig 8(b), tasks are partitioned into groups. We apply k-means algorithm to separate the tasks into 4 groups. Tasks in Group 1 are mainly about current status. The descriptions of tasks of Group 2 are “how often felt sad or depressed in the past 12 months” and “self rated memory”. Tasks in Group 3 and 4 are about having difficulty performing some certain actions. Group 3 is similar to Group 4, which can be reflected by Fig 8(b).

Refer to caption
(a) Similarity Matrix
Refer to caption
(b) Dimension Reduction by Kernel PCA
Figure 8: Clustered Relationship among tasks on LSOA II. (a) Similarity matrix among tasks. First three tasks are Gaussian tasks. Other tasks are Bernoulli tasks. (b) Relationship among tasks shown by Kernel PCA.

For easySHARE data set, the similarity matrix and results of 2D reduction are shown in Fig 9. In Fig 9(b), tasks are partitioned into groups as well. We also apply k-means algorithm to separate the tasks into 4 groups. The descriptions of tasks for each group are shown in Table 13, where descriptions of 4 types of interview modules are basically separated into 4 groups, respectively. The only “misclassified” task with the description of “Orientation to date” seems to be more related to other tasks in Group 2 than the tasks in Group 4.

Refer to caption
(a) Similarity Matrix
Refer to caption
(b) Dimension Reduction by Kernel PCA
Figure 9: Clustered Relationship among tasks on easySHARE. (a) Similarity matrix among tasks. First four tasks are Gaussian tasks. Other tasks are Bernoulli tasks. (b) Relationship among tasks shown by Kernel PCA.
Group Targets Interview module
1 Activities of daily living index Functional Limitation Indices
Instrumental activities of daily living index Functional Limitation Indices
Mobility index Functional Limitation Indices
2 Depression Mental Health
Pessimism Mental Health
Sleep Mental Health
Irritability Mental Health
Appetite Mental Health
Concentration Mental Health
Orientation to date Cognitive Function Indices
3 Provided help to family friends or neighbors Social Support & Network
Gone to sport social or other kind of club Social Support & Network
Give help to others outside the household Social Support & Network
4 Recall of words score Cognitive Function Indices
Numeracy score Cognitive Function Indices
Table 13: Clustered tasks of easySHARE. 15 tasks are clustered into 4 groups.

For each data set, we further apply our Hermit  methods Mix and Mix GS for each group of tasks. For LSOA II data set, tuned k^=3,3,5\hat{k}=3,3,5 and 22 for Group 1,2,3 and 4, respectively. For easySHARE data set, tuned k^=5,2,1\hat{k}=5,2,1 and 11 for Group 1,2,3 and 4, respectively. Imputation performances are shown in Table 14. Performances increase by building separate models for each group, suggesting that separate models for clustered tasks are more accurate.

LSOA II easySHARE
nMSE aAUC nMSE aAUC
Mix - All tasks 0.6408 0.7525 0.6601 0.7419
Mix - Clustered tasks 0.6370 0.7592 0.6552 0.7439
Mix GS - All tasks 0.6294 0.7481 0.6548 0.7402
Mix GS - Clustered tasks 0.6202 0.7559 0.6533 0.7474
Table 14: Comparison for imputation performances. “All tasks” denotes building one FMR model for all the tasks. “Clustered tasks” denotes building different FMR models for different groups of tasks.

6.5.7 Feature-Based Prediction by MOE

We compare the methods using only features to predict targets on both real-world data sets. Our proposed MOE type of Hermit  methods, Mix MOE and Mix MOE GS, are compared with the non-FMR methods. We also integrate our strategies to handle anomaly tasks and clustered structure among tasks in both our proposed MOE type of Hermit  methods Mix MOE and Mix MOE GS. Concretely, we use the anomaly-task detection results in Section 6.5.5 and the task clustering results in Section 6.5.6.

The prediction results are reported in Table 15. Our proposed Hermit  method Mix MOE and Mix MOE GS outperform baseline methods on LSOA II and on Gaussian tasks of easySHARE, which is consistent with the results in Table 7. In addition, by integrating our task clustering strategy, our proposed Hermit  methods Mix MOE TC and Mix MOE GS TC outperform other methods on Gaussian targets, while providing comparable results on Bernoulli tasks. Mix MOE TC and Mix MOE GS TC even outperform our proposed Hermit  methods Mix MOE Robust and Mix MOE GS Robust on Gaussian targets, suggesting that it is more accurate to build a specific model for each cluster of tasks.

Comparing Table 15 with Table 7, our MOE methods do not rival our FMR methods. We investigate the reason by showing the concordance between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and p(δi,r=1yij,jΩi,𝐱i,θ^2)=ρ^i,r(T)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}_{2})=\hat{\rho}_{i,r}^{(T)} (ρ^i,r(T)\hat{\rho}_{i,r}^{(T)} is defined in equation 21) in Table 16. In Table 16, the concordances of conditional probabilities measured by NMI are generally low, especially comparing with the results in Table 6, suggesting that on both real-world data sets, it is difficult to learn the mixture probabilities.

LSOA II easySHARE
nMSE aAUC nMSE aAUC
LASSO 0.7051 0.7474 0.7869 0.7386
Group LASSO 0.6975 0.7413 0.7897 0.7413
MSMTFL 0.7028 0.7473 0.7803 0.7411
Sep L2 0.7176 0.7392 0.7796 0.7464
GO-MTL 0.8516 0.6972 0.8231 0.7288
CMTL 0.8186 0.7089 0.7958 0.7364
TraceReg 0.7150 0.7408 0.7809 0.7496
SparseTrace 0.6972 0.7475 0.7791 0.7475
RMTL 0.7145 0.7418 0.7808 0.7496
Dirty 0.7032 0.7480 0.7781 0.7486
rMTFL 0.6953 0.7418 0.7781 0.7486
Mix MOE 0.6935 0.7504 0.7991 0.7395
Mix MOE GS 0.7054 0.7438 0.7774 0.7387
Mix MOE Robust 0.6906 0.7436 0.7642 0.7351
Mix MOE GS Robust 0.6981 0.7430 0.7668 0.7344
Mix MOE TC 0.6859 0.7333 0.7584 0.7389
Mix MOE GS TC 0.6925 0.7379 0.7657 0.7367
Table 15: Comparison for prediction performance with non-FMR methods on real data sets. “Robust” denotes adopting the strategy to handle anomaly tasks. “TC” denotes task clustering strategy.
LSOA II easySHARE
Training Testing Training Testing
Mix MOE 0.2745 0.1301 0.1314 0.1068
Mix MOE GS 0.1060 0.0673 0.2527 0.2054
Table 16: The concordance between p(δi,r=1𝐱i,α^r)p(\delta_{i,r}=1\mid\mathbf{x}_{i},\hat{\alpha}_{r}) and p(δi,r=1yij,jΩi,𝐱i,θ^2)p(\delta_{i,r}=1\mid y_{ij^{\prime}},j^{\prime}\in\Omega_{i},\mathbf{x}_{i},\hat{\theta}_{2}). The concordances are measured by NMI defined in (16).

7 Discussions & Conclusions

In this paper, we propose a novel model Hermit  to explore heterogeneities of conditional relationship, output type and shared information among tasks. Based on multivariate-target FMR and MOE models, our model jointly learns tasks with mixed type of output, allows incomplete data in the output, imposes inner component-wise group 1\ell_{1} constraint and handles anomaly tasks and clustered structure among tasks. These key elements are integrated in a unified generalized mixture model setup so that they can benefit from and reinforce each other to discover the triple heterogeneities in data. Rigorous theoretical analyses under the high dimensional framework are provided.

We mainly consider the special setting of MTL, where the multivariate outcomes share the same set of instances and the same set of features because our main objective is to learn potentially shared sample clusters and feature sets among tasks. However, as stressed in the introduction, the main definition of MTL considers tasks that do not necessarily share the same set of samples/instances and the same set of features, such as distributed learning systems (different tasks have entirely different data instances, see Jin et al 2006 and Boyd et al 2011) and multi-source learning systems (different tasks have entirely different feature spaces, see Zhang and Yeung 2011 and Jin et al 2015). For such cases, one can define the specific expected shared information among tasks and then extend our methodology. For example, although tasks do not share the same instances, they could share the same mixture model structure. Then for the distributed learning systems, our model in Section 3 can still be applied. Additionally, the tasks could still share the pattern/sparsity in feature selection even though the feature sets are different, e.g., Liu et al (2009) and Gong et al (2012b). Then one can build FMR models for the tasks in which the regression coefficient vectors of the tasks share the same sparsity pattern achieved by group 1\ell_{1} penalization. The case of multi-source learning systems can also be handled similarly by embedding features into a shared feature space, e.g., Zhang and Yeung (2011) and Jin et al (2015).

There are many interesting future directions. It is worthwhile to explore the theoretical and empirical performance of non-convex penalties. Meanwhile, different components should share some features, and overlapping cluster pattern of conditional relationship should also be considered in real applications, both of which require further investigation. It is also interesting to explore other low-dimensional structures in the natural parameters, e.g., low-rank structure and its sparse composition (Chen et al, 2012b). Our strategies on handling anomaly tasks and clustered structure among tasks require two stages. It is worthwhile to explore one-stage models to handle such task heterogeneities during a whole learning process. More complicated structure among tasks, such as graph-based structure, should also be explored. Our theoretical results cover our method introduced in Section 3 and robust estimation introduced in Section 4.1. Nonetheless, theoretical guarantees for other extensions in Section 4 are still challenging due to joint learning complicated relationship among tasks and population heterogeneity, which will be focused on in our future research.

Acknowledgements.
The authors would like to thank the editors and reviewers for their valuable suggestions on improving this paper. This work of Jian Liang and Changshui Zhang is (jointly or partly) funded by National Natural Science Foundation of China under Grant No.61473167 and Beijing Natural Science Foundation under Grant No. L172037. Kun Chen’s work is partially supported by U.S. National Science Foundation under Grants DMS-1613295 and IIS-1718798. The work of Fei Wang is supported by National Science Foundation under Grants IIS-1650723 and IIS-1716432.

Appendix A Definitions

Definition 1

Z=(Z1,,Zm)TmZ=(Z_{1},\ldots,Z_{m^{\prime}})^{\rm T}\in\mathbb{R}^{m^{\prime}} has a sub-exponential distribution with parameters (σ,v,t)(\sigma,v,t) if for M>tM>t, it holds

(Z>M){exp(M2σ2),tMσ2vexp(Mv),M>σ2v.\displaystyle\mathbb{P}(\|Z\|_{\infty}>M)\leq\left\{\begin{array}[]{ll}\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)},&t\leq M\leq\frac{\sigma^{2}}{v}\\ \exp\biggl{(}-\frac{M}{v}\biggr{)},&M>\frac{\sigma^{2}}{v}.\end{array}\right.

Appendix B The Empirical Process

In order to prove the first part of Theorem 1 that the bound in (26) has the probability in (25), we firstly follow Städler et al (2010) to define the empirical process for fixed data points 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n}. For 𝐲~i=(yij,jΩi)T|Ωi|\tilde{\mathbf{y}}_{i}=(y_{ij},j\in\Omega_{i})^{\rm T}\in\mathbb{R}^{|\Omega_{i}|} and X=(X1,,Xd)X=(X_{1},\ldots,X_{d}), let

Vn(θ)=1ni=1n(θ(𝐱i,𝐲~i)𝔼[θ(𝐱i,𝐲~i)X=𝐱i]).\displaystyle V_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\left(\ell_{\theta}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})-\mathbb{E}[\ell_{\theta}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})\mid X=\mathbf{x}_{i}]\right).

By fixing some T1T\geq 1 and λ00\lambda_{0}\geq 0, we define an event 𝒯\mathcal{T} below, upon which the bound in (26) can be proved. So the probability of the event 𝒯\mathcal{T} is the probability in (25).

𝒯={supθΘ~|Vn(θ)Vn(θ0)|(𝜷𝜷01+ηη02)λ0Tλ0}.\mathcal{T}=\left\{\sup_{\theta\in\tilde{\Theta}}\frac{|V_{n}(\theta)-V_{n}(\theta_{0})|}{(\|\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}_{0}\|_{1}+\|\eta-\eta_{0}\|_{2})\vee\lambda_{0}}\leq T\lambda_{0}\right\}. (21)

It can be seen that, (21) defines a set of the parameter θ\theta, and the bound in (26) will be proved with θ^\hat{\theta} in the set.

For group-lasso type estimator, define an event similar to that in (21) in the following.

𝒯group={supθΘ~|Vn(θ)Vn(θ0)|(p𝜷𝒢p𝜷0,𝒢p2+ηη02)λ0Tλ0}.\mathcal{T}_{group}=\left\{\sup_{\theta\in\tilde{\Theta}}\frac{|V_{n}(\theta)-V_{n}(\theta_{0})|}{(\sum_{p}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{2}+\|\eta-\eta_{0}\|_{2})\vee\lambda_{0}}\leq T\lambda_{0}\right\}. (22)

Appendix C Lemmas

In order to show that the probability of event 𝒯\mathcal{T} is large, we firstly invoke the following lemma.

Lemma 2

Under Condition 2, for model (1) with θ0Θ~\theta_{0}\in\tilde{\Theta}, MnM_{n} and λ0\lambda_{0} defined in (24), some constants c6,c7c_{6},c_{7} depending on KK, and for nc7n\geq c_{7}, we have

𝐗(1ni=1nF(𝐲~i)>c6λ02/(mk))1n,\displaystyle\mathbb{P}_{\mathbf{X}}\left(\frac{1}{n}\sum_{i=1}^{n}F(\tilde{\mathbf{y}}_{i})>c_{6}\lambda_{0}^{2}/(mk)\right)\leq\frac{1}{n},

where 𝐗\mathbb{P}_{\mathbf{X}} denote the conditional probability given (X1T,,XnT)T=(𝐱1T,,𝐱nT)T=𝐗(X_{1}^{\rm T},\ldots,X_{n}^{\rm T})^{\rm T}=(\mathbf{x}_{1}^{\rm T},\ldots,\mathbf{x}_{n}^{\rm T})^{\rm T}=\mathbf{X}, and F(𝐲~i)=G1(𝐲~i)1{G1(𝐲~i)>Mn}+𝔼[G1(𝐲~i)1{G1(𝐲~i)>Mn}X=𝐱i],i.F(\tilde{\mathbf{y}}_{i})=G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}+\mathbb{E}[G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}\mid X=\mathbf{x}_{i}],\forall i.

A proof is given in Appendix F.

Then we can follow the Corollary 1 in Städler et al (2010) to show that the probability of event 𝒯\mathcal{T} is large below.

Lemma 3

Use Lemma 2. For model (1) with θ0Θ~\theta_{0}\in\tilde{\Theta}, some constants c7,c8,c9,c10c_{7},c_{8},c_{9},c_{10} depending on KK, for 𝒯\mathcal{T} is defined in (21), and for all Tc10T\geq c_{10} we have

𝐗(𝒯)1c9exp(T2(logn)2log(dn)c8)1n,nc7.\displaystyle\mathbb{P}_{\mathbf{X}}(\mathcal{T})\geq 1-c_{9}\exp\left(-\frac{T^{2}(\log n)^{2}\log(d\vee n)}{c_{8}}\right)-\frac{1}{n},\forall n\geq c_{7}.

A proof is given in Appendix G.

Appendix D Corollaries for Models Considering Outlier Samples

When considering outlier samples and modifying the natural parameter model as in (11), we can show in this section the similar results.

First, as 𝜷\beta and 𝜻\zeta are treated in the similar way, we denote them together by 𝝃((d+n)×m)×k\mbox{\boldmath$\xi$}\in\mathbb{R}^{((d+n)\times m)\times k}, and ξ=vec(𝝃)(d+n)mk\xi=vec(\mbox{\boldmath$\xi$})\in\mathbb{R}^{(d+n)mk} such that for all r=1,,kr=1,\ldots,k,

𝝋r\displaystyle\mbox{\boldmath$\varphi$}_{r} =𝐗𝜷r+𝜻r𝝋r=𝐀𝝃r,\displaystyle=\mathbf{X}\mbox{\boldmath$\beta$}_{r}+\mbox{\boldmath$\zeta$}_{r}\ \Rightarrow\mbox{\boldmath$\varphi$}_{r}=\mathbf{A}\mbox{\boldmath$\xi$}_{r},
𝐀\displaystyle\mathbf{A} =[𝐗,𝐈n]n×(d+n),𝝃r=[𝜷rT,𝜻rT]T(d+n)×m,\displaystyle=[\mathbf{X},\mathbf{I}_{n}]\in\mathbb{R}^{n\times(d+n)},\ \mbox{\boldmath$\xi$}_{r}=[\mbox{\boldmath$\beta$}_{r}^{\rm T},\mbox{\boldmath$\zeta$}_{r}^{\rm T}]^{\rm T}\in\mathbb{R}^{(d+n)\times m},

where 𝐈nn×n\mathbf{I}_{n}\in\mathbb{R}^{n\times n} is a identity matrix.

Thus it can be seen that the modification only results in new design matrix and regression coefficient matrix, therefore, we can apply Theorem 1 \sim 3 to have similar results for the modified models.

For lasso-type penalties, denote the set of indices of non-zero entries of β0\beta_{0} by SβS_{\beta}, and the set of indices of non-zero entries of ζ0\zeta_{0} by SζS_{\zeta}, where ζ=vec(𝜻1,,𝜻k)\zeta=\mbox{vec}(\mbox{\boldmath$\zeta$}_{1},\ldots,\mbox{\boldmath$\zeta$}_{k}). Denote by s=|Sβ|+|Sζ|s=|S_{\beta}|+|S_{\zeta}|. Then for entry-wise 1\ell_{1} penalties in (5) (for 𝜷\beta) with γ=0\gamma=0 and (𝜻)=λζ1\mathcal{R}(\mbox{\boldmath$\zeta$})=\lambda\|\zeta\|_{1} (for 𝜻\zeta), we need the following modified restricted eigenvalue condition.

Condition 6

For all βdmk\beta\in\mathbb{R}^{dmk} and all ζnmk\zeta\in\mathbb{R}^{nmk} satisfying βSβc1+ζSζc16(βSβ1+ζSζ1)\|\beta_{S_{\beta}^{c}}\|_{1}+\|\zeta_{S_{\zeta}^{c}}\|_{1}\leq 6(\|\beta_{S_{\beta}}\|_{1}+\|\zeta_{S_{\zeta}}\|_{1}), it holds for some constant κ1\kappa\geq 1 that,

βSβ22+ζSζ22κ2φQn2=κ2ni=1njΩir=1k(𝐱i𝜷jr+ζijr)2.\displaystyle\|\beta_{S_{\beta}}\|_{2}^{2}+\|\zeta_{S_{\zeta}}\|_{2}^{2}\leq\kappa^{2}\|\varphi\|_{Q_{n}}^{2}=\frac{\kappa^{2}}{n}\sum_{i=1}^{n}\sum_{j\in\Omega_{i}}\sum_{r=1}^{k}(\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}+\zeta_{ijr})^{2}.
Corollary 1

Consider the Hermit  model in (1) with θ0Θ~\theta_{0}\in\tilde{\Theta}, and consider the penalized estimator (12) with the 1\ell_{1} penalties in (5) and (𝛇)=λζ1\mathcal{R}(\mbox{\boldmath$\zeta$})=\lambda\|\zeta\|_{1}.

(a) Assume conditions 1-3 and 6 hold. Suppose mkn/Mn\sqrt{mk}\lesssim n/M_{n}, and take λ>2Tλ0\lambda>2T\lambda_{0} for some constant T>1T>1. For some constant c>0c>0 and large enough nn, with probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have

ε¯(θ^θ0)+2(λTλ0)(β^Sβc1+ζ^Sζc1)4(λ+Tλ0)2κ2c02s,\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})+2(\lambda-T\lambda_{0})(\|\hat{\beta}_{S_{\beta}^{c}}\|_{1}+\|\hat{\zeta}_{S_{\zeta}^{c}}\|_{1})\leq 4(\lambda+T\lambda_{0})^{2}\kappa^{2}c_{0}^{2}s,

(b) Assume conditions 1-3 hold (without condition 6), assume

β01+ζ01\displaystyle\|\beta_{0}\|_{1}+\|\zeta_{0}\|_{1} =o(n/((logn)2+2c1log(dn)mk)),\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n)mk)}),
mk\displaystyle\sqrt{mk} =o(n/((logn)2+2c1log(dn)))\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n))})

as nn\rightarrow\infty. If λ=C(logn)2+2c1log(dn)mk/n\lambda=C\sqrt{(\log n)^{2+2c_{1}}\log(d\vee n)mk/n} for some C>0C>0 sufficiently large, and for some constant c>0c>0 and large enough nn, with the following probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have ε¯(θ^θ0)=oP(1).\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})=o_{P}(1).

For group-lasso type penalties, denote

β={p:𝜷0,𝒢β,p=𝟎},βc={p:𝜷0,𝒢β,p𝟎},\displaystyle\mathcal{I}_{\beta}=\{p:\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{\beta,p}}=\mathbf{0}\},\ \mathcal{I}_{\beta}^{c}=\{p:\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{\beta,p}}\neq\mathbf{0}\},
ζ={q:𝜻0,𝒢ζ,q=𝟎},ζc={q:𝜻0,𝒢ζ,q𝟎},\displaystyle\mathcal{I}_{\zeta}=\{q:\mbox{\boldmath$\zeta$}_{0,\mathcal{G}_{\zeta,q}}=\mathbf{0}\},\ \mathcal{I}_{\zeta}^{c}=\{q:\mbox{\boldmath$\zeta$}_{0,\mathcal{G}_{\zeta,q}}\neq\mathbf{0}\},

where 𝜷0,𝒢β,p\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{\beta,p}} and 𝜻0,𝒢ζ,q\mbox{\boldmath$\zeta$}_{0,\mathcal{G}_{\zeta,q}} denote the ppth group of 𝜷0\mbox{\boldmath$\beta$}_{0} and the qqth group of 𝜻0\mbox{\boldmath$\zeta$}_{0}, respectively. Now denote s=|β|+|ζ|s=|\mathcal{I}_{\beta}|+|\mathcal{I}_{\zeta}| with some abuse of notation.

Then for group 1\ell_{1} penalties in (27) (for 𝜷\beta) and (𝜻)=qQ𝜻𝒢ζ,qF\mathcal{R}(\mbox{\boldmath$\zeta$})=\sum_{q}^{Q}\|\mbox{\boldmath$\zeta$}_{\mathcal{G}_{\zeta,q}}\|_{F} (for 𝜻\zeta), we need the following modified restricted eigenvalue condition.

Condition 7

For all 𝛃d×mk\mbox{\boldmath$\beta$}\in\mathbb{R}^{d\times mk} and all 𝛇n×mk\mbox{\boldmath$\zeta$}\in\mathbb{R}^{n\times mk} satisfying

pβc𝜷𝒢β,pF+qζc𝜻𝒢ζ,qF6(pβ𝜷𝒢β,pF+qζ𝜻𝒢ζ,qF),\displaystyle\sum_{p\in\mathcal{I}_{\beta}^{c}}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{\beta,p}}\|_{F}+\sum_{q\in\mathcal{I}_{\zeta}^{c}}\|\mbox{\boldmath$\zeta$}_{\mathcal{G}_{\zeta,q}}\|_{F}\leq 6\left(\sum_{p\in\mathcal{I}_{\beta}}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{\beta,p}}\|_{F}+\sum_{q\in\mathcal{I}_{\zeta}}\|\mbox{\boldmath$\zeta$}_{\mathcal{G}_{\zeta,q}}\|_{F}\right),

it holds that for some constant κ1\kappa\geq 1,

pβ𝜷𝒢β,pF2+qζ𝜻𝒢ζ,qF2κ2φQn2=κ2ni=1njΩir=1k(𝐱i𝜷jr+ζijr)2.\displaystyle\sum_{p\in\mathcal{I}_{\beta}}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{\beta,p}}\|_{F}^{2}+\sum_{q\in\mathcal{I}_{\zeta}}\|\mbox{\boldmath$\zeta$}_{\mathcal{G}_{\zeta,q}}\|_{F}^{2}\leq\kappa^{2}\|\varphi\|_{Q_{n}}^{2}=\frac{\kappa^{2}}{n}\sum_{i=1}^{n}\sum_{j\in\Omega_{i}}\sum_{r=1}^{k}(\mathbf{x}_{i}\mbox{\boldmath$\beta$}_{jr}+\zeta_{ijr})^{2}.
Corollary 2

Consider the Hermit  model in (1) with θ0Θ~\theta_{0}\in\tilde{\Theta}, and consider estimator (12) with the group 1\ell_{1} penalties in (27) and (𝛇)=qQ𝛇𝒢ζ,qF\mathcal{R}(\mbox{\boldmath$\zeta$})=\sum_{q}^{Q}\|\mbox{\boldmath$\zeta$}_{\mathcal{G}_{\zeta,q}}\|_{F}.

(a) Assume conditions 1-3 and 7 hold. Suppose mkn/Mn\sqrt{mk}\lesssim n/M_{n}, and take λ>2Tλ0\lambda>2T\lambda_{0} for some constant T>1T>1. For some constant c>0c>0 and large enough nn, with probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have

ε¯(θ^θ0)+2(λTλ0)(pβc𝜷^𝒢β,pF+qζc𝜻^𝒢ζ,qF)4(λ+Tλ0)2κ2c02s,\displaystyle\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})+2(\lambda-T\lambda_{0})\biggl{(}\sum_{p\in\mathcal{I}_{\beta}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{\beta,p}}\|_{F}+\sum_{q\in\mathcal{I}_{\zeta}^{c}}\|\hat{\mbox{\boldmath$\zeta$}}_{\mathcal{G}_{\zeta,q}}\|_{F}\biggr{)}\leq 4(\lambda+T\lambda_{0})^{2}\kappa^{2}c_{0}^{2}s,

(b) Assume conditions 1-3 hold (without condition 7), assume

p=1P𝜷0,𝒢β,pF+q=1Q𝜻0,𝒢ζ,qF\displaystyle\sum_{p=1}^{P}\|\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{\beta,p}}\|_{F}+\sum_{q=1}^{Q}\|\mbox{\boldmath$\zeta$}_{0,\mathcal{G}_{\zeta,q}}\|_{F} =o(n/((logn)2+2c1log(dn)mk)),\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n)mk)}),
mk\displaystyle\sqrt{mk} =o(n/((logn)2+2c1log(dn)))\displaystyle=o(\sqrt{n/((\log n)^{2+2c_{1}}\log(d\vee n))})

as nn\rightarrow\infty. If λ=C(logn)2+2c1log(dn)mk/n\lambda=C\sqrt{(\log n)^{2+2c_{1}}\log(d\vee n)mk/n} for some C>0C>0 sufficiently large, and for some constant c>0c>0 and large enough nn, with the following probability 1cexp((logn)2log(dn)c)1n,1-c\exp\left(-\frac{(\log n)^{2}\log(d\vee n)}{c}\right)-\frac{1}{n}, we have ε¯(θ^θ0)=oP(1).\bar{\varepsilon}(\hat{\theta}\mid\theta_{0})=o_{P}(1).

Appendix E Proof of Lemma 1

Proof

For non-negative continuous variable XX, we have

𝔼[X1{X>M}]\displaystyle\mathbb{E}[X1\{X>M\}] =MtfX(t)𝑑t=M0tfX(t)𝑑x𝑑t\displaystyle=\int_{M}^{\infty}tf_{X}(t)dt=\int_{M}^{\infty}\int_{0}^{t}f_{X}(t)dxdt
=0MMfX(t)𝑑t𝑑x+MxfX(t)𝑑t𝑑x\displaystyle=\int_{0}^{M}\int_{M}^{\infty}f_{X}(t)dtdx+\int_{M}^{\infty}\int_{x}^{\infty}f_{X}(t)dtdx
=M(X>M)+M(X>x)𝑑x.\displaystyle=M\mathbb{P}(X>M)+\int_{M}^{\infty}\mathbb{P}(X>x)dx.

Similarly, we have 𝔼[X21{X>M}]=M2(X>M)+M2x(X>x)𝑑x.\mathbb{E}[X^{2}1\{X>M\}]=M^{2}\mathbb{P}(X>M)+\int_{M}^{\infty}2x\mathbb{P}(X>x)dx.

For XX sub-exponential with parameters (σ,v,t)(\sigma,v,t) such that for M>tM>t

(X>M){exp(M2σ2),tMσ2vexp(Mv),Mσ2v,\displaystyle\mathbb{P}(X>M)\leq\left\{\begin{array}[]{ll}\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)},&t\leq M\leq\frac{\sigma^{2}}{v}\\ \exp\biggl{(}-\frac{M}{v}\biggr{)},&M\geq\frac{\sigma^{2}}{v},\end{array}\right.

we have the following.

If Mσ2vM\leq\frac{\sigma^{2}}{v}, we have

𝔼[X1{X>M}]\displaystyle\mathbb{E}[X1\{X>M\}] =M(X>M)+M(X>x)𝑑x\displaystyle=M\mathbb{P}(X>M)+\int_{M}^{\infty}\mathbb{P}(X>x)dx
Mexp(M2σ2)+Mσ2vexp(x2σ2)𝑑x+σ2vexp(xv)𝑑x\displaystyle\leq M\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)}+\int_{M}^{\frac{\sigma^{2}}{v}}\exp\biggl{(}-\frac{x^{2}}{\sigma^{2}}\biggr{)}dx+\int_{\frac{\sigma^{2}}{v}}^{\infty}\exp\biggl{(}-\frac{x}{v}\biggr{)}dx
Mexp(M2σ2)+(σ2vM)exp(M2σ2)+vexp(Mv)\displaystyle\leq M\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)}+(\frac{\sigma^{2}}{v}-M)\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)}+v\exp\biggl{(}-\frac{M}{v}\biggr{)}
=Mexp(M2σ2)+vexp(Mv)(M+v)exp(M2σ2),\displaystyle=M\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)}+v\exp\biggl{(}-\frac{M}{v}\biggr{)}\leq(M+v)\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)},

and similarly, 𝔼[X21{X>M}](M2+2v2+2σ2)exp(M2σ2).\mathbb{E}[X^{2}1\{X>M\}]\leq\biggl{(}M^{2}+2v^{2}+2\sigma^{2}\biggr{)}\exp\biggl{(}-\frac{M^{2}}{\sigma^{2}}\biggr{)}.

If M>σ2vM>\frac{\sigma^{2}}{v}, we have 𝔼[X1{X>M}](M+v)exp(Mv)\mathbb{E}[X1\{X>M\}]\leq(M+v)\exp\biggl{(}-\frac{M}{v}\biggr{)} and 𝔼[X21{X>M}](M2+2v2+2vM)exp(Mv)\mathbb{E}[X^{2}1\{X>M\}]\leq(M^{2}+2v^{2}+2vM)\exp\biggl{(}-\frac{M}{v}\biggr{)}.

Then for some constants c1,c2,c3,c4,c5>0c_{1},c_{2},c_{3},c_{4},c_{5}>0, for non-negative continuous variable XX which is sub-exponential with parameters (σ,v,t)(\sigma,v,t), for M>c4>tM>c_{4}>t and c=2+3c1c^{\prime}=2+\frac{3}{c_{1}}, we have

𝔼[X1{X>M}][c3(Mc2)c+c5]exp{(Mc2)1/c1},\displaystyle\mathbb{E}[X1\{X>M\}]\leq\biggl{[}c_{3}\biggl{(}\frac{M}{c_{2}}\biggr{)}^{c^{\prime}}+c_{5}\biggr{]}\exp\biggl{\{}-\biggl{(}\frac{M}{c_{2}}\biggr{)}^{1/c_{1}}\biggr{\}},
𝔼[X21{X>M}][c3(Mc2)c+c5]2exp{2(Mc2)1/c1}.\displaystyle\mathbb{E}[X^{2}1\{X>M\}]\leq\biggl{[}c_{3}\biggl{(}\frac{M}{c_{2}}\biggr{)}^{c^{\prime}}+c_{5}\biggr{]}^{2}\exp\biggl{\{}-2\biggl{(}\frac{M}{c_{2}}\biggr{)}^{1/c_{1}}\biggr{\}}.

If tMσ2vt\leq M\leq\frac{\sigma^{2}}{v}, c1=1/2,c2=2σ,c3=16σ8c_{1}=1/2,c_{2}=\sqrt{2}\sigma,c_{3}=16\sigma^{8}. And if Mσ2vM\geq\frac{\sigma^{2}}{v}, c1=1,c2=2v,c3=32v5c_{1}=1,c_{2}=2v,c_{3}=32v^{5}. And c5=2(v+σ)c_{5}=\sqrt{2}(v+\sigma).

For non-negative discrete variables, the result is the same.

The result of Lemma 1 follows from the result above, 𝐲~i\tilde{\mathbf{y}}_{i} has a finite mixture distribution for i=1,,ni=1,\ldots,n and the following.

When dispersion parameter ϕ\phi is known, for a constant cKc_{K} depending on KK, we have

G1(𝐲~i)=eKmaxjΩi|yij|+cK,i=1,,n.G_{1}(\tilde{\mathbf{y}}_{i})=e^{K}\max_{j\in\Omega_{i}}|y_{ij}|+c_{K},\ i=1,\ldots,n.

Appendix F Proof of Lemma 2

Proof

Under Condition 2, Mn=c2(logn)c1M_{n}=c_{2}(\log n)^{c_{1}}, and λ0\lambda_{0} defined in (24), for a constant c6c_{6} depending on KK, for i=1,,ni=1,\ldots,n, we have

𝔼[|G1(𝐲~i)|1{|G1(𝐲~i)|>Mn}]c6λ02/(mk),\displaystyle\mathbb{E}[|G_{1}(\tilde{\mathbf{y}}_{i})|1\{|G_{1}(\tilde{\mathbf{y}}_{i})|>M_{n}\}]\leq c_{6}\lambda_{0}^{2}/(mk),
𝔼[|G1(𝐲~i)|21{|G1(𝐲~i)|>Mn}]c62λ04/(mk)2.\displaystyle\mathbb{E}[|G_{1}(\tilde{\mathbf{y}}_{i})|^{2}1\{|G_{1}(\tilde{\mathbf{y}}_{i})|>M_{n}\}]\leq c_{6}^{2}\lambda_{0}^{4}/(mk)^{2}.

The we can get

𝐗(1ni=1nG1(𝐲~i)1{G1(𝐲~i)>Mn}+𝔼[G1(𝐲~i)1{G1(𝐲~i)>Mn}]>3c6λ02/(mk))\displaystyle\mathbb{P}_{\mathbf{X}}\biggl{(}\frac{1}{n}\sum_{i=1}^{n}G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}+\mathbb{E}[G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}]>3c_{6}\lambda_{0}^{2}/(mk)\biggr{)}
𝐗(1ni=1nG1(𝐲~i)1{G1(𝐲~i)>Mn}𝔼[G1(𝐲~i)1{G1(𝐲~i)>Mn}]>c6λ02/(mk))\displaystyle\leq\mathbb{P}_{\mathbf{X}}\biggl{(}\frac{1}{n}\sum_{i=1}^{n}G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}-\mathbb{E}[G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}]>c_{6}\lambda_{0}^{2}/(mk)\biggr{)}
𝔼[|G1(𝐲~i)|21{|G1(𝐲~i)|>Mn}]nm2k2c62λ041n.\displaystyle\leq\frac{\mathbb{E}[|G_{1}(\tilde{\mathbf{y}}_{i})|^{2}1\{|G_{1}(\tilde{\mathbf{y}}_{i})|>M_{n}\}]}{n}\frac{m^{2}k^{2}}{c_{6}^{2}\lambda_{0}^{4}}\leq\frac{1}{n}.

Appendix G Proof of Lemma 3

Proof

We follow Städler et al (2010) to give a Entropy Lemma and then prove Lemma 3.

We use the following norm Pn\|\cdot\|_{P_{n}} introduced in the Proof of Lemma 2 in Städler et al (2010) and use H(,,Pn)H(\cdot,\mathcal{H},\|\cdot\|_{P_{n}}) as the entropy of covering number (see Van de Geer (2000)) which is equipped the metric induced by the norm for a collection \mathcal{H} of functions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y},

h(,)Pn=1ni=1nh2(𝐱i,𝐲~i).\displaystyle\|h(\cdot,\cdot)\|_{P_{n}}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}h^{2}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})}.

Define Θ~(ϵ)={θΘ~:𝜷𝜷01+ηη02ϵ}\tilde{\Theta}(\epsilon)=\{\theta\in\tilde{\Theta}:\|\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}_{0}\|_{1}+\|\eta-\eta_{0}\|_{2}\leq\epsilon\}.

Lemma 4

(Entropy Lemma) For a constant c12>0c_{12}>0, for all u>0u>0 and Mn>0M_{n}>0, we have

H(u,{(θθ)1{G1Mn}:θΘ~(ϵ)},Pn)c12mkϵ2Mn2u2log(mkϵMnu).\displaystyle H\biggl{(}u,\biggl{\{}(\ell_{\theta}-\ell_{\theta^{\star}})1\{G_{1}\leq M_{n}\}:\theta\in\tilde{\Theta}(\epsilon)\biggr{\}},\|\cdot\|_{P_{n}}\biggr{)}\leq c_{12}\frac{mk\epsilon^{2}M_{n}^{2}}{u^{2}}\log\biggl{(}\frac{\sqrt{mk}\epsilon M_{n}}{u}\biggr{)}.
Proof

(For Entropy Lemma) The difference between this proof and that of Entropy Lemma in the proof of Lemma 2 of Städler et al (2010) is in the notations and the effect of multivariate responses.

For multivariate responses we have for i=1,,ni=1,\ldots,n,

|θ(𝐱i,𝐲~i)θ(𝐱i,𝐲~i)|2\displaystyle|\ell_{\theta}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})-\ell_{\theta^{\prime}}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})|^{2} G12(𝐲~i)ψiψi12dψG12(𝐲~i)ψiψi22\displaystyle\leq G_{1}^{2}(\tilde{\mathbf{y}}_{i})\|\psi_{i}-\psi^{\prime}_{i}\|_{1}^{2}\leq d_{\psi}G_{1}^{2}(\tilde{\mathbf{y}}_{i})\|\psi_{i}-\psi^{\prime}_{i}\|_{2}^{2}
=dψG12(𝐲~i)[r=1kjΩi|𝐱i(𝜷rj𝜷rj)|2+ηη22],\displaystyle=d_{\psi}G_{1}^{2}(\tilde{\mathbf{y}}_{i})\biggl{[}\sum_{r=1}^{k}\sum_{j\in\Omega_{i}}|\mathbf{x}_{i}(\mbox{\boldmath$\beta$}_{rj}-\mbox{\boldmath$\beta$}^{\prime}_{rj})|^{2}+\|\eta-\eta^{\prime}\|_{2}^{2}\biggr{]},

where dψ=(2m+1)kd_{\psi}=(2m+1)k is the maximum of dimension of ψi\psi_{i} for i=1,,ni=1,\ldots,n.

Under the definition of the norm Pn\|\cdot\|_{P_{n}} we have

(θθ)1{G1Mn}Pn2dψMn2[1ni=1nr=1kjΩi|𝐱i(𝜷rj𝜷rj)|2+ηη22].\displaystyle\|(\ell_{\theta}-\ell_{\theta^{\prime}})1\{G_{1}\leq M_{n}\}\|_{P_{n}}^{2}\leq d_{\psi}M_{n}^{2}\left[\frac{1}{n}\sum_{i=1}^{n}\sum_{r=1}^{k}\sum_{j\in\Omega_{i}}|\mathbf{x}_{i}(\mbox{\boldmath$\beta$}_{rj}-\mbox{\boldmath$\beta$}^{\prime}_{rj})|^{2}+\|\eta-\eta^{\prime}\|_{2}^{2}\right].

Then by the result of Städler et al (2010) we have

H(u,{ηdη:ηη02ϵ},2)dηlog(5ϵu),\displaystyle H(u,\{\eta\in\mathbb{R}^{d_{\eta}}:\|\eta-\eta_{0}\|_{2}\leq\epsilon\},\|\cdot\|_{2})\leq d_{\eta}\log\biggl{(}\frac{5\epsilon}{u}\biggr{)},

where dη=(m+1)kd_{\eta}=(m+1)k is the dimension of η\eta.

And we follow Städler et al (2010) to apply Lemma 2.6.11 of Van Der Vaart and Wellner (1996) to give a bound as

H(2u,{r=1kjΩi𝐱i(𝜷rj𝜷0,rj):𝜷𝜷01ϵ},Pn)(ϵ2u2+1)log(1+kmd).\displaystyle H\biggl{(}2u,\biggl{\{}\sum_{r=1}^{k}\sum_{j\in\Omega_{i}}\mathbf{x}_{i}(\mbox{\boldmath$\beta$}_{rj}-{\mbox{\boldmath$\beta$}}_{0,rj}):\|\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}_{0}\|_{1}\leq\epsilon\biggr{\}},\|\cdot\|_{P_{n}}\biggr{)}\leq\biggl{(}\frac{\epsilon^{2}}{u^{2}}+1\biggr{)}\log(1+kmd).

Thus we can get

H(3dψMnu,{(θθ0)1{G1Mn}:θΘ~(ϵ)},Pn)\displaystyle H\biggl{(}3\sqrt{d_{\psi}}M_{n}u,\biggl{\{}(\ell_{\theta}-\ell_{\theta_{0}})1\{G_{1}\leq M_{n}\}:\theta\in\tilde{\Theta}(\epsilon)\biggr{\}},\|\cdot\|_{P_{n}}\biggr{)}
(ϵ2u2+1+dη)(log(1+kmd)+log(5ϵu)).\displaystyle\leq\biggl{(}\frac{\epsilon^{2}}{u^{2}}+1+d_{\eta}\biggr{)}\biggl{(}\log(1+kmd)+\log\biggl{(}\frac{5\epsilon}{u}\biggr{)}\biggr{)}.

Now we turn to prove Lemma 3.

We follow Städler et al (2010) to use the truncated version of the empirical process below.

Vntrunc(θ)=1ni=1n(θ(𝐱i,𝐲~i)1{G1(𝐲~i)Mn}𝔼[θ(𝐱i,𝐲~i)1{G1(𝐲~i)Mn}X=𝐱i].)\displaystyle V_{n}^{trunc}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\biggl{(}\ell_{\theta}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})\leq M_{n}\}-\mathbb{E}[\ell_{\theta}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})\leq M_{n}\}\mid X=\mathbf{x}_{i}].\biggr{)}

We follow Städler et al (2010) to apply the Lemma 3.2 in Van de Geer (2000) and a conditional version of Lemma 3.3 in Van de Geer (2000) to the class

{(θθ0)1{G1Mn}:θΘ~(ϵ)},ϵ>0.\displaystyle\biggl{\{}(\ell_{\theta}-\ell_{\theta_{0}})1\{G_{1}\leq M_{n}\}:\theta\in\tilde{\Theta}(\epsilon)\biggr{\}},\forall\epsilon>0.

For some constants {ct}t>12\{c_{t}\}_{t>12} depending on KK and Λmax\Lambda_{\max} in Condition 2 of Städler et al (2010), using the notation of Lemma 3.2 in Van de Geer (2000), we follow Städler et al (2010) to choose δ=c13Tϵλ0\delta=c_{13}T\epsilon\lambda_{0} and R=c14(mkϵ1)MnR=c_{14}(\sqrt{mk}\epsilon\wedge 1)M_{n}.

Thus we by choosing Mn=c2(logn)c1M_{n}=c_{2}(\log n)^{c_{1}} we can satisfy the condition of Lemma 3.2 of Van de Geer (2000) to have

ϵ/c15RH1/2(u,{(θθ)1{G1Mn}:θΘ~(ϵ)},Pn)duR\displaystyle\int_{\epsilon/c_{15}}^{R}H^{1/2}\biggl{(}u,\biggl{\{}(\ell_{\theta}-\ell_{\theta^{\star}})1\{G_{1}\leq M_{n}\}:\theta\in\tilde{\Theta}(\epsilon)\biggr{\}},\|\cdot\|_{P_{n}}\biggr{)}du\vee R
=\displaystyle= ϵ/c15c14mk(ϵ1)Mnc12(mkϵMnu)log1/2(mkϵMnu)𝑑u(c14(ϵ1)Mn)\displaystyle\int_{\epsilon/c_{15}}^{c_{14}\sqrt{mk}(\epsilon\wedge 1)M_{n}}c_{12}\biggl{(}\frac{\sqrt{mk}\epsilon M_{n}}{u}\biggr{)}\log^{1/2}\biggl{(}\frac{\sqrt{mk}\epsilon M_{n}}{u}\biggr{)}du\vee(c_{14}(\epsilon\wedge 1)M_{n})
\displaystyle\leq 23c12mkϵMn[log3/2(c15mkMn)log3/2(mkϵMnc14mk(ϵ1)Mn)](c14mk(ϵ1)Mn)\displaystyle\frac{2}{3}c_{12}\sqrt{mk}\epsilon M_{n}[\log^{3/2}(c_{15}\sqrt{mk}M_{n})-\log^{3/2}(\frac{\sqrt{mk}\epsilon M_{n}}{c_{14}\sqrt{mk}(\epsilon\wedge 1)M_{n}})]\vee(c_{14}\sqrt{mk}(\epsilon\wedge 1)M_{n})
\displaystyle\leq 23c12mkϵMnlog3/2(c15mkMn)\displaystyle\frac{2}{3}c_{12}\sqrt{mk}\epsilon M_{n}\log^{3/2}(c_{15}\sqrt{mk}M_{n})
\displaystyle\leq c16mkϵMnlog3/2(n)(by choosingMn=c2(logn)c1,andmkc17nMn)\displaystyle c_{16}\sqrt{mk}\epsilon M_{n}\log^{3/2}(n)\quad(\mbox{by choosing}\ M_{n}=c_{2}(\log n)^{c_{1}},\mbox{and}\ \sqrt{mk}\leq c_{17}\frac{n}{M_{n}})
\displaystyle\leq c18nTϵλ0n(δϵ).\displaystyle c_{18}\sqrt{n}T\epsilon\lambda_{0}\leq\sqrt{n}(\delta-\epsilon).

Now for the rest we can apply Lemma 3.2 of Van de Geer (2000) to give the same result with Lemma 2 of Städler et al (2010).

So we have

supθΘ~|Vntrunc(θ)Vntrunc(θ0)|(𝜷𝜷01+ηη02)λ02c23Tλ0\displaystyle\sup_{\theta\in\tilde{\Theta}}\frac{|V_{n}^{trunc}(\theta)-V_{n}^{trunc}(\theta_{0})|}{(\|\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}_{0}\|_{1}+\|\eta-\eta_{0}\|_{2})\vee\lambda_{0}}\leq 2c_{23}T\lambda_{0}

with probability at least 1c9exp[T2(logn)2log(dn)c82].1-c_{9}\exp\biggl{[}-\frac{T^{2}(\log n)^{2}\log(d\vee n)}{c_{8}^{2}}\biggr{]}.

At last, for the case when G1(𝐲~i)>MnG_{1}(\tilde{\mathbf{y}}_{i})>M_{n}, for i=1,,ni=1,\ldots,n, we have

|(θ(𝐱i,𝐲~i)θ0(𝐱i,𝐲~i))1{G1(𝐲~i)>Mn}|dψKG1(𝐲~i)1{G1(𝐲~i)>Mn},\displaystyle|(\ell_{\theta}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i})-\ell_{\theta_{0}}(\mathbf{x}_{i},\tilde{\mathbf{y}}_{i}))1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}|\leq d_{\psi}KG_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\},

and

|(Vntrunc(θ)Vntrunc(θ0))(Vn(θ)Vn(θ0))|(𝜷𝜷01+ηη02)λ0\displaystyle\frac{|(V_{n}^{trunc}(\theta)-V_{n}^{trunc}(\theta_{0}))-(V_{n}(\theta)-V_{n}(\theta_{0}))|}{(\|\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}_{0}\|_{1}+\|\eta-\eta_{0}\|_{2})\vee\lambda_{0}}
dψKnλ0i=1n(G1(𝐲~i)1{G1(𝐲~i)>Mn}+𝔼[G1(𝐲~i)1{G1(𝐲~i)>Mn}X=𝐱i]).\displaystyle\leq\frac{d_{\psi}K}{n\lambda_{0}}\sum_{i=1}^{n}\biggl{(}G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}+\mathbb{E}[G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}\mid X=\mathbf{x}_{i}]\biggr{)}.

Then the probability of the following inequality under our model is given in Lemma 2.

dψKnλ0i=1n(G1(𝐲~i)1{G1(𝐲~i)>Mn}+𝔼[G1(𝐲~i)1{G1(𝐲~i)>Mn}X=𝐱i])c23Tλ0,\displaystyle\frac{d_{\psi}K}{n\lambda_{0}}\sum_{i=1}^{n}\biggl{(}G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}+\mathbb{E}[G_{1}(\tilde{\mathbf{y}}_{i})1\{G_{1}(\tilde{\mathbf{y}}_{i})>M_{n}\}\mid X=\mathbf{x}_{i}]\biggr{)}\leq c_{23}T\lambda_{0},

where dψ=2(m+1)kd_{\psi}=2(m+1)k.

Appendix H Proof of Theorem 1

Proof

This proof mostly follows that of Theorem 3 of Städler et al (2010). The only difference is in the notations. As such, we omit the details.

Appendix I Proof of Theorem 2

Proof

This proof also mostly follows that of Theorem 5 of Städler et al (2010). The difference is in the notations and the choice of MnM_{n}.

If the event 𝒯\mathcal{T} happens, with Mn=c2(logn)c1M_{n}=c_{2}(\log n)^{c_{1}} for some constants 0c1,c2<0\leq c_{1},c_{2}<\infty, where c2c_{2} depends on KK,

λ0=mkMnlognlog(dn)/n=c2mklog2+2c1log(dn)/n,\lambda_{0}=\sqrt{mk}M_{n}\log n\sqrt{\log(d\vee n)/n}=c_{2}\sqrt{mk\log^{2+2c_{1}}\log(d\vee n)/n},

we have

ε¯(ψ^ψ0)+λβ^1Tλ0[(β^β01+ηη02)λ0]+λβ01+ε¯(ψ0ψ0).\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+\lambda\|\hat{\beta}\|_{1}\leq T\lambda_{0}[(\|\hat{\beta}-\beta_{0}\|_{1}+\|\eta-\eta_{0}\|_{2})\vee\lambda_{0}]+\lambda\|\beta_{0}\|_{1}+\bar{\varepsilon}(\psi_{0}\mid\psi_{0}).

Under the definition of θΘ~\theta\in\tilde{\Theta} in (23) we have ηη022K\|\eta-\eta_{0}\|_{2}\leq 2K. And as ε¯(ψ0ψ0)=0\bar{\varepsilon}(\psi_{0}\mid\psi_{0})=0 we have for nn sufficiently large.

ε¯(ψ^ψ0)+λβ^1Tλ0(β^1+β01+2K)+λβ01\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+\lambda\|\hat{\beta}\|_{1}\leq T\lambda_{0}(\|\hat{\beta}\|_{1}+\|\beta_{0}\|_{1}+2K)+\lambda\|\beta_{0}\|_{1}
ε¯(ψ^ψ0)+(λTλ0)β^1Tλ02K+(λ+Tλ0)β01\displaystyle\rightarrow\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+(\lambda-T\lambda_{0})\|\hat{\beta}\|_{1}\leq T\lambda_{0}2K+(\lambda+T\lambda_{0})\|\beta_{0}\|_{1}

As C>0C>0 sufficiently large we have λ2Tλ0\lambda\geq 2T\lambda_{0}.

And using the condition on β01\|\beta_{0}\|_{1} and mk\sqrt{mk}, we have both Tλ02K=o(1)T\lambda_{0}2K=o(1) and (λ+Tλ0)β01=o(1)(\lambda+T\lambda_{0})\|\beta_{0}\|_{1}=o(1), so we have ε¯(ψ^ψ0)0(n)\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})\rightarrow 0\ (n\rightarrow\infty).

At last, as the event 𝒯\mathcal{T} has large probability, we have ε¯(θ^λθ0)=oP(1)(n).\bar{\varepsilon}(\hat{\theta}_{\lambda}\mid\theta_{0})=o_{P}(1)\ (n\rightarrow\infty).

Appendix J Proof of Theorem 3

Proof

First we discuss the bound for the probability of 𝒯group\mathcal{T}_{group} in (22).

The difference between 𝒯group\mathcal{T}_{group} and 𝒯\mathcal{T} in (21) is only related to the following entropy of the Entropy Lemma in the proof of Lemma 3.

H(2u,{r=1kjΩi𝐱i(𝜷rj𝜷0,rj):p𝜷𝒢p𝜷0,𝒢pFϵ},Pn),fori=1,n,\displaystyle H\biggl{(}2u,\biggl{\{}\sum_{r=1}^{k}\sum_{j\in\Omega_{i}}\mathbf{x}_{i}(\mbox{\boldmath$\beta$}_{rj}-{\mbox{\boldmath$\beta$}}_{0,rj}):\sum_{p}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}\leq\epsilon\biggr{\}},\|\cdot\|_{P_{n}}\biggr{)},\ \mbox{for}\ i=1\ldots,n,

where p𝜷𝒢p𝜷0,𝒢pFϵ\sum_{p}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}\leq\epsilon still maintains a convex hull for 𝜷\beta in the metric space equipped with the metric induced by the norm Pn\|\cdot\|_{P_{n}} defined in the proof of Lemma 3. Thus it still satisfies the condition of Lemma 2.6.11 of Van Der Vaart and Wellner (1996) which can still be applied to give

H(2u,{r=1kjΩi𝐱i(𝜷rj𝜷0,rj):p𝜷𝒢p𝜷0,𝒢pFϵ},Pn)\displaystyle H\biggl{(}2u,\biggl{\{}\sum_{r=1}^{k}\sum_{j\in\Omega_{i}}\mathbf{x}_{i}(\mbox{\boldmath$\beta$}_{rj}-{\mbox{\boldmath$\beta$}}_{0,rj}):\sum_{p}\|\mbox{\boldmath$\beta$}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}\leq\epsilon\biggr{\}},\|\cdot\|_{P_{n}}\biggr{)}
(ϵ2u2+1)log(1+kmd),fori=1,n.\displaystyle\leq\biggl{(}\frac{\epsilon^{2}}{u^{2}}+1\biggr{)}\log(1+kmd),\ \mbox{for}\ i=1\ldots,n.

So the probability of event 𝒯group\mathcal{T}_{group} remains the same form with that in Lemma 3.

Then we discuss the bound for the average excess risk and feature selection.

If the event 𝒯group\mathcal{T}_{group} happens, we have

ε¯(ψ^ψ0)+λp𝜷^𝒢pF\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+\lambda\sum_{p}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F} Tλ0[(𝒢p𝜷^𝒢p𝜷0,𝒢pF+ηη02)λ0]\displaystyle\leq T\lambda_{0}\biggl{[}\biggl{(}\sum_{\mathcal{G}_{p}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}+\|\eta-\eta_{0}\|_{2}\biggr{)}\vee\lambda_{0}\biggr{]}
+λp𝜷0,𝒢pF+ε¯(ψ0ψ0).\displaystyle+\lambda\sum_{p}\|\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}+\bar{\varepsilon}(\psi_{0}\mid\psi_{0}).

Using Condition 3 we have ε¯(ψ0ψ0)=0\bar{\varepsilon}(\psi_{0}\mid\psi_{0})=0 and ε¯(ψ^ψ0)ψ^ψ0Qn2/c02.\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})\geq{\|\hat{\psi}-\psi_{0}\|_{Q_{n}}^{2}}/{c_{0}^{2}}.

Case 1 When the following is true:

p𝜷^𝒢p𝜷0,𝒢pF+η^η02λ0,\displaystyle\sum_{p}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}+\|\hat{\eta}-\eta_{0}\|_{2}\leq\lambda_{0},

we have

ε¯(ψ^ψ0)\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0}) Tλ02+λp𝜷^𝒢p𝜷0,𝒢pF+ε¯(ψ0ψ0)(λ+Tλ0)λ0.\displaystyle\leq T\lambda_{0}^{2}+\lambda\sum_{p}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}+\bar{\varepsilon}(\psi_{0}\mid\psi_{0})\leq(\lambda+T\lambda_{0})\lambda_{0}.

Case 2 When the following is true:

p𝜷^𝒢p𝜷0,𝒢pF+η^η02λ0,\displaystyle\sum_{p}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}+\|\hat{\eta}-\eta_{0}\|_{2}\geq\lambda_{0},
Tλ0η^η02(λ+Tλ0)p𝜷^𝒢p𝜷0,𝒢pF.\displaystyle T\lambda_{0}\|\hat{\eta}-\eta_{0}\|_{2}\geq(\lambda+T\lambda_{0})\sum_{p\in\mathcal{I}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}.

As pc𝜷0,𝒢pF=0\sum_{p\in\mathcal{I}^{c}}\|\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}=0, we have

ε¯(ψ^ψ0)+(λTλ0)pc𝜷^𝒢pF2Tλ0η^η02\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+(\lambda-T\lambda_{0})\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 2T\lambda_{0}\|\hat{\eta}-\eta_{0}\|_{2}
2T2λ02c02+η^η022/(2c02)2T2λ02c02+ε¯(ψ^ψ0)/2.\displaystyle\leq 2T^{2}\lambda_{0}^{2}c_{0}^{2}+\|\hat{\eta}-\eta_{0}\|_{2}^{2}/(2c_{0}^{2})\leq 2T^{2}\lambda_{0}^{2}c_{0}^{2}+\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})/2.

Then we get

ε¯(ψ^ψ0)+2(λTλ0)pc𝜷^𝒢pF4T2λ02c02.\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+2(\lambda-T\lambda_{0})\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 4T^{2}\lambda_{0}^{2}c_{0}^{2}.

Case 3 When the following is true:

p𝜷^𝒢p𝜷0,𝒢pF+η^η02λ0,\displaystyle\sum_{p}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}+\|\hat{\eta}-\eta_{0}\|_{2}\geq\lambda_{0},
Tλ0η^η02(λ+Tλ0)p𝜷^𝒢p𝜷0,𝒢pF,\displaystyle T\lambda_{0}\|\hat{\eta}-\eta_{0}\|_{2}\leq(\lambda+T\lambda_{0})\sum_{p\in\mathcal{I}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F},

we have

ε¯(ψ^ψ0)+(λTλ0)pc𝜷^𝒢pF2(λ+Tλ0)p𝜷^𝒢p𝜷0,𝒢pF.\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+(\lambda-T\lambda_{0})\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 2(\lambda+T\lambda_{0})\sum_{p\in\mathcal{I}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F}.

Thus we have

pc𝜷^𝒢pF6p𝜷^𝒢p𝜷0,𝒢pF,\displaystyle\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 6\sum_{p\in\mathcal{I}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F},

so we can use the Condition 5 for 𝜷^𝜷0\hat{\mbox{\boldmath$\beta$}}-\mbox{\boldmath$\beta$}_{0} to have

ε¯(ψ^ψ0)+(λTλ0)pc𝜷^𝒢pF2(λ+Tλ0)sp𝜷^𝒢p𝜷^0,𝒢pF\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+(\lambda-T\lambda_{0})\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 2(\lambda+T\lambda_{0})\sqrt{s}\sum_{p\in\mathcal{I}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}-\hat{\mbox{\boldmath$\beta$}}_{0,\mathcal{G}_{p}}\|_{F}
2(λ+Tλ0)sκφ^(φ0)Qn2(λ+Tλ0)2sκ2c02+φ^(φ0)Qn2/(2c02)\displaystyle\leq 2(\lambda+T\lambda_{0})\sqrt{s}\kappa\|\hat{\varphi}-(\varphi_{0})\|_{Q_{n}}\leq 2(\lambda+T\lambda_{0})^{2}s\kappa^{2}c_{0}^{2}+\|\hat{\varphi}-(\varphi_{0})\|_{Q_{n}}^{2}/(2c_{0}^{2})
2(λ+Tλ0)2sκ2c02+ε¯(ψ^ψ0)/2.\displaystyle\leq 2(\lambda+T\lambda_{0})^{2}s\kappa^{2}c_{0}^{2}+\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})/2.

So we have

ε¯(ψ^ψ0)+2(λTλ0)pc𝜷^𝒢pF4(λ+Tλ0)2sκ2c02.\displaystyle\bar{\varepsilon}(\hat{\psi}\mid\psi_{0})+2(\lambda-T\lambda_{0})\sum_{p\in\mathcal{I}^{c}}\|\hat{\mbox{\boldmath$\beta$}}_{\mathcal{G}_{p}}\|_{F}\leq 4(\lambda+T\lambda_{0})^{2}s\kappa^{2}c_{0}^{2}.

And without restricted eigenvalue condition 5, we can prove similarly as in Appendix I, assuming event 𝒯group\mathcal{T}_{group} happens and using the condition on p𝜷0,𝒢pF\sum_{p}\|\mbox{\boldmath$\beta$}_{0,\mathcal{G}_{p}}\|_{F} and mk\sqrt{mk}.


References

  • Aho et al (2014) Aho K, Derryberry D, Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3):631–636
  • Alfò et al (2016) Alfò M, Salvati N, Ranallli MG (2016) Finite mixtures of quantile and M-quantile regression models. Statistics and Computing pp 1–24
  • Argyriou et al (2007a) Argyriou A, Evgeniou T, Pontil M (2007a) Multi-task feature learning. In: Advances in Neural Information Processing Systems, pp 41–48
  • Argyriou et al (2007b) Argyriou A, Pontil M, Ying Y, Micchelli CA (2007b) A spectral regularization framework for multi-task structure learning. In: Advances in Neural Information Processing Systems, pp 25–32
  • Bai et al (2016) Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. Journal of Statistical Computation and Simulation 86(4):771–787
  • Bartolucci and Scaccia (2005) Bartolucci F, Scaccia L (2005) The use of mixtures for dealing with non-normal regression errors. Computational Statistics & Data Analysis 48(4):821–834
  • Barzilai and Borwein (1988) Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA Journal of Numerical Analysis 8(1):141–148
  • Becker et al (2011) Becker SR, Candès EJ, Grant MC (2011) Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation 3(3):165–218
  • Bhat and Kumar (2010) Bhat HS, Kumar N (2010) On the derivation of the Bayesian Information Criterion. School of Natural Sciences, University of California
  • Bickel et al (2009) Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics pp 1705–1732
  • Bishop (2006) Bishop CM (2006) Pattern recognition. Machine Learning 128:1–58
  • Boyd et al (2011) Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3(1):1–122
  • Candès and Recht (2009) Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772
  • Chen et al (2011) Chen J, Zhou J, Ye J (2011) Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 42–50
  • Chen et al (2012a) Chen J, Liu J, Ye J (2012a) Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(4):22
  • Chen et al (2012b) Chen K, Chan KS, Stenseth NC (2012b) Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74(2):203–221
  • Chen et al (2010) Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010) Graph-structured multi-task regression and an efficient optimization method for general fused lasso. arXiv preprint arXiv:10053579
  • Cover and Thomas (2012) Cover TM, Thomas JA (2012) Elements of information theory. John Wiley & Sons
  • Dempster et al (1977) Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (methodological) pp 1–38
  • Doğru and Arslan (2016) Doğru FZ, Arslan O (2016) Robust mixture regression using mixture of different distributions. In: Recent Advances in Robust Statistics: Theory and Applications, Springer, pp 57–79
  • Do˘ru and Arslan (2017) Do˘ru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Communications in Statistics-Theory and Methods 46(21):10,879–10,896
  • Fahrmeir et al (2013) Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression: models, methods and applications. Springer Science & Business Media
  • Fan and Lv (2010) Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20(1):101–148
  • Fern and Brodley (2003) Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 186–193
  • Van de Geer (2000) Van de Geer SA (2000) Applications of empirical process theory, vol 91. Cambridge University Press Cambridge
  • Gong et al (2012a) Gong P, Ye J, Zhang C (2012a) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 895–903
  • Gong et al (2012b) Gong P, Ye J, Zhang Cs (2012b) Multi-stage multi-task feature learning. In: Advances in Neural Information Processing Systems, pp 1988–1996
  • Hanley and McNeil (1982) Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
  • He and Lawrence (2011) He J, Lawrence R (2011) A graph-based framework for multi-task multi-view learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp 25–32
  • Huang et al (2012) Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Statistical Science 27(4):481–499
  • Jacob et al (2009) Jacob L, Vert Jp, Bach FR (2009) Clustered multi-task learning: A convex formulation. In: Advances in Neural Information Processing Systems, pp 745–752
  • Jacobs et al (1991) Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Computation 3(1):79–87
  • Jalali et al (2010) Jalali A, Sanghavi S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: Advances in Neural Information Processing Systems, pp 964–972
  • Ji and Ye (2009) Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 457–464
  • Jin et al (2006) Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems 10(1):17–40
  • Jin et al (2015) Jin X, Zhuang F, Pan SJ, Du C, Luo P, He Q (2015) Heterogeneous multi-task semantic feature learning for classification. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM, pp 1847–1850
  • Jorgensen (1987) Jorgensen B (1987) Exponential dispersion models. Journal of the Royal Statistical Society Series B (Methodological) pp 127–162
  • Khalili (2011) Khalili A (2011) An overview of the new feature selection methods in finite mixture of regression models. Journal of Iranian Statistical Society 10(2):201–235
  • Khalili and Chen (2007) Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. Journal of the American Statistical Association 102(479):1025–1038
  • Koller and Sahami (1996) Koller D, Sahami M (1996) Toward optimal feature selection
  • Kubat (2015) Kubat M (2015) An introduction to machine learning. Springer
  • Kumar and Daumé III (2012) Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: Proceedings of the 29th International Conference on Machine Learning, Omnipress, pp 1723–1730
  • Kyung Lim et al (2016) Kyung Lim H, Narisetty NN, Cheon S (2016) Robust multivariate mixture regression models with incomplete data. Journal of Statistical Computation and Simulation pp 1–20
  • Law et al (2002) Law MH, Jain AK, Figueiredo M (2002) Feature selection in mixture-based clustering. In: Advances in Neural Information Processing Systems, pp 625–632
  • Li et al (2014) Li S, Liu ZQ, Chan AB (2014) Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 482–489
  • Liu et al (2009) Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient 2,1\ell_{2,1}-norm minimization. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp 339–348
  • McLachlan and Peel (2004) McLachlan G, Peel D (2004) Finite mixture models. John Wiley & Sons
  • Neal and Hinton (1998) Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, Springer, pp 355–368
  • Nelder and Baker (1972) Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of Statistical Sciences
  • Nesterov et al (2007) Nesterov Y, et al (2007) Gradient methods for minimizing composite objective function. Tech. rep., UCL
  • Passos et al (2012) Passos A, Rai P, Wainer J, Daumé III H (2012) Flexible modeling of latent task structures in multitask learning. In: Proceedings of the 29th International Conference on Machine Learning, Omnipress, pp 1283–1290
  • Schölkopf et al (1998) Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5):1299–1319
  • She and Chen (2017) She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647
  • She and Owen (2011) She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106(494):626–639
  • Städler et al (2010) Städler N, Bühlmann P, Van De Geer S (2010) 1\ell_{1}-penalization for mixture regression models. Test 19(2):209–256
  • Strehl and Ghosh (2002a) Strehl A, Ghosh J (2002a) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3(Dec):583–617
  • Strehl and Ghosh (2002b) Strehl A, Ghosh J (2002b) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: 18th National Conference on Artificial Intelligence, American Association for Artificial Intelligence, pp 93–98
  • Tan et al (2010) Tan Z, Kaddoum R, Le Yi Wang HW (2010) Decision-oriented multi-outcome modeling for anesthesia patients. The Open Biomedical Engineering Journal 4:113
  • Van Der Maaten et al (2009) Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: A comparative. J Mach Learn Res 10:66–71
  • Van Der Vaart and Wellner (1996) Van Der Vaart AW, Wellner JA (1996) Weak convergence. Springer
  • Wedel and DeSarbo (1995) Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. Journal of Classification 12(1):21–55
  • Weruaga and Vía (2015) Weruaga L, Vía J (2015) Sparse multivariate gaussian mixture regression. IEEE Transactions on Neural Networks and Learning Systems 26(5):1098–1108
  • Xian Wang et al (2004) Xian Wang H, bing Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognition Letters 25(6):701–710
  • Yang et al (2009) Yang X, Kim S, Xing EP (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Advances in Neural Information Processing Systems, pp 2151–2159
  • Yuksel et al (2012) Yuksel SE, Wilson JN, Gader PD (2012) Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems 23(8):1177–1193
  • Zhang et al (2012) Zhang D, Shen D, Initiative ADN, et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. NeuroImage 59(2):895–907
  • Zhang and Yeung (2011) Zhang Y, Yeung DY (2011) Multi-task learning in heterogeneous feature spaces. In: 25th AAAI Conference on Artificial Intelligence and the 23rd Innovative Applications of Artificial Intelligence Conference, AAAI-11/IAAI-11, San Francisco, CA, United States, 7-11 August 2011, Code 87049, Proceedings of the National Conference on Artificial Intelligence, p 574
  • Zhou et al (2011) Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in Neural Information Processing Systems, pp 702–710