This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Latent Group Structured Multi-task Learning

Xiangyu Niu Department of Electrical Engineering
and Computer Science
University of Tennessee, Knoxville
Knoxville, Tennessee 37996
Email: [email protected]
   Yifan Sun Department of Computer Science
University of British Columbia
Vancouver, B.C. V6T 1Z4
Canada
Email: [email protected]
   Jinyuan Sun Department of Electrical Engineering
and Computer Science
University of Tennessee, Knoxville
Knoxville, Tennessee 37996
Email: [email protected]
Abstract

In multi-task learning (MTL), we improve the performance of key machine learning algorithms by training various tasks jointly. When the number of tasks is large, modeling task structure can further refine the task relationship model. For example, often tasks can be grouped based on metadata, or via simple preprocessing steps like K-means. In this paper, we present our group structured latent-space multi-task learning model, which encourages group structured tasks defined by prior information. We use an alternating minimization method to learn the model parameters. Experiments are conducted on both synthetic and real-world datasets, showing competitive performance over single-task learning (where each group is trained separately) and other MTL baselines.

I Introduction

Multi-task learning (MTL) [6, 2] seeks to improve the performance of a specific task by sharing information across multiple related tasks. Specifically, this is done by simultaneously training many tasks, and promoting relatedness across each task’s feature weights. MTL continues to be a promising tool in many applications including medicine [24, 29], imaging [12], and transportation [25], and has recently become repopularized in the field of deep learning (e.g. [31, 26, 8, 7]).

A common approach to multi-task learning [2, 10] is to provide task-specific weights on features. There are two main approaches toward enforcing task relatedness. One is by promoting similarity between task weights, either through regularization in convex models [3, 17, 11] or structured priors in statistical inference [28, 27, 19]. The other is through enforcing a low-dimensional latent space representation of the task weights [32, 18, 21, 1, 9].

Where the main strength of multi-task learning is shared learning across tasks, a main weakness is contamination from unrelated tasks. Many MTL models have been proposed to combat this contamination. Weighted task relatedness can be imposed via the Gram matrix of the features [23] or a kernel matrix [10], or probabilistically using a joint probability distribution with a given covariance matrix. Pairwise relativeness can also be learned by optimizing for a sparse covariance or inverse covariance matrix expressing the relatedness between weights on different tasks; this can be done either by alternating minimization [30, 13, 16] or variational inference[27]. To encode specifically for group or cluster structure, one approach is to compose a combinatorial problem, in which group identity is represented by an integer [17, 4]. Another approach is to provide multiple sets of weights for each task, regularized separately to encode for different hierarchies of relatedness [14].

For latent space models, task relatedness and grouping can be imposed more simply, without many added variables or discrete optimization. Concretely, each task-specific feature vector can be modeled as xtLstx^{t}\approx Ls^{t}, where the columns of LL encode the latent space and a sparse vectors sts^{t} decide how the latent vectors are shared across tasks. In this context, [21] uses dictionary learning to learn a low rank LL and sparse sts^{t}’s for each task; this representation is then applied to online learning tasks. Similarly, [18] uses the same model, but learns LL and sts^{t} through minimizing the task loss function. In both cases, the support of sts^{t} for different tt’s may overlap, enabling this model to capture overlapping group structure in a completely unsupervised way.

In this paper, we take a step back and decompose this problem into two steps: first, learning task relatedness structure, and then, performing MTL. The main motivation for this two-step approach is that oftentimes, task-relatedness already given in the metadata. For example, in the task of interpreting geological photos, it may be helpful to use terrain labels (forest, desert, ocean) to help group the types of images. Similarly, in regressing school test scores, where each task corresponds to a school, a task group may be a specific district, or designation (private/public/parochial), or a cluster discovered from ethnic or socioeconomic makeup–in this way, the groups are intentionally interpretable. Once the (possibly overlapping) groups are identified, we solve the latent-space model with an overlapping group norm regularization on the variables sts^{t}. Although the first step can be made significantly more sophisticated, we find this simple approach can already give superior performance on several benchmark tasks.

II Notation

For a vector xnx\in\mathbb{R}^{n}, we denote xix_{i} as the iith element of xx, and for some group G{1,,n}G\subseteq\{1,\ldots,n\}, we denote xG={xi}iGx_{G}=\{x_{i}\}_{i\in G} the subvector of xx indexed by GG. We denote the Euclidean projection of a vector xx on a set 𝒮\mathcal{S} as proj𝒮(x)\textbf{proj}_{\mathcal{S}}(x). For a vector or matrix, we use xx^{\prime} to represent the transpose of xx.

III Multi-task Learning

III-A Linear Multi-Task Learning

Consider TT tasks, each with nt,t=1,2,,Tn_{t},\;t=1,2,...,T labeled training samples {xitd,yit}i=1,2,,nt\{x_{i}^{t}\in\mathbb{R}^{d},y_{i}^{t}\in\mathbb{R}\}_{i=1,2,...,n_{t}}. Following [2], we minimize the function

min{wt}t{t=1m(i=1nt((wt)xit,yit))+(w(1),,wt)}\min_{\{\textbf{w}^{t}\}_{t}}\left\{\sum_{t=1}^{m}\left(\sum_{i=1}^{n_{t}}\ell((\textbf{w}^{t})^{\prime}x_{i}^{t},y_{i}^{t})\right)+\mathcal{R}(\textbf{w}^{(1)},\ldots,\textbf{w}^{t})\right\} (1)

over task-differentiated weights on each feature wtw^{t}, differentiated per task. The loss function \ell is any smooth convex loss function, such as squared loss for regression or logistic loss for classification. The regularization \mathcal{R} enforces task relatedness.

III-B Latent Subspace Multi-task Learning

In latent subspace MTL [21, 18], task relatedness is expressed through a common low-dimensional subspace wt=Lst\textbf{w}^{t}=\textbf{L}\textbf{s}^{t}, where Ld×k\textbf{L}\in\mathbb{R}^{d\times k} is common to all tasks, and a sparse st\textbf{s}^{t} weights the component of each subspace to each task. In [21, 18], parsimony is enforced via the regularization:

(L,{st}t)=μLF2+λt=1Tst1.\mathcal{R}(\textbf{L},\{\textbf{s}^{t}\}_{t})=\mu\|\textbf{L}\|_{F}^{2}+\lambda\sum_{t=1}^{T}\|\textbf{s}^{t}\|_{1}.

Here, LF2=ijLij2\ \|\textbf{L}\|_{F}^{2}=\sum_{ij}L_{ij}^{2} is the square of the Frobenius norm of LL and is frequently used to promote low-rank (since LF2=tr(LL)\|L\|_{F}^{2}=\text{tr}(L^{\prime}L)) and st1=i|sit|\|\textbf{s}^{t}\|_{1}=\sum_{i}|\textbf{s}^{t}_{i}| promotes sparsity on each st\textbf{s}^{t}. In words, the weights wt=Lst\textbf{w}^{t}=\textbf{L}\textbf{s}^{t} for different t=t1t=t_{1} and t=t2t=t_{2} are related through latent space Lk\textbf{L}_{k} only if skt1\textbf{s}^{t_{1}}_{k} and skt2\textbf{s}^{t_{2}}_{k} are nonzero.

III-C Group Structured Latent Subspace Multi-task Learning

We now describe our group-regularized latent space MTL model. As before, we assume task parameters wt\textbf{w}_{t} within a group lie in a low-dimensional subspace, and the penalty function \mathcal{R} promotes group structure. We first assign the set of tasks into gg groups 𝒢={G1,,Gg}\mathcal{G}=\{G_{1},\ldots,G_{g}\}. The assignment may not be unique; i.e. we allow for groups with overlaps. The weight parameter of every task is a linear combination of k<Tk<T latent tasks. Mathematically, stacking W=[w1,,wt]\textbf{W}=[\textbf{w}^{1},\ldots,\textbf{w}^{t}] and S=[s1,,st]\textbf{S}=[\textbf{s}^{1},\ldots,\textbf{s}^{t}], we can describe W=LS\textbf{W}=\textbf{L}\textbf{S} and

(L,{st}t)=μLF2+λS𝒢,1,S𝒢,1=i=1dSi,:𝒢\mathcal{R}(L,\{\textbf{s}^{t}\}_{t})=\mu\|\textbf{L}\|_{F}^{2}+\lambda\|\textbf{S}\|_{\mathcal{G},1},\quad\|\textbf{S}\|_{\mathcal{G},1}=\sum_{i=1}^{d}\|\textbf{S}_{i,:}\|_{\mathcal{G}}

where Si,:\textbf{S}_{i,:} is the iith row of S. and

x𝒢=minw1,,wg{G𝒢wG2:x=G𝒢wG}\|x\|_{\mathcal{G}}=\min_{w_{1},\ldots,w_{g}}\left\{\sum_{G\in\mathcal{G}}\|w_{G}\|_{2}:x=\sum_{G\in\mathcal{G}}w_{G}\right\}

is the group norm proposed in [15]. The overall optimization function can be expressed as

minL,St=1Ti=1Nt(yit,(st)Lxit)+μS𝒢,1+λLF2.\min_{\textbf{L},\textbf{S}}\sum_{t=1}^{T}\sum_{i=1}^{N_{t}}\ell(y_{i}^{t},(\textbf{s}^{t})^{\prime}\textbf{L}^{\prime}x_{i}^{t})+\mu\left\|\textbf{S}\right\|_{\mathcal{G},1}+\lambda\|\textbf{L}\|_{F}^{2}. (2)

We evaluate our model for two tasks:

  1. 1.

    linear regression, where (yit,stLxit)=(stLxityit)2\ell(y_{i}^{t},\textbf{s}_{t}^{\prime}\textbf{L}^{\prime}x_{i}^{t})=(\textbf{s}_{t}^{\prime}\textbf{L}^{\prime}x_{i}^{t}-y_{i}^{t})^{2}, and

  2. 2.

    logistic regression, where (yit,stLxit)=log(1+exp(yitstLxit))\ell(y_{i}^{t},\textbf{s}_{t}^{\prime}\textbf{L}^{\prime}x_{i}^{t})=\log(1+\exp(-y_{i}^{t}\textbf{s}_{t}^{\prime}\textbf{L}^{\prime}x_{i}^{t})).

III-D Generalization to basis functions

The success of each of these models is based on the assumption that a linear representation of the feature space is sufficient for modeling, e.g.

yit(wt)xity_{i}^{t}\approx(\textbf{w}^{t})^{\prime}x_{i}^{t} (3)

Note that this representation can easily be made nonlinear through a set of nonlinear functions ϕ1,,ϕd\phi_{1},\ldots,\phi_{d}, with

yitj=1d(wt)ϕj(xit)y_{i}^{t}\approx\sum_{j=1}^{d}(\textbf{w}^{t})^{\prime}\phi_{j}(x_{i}^{t}) (4)

which, for known ϕj\phi_{j}, does not increase the numerical difficulty. Therefore, although the model (4) is more general, for clarity we restrict our attention to the model (3).

IV Optimization Procedure

Although (2) is not convex, it is biconvex in L and S; we therefore solve (2) using an alternating minimization strategy:

L+\displaystyle\textbf{L}^{+} =\displaystyle= argminLt=1Ti(yit,(st)Lxit)+λLF2\displaystyle\underset{\textbf{L}}{\text{argmin}}\sum_{t=1}^{T}\sum_{i}\ell(y_{i}^{t},(\textbf{s}^{t})^{\prime}\textbf{L}^{\prime}x_{i}^{t})+\lambda\|\textbf{L}\|_{F}^{2} (5)
S+\displaystyle\textbf{S}^{+} =\displaystyle= argminSt=1Ti(yit,(st)(L+)xit)+μSG\displaystyle\underset{\textbf{S}}{\text{argmin}}\sum_{t=1}^{T}\sum_{i}\ell(y_{i}^{t},(\textbf{s}^{t})^{\prime}(\textbf{L}^{+})^{\prime}x_{i}^{t})+\mu\|\textbf{S}\|_{G} (6)

where S+,L+\textbf{S}^{+},\textbf{L}^{+} are the new iterates. The optimization function in (5) is over a smooth, strongly convex function in L, and can be efficiently minimized either via accelerated gradient descent directly via backsolve if (y,x)=xy22\ell(y,x)=\|x-y\|_{2}^{2} (e.g. linear regression).

The optimization for S in (6) is less straightforward, since the group norm is nonsmooth; this is done via the fast iterative shrinkage-thresholding algorithm (FISTA), which to minimize f(S)+g(S)f(\textbf{S})+g(\textbf{S}) where f(S)=ti(yit,stLxit)f(\textbf{S})=\sum_{t}\sum_{i}\ell(y_{i}^{t},\textbf{s}_{t}^{\prime}\textbf{L}^{\prime}x_{i}^{t}) is convex and differentiable, and g(S)=S1,𝒢g(\textbf{S})=\|\textbf{S}\|_{1,\mathcal{G}} is nonsmooth. The FISTA iterates for a step size t>0t>0 are then

Si+=proxtg(Sitf(Si))\textbf{S}_{i}^{+}=\textbf{prox}_{tg}(\textbf{S}_{i}-t\nabla f(\textbf{S}_{i}))

for each row i=1,,ki=1,\ldots,k of S. The proximal operator[20] is defined as

proxtg(x)=argmin𝑢g(u)+12tux22.\textbf{prox}_{tg}(x)=\underset{u}{\text{argmin}}g(u)+\frac{1}{2t}\|u-x\|_{2}^{2}.

For g(x)=x𝒢g(x)=\|x\|_{\mathcal{G}}, we compute proxtg(x)\textbf{prox}_{tg}(x) as described in [22]. Define the sets k\mathcal{B}_{k} where xkxGk21x\in\mathcal{B}_{k}\iff\|x_{G_{k}}\|_{2}\leq 1, for k=1,,gk=1,\ldots,g. Define =12g\mathcal{B}=\mathcal{B}_{1}\cap\mathcal{B}_{2}\cap\cdots\cap\mathcal{B}_{g}. That is,

xtxGk2t,k=1,,g.x\in t\mathcal{B}\iff\|x_{G_{k}}\|_{2}\leq t,\;k=1,\ldots,g.

By Fenchel duality,

proxtg(x)=xprojt(x).\textbf{prox}_{tg}(x)=x-\textbf{proj}_{t\mathcal{B}}(x).

Note that if the groups are not overlapping, then this projection decomposes to gg smaller projections:

zGk:=projtk(xGk)=txGk2xGkz_{G_{k}}:=\textbf{proj}_{t\mathcal{B}_{k}}(x_{G_{k}})=\frac{t}{\|x_{G_{k}}\|_{2}}x_{G_{k}}

and can be computed in one step. When the groups overlap, we adopt the simple cyclic projection algorithm proposed in [22, 5]. These steps are outlined precisely in Alg. 2.

The algorithm listing in Algorithm 1 outlines the alternating minimization steps for solving Equation 2. We adopt the method from [18] to initialize L. The individual task weight matrix wt\textbf{w}_{t} are first learned independently using their own data Zt\textbf{Z}_{t}, and stacked as the columns in W. The matrix L is then initialized with the top-k singular vectors of W. The main algorithm then alternately minimizes for L and S, and is terminated when there is small change of either L or S between consecutive iterations

Data: (xit,yit)(x^{t}_{i},y^{t}_{i}), i=1,,nti=1,\ldots,n_{t} samples, t=1,,Tt=1,\ldots,T tasks. kk: Number of latent tasks.
1 Initialization Learn individually for each task
wt=argmin𝑤i=1nt(yit;wtxit),W=[w1,,wT].\textbf{w}_{t}=\underset{w}{\text{argmin}}\sum_{i=1}^{n_{t}}\ell(y_{i}^{t};\textbf{w}_{t}^{\prime}x_{i}^{t}),\quad\textbf{W}=[\textbf{w}_{1},\ldots,\textbf{w}_{T}].
2 Compute top-kk SVD: W=Uk𝚺kVk\textbf{W}=\textbf{U}_{k}\mathbf{\Sigma}_{k}\textbf{V}_{k}^{\prime}.
3 Initialize L=Uk\textbf{L}=\textbf{U}_{k}
4 while not converged do
5       Fix L and solve for S via Alg. 2;
6       Fix S and solve for L
L=argminSt=1Tintl(yti,rtLxti)+λLF2\textbf{L}=\underset{\textbf{S}}{\text{argmin}}\sum_{t=1}^{T}\sum_{i}^{n_{t}}l(y_{t}^{i},\textbf{r}^{\prime}_{t}\textbf{L}^{\prime}x_{t}^{i})+\lambda\left\|\textbf{L}\right\|_{F}^{2}
via gradient descent or direct method;
7      
8 end while
Result: Task predictor matrices L and S.
Algorithm 1 Group structured subspace multi-task learning
Data: Current S, L, step size t>0t>0
1
2while not converged do
3       Take a gradient step S~=Stf(S)\tilde{\textbf{S}}=\textbf{S}-t\nabla f(\textbf{S})
4       for  i=1,..,di=1,..,d do
5             % Compute the projection of row S~i,:\tilde{\textbf{S}}_{i,:}on \mathcal{B}
6             if groups are separable then
7                   (Si)Gj:=t(S~i)Gj(S~i)Gj(\textbf{S}_{i})_{G_{j^{\prime}}}:=\frac{t}{\|(\tilde{\textbf{S}}_{i})_{G_{j^{\prime}}}\|}(\tilde{\textbf{S}}_{i})_{G_{j^{\prime}}}
8             end if
9            else
10                   z(0)=0z^{(0)}=0
11                   for j=1j=1\ldots until convergence do
12                         j=j mod gj^{\prime}=j\text{ mod }g
13                         zGj(j+1):=1j+1(S~i)Gj+jj+1t(zGj(j))(zGj(j))z^{(j+1)}_{G_{j^{\prime}}}:=\frac{1}{j+1}(\tilde{\textbf{S}}_{i})_{G_{j^{\prime}}}+\frac{j}{j+1}\frac{t}{\|(z^{(j)}_{G_{j^{\prime}}})\|}(z^{(j)}_{G_{j^{\prime}}}).
14                   end for
15                  Update Si=z(j)\textbf{S}_{i}=z^{(j)}
16                  
17             end if
18            
19       end for
20      
21 end while
Result: S=[S1,S2,,Sk]\textbf{S}=[\textbf{S}_{1}^{\prime},\textbf{S}_{2}^{\prime},\ldots,\textbf{S}_{k}^{\prime}].
Algorithm 2 Update of S

V Experiments

In this section we provide experimental results to show the effectiveness of the proposed formulation for both regression and classification problems. We compare against three baselines:

  • Single task learning (STL): Each task is learned separately, through logistic or ridge regression.

  • MTL-FEAT [3]: A latent space MTL model that regularizes S via the 2,1 norm, e.g. solves

    minS,Lt=1T(i=1nt(yit,siLxit)+μst12)\min_{\textbf{S},\textbf{L}}\sum_{t=1}^{T}\left(\sum_{i=1}^{n_{t}}\ell(y_{i}^{t},\textbf{s}_{i}^{\prime}\textbf{L}^{\prime}x_{i}^{t})+\mu\|s_{t}\|_{1}^{2}\right)
  • GO-MTL [18]: A latent space model that regularizes L for low rank and S for sparsity, e.g. minimizing

    minS,Lt=1Ti=1nt(yit,siLxit)+μS1+λLF\min_{\textbf{S},\textbf{L}}\sum_{t=1}^{T}\sum_{i=1}^{n_{t}}\ell(y_{i}^{t},\textbf{s}_{i}^{\prime}\textbf{L}^{\prime}x_{i}^{t})+\mu\|\textbf{S}\|_{1}+\lambda\|\textbf{L}\|_{F}
  • GS-MTL Our model.

Below, we test three MTL models on synthetic and real-world datasets. For each dataset, we apply a 60% / 20% / 20% train / validation / test split, with a grid search to determine the best λ\lambda and μ\mu over powers of 10.

V-A Synthetic data

We evaluate our model on two synthetic datasets: one we generate ourselves, and another baseline from a similar work, for comparison. In both, the task is regression.

Synthetic 1 Consider mm the number of features and 𝒢={G1,G2,,Gg}\mathcal{G}=\{G_{1},G_{2},\ldots,G_{g}\} the set of groups, where each Gk{1,,m}G_{k}\subset\{1,\ldots,m\}. We uniformly sample cluster centers μ1,,μg\mu_{1},\ldots,\mu_{g} where μk𝒰(0,1)m\mu_{k}\in\mathcal{U}(0,1)^{m}. Then, we generate nn datapoints xmx\in\mathbb{R}^{m} such that xi𝒩(μk,σ)x_{i}\in\mathcal{N}(\mu_{k},\sigma) where iGki\in G_{k}. If ii belongs to more than one group, then kk is picked randomly (uniformly) from the set {k:iGk}\{k:i\in G_{k}\}. This is done independently for i=1,,mi=1,\ldots,m. Specifically, m=20m=20, g=3g=3, with 10 tasks and 20 samples per task. The motivation behind this procedure is to model data vectors where each feature is drawn from a different clustering scheme–for example, a person’s age, socioeconomic class, and geographic location produces different communities for which the person may belong, and each community characteristic plays a role in predicting the person’s task performance.

Synthetic 2 [18] We borrow the synthetic dataset from 111https://github.com/wOOL/GO_MTL, to compare against other models. It consists of m=20m=20 dimensional feature vectors, 10 tasks, with 65 samples per tasks.

V-B Real datasets

  • Human Activity Recognition 222https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using
    +Smartphones
    : This dataset contains signal features from smartphone sensors held by test subjects performing various actions. The task is to classify the action based on the signal. In total there are 30 volunteers and 6 activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying. Each datapoint contains 561 processed features from the raw signal. We pick the groups using a K-Means clustering of these feature vectors. We model each individual as a separate task and predict between sitting and other activities.

  • Land Mine333http://www.ee.duke.edu/ lcarin/LandmineData.zip: This dataset consists of 29 binary classification tasks. Each instance consists of a 9-dimensional feature vector extracted from radar images taken at various locations. Each task is to predict whether landmines are present in a field, where several images are taken. The data is also labeled by terrain type: the first 15 fields are highly foliated (forests), while the last 14 are barren (desert). This lends to a natural group assignment.

The results are summarized in Table I. All MTL approaches have lower RMSE than the single task learning baseline, which confirms that sharing task information is crucial for optimal performance. Moreover, the proposed method is able to outperform both MTL-FEAT and GO-MTL, suggesting successful incorporation of side information.

Refer to caption
(a) Sparse pattern generated by GO_MTL
Refer to caption
(b) Sparse pattern generated by GS_MTL
Figure 1: Visualization of final SS matrix solved over the Landmine dataset. The 29 tasks are along the xx-axis, and the 2 latent tasks are along the yy axis. Although the GO_MTL model is designed to find group structure, it is difficult to discern. In our model, by specifically regularizing for the two groups, we see much clearer separation. However, the pattern is not completely block diagonal, showing some cross-group information sharing between tasks.
TABLE I: Average prediction error for regression and classification datasets.
Synthetic1 Synthetic2 Landmine Human Activity
STL 1.729 1.682 0.253 0.660
MTL-FEAT 1.553 1.099 0.292 0.641
GO-MTL 1.314 0.430 0.240 0.580
GS-MTL 1.253 0.385 0.2303 0.559

VI Conclusion

In this paper, we proposed a novel framework for learning tasks’ relationship in multi-task learning, where a prior group structure is either determined beforehand by problem metadata or inferred via an independent method such as K-means. We build upon models that enforce task interdependence through a latent subspace, regularized accordingly to capture group structure. We give algorithms for solving the resulting nonconvex problem for overlapping and non-overlapping group structure, and demonstrate its performance on simulated and real world datasets.

References

  • [1] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
  • [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in neural information processing systems, pages 41–48, 2007.
  • [3] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
  • [4] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 71–85. Springer, 2008.
  • [5] Heinz H Bauschke. The approximation of fixed points of compositions of nonexpansive mappings in hilbert space. Journal of Mathematical Analysis and Applications, 202(1):150–159, 1996.
  • [6] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
  • [7] Richard A Caruana. Multitask connectionist learning. In In Proceedings of the 1993 Connectionist Models Summer School. Citeseer, 1993.
  • [8] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
  • [9] Hal Daumé III. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 135–142. AUAI Press, 2009.
  • [10] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
  • [11] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
  • [12] Le Gan, Junshi Xia, Peijun Du, and Jocelyn Chanussot. Multiple feature kernel sparse representation classifier for hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 2018.
  • [13] André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. The Journal of Machine Learning Research, 17(1):1205–1234, 2016.
  • [14] Lei Han and Yu Zhang. Learning multi-level task groups in multi-task learning. In AAAI, pages 2638–2644, 2015.
  • [15] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440. ACM, 2009.
  • [16] Laurent Jacob, Jean-philippe Vert, and Francis R Bach. Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems, pages 745–752, 2009.
  • [17] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature learning. In ICML, pages 521–528, 2011.
  • [18] Abhishek Kumar and Hal Daumé III. Learning task grouping and overlap in multi-task learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1723–1730. Omnipress, 2012.
  • [19] Su-In Lee, Vassil Chatalbashev, David Vickrey, and Daphne Koller. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of the 24th international conference on Machine learning, pages 489–496. ACM, 2007.
  • [20] Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93(2):273–299, 1965.
  • [21] Paul Ruvolo and Eric Eaton. Online multi-task learning via sparse dictionary optimization. In AAAI, pages 2062–2068, 2014.
  • [22] Silvia Villa, Lorenzo Rosasco, Sofia Mosci, and Alessandro Verri. Proximal methods for the latent group lasso penalty. Computational Optimization and Applications, 58(2):381–407, 2014.
  • [23] Fei Wang, Xin Wang, and Tao Li. Semi-supervised multi-task learning with task regularizations. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 562–568. IEEE, 2009.
  • [24] Lu Wang, Dongxiao Zhu, Elizabeth Towner, and Ming Dong. Obesity risk factors ranking using multi-task learning. In Biomedical & Health Informatics (BHI), 2018 IEEE EMBS International Conference on, pages 385–388. IEEE, 2018.
  • [25] Weixin Wang, Qing He, Yu Cui, and Zhiguo Li. Joint prediction of remaining useful life and failure type of train wheelsets: Multitask learning approach. Journal of Transportation Engineering, Part A: Systems, 144(6):04018016, 2018.
  • [26] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 4460–4464, 2015.
  • [27] Ming Yang, Yingming Li, and Zhongfei Zhang. Multi-task learning with gaussian matrix generalized inverse gaussian model. In International Conference on Machine Learning, pages 423–431, 2013.
  • [28] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on Machine learning, pages 1012–1019. ACM, 2005.
  • [29] Weizhong Zhang, Tingjin Luo, Shuang Qiu, Jieping Ye, Deng Cai, Xiaofei He, and Jie Wang. Identifying genetic risk factors for alzheimer’s disease via shared tree-guided feature learning across multiple tasks. IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [30] Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 733–742. AUAI Press, 2010.
  • [31] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108, 2014.
  • [32] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems, pages 702–710, 2011.