Latent Group Structured Multi-task Learning
Abstract
In multi-task learning (MTL), we improve the performance of key machine learning algorithms by training various tasks jointly. When the number of tasks is large, modeling task structure can further refine the task relationship model. For example, often tasks can be grouped based on metadata, or via simple preprocessing steps like K-means. In this paper, we present our group structured latent-space multi-task learning model, which encourages group structured tasks defined by prior information. We use an alternating minimization method to learn the model parameters. Experiments are conducted on both synthetic and real-world datasets, showing competitive performance over single-task learning (where each group is trained separately) and other MTL baselines.
I Introduction
Multi-task learning (MTL) [6, 2] seeks to improve the performance of a specific task by sharing information across multiple related tasks. Specifically, this is done by simultaneously training many tasks, and promoting relatedness across each task’s feature weights. MTL continues to be a promising tool in many applications including medicine [24, 29], imaging [12], and transportation [25], and has recently become repopularized in the field of deep learning (e.g. [31, 26, 8, 7]).
A common approach to multi-task learning [2, 10] is to provide task-specific weights on features. There are two main approaches toward enforcing task relatedness. One is by promoting similarity between task weights, either through regularization in convex models [3, 17, 11] or structured priors in statistical inference [28, 27, 19]. The other is through enforcing a low-dimensional latent space representation of the task weights [32, 18, 21, 1, 9].
Where the main strength of multi-task learning is shared learning across tasks, a main weakness is contamination from unrelated tasks. Many MTL models have been proposed to combat this contamination. Weighted task relatedness can be imposed via the Gram matrix of the features [23] or a kernel matrix [10], or probabilistically using a joint probability distribution with a given covariance matrix. Pairwise relativeness can also be learned by optimizing for a sparse covariance or inverse covariance matrix expressing the relatedness between weights on different tasks; this can be done either by alternating minimization [30, 13, 16] or variational inference[27]. To encode specifically for group or cluster structure, one approach is to compose a combinatorial problem, in which group identity is represented by an integer [17, 4]. Another approach is to provide multiple sets of weights for each task, regularized separately to encode for different hierarchies of relatedness [14].
For latent space models, task relatedness and grouping can be imposed more simply, without many added variables or discrete optimization. Concretely, each task-specific feature vector can be modeled as , where the columns of encode the latent space and a sparse vectors decide how the latent vectors are shared across tasks. In this context, [21] uses dictionary learning to learn a low rank and sparse ’s for each task; this representation is then applied to online learning tasks. Similarly, [18] uses the same model, but learns and through minimizing the task loss function. In both cases, the support of for different ’s may overlap, enabling this model to capture overlapping group structure in a completely unsupervised way.
In this paper, we take a step back and decompose this problem into two steps: first, learning task relatedness structure, and then, performing MTL. The main motivation for this two-step approach is that oftentimes, task-relatedness already given in the metadata. For example, in the task of interpreting geological photos, it may be helpful to use terrain labels (forest, desert, ocean) to help group the types of images. Similarly, in regressing school test scores, where each task corresponds to a school, a task group may be a specific district, or designation (private/public/parochial), or a cluster discovered from ethnic or socioeconomic makeup–in this way, the groups are intentionally interpretable. Once the (possibly overlapping) groups are identified, we solve the latent-space model with an overlapping group norm regularization on the variables . Although the first step can be made significantly more sophisticated, we find this simple approach can already give superior performance on several benchmark tasks.
II Notation
For a vector , we denote as the th element of , and for some group , we denote the subvector of indexed by . We denote the Euclidean projection of a vector on a set as . For a vector or matrix, we use to represent the transpose of .
III Multi-task Learning
III-A Linear Multi-Task Learning
Consider tasks, each with labeled training samples . Following [2], we minimize the function
(1) |
over task-differentiated weights on each feature , differentiated per task. The loss function is any smooth convex loss function, such as squared loss for regression or logistic loss for classification. The regularization enforces task relatedness.
III-B Latent Subspace Multi-task Learning
In latent subspace MTL [21, 18], task relatedness is expressed through a common low-dimensional subspace , where is common to all tasks, and a sparse weights the component of each subspace to each task. In [21, 18], parsimony is enforced via the regularization:
Here, is the square of the Frobenius norm of and is frequently used to promote low-rank (since ) and promotes sparsity on each . In words, the weights for different and are related through latent space only if and are nonzero.
III-C Group Structured Latent Subspace Multi-task Learning
We now describe our group-regularized latent space MTL model. As before, we assume task parameters within a group lie in a low-dimensional subspace, and the penalty function promotes group structure. We first assign the set of tasks into groups . The assignment may not be unique; i.e. we allow for groups with overlaps. The weight parameter of every task is a linear combination of latent tasks. Mathematically, stacking and , we can describe and
where is the th row of S. and
is the group norm proposed in [15]. The overall optimization function can be expressed as
(2) |
We evaluate our model for two tasks:
-
1.
linear regression, where , and
-
2.
logistic regression, where .
III-D Generalization to basis functions
The success of each of these models is based on the assumption that a linear representation of the feature space is sufficient for modeling, e.g.
(3) |
Note that this representation can easily be made nonlinear through a set of nonlinear functions , with
(4) |
which, for known , does not increase the numerical difficulty. Therefore, although the model (4) is more general, for clarity we restrict our attention to the model (3).
IV Optimization Procedure
Although (2) is not convex, it is biconvex in L and S; we therefore solve (2) using an alternating minimization strategy:
(5) | |||||
(6) |
where are the new iterates. The optimization function in (5) is over a smooth, strongly convex function in L, and can be efficiently minimized either via accelerated gradient descent directly via backsolve if (e.g. linear regression).
The optimization for S in (6) is less straightforward, since the group norm is nonsmooth; this is done via the fast iterative shrinkage-thresholding algorithm (FISTA), which to minimize where is convex and differentiable, and is nonsmooth. The FISTA iterates for a step size are then
for each row of S. The proximal operator[20] is defined as
For , we compute as described in [22]. Define the sets where , for . Define . That is,
By Fenchel duality,
Note that if the groups are not overlapping, then this projection decomposes to smaller projections:
and can be computed in one step. When the groups overlap, we adopt the simple cyclic projection algorithm proposed in [22, 5]. These steps are outlined precisely in Alg. 2.
The algorithm listing in Algorithm 1 outlines the alternating minimization steps for solving Equation 2. We adopt the method from [18] to initialize L. The individual task weight matrix are first learned independently using their own data , and stacked as the columns in W. The matrix L is then initialized with the top-k singular vectors of W. The main algorithm then alternately minimizes for L and S, and is terminated when there is small change of either L or S between consecutive iterations
V Experiments
In this section we provide experimental results to show the effectiveness of the proposed formulation for both regression and classification problems. We compare against three baselines:
-
•
Single task learning (STL): Each task is learned separately, through logistic or ridge regression.
-
•
MTL-FEAT [3]: A latent space MTL model that regularizes S via the 2,1 norm, e.g. solves
-
•
GO-MTL [18]: A latent space model that regularizes L for low rank and S for sparsity, e.g. minimizing
-
•
GS-MTL Our model.
Below, we test three MTL models on synthetic and real-world datasets. For each dataset, we apply a 60% / 20% / 20% train / validation / test split, with a grid search to determine the best and over powers of 10.
V-A Synthetic data
We evaluate our model on two synthetic datasets: one we generate ourselves, and another baseline from a similar work, for comparison. In both, the task is regression.
Synthetic 1 Consider the number of features and the set of groups, where each . We uniformly sample cluster centers where . Then, we generate datapoints such that where . If belongs to more than one group, then is picked randomly (uniformly) from the set . This is done independently for . Specifically, , , with 10 tasks and 20 samples per task. The motivation behind this procedure is to model data vectors where each feature is drawn from a different clustering scheme–for example, a person’s age, socioeconomic class, and geographic location produces different communities for which the person may belong, and each community characteristic plays a role in predicting the person’s task performance.
Synthetic 2 [18] We borrow the synthetic dataset from 111https://github.com/wOOL/GO_MTL, to compare against other models. It consists of dimensional feature vectors, 10 tasks, with 65 samples per tasks.
V-B Real datasets
-
•
Human Activity Recognition 222https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using
+Smartphones: This dataset contains signal features from smartphone sensors held by test subjects performing various actions. The task is to classify the action based on the signal. In total there are 30 volunteers and 6 activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying. Each datapoint contains 561 processed features from the raw signal. We pick the groups using a K-Means clustering of these feature vectors. We model each individual as a separate task and predict between sitting and other activities. -
•
Land Mine333http://www.ee.duke.edu/ lcarin/LandmineData.zip: This dataset consists of 29 binary classification tasks. Each instance consists of a 9-dimensional feature vector extracted from radar images taken at various locations. Each task is to predict whether landmines are present in a field, where several images are taken. The data is also labeled by terrain type: the first 15 fields are highly foliated (forests), while the last 14 are barren (desert). This lends to a natural group assignment.
The results are summarized in Table I. All MTL approaches have lower RMSE than the single task learning baseline, which confirms that sharing task information is crucial for optimal performance. Moreover, the proposed method is able to outperform both MTL-FEAT and GO-MTL, suggesting successful incorporation of side information.


Synthetic1 | Synthetic2 | Landmine | Human Activity | |
STL | 1.729 | 1.682 | 0.253 | 0.660 |
MTL-FEAT | 1.553 | 1.099 | 0.292 | 0.641 |
GO-MTL | 1.314 | 0.430 | 0.240 | 0.580 |
GS-MTL | 1.253 | 0.385 | 0.2303 | 0.559 |
VI Conclusion
In this paper, we proposed a novel framework for learning tasks’ relationship in multi-task learning, where a prior group structure is either determined beforehand by problem metadata or inferred via an independent method such as K-means. We build upon models that enforce task interdependence through a latent subspace, regularized accordingly to capture group structure. We give algorithms for solving the resulting nonconvex problem for overlapping and non-overlapping group structure, and demonstrate its performance on simulated and real world datasets.
References
- [1] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
- [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in neural information processing systems, pages 41–48, 2007.
- [3] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
- [4] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 71–85. Springer, 2008.
- [5] Heinz H Bauschke. The approximation of fixed points of compositions of nonexpansive mappings in hilbert space. Journal of Mathematical Analysis and Applications, 202(1):150–159, 1996.
- [6] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
- [7] Richard A Caruana. Multitask connectionist learning. In In Proceedings of the 1993 Connectionist Models Summer School. Citeseer, 1993.
- [8] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
- [9] Hal Daumé III. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 135–142. AUAI Press, 2009.
- [10] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
- [11] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
- [12] Le Gan, Junshi Xia, Peijun Du, and Jocelyn Chanussot. Multiple feature kernel sparse representation classifier for hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 2018.
- [13] André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. The Journal of Machine Learning Research, 17(1):1205–1234, 2016.
- [14] Lei Han and Yu Zhang. Learning multi-level task groups in multi-task learning. In AAAI, pages 2638–2644, 2015.
- [15] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440. ACM, 2009.
- [16] Laurent Jacob, Jean-philippe Vert, and Francis R Bach. Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems, pages 745–752, 2009.
- [17] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature learning. In ICML, pages 521–528, 2011.
- [18] Abhishek Kumar and Hal Daumé III. Learning task grouping and overlap in multi-task learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1723–1730. Omnipress, 2012.
- [19] Su-In Lee, Vassil Chatalbashev, David Vickrey, and Daphne Koller. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of the 24th international conference on Machine learning, pages 489–496. ACM, 2007.
- [20] Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93(2):273–299, 1965.
- [21] Paul Ruvolo and Eric Eaton. Online multi-task learning via sparse dictionary optimization. In AAAI, pages 2062–2068, 2014.
- [22] Silvia Villa, Lorenzo Rosasco, Sofia Mosci, and Alessandro Verri. Proximal methods for the latent group lasso penalty. Computational Optimization and Applications, 58(2):381–407, 2014.
- [23] Fei Wang, Xin Wang, and Tao Li. Semi-supervised multi-task learning with task regularizations. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 562–568. IEEE, 2009.
- [24] Lu Wang, Dongxiao Zhu, Elizabeth Towner, and Ming Dong. Obesity risk factors ranking using multi-task learning. In Biomedical & Health Informatics (BHI), 2018 IEEE EMBS International Conference on, pages 385–388. IEEE, 2018.
- [25] Weixin Wang, Qing He, Yu Cui, and Zhiguo Li. Joint prediction of remaining useful life and failure type of train wheelsets: Multitask learning approach. Journal of Transportation Engineering, Part A: Systems, 144(6):04018016, 2018.
- [26] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 4460–4464, 2015.
- [27] Ming Yang, Yingming Li, and Zhongfei Zhang. Multi-task learning with gaussian matrix generalized inverse gaussian model. In International Conference on Machine Learning, pages 423–431, 2013.
- [28] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on Machine learning, pages 1012–1019. ACM, 2005.
- [29] Weizhong Zhang, Tingjin Luo, Shuang Qiu, Jieping Ye, Deng Cai, Xiaofei He, and Jie Wang. Identifying genetic risk factors for alzheimer’s disease via shared tree-guided feature learning across multiple tasks. IEEE Transactions on Knowledge and Data Engineering, 2018.
- [30] Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 733–742. AUAI Press, 2010.
- [31] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108, 2014.
- [32] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems, pages 702–710, 2011.