This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-task manifold learning for small sample size datasets

Hideaki Ishibashi [email protected] Kazushi Higa [email protected] Tetsuo Furukawa [email protected] Kyushu Institute of Technology, 2–4 Hibikino, Wakamatsu-ku, Kitakyushu 808-0196, Japan Horiba Ltd., Kyoto, Japan
Abstract

In this study, we develop a method for multi-task manifold learning. The method aims to improve the performance of manifold learning for multiple tasks, particularly when each task has a small number of samples. Furthermore, the method also aims to generate new samples for new tasks, in addition to new samples for existing tasks. In the proposed method, we use two different types of information transfer: instance transfer and model transfer. For instance transfer, datasets are merged among similar tasks, whereas for model transfer, the manifold models are averaged among similar tasks. For this purpose, the proposed method consists of a set of generative manifold models corresponding to the tasks, which are integrated into a general model of a fiber bundle. We applied the proposed method to artificial datasets and face image sets, and the results showed that the method was able to estimate the manifolds, even for a tiny number of samples.

keywords:
Multi-task unsupervised learning , Multi-level modeling , Small sample size problem , manifold disentanglement , Meta-learning
journal: Neurocomputing

1 Introduction

To model a high-dimensional dataset, it is often assumed that the data points are distributed in a low-dimensional nonlinear subspace, that is, a manifold. Such manifold-based methods are useful in data visualization and unsupervised modeling, and have been applied in many fields [1, 2]. A common problem of the manifold-based approach is that it requires a sufficient number of samples to capture the overall shape of the manifold. If there are only a few samples in a high-dimensional space, the samples cover the manifold very sparsely, and then it is difficult to estimate the manifold shape. This is referred to as the small sample size problem in the literature [3].

As a typical example, we consider a face image dataset because face images of a person are modeled by a manifold [4, 5]. To estimate the face manifold of a person, we need many photographs with various expressions or various poses. However, it is typically difficult to obtain such an exhaustive image set for a single person. If we only have a limited number of facial expressions for the person, it seems almost impossible to estimate a face manifold in which many unknown expressions are contained. Thus, we use multi-task learning; if we can utilize face images of many other people, an exhaustive dataset is no longer necessary.

In the single-task scenario, the aim of manifold learning is to map a dataset to a low-dimensional latent space by modeling the data distribution by a manifold. Thus, in the multi-task scenario, the aim is to map a set of datasets to a common low-dimensional space. For this purpose, multiple manifold learning tasks are executed in parallel, with transferring the information between tasks. Therefore, the success of multi-task learning depends on what and how information is transferred between tasks.

In this study, we use a generative manifold model approach, which is a class of manifold learning methods. Particularly, we chose the method that models a manifold by kernel smoothing, that is, kernel smoothing manifold model (KSMM). Therefore, our target is to develop a multi-task KSMM (MT-KSMM), which works for small sample size datasets. Additionally, MT-KSMM also aims to estimate a general model, which can be applied to new tasks. Thus, our target method covers meta-learning of manifold models [6].

In the proposed method, we introduce two different types of information transfer between tasks: instance transfer and model transfer. For the instance transfer, the given datasets are merged among the similar tasks, whereas for the model transfer, the estimated manifold models are regularized so that they get closer among the similar tasks. To execute these information transfers, we introduce an extra KSMM into MT-KSMM, whereby the manifold models of tasks are integrated into a general model of a fiber bundle. In addition, the extra KSMM also estimates the similarities between tasks, by which the amount of information transfer is controlled. Thus the extra KSMM works as the command center which governs the KSMMs for tasks. Hereafter, the extra KSMM is referred to as the higher-order KSMM (higher-KSMM, in short), whereas the KSMMs for tasks are referred to as the lower-KSMMs. Thus, the key ideas of this work are (1) using two styles of information transfer alternately, and (2) using the hierarchical structure consisting of the lower-KSMMs and the higher-KSMM.

The remainder of this paper is structured as follows: We introduce related work in Section 2, and describe the theoretical framework in Section 3. We present the proposed method in Section 4 and experimental results in Section 5. In Section 6, we discuss the study from the viewpoint of information geometry, and in the final section, we present the conclusion.

2 Related work

2.1 Multi-task unsupervised learning

Multi-task learning is a paradigm of machine learning that aims to improve performance by learning similar tasks simultaneously [7, 8]. Many studies have been conducted on multi-task learning, particularly in supervised learning settings. By contrast, few studies have been conducted on multi-task unsupervised learning. Some studies on multi-task clustering have been reported [9, 10, 11, 12].

To date, few studies have been conducted on dimensionality reduction, subspace methods, and manifold learning. To the best of our knowledge, a study on multi-task principal component analysis (PCA) is the only research that is expressly aimed at the multi-task learning of subspace methods [13]. In multi-task PCA, the given datasets are modeled by linear subspaces, which are regularized so that they are closed to each other on the Grassmannian manifold.

Regarding nonlinear subspace methods, to the best of our knowledge, the most closely related work is the higher-order self-organizing map (SOM2, which is also called the ‘SOM of SOMs’) [14, 15]. Although SOM2 is designed for multi-level modeling rather than multi-task learning, the function of SOM2 can also be regarded as multi-task learning. SOM2 is described in detail below.

2.2 Multi-level modeling

Multi-level modeling (or hierarchical modeling) aims to obtain higher models of tasks in addition to modeling each task [16]. Although multi-level modeling does not aim to improve the performance of individual tasks, the research areas of multi-level modeling and multi-task learning overlap. In fact, multi-level modeling is sometimes adopted as an approach for multi-task learning [17, 18, 19].

As in the case of multi-task learning, most studies conducted on multi-level modeling have considered supervised learning, particularly linear models. Among the multi-level unsupervised modeling approaches, SOM2 aims to model a set of self-organizing maps (SOMs) using a higher-order SOM (higher-SOM) [14, 15]. In SOM2, the lower-order SOMs (lower-SOMs) model each task using a nonlinear manifold, whereas the higher-SOM models those manifolds by another nonlinear manifold in function space. As a result, the entire dataset is represented by a product manifold, that is, a fiber bundle. Thus, the manifolds represented by the lower-SOMs are the fibers, whereas the higher-SOM represents the base space of the fiber bundle. Additionally, the relations between tasks are visualized in the low-dimensional space of the tasks, in which similar tasks are arranged nearer to each other and different tasks are arranged further from each other.

The learning algorithm of SOM2 is not a simple cascade of SOMs. The higher and lower-SOMs learn in parallel and affect each other. Thus, whereas the higher-SOM learns the set of lower models, the lower-SOMs are regularized by the higher-SOM so that similar tasks are represented by similar manifolds. Thus, SOM2 has many aspects of multi-task learning. In fact, SOM2 has been applied to several areas of multi-task learning, such as unsupervised modeling of face images of various people [20], nonlinear dynamical systems with latent state variables [21, 22], shapes of various objects [23, 24], and people of various groups [25, 26]. In this sense, SOM2 is one of the earliest works on multi-task unsupervised learning for nonlinear subspace methods.

Although SOM2 works like multi-task learning, it is still difficult for it to estimate manifolds when the sample size is small. Additionally, SOM2 has several limitations that originate from SOM, such as poor manifold representation caused by the discrete grid nodes. In this study, we try to overcome these limitations of SOM2 by replacing SOM with KSMM. Then we extend it for multi-task learning so that it estimates the manifold shape more accurately, even for a small sample size.

2.3 Meta-learning

Recently, the concept of meta-learning has been gaining importance [6]. Meta-learning aims to solve unseen future tasks efficiently through learning multiple existing tasks. This is in contrast to multi-task learning, which aims to improve the learning performance of existing tasks. Because MT-KSMM aims to obtain a general model that can represent future tasks in addition to existing tasks, our aim covers both multi-task learning and meta-learning. Similar to the cases of multi-task learning and multi-level learning, meta-learning of unsupervised learning has rarely been performed. Particularly, to the best of our knowledge, meta-learning in an unsupervised manner of unsupervised learning has not been reported, except SOM2. In this sense, SOM2 is a method in a unique position. In this study, we aim to develop a novel method for multi-task, multi-level, and meta-learning of manifold models based on SOM2.

2.4 Generative manifold modeling

Nonlinear methods for dimensionality reduction are roughly categorized into two groups. The first group is methods that project data points from high-dimensional space to low-dimensional space. Most dimensionality reduction methods, including many manifold learning approaches, are classified in this group. By contrast, the second group explicitly estimates the mapping from low-dimensional latent space to the manifold in high-dimensional visible space [27]. Thus a nonlinear embedding is estimated by this group. Because new samples on the manifold can be generated by using the estimated embedding, we refer to the second group as generative manifold modeling in this study. An advantage of generative manifold model is that it does not suffer from the pre-image problem and the out-of-sample extension problem [28]. Thus, by generating new samples between the given data points, generative manifold model completes the manifold shape entirely. Additionally, because the embedding is explicitly estimated, generative manifold model allows the direct measurement of the distance between two manifolds in function space. These are the reasons that we chose the generative manifold model approach.

The representative methods of generative manifold modeling are generative topographic mapping (GTM) [29], the Gaussian process latent variable model (GPLVM) [30, 27], and unsupervised kernel regression (UKR) [31], which originate from the SOM [32]. To estimate the embedding and the latent variables, GPLVM and GTM employ the Bayesian approach that use a Gaussian process, whereas UKR and SOM employ the non-Bayesian approach that use a kernel smoother. Additionally, GPLVM and UKR represent the manifold in a non-parametric manner, whereas GTM and SOM represent it in parametric manner. In this study, we use a parametric kernel smoothing approach like the original SOM, for ease of information transfer between tasks.

2.5 Unsupervised manifold alignment

In nonlinear subspace methods, there is always arbitrariness in the determination of the coordinate system of the latent space. For example, any orthogonal transformation of the latent space yields an equivalent result. Additionally, any nonlinear distortion along the manifold is also allowed.

In single-task learning, such arbitrariness is not a problem, but it causes serious problems in multi-task learning. Because the coordinate system of the latent space is determined differently for each task, the learning results become incompatible between tasks. Therefore, it prevents knowledge transfer between tasks.

To solve this problem, we need to know the correspondences between two different manifolds, or we need to regularize them so that they share the same coordinate system. This is known as the manifold alignment problem, which is challenging to solve under the unsupervised condition [33]. This is one reason why the multi-task subspace method is difficult. In the case of multi-task PCA, they escaped this problem by mapping them to a Grassmannian manifold [13]; however, this approach is only possible for the linear subspace case. Therefore, manifold alignment is an unavoidable problem, which is challenged in this study.

3 Problem formulation

In this section, we first describe generative manifold modeling in the single-task case, and then we present the problem formulation of the multi-task case. The notation used in this paper is described in Appendix A, and the symbol list is presented in Table 2.

Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Figure 1: Generative model for multi-task manifold learning. (a) Single-task case. A datum 𝐱\mathbf{x} on manifold 𝒳𝒱{\cal X}\subseteq{\cal V} is generated from the embedding ff and the latent variable 𝐳\mathbf{z}\in{\cal L}. (b) Multi-task case. The latent space {\cal L} is common to all tasks, whereas the embedding ff is different between tasks. (c) Manifold assumption for tasks. Embedding ff is distributed on a manifold 𝒴{\cal Y} in function space {\cal H}. (d) Using the manifold assumption, the entire datasets are modeled using a fiber bundle {\cal E}.

3.1 Problem formulation for single-task learning

First, we describe the problem formulation for the single-task case (Figure 1 (a)). Let 𝒱=D𝒱{\cal V}=\mathbb{R}^{{D_{\cal V}}} and D{\cal L}\subseteq\mathbb{R}^{{D_{\cal L}}} be high-dimensional visible space and low-dimensional latent space, respectively, and let X={𝐱n}n=1NX=\bigl{\{}\mathbf{x}_{n}\bigr{\}}_{n=1}^{N} be the observed sample set, where 𝐱n𝒱\mathbf{x}_{n}\in{\cal V}. In the case of a face image dataset, XX is the set of face images of a single person with various expressions or poses, and 𝐳n\EuScriptL\mathbf{z}_{n}\in\EuScript{L} is the sample latent variable corresponding to 𝐱n\mathbf{x}_{n}, which represents the intrinsic property of the image, such as the expression or pose. The main purpose of manifold learning is to map XX to {\cal L} by assuming that sample set XX is modeled by a manifold 𝒳𝒱{\cal X}\subseteq{\cal V}, which is homeomorphic to {\cal L}. Thus, a homeomorphism π:𝒳\pi\colon{\cal X}\xrightarrow{\sim}{\cal L} can be defined, by which 𝐱𝒳\mathbf{x}\in{\cal X} is projected to 𝐳=π(𝐱)\mathbf{z}=\pi(\mathbf{x}). Note that actual samples are not distributed in 𝒳{\cal X} exactly because of observation noise. Thus, the observed sample 𝐱n\mathbf{x}_{n} is represented as 𝐱n=π1(𝐳n)+𝜺n\mathbf{x}_{n}=\pi^{-1}(\mathbf{z}_{n})+{\boldsymbol{\varepsilon}}_{n}, where 𝜺n𝒩(𝟎,β1𝐈D𝒱){\boldsymbol{\varepsilon}}_{n}\sim{\cal N}(\mathbf{0},\beta^{-1}\mathbf{I}_{{D_{\cal V}}}) is observation noise. Generative manifold modeling not only aims to estimate the sample latent variables Z={𝐳n}n=1NZ=\left\{\mathbf{z}_{n}\right\}_{n=1}^{N}, but also to estimate the embedding fπ1f\coloneqq\pi^{-1} explicitly. Thus, the probabilistic generative model is estimated as q(𝐱,𝐳f)=𝒩(𝐱f(𝐳),β1𝐈D𝒱)p(𝐳)q(\mathbf{x},\mathbf{z}\mid f)={\cal N}(\mathbf{x}\mid f(\mathbf{z}),\beta^{-1}\mathbf{I}_{D_{\cal V}})\,p(\mathbf{z}). Because ff is expected to be a smooth continuous mapping, we assume that ff is a member of a reproducing kernel Hilbert space (RKHS) ={ff:𝒱}{\cal H}=\{f\mid f\colon{\cal L}\to{\cal V}\}.

3.2 Problem formulation for multi-task learning

Now we formulate the problem in which we have II tasks. Thus, we have II sample sets {X1,,XI}\bigl{\{}X_{1},\dots,X_{I}\bigr{\}}, where Xi={𝐱ij}j=1JiX_{i}=\bigl{\{}\mathbf{x}_{ij}\bigr{\}}_{j=1}^{J_{i}}. For example, in the case of face image datasets, we have II image sets of II people, with JiJ_{i} expressions or poses for each person. We denote the entire sample set by X=iXi={𝐱n}n=1NX=\bigcup_{i}X_{i}=\bigl{\{}\mathbf{x}_{n}\bigr{\}}_{n=1}^{N}, where N=iJiN=\sum_{i}J_{i}. We also denote the entire sample set by matrix 𝐗=(𝐱n𝖳)N×D𝒱\mathbf{X}=\bigl{(}\mathbf{x}^{\mathsf{T}}_{n}\bigr{)}\in\mathbb{R}^{N\times D_{\cal V}}. In the same manner, Zi={𝐳ij}j=1JiZ_{i}=\left\{\mathbf{z}_{ij}\right\}_{j=1}^{J_{i}} is the latent variable set of task ii, which corresponds to XiX_{i}, whereas the entire latent variable set is represented as Z={𝐳n}n=1NZ=\left\{\mathbf{z}_{n}\right\}_{n=1}^{N}. Additionally, let ini_{n} be the task index of sample nn, and let 𝒩i{\cal N}_{i} be the index set of samples that belong to task ii. To simplify the explanation, we consider the case in which each dataset has the same number of samples (sample per task: S/T), that is, JiJJ_{i}\equiv J.

Similar to the single-task case, we assume that {Xi}\{X_{i}\} is modeled by manifolds {𝒳i}\{{\cal X}_{i}\}, which are all homeomorphic to the common latent space D{\cal L}\subseteq\mathbb{R}^{{D_{\cal L}}}. Thus, we can define a smooth embedding fi:𝒳i𝒱f_{i}\colon{\cal L}\to{\cal X}_{i}\subseteq{\cal V} for each task, where fiπi1f_{i}\coloneqq\pi_{i}^{-1} (Figure 1 (b)). By considering the observation noise, we represent the probability distribution of task ii by qi(𝐱,𝐳fi)=𝒩(𝐱fi(𝐳),β1𝐈D𝒱)p(𝐳)q_{i}(\mathbf{x},\mathbf{z}\mid f_{i})={\cal N}(\mathbf{x}\mid f_{i}(\mathbf{z}),\beta^{-1}\mathbf{I}_{D_{\cal V}})\,p(\mathbf{z}). In this study, the individual embedding fif_{i} is referred to as the task model.

To apply multi-task learning, the given tasks need to have some similarities or common properties. We make the following two assumptions on the similarities of tasks. First, the sample latent variable 𝐳\mathbf{z} represents the intrinsic property of the sample, which is independent of the task. Thus, if 𝐱ij\mathbf{x}_{ij} and 𝐱ij\mathbf{x}_{i^{\prime}j^{\prime}} are mapped to the same 𝐳\mathbf{z}, they are regarded as having the same intrinsic property, even if 𝐱ij𝐱ij\mathbf{x}_{ij}\neq\mathbf{x}_{i^{\prime}j^{\prime}}. For example, if both 𝐱ij\mathbf{x}_{ij} and 𝐱ij\mathbf{x}_{i^{\prime}j^{\prime}} are face images of different people with the same expression (e.g., smiling), they should be mapped to the similar 𝐳\mathbf{z}, even if the images look different. Using this assumption, the distance between two manifolds 𝒳i{\cal X}_{i} and 𝒳i{\cal X}_{i^{\prime}} can be defined using the norm fifi\left\|f_{i}-f_{i^{\prime}}\right\|_{\cal H} as

L2(𝒳i,𝒳i)\displaystyle L^{2}({\cal X}_{i},{\cal X}_{i^{\prime}}) fifi2\displaystyle\coloneqq\left\|f_{i}-f_{i^{\prime}}\right\|^{2}_{\cal H}
=fi(𝐳)fi(𝐳)2𝑑P(𝐳),\displaystyle=\int_{\cal L}\left\|f_{i}(\mathbf{z})-f_{i^{\prime}}(\mathbf{z})\right\|^{2}\,dP(\mathbf{z}), (1)

where dP(𝐳)dP(\mathbf{z}) is the measure of 𝐳\mathbf{z}, which is common to all tasks. In this study, we define the measure as dP(𝐳)=p(𝐳)d𝐳dP(\mathbf{z})=p(\mathbf{z})\,d\mathbf{z}, where p(𝐳)p(\mathbf{z}) is the prior of 𝐳\mathbf{z}.

Second, we assume that the set of task models F={fi}i=1IF=\{f_{i}\}_{i=1}^{I} is also distributed in a nonlinear manifold 𝒴{\cal Y} embedded into RKHS {\cal H} (Figure 1 (c)). Thus, 𝒴{\cal Y} is defined by a nonlinear smooth embedding g:𝒯𝒴g\colon{\cal T}\to{\cal Y}\subseteq{\cal H}, where 𝒯D𝒯{\cal T}\subseteq\mathbb{R}^{{D_{\cal T}}} is the low-dimensional latent space for tasks. Under this assumption, all tasks are assigned to the task latent variables U={𝐮i}i=1IU=\{\mathbf{u}_{i}\}_{i=1}^{I}, 𝐮i𝒯\mathbf{u}_{i}\in{\cal T} so that fi=g(𝐮i)f_{i}=g(\mathbf{u}_{i}). It also means that if ii and ii^{\prime} are similar, then 𝐮i𝐮i\mathbf{u}_{i}\sim\mathbf{u}_{i^{\prime}} and 𝒳i𝒳i{\cal X}_{i}\sim{\cal X}_{i^{\prime}}. As a result, the entire model distribution is represented as

q(𝐱,𝐳,𝐮G)=𝒩(𝐱|G(𝐳,𝐮),β1𝐈D𝒱)p(𝐳)p(𝐮),\displaystyle q(\mathbf{x},\mathbf{z},\mathbf{u}\mid G)={\cal N}\left(\mathbf{x}\,\middle|\,G(\mathbf{z},\mathbf{u}),\beta^{-1}\mathbf{I}_{D_{\cal V}}\right)\,p(\mathbf{z})\,p(\mathbf{u}), (2)

where G:×𝒯𝒱G\colon{\cal L}\times{\cal T}\to{\cal E}\subseteq{\cal V} is defined as G(,𝐮)=g(𝐮)G(\cdot,\mathbf{u})=g(\mathbf{u}), and \EuScriptE\EuScript{E} is the embedded product manifold of ×𝒯{\cal L}\times{\cal T} into 𝒱{\cal V} (Figure 1 (d)). In this study, GG is referred to as the general model. Using the general model, each task model is expected to be fi(𝐳)=G(𝐳,𝐮i)f_{i}(\mathbf{z})=G(\mathbf{z},\mathbf{u}_{i}).

We can describe this problem formulation using the terms of the fiber bundle. Suppose that 𝒱{\cal E}\subseteq{\cal V} is the total space of the fiber bundle, where G(,𝒯){\cal E}\coloneqq G\left({\cal L},{\cal T}\right). Then we can define a projection η:𝒯\eta\colon{\cal E}\rightarrow{\cal T} so that η(𝒳i)𝐮i\eta({\cal X}_{i})\equiv\mathbf{u}_{i}, where 𝒯{\cal T} is now referred to as the base space. Thus, the samples that belong to the same task are all projected to the same 𝐮\mathbf{u}. Under this scenario, (,𝒯,η)({\cal E},{\cal T},\eta) becomes a fiber bundle, and each task manifold 𝒳i{\cal X}_{i} is regarded as a fiber. Additionally, we can define another projection π:\pi\colon{\cal E}\rightarrow{\cal L}, which satisfies π(𝐱ij)πi(𝐱ij)\pi(\mathbf{x}_{ij})\equiv\pi_{i}(\mathbf{x}_{ij}). Such a formulation using the fiber bundle has been proposed in some studies [14, 34].

Under this scenario, our aim is to determine the general model GG in addition to the task model set F={fi}i=1IF=\bigl{\{}f_{i}\bigr{\}}_{i=1}^{I}, and to estimate the latent variables of samples Z={𝐳n}n=1NZ=\left\{\mathbf{z}_{n}\right\}_{n=1}^{N} and tasks U={𝐮i}i=1IU=\left\{\mathbf{u}_{i}\right\}_{i=1}^{I}. Considering compatibility with SOM, we assume that the latent spaces are square spaces =[1,+1]D{\cal L}=[-1,+1]^{D_{\cal L}} and 𝒯=[1,+1]D𝒯{\cal T}=[-1,+1]^{D_{\cal T}}, and the priors p(𝐳)p(\mathbf{z}) and p(𝐮)p(\mathbf{u}) are the uniform distribution on {\cal L} and 𝒯{\cal T}111Rigorously speaking, we need to consider the border effect in the case of compact latent space, such as [1,+1]D[-1,+1]^{D}. In this study, we ignore such a border effect for ease of implementation. Such an approximation is common in SOM and its variations.. The difficulty of learning depends on S/T (i.e., JJ) and the number of tasks II. Particularly, we are interested in the case in which S/T is too small to capture the manifold shape, whereas we have a sufficient number of tasks for multi-task learning.

4 Proposed method

In this section, we first describe the single task of manifold modeling, which is called KSMM. Then we describe the information transfer among the tasks. Finally, we present the multi-task KSMM, that is, MT-KSMM. In this section, we outline the derivation of the MT-KSMM algorithm, and in Appendix B, we describe the details.

4.1 Kernel smoothing manifold modeling (KSMM)

We chose KSMM as a generative manifold modeling method, which estimates the embedding using a kernel smoother. Thus, our aim is to develop the multi-task KSMM, that is, MT-KSMM. KSMM is a theoretical generalization of SOM, which has been proposed in many studies [35, 36, 37, 38, 39]. The main difference between SOM and KSMM is that the former discretizes the latent space {\cal L} to regular grid nodes, whereas the latter treats it as continuous space. According to previous studies, the cost function of KSMM is given by

E[Z,fX]=β2Nnh(𝐳𝐳n)f(𝐳)𝐱n2𝑑𝐳,E[Z,f\mid X]\\ =\frac{\beta}{2N}\sum_{n}\int_{\cal L}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\left\|f(\mathbf{z})-\mathbf{x}_{n}\right\|^{2}\,d\mathbf{z}, (3)

where h(𝐳𝐳)h_{\cal L}(\mathbf{z}\mid\mathbf{z}^{\prime}) is the non-negative smoothing kernel defined on {\cal L}, which is typically h(𝐳𝐳)=𝒩(𝐳|𝐳,λ2𝐈)h_{\cal L}(\mathbf{z}\mid\mathbf{z}^{\prime})={\cal N}\bigl{(}\mathbf{z}\,|\,\mathbf{z}^{\prime},\lambda_{\cal L}^{2}\mathbf{I}\bigr{)} [38]. If we regard h(𝐳𝐳)h_{\cal L}(\mathbf{z}\mid\mathbf{z}^{\prime}) as the probability density that represents the uncertainty of 𝐳n\mathbf{z}_{n}, the cost function (3) equals the cross-entropy between the data distribution p(𝐱,𝐳)p(\mathbf{x},\mathbf{z}) and the model distribution q(𝐱,𝐳)q(\mathbf{x},\mathbf{z}) (ignoring the constant), which are given by

p(𝐱,𝐳|X,Z)\displaystyle p(\mathbf{x},\mathbf{z}\,|\,X,Z) =1Nn𝒩(𝐱𝐱n,β1𝐈)h(𝐳𝐳n)\displaystyle=\frac{1}{N}\sum_{n}{\cal N}(\mathbf{x}\mid\mathbf{x}_{n},\beta^{-1}\mathbf{I})\,h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n}) (4)
q(𝐱,𝐳|f)\displaystyle q(\mathbf{x},\mathbf{z}\,|\,f) =𝒩(𝐱|f(𝐳),β1𝐈)p(𝐳).\displaystyle={\cal N}\left(\mathbf{x}\,\middle|\,f(\mathbf{z}),\beta^{-1}\mathbf{I}\right)\,p(\mathbf{z}). (5)

Thus, E[Z,fX]=cH[p(𝐱,𝐳X,Z),q(𝐱,𝐳f)]E[Z,f\mid X]\stackrel{{\scriptstyle c}}{{=}}H[p(\mathbf{x},\mathbf{z}\mid X,Z),q(\mathbf{x},\mathbf{z}\mid f)], where ‘=c\stackrel{{\scriptstyle c}}{{=}}’ means that both hands are equal except the constant, and H[,]H[\cdot,\cdot] denotes the cross-entropy.

The KSMM algorithm is the expectation maximization (EM) algorithm in the broad sense [38, 39], in which the E and M steps are iterated alternately until the cost function (3) converges (Figure 2 (a)). In the E step, the cost function (3) is optimized with respect to ZZ, whereas ff is fixed as follows:

𝐳^n\displaystyle\hat{\mathbf{z}}_{n} argmin𝐳nh(𝐳𝐳n)f^(𝐳)𝐱n2𝑑𝐳\displaystyle\coloneqq\operatorname*{\arg\min}_{\mathbf{z}_{n}}\int h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\left\|\hat{f}(\mathbf{z})-\mathbf{x}_{n}\right\|^{2}d\mathbf{z} (6)
argmin𝐳nf^(𝐳n)𝐱n2,\displaystyle\simeq\operatorname*{\arg\min}_{\mathbf{z}_{n}}\left\|\hat{f}(\mathbf{z}_{n})-\mathbf{x}_{n}\right\|^{2}, (7)

where 𝐳^n\hat{\mathbf{z}}_{n} and f^\hat{f} represent the estimators of 𝐳n\mathbf{z}_{n} and ff, respectively222In this study, the (tentative) estimators are denoted by hat ^\hat{\ } when we need to indicate expressly. We also denote the information transfer result by tilde ~\tilde{\ }.. The latent variable estimation obtained by (7) is used in the original SOM [32] and some similar methods [31]. Note that we can rewrite (7) as

𝐳^n\displaystyle\hat{\mathbf{z}}_{n} argmax𝐳nlogq(𝐱n,𝐳nf^),\displaystyle\simeq\operatorname*{\arg\max}_{\mathbf{z}_{n}}\log q(\mathbf{x}_{n},\mathbf{z}_{n}\mid\hat{f}),

which means the maximum log-likelihood estimator. We optimize (7) using SOM-like grid search to avoid local minima for early loops of iterations, and then they are updated by the gradient method in the later loops.

By contrast, in the M step, (3) is optimized with respect to ff, whereas ZZ is fixed, as follows:

f^\displaystyle\hat{f} argminfnh(𝐳𝐳^n)f(𝐳)𝐱n2𝑑𝐳\displaystyle\coloneqq\operatorname*{\arg\min}_{f}\sum_{n}\int h_{\cal L}(\mathbf{z}\mid\hat{\mathbf{z}}_{n})\left\|f(\mathbf{z})-\mathbf{x}_{n}\right\|^{2}d\mathbf{z}
=argminfH[p(𝐱,𝐳X,Z^),q(𝐱,𝐳,f)].\displaystyle=\operatorname*{\arg\min}_{f}H\left[p(\mathbf{x},\mathbf{z}\mid X,\hat{Z}),q(\mathbf{x},\mathbf{z},\mid f)\right]. (8)

The solution of (8) is given by the kernel smoothing of samples as

f^(𝐳)\displaystyle\hat{f}(\mathbf{z}) nh(𝐳𝐳^n)𝐱nnh(𝐳𝐳^n).\displaystyle\coloneqq\frac{\sum_{n}h_{\cal L}(\mathbf{z}\mid\hat{\mathbf{z}}_{n})\,\mathbf{x}_{n}}{\sum_{n^{\prime}}h_{\cal L}(\mathbf{z}\mid\hat{\mathbf{z}}_{n^{\prime}})}. (9)

These E and M steps are executed repeatedly until the cost function (3) converges. In the proposed method, embedding ff is represented parametrically using orthonormal bases. The details are provided in Section 4.5.

There are several reasons why we chose kernel smoothing-based manifold modeling rather than Gaussian process-based modeling. First, because KSMM is a generalization of SOM, it is easy to extend SOM2 to MT-KSMM naturally. Second, because the information transfers are made by the non-negative mixture of datasets or models, a kernel smoother can be used consistently through the method. Third, because the kernel smoother acts as an elastic net that connects the data points [40, 41], the kernel smoothing of task models is expected to solve the unsupervised manifold alignment problem by minimizing the distance between manifolds. Finally, as far as we examined, kernel smoothing-based manifold modeling is more stable and less sensitive to sample variations than the Gaussian process. This property seems to be desirable for multi-task learning, particularly when the sample size is small.

Refer to caption

(a) KSMM                                        (b) MT-KSMM                                        (c) Higher-KSMM  

Figure 2: Block diagram of the learning process. (a) Single-task KSMM. In the M step, the estimator of embedding f^\hat{f} is updated by the dataset X={𝐱n}X=\{\mathbf{x}_{n}\} and latent variables Z^={𝐳^n}\hat{Z}=\{\hat{\mathbf{z}}_{n}\}, whereas in the E step, Z^\hat{Z} is updated by XX and f^\hat{f}. (b) MT-KSMM. To execute II tasks in parallel, MT-KSMM consists of II lower-KSMMs and two information transfer blocks, that is, the instance transfer block and model transfer block (higher-KSMM). (c) The higher-KSMM receives the task models F^={f^i}\hat{F}=\{\hat{f}_{i}\} from the lower-KSMMs as the input data and outputs the general model G^\hat{G}. The general model is fed back to the lower-KSMMs as the model transfer result, whereas the task latent variables U^={𝐮^i}\hat{U}=\{\hat{\mathbf{u}}_{i}\} are fed back to the instance transfer block.

4.2 Information transfer between tasks

For multi-task learning, the information about each task needs to be transferred to other tasks. Generally, there are three major approaches: feature-based, parameter-based, and instance-based [8]. To summarize, the feature-based approach transfers or shares the intrinsic representation between tasks, the parameter-based approach transfers the model parameters between tasks, and the instance-based approach shares the datasets among the tasks. In the proposed method, we use two information transfer styles: model transfer, which corresponds to the parameter-based approach, and instance transfer, which corresponds to the instance-based approach. Although we do not use the feature-based approach explicitly in the proposed method, we can regard the latent space {\cal L}, which is common to all tasks, as the intrinsic representation space shared among the tasks. Therefore, the proposed method is related to all information transfer styles of multi-task learning.

In instance transfer, samples of a task are transferred to other tasks, and form a weighted mixture of samples. Thus, if task ii^{\prime} is a neighbor of task ii in the task latent space 𝒯{\cal T}, the sample set XiX_{i^{\prime}} is merged with the target set XiX_{i} as an auxiliary sample set with a larger weight. By contrast, if task i′′i^{\prime\prime} is far from task ii in 𝒯{\cal T}, then Xi′′X_{i^{\prime\prime}} is merged with XiX_{i} with a small (or zero) weight. We denote the weight of sample nn in source task ini_{n} with respect to the target task ii by ρin\rho_{in} (0ρin10\leq\rho_{in}\leq 1). Typically ρin\rho_{in} is defined as

ρinρ(𝐮^i𝐮^in)\displaystyle\rho_{in}\equiv\rho(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i_{n}}) =exp[12λρ2𝐮^i𝐮^in2],\displaystyle=\exp\left[-\frac{1}{2\lambda_{\rho}^{2}}\left\|\hat{\mathbf{u}}_{i}-\hat{\mathbf{u}}_{i_{n}}\right\|^{2}\right], (10)

where λρ\lambda_{\rho} determines the neighborhood size for data mixing, and 𝐮^i\hat{\mathbf{u}}_{i} is the estimator of the task latent variable 𝐮i\mathbf{u}_{i}. Then we have the merged dataset X~i={(𝐱n,ρin)}\tilde{X}_{i}=\{(\mathbf{x}_{n},\rho_{in})\}, where (𝐱n,ρin)X~i(\mathbf{x}_{n},\rho_{in})\in\tilde{X}_{i} indicates that 𝐱n\mathbf{x}_{n} is a member of X~i\tilde{X}_{i} with weight ρin\rho_{in}. Similarly, the latent variable set ZiZ_{i} is also merged with the same weight as Z~i={(𝐳^n,ρin)}\tilde{Z}_{i}=\left\{(\hat{\mathbf{z}}_{n},\rho_{in})\right\}. The data distribution corresponding to the merged dataset (X~i,Z~i)(\tilde{X}_{i},\tilde{Z}_{i}) becomes

p~i(𝐱,𝐳|X~i,Z~i)\displaystyle\tilde{p}_{i}(\mathbf{x},\mathbf{z}\,|\,\tilde{X}_{i},\tilde{Z}_{i}) =1J~inρin𝒩(𝐱𝐱n,β1𝐈)h(𝐳𝐳^n)\displaystyle=\frac{1}{\tilde{J}_{i}}\sum_{n}\rho_{in}{\cal N}(\mathbf{x}\mid\mathbf{x}_{n},\beta^{-1}\mathbf{I})\,h_{\cal L}(\mathbf{z}\mid\hat{\mathbf{z}}_{n})
=iρ(𝐮^i𝐮^i)pi(𝐱,𝐳Xi,Z^i)i′′ρ(𝐮^i𝐮^i′′),\displaystyle=\frac{\sum_{i^{\prime}}\rho(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i^{\prime}})\,p_{i}(\mathbf{x},\mathbf{z}\mid X_{i},\hat{Z}_{i})}{\sum_{i^{\prime\prime}}\rho(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i^{\prime\prime}})}, (11)

where J~i=nρin\tilde{J}_{i}=\sum_{n}\rho_{in}. Eq. (11) means that the data distributions P={pi}P=\{p_{i}\} are smoothed by kernel ρ(𝐮,𝐮)\rho(\mathbf{u},\mathbf{u}^{\prime}) before the task models are estimated. By creating a weighted mixture of samples, we can avoid the small sample set problem.

By contrast, in model transfer, each task model fif_{i} is modified so that it becomes similar to the models of the neighboring tasks, as follows:

f~i(𝐳)\displaystyle\tilde{f}_{i}(\mathbf{z}) =ih𝒯(𝐮^i𝐮^i)f^i(𝐳)i′′h𝒯(𝐮^i𝐮^i′′),\displaystyle=\frac{\sum_{i^{\prime}}h_{\cal T}(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i^{\prime}})\,\hat{f}_{i^{\prime}}(\mathbf{z})}{\sum_{i^{\prime\prime}}h_{\cal T}(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i^{\prime\prime}})}, (12)

where h𝒯(𝐮𝐮)=𝒩(𝐮𝐮,λ𝒯2𝐈)h_{\cal T}(\mathbf{u}\mid\mathbf{u}^{\prime})={\cal N}(\mathbf{u}\mid\mathbf{u}^{\prime},\lambda^{2}_{\cal T}\mathbf{I}). Eq. (12) means the kernel smoothing of the task models F={fi}F=\{f_{i}\}. Thus, the model distribution becomes

q~i(𝐱,𝐳f~i)\displaystyle\tilde{q}_{i}(\mathbf{x},\mathbf{z}\mid\tilde{f}_{i}) =𝒩(𝐱f~i(𝐳),β1𝐈)p(𝐳).\displaystyle={\cal N}(\mathbf{x}\mid\tilde{f}_{i}(\mathbf{z}),\beta^{-1}\mathbf{I})\,p(\mathbf{z}). (13)

Note that (12) and (13) are integrated as

logq~i(𝐱,𝐳)\displaystyle\log\tilde{q}_{i}(\mathbf{x},\mathbf{z}) =cih𝒯(𝐮^i𝐮^i)logqi(𝐱,𝐳)i′′h𝒯(𝐮^i𝐮^i′′).\displaystyle\stackrel{{\scriptstyle c}}{{=}}\frac{\sum_{i^{\prime}}h_{\cal T}(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i^{\prime}})\log q_{i^{\prime}}(\mathbf{x},\mathbf{z})}{\sum_{i^{\prime\prime}}h_{\cal T}(\hat{\mathbf{u}}_{i}\mid\hat{\mathbf{u}}_{i^{\prime\prime}})}. (14)

Therefore, the log of model distributions Q={qi}Q=\{q_{i}\} is smoothed by kernel h𝒯(𝐮𝐮)h_{\cal T}(\mathbf{u}\mid\mathbf{u}^{\prime}) after the task models are estimated.

4.3 Architecture of multi-task KSMM (MT-KSMM)

When we have II datasets (i.e., II tasks), we need to execute II KSMMs in parallel. To execute the multi-task learning of KSMMs, the proposed method MT-KSMM introduces two key ideas. The first idea is to use both instance transfer and model transfer, and the second idea is to introduce the higher-KSMM, which integrates the II manifold models estimated by the KSMMs for tasks (i.e., the lower-KSMMs).

As described in 4.1, KSMM estimates f^\hat{f} from XX and Z^\hat{Z} in the M step, whereas Z^\hat{Z} is estimated from XX and f^\hat{f} in the E step. The first key idea is to use the merged dataset (X~i,Z~i)(\tilde{X}_{i},\tilde{Z}_{i}) that is obtained by instance transfer instead of using the original dataset (Xi,Z^i)(X_{i},\hat{Z}_{i}). Thus, f^i\hat{f}_{i} is estimated from X~i\tilde{X}_{i} and Z~i\tilde{Z}_{i}. Similarly, to estimate the latent variables Z^i\hat{Z}_{i}, the proposed method uses the embedding f~i\tilde{f}_{i} that is obtained using model transfer instead of using f^i\hat{f}_{i} directly (Figure 2 (b)).

To execute instance transfer and model transfer, we need to determine the similarities between tasks. For this purpose, the proposed method introduces the higher-KSMM (Figure 2 (c)). Thus, the higher-KSMM estimates the latent variables of tasks U^={𝐮^i}i=1I\hat{U}=\left\{\hat{\mathbf{u}}_{i}\right\}_{i=1}^{I}, by which the similarities between tasks are determined. Another important role of the higher-KSMM is to make the model transfer by estimating the general model G^\hat{G} from the task models F^={f^i}i=1I\hat{F}=\{\hat{f}_{i}\}_{i=1}^{I}.

After introducing these two key ideas, the entire architecture of MT-KSMM consists of three blocks: the instance transfer block, lower-KSMM block, and higher-KSMM block (i.e., the model transfer block) (Figure 2 (b)). The aim of the lower-KSMM is to estimate the task models fif_{i} for each task ii from the merged dataset (X~i,Z~i)(\tilde{X}_{i},\tilde{Z}_{i}). Without considering information transfer, the cost function of the lower KSMM for task ii is given by

Eilower[Z~i,fiX~i]\displaystyle E_{i}^{\text{lower}}[\tilde{Z}_{i},f_{i}\mid\tilde{X}_{i}]
β2J~inh(𝐳,𝐳n)ρinfi(𝐳)𝐱n2𝑑𝐳\displaystyle\qquad\coloneqq\frac{\beta}{2\tilde{J}_{i}}\sum_{n}\int h_{\cal L}(\mathbf{z},\mathbf{z}_{n})\rho_{in}\left\|f_{i}(\mathbf{z})-\mathbf{x}_{n}\right\|^{2}\,d\mathbf{z}
=cH[p~i(𝐱,𝐳X~i,Z~i),qi(𝐱,𝐳fi)].\displaystyle\qquad\stackrel{{\scriptstyle c}}{{=}}H\left[\tilde{p}_{i}(\mathbf{x},\mathbf{z}\mid\tilde{X}_{i},\tilde{Z}_{i}),q_{i}(\mathbf{x},\mathbf{z}\mid f_{i})\right]. (15)

However, the minimization of the cost function is affected by the information transfer at every iteration of the EM algorithm. Thus, before executing the M step, (Xi,Z^i)(X_{i},\hat{Z}_{i}) are merged by the instance transfer as (X~i,Z~i)(\tilde{X}_{i},\tilde{Z}_{i}), whereas before executing the E step, f^i\hat{f}_{i} is replaced by f~i\tilde{f}_{i} as a result of model transfer.

By contrast, the aim of the higher-KSMM is to estimate the general model GG by regarding the task models F^={f^i}i=1I\hat{F}=\{\hat{f}_{i}\}_{i=1}^{I} as the dataset. Thus, the cost function (without considering the influence of other blocks) is given by

Ehigher[U,GF^]=β2Iih𝒯(𝐮𝐮i)G(,𝐮)f^i()2d𝐮.E^{\text{higher}}[U,G\mid\hat{F}]=\frac{\beta}{2I}\sum_{i}\int h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i})\\ \left\|G(\cdot,\mathbf{u})-\hat{f}_{i}(\cdot)\right\|^{2}_{\cal H}\,d\mathbf{u}. (16)

This cost function equals the cross-entropy between p(𝐱,𝐳,𝐮F^,U)p(\mathbf{x},\mathbf{z},\mathbf{u}\mid\hat{F},U) and q(𝐱,𝐳,𝐮G)q(\mathbf{x},\mathbf{z},\mathbf{u}\mid G) as

Ehigher[U,G,F^]\displaystyle E^{\text{higher}}[U,G,\mid\hat{F}] =cH[p(𝐱,𝐳,𝐮F,U),q(𝐱,𝐳,𝐮G)],\displaystyle\stackrel{{\scriptstyle c}}{{=}}H\left[p(\mathbf{x},\mathbf{z},\mathbf{u}\mid F,U),q(\mathbf{x},\mathbf{z},\mathbf{u}\mid G)\right],

where

p(𝐱,𝐳,𝐮F^,U)\displaystyle p(\mathbf{x},\mathbf{z},\mathbf{u}\mid\hat{F},U) =1Ii𝒩(𝐱fi(𝐳),β1𝐈)p(𝐳)h𝒯(𝐮𝐮i)\displaystyle=\frac{1}{I}\sum_{i}{\cal N}(\mathbf{x}\mid f_{i}(\mathbf{z}),\beta^{-1}\mathbf{I})\,p(\mathbf{z})\,h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i})
q(𝐱,𝐳,𝐮G)\displaystyle q(\mathbf{x},\mathbf{z},\mathbf{u}\mid G) =𝒩(𝐱G(𝐳,𝐮),β1𝐈)p(𝐳)p(𝐮).\displaystyle={\cal N}(\mathbf{x}\mid G(\mathbf{z},\mathbf{u}),\beta^{-1}\mathbf{I})\,p(\mathbf{z})\,p(\mathbf{u}).

Note that p(𝐱,𝐳,𝐮)p(\mathbf{x},\mathbf{z},\mathbf{u}) is determined by F^={f^i}\hat{F}=\{\hat{f}_{i}\} and U={𝐮i}U=\{\mathbf{u}_{i}\}, regarding F^={f^i}\hat{F}=\{\hat{f}_{i}\} as the input data. Therefore, p(𝐱,𝐳,𝐮)p(\mathbf{x},\mathbf{z},\mathbf{u}) is not determined by the observed samples XX naively; thus,

p(𝐱,𝐳,𝐮)\displaystyle p(\mathbf{x},\mathbf{z},\mathbf{u}) 1Nn𝒩(𝐱𝐱n,β1𝐈)h(𝐳𝐳n)h𝒯(𝐮𝐮in).\displaystyle\neq\frac{1}{N}\sum_{n}{\cal N}(\mathbf{x}\mid\mathbf{x}_{n},\beta^{-1}\mathbf{I})\,h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\,h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i_{n}}).

The lower and higher-KSMMs do not simply optimize the cost functions (15) and (16) in parallel. The estimated task models F^={f^i}i=1I\hat{F}=\{\hat{f}_{i}\}_{i=1}^{I} are fed forward to the higher-KSMM from the lower-KSMM as the dataset, whereas the estimated general model G^\hat{G} is fed back from the higher to the lower as a result of model transfer, so that f~i()=G^(,𝐮^i)\tilde{f}_{i}(\cdot)=\hat{G}(\cdot,\hat{\mathbf{u}}_{i}). Additionally, the estimated task latent variables U^={𝐮^i}i=1I\hat{U}=\{\hat{\mathbf{u}}_{i}\}_{i=1}^{I} are also fed back from the higher-KSMM to the instance transfer block, whereby the merged datasets {(X~i,Z~i)}i=1I\{(\tilde{X}_{i},\tilde{Z}_{i})\}_{i=1}^{I} are updated. Such a hierarchical optimization process satisfies the concept of meta-learning [6].

Algorithm 1 MT-KSMM algorithm
  For all nn, initialize Z={𝐳n}Z=\{\mathbf{z}_{n}\} randomly. (Step 0)
  For all ii, initialize U={𝐮i}U=\{\mathbf{u}_{i}\} randomly. (Step 0)
  repeat
     For all (i,n)(i,n), calculate ρin\rho_{in} using (10). (Step 1)
     For all ii, calculate 𝐕i\mathbf{V}_{i} using (26)–(30). (Step 2)
     Calculate 𝐖¯\underline{\mathbf{W}} using (32)–(36). (Step 3)
     For all ii, update 𝐮i\mathbf{u}_{i} using (19). (Step 4)
     For all nn, update 𝐳n\mathbf{z}_{n} using (20). (Step 5)
  until the calculation converges.

4.4 MT-KSMM algorithm

The MT-KSMM algorithm is described as follows (also see Algorithm 1).

Step 0:

Initialization

For the first loop, the latent variables Z^={𝐳^n}\hat{Z}=\left\{\hat{\mathbf{z}}_{n}\right\} and U^={𝐮^i}\hat{U}=\left\{\hat{\mathbf{u}}_{i}\right\} are initialized randomly.

Step 1:

Instance transfer

To execute instance transfer, ρin\rho_{in} is calculated using (10). Then we have the merged datasets X~i={(𝐱n,ρin)}\tilde{X}_{i}=\{(\mathbf{x}_{n},\rho_{in})\} and Z~i={(𝐳n,ρin)}\tilde{Z}_{i}=\{(\mathbf{z}_{n},\rho_{in})\} for all ii. This step can be interpreted as the kernel smoothing of the data distribution set P={pi(𝐱,𝐳)}P=\{p_{i}(\mathbf{x},\mathbf{z})\} using (11), and we have P~={p~i(𝐱,𝐳)}\tilde{P}=\{\tilde{p}_{i}(\mathbf{x},\mathbf{z})\} as a result.

Step 2:

M step of the lower-KSMM

Task models F^={f^i}\hat{F}=\{\hat{f}_{i}\} are updated so that the cost function of the lower-KSMM is minimized as

f^i\displaystyle\hat{f}_{i} argminfiEilower[Z~i,fiX~i],\displaystyle\coloneqq\operatorname*{\arg\min}_{f_{i}}E_{i}^{\text{lower}}[\tilde{Z}_{i},f_{i}\mid\tilde{X}_{i}],

for each task ii. The solution is given by the kernel smoothing of the merged sample set as follows:

f^i(𝐳)\displaystyle\hat{f}_{i}(\mathbf{z}) nh(𝐳,𝐳^n)ρin𝐱nnh(𝐳,𝐳^n)ρin.\displaystyle\coloneqq\frac{\sum_{n}h_{\cal L}(\mathbf{z},\hat{\mathbf{z}}_{n})\rho_{in}\,\mathbf{x}_{n}}{\sum_{n^{\prime}}h_{\cal L}(\mathbf{z},\hat{\mathbf{z}}_{n^{\prime}})\rho_{in}}. (17)

Then we have the model distribution set Q={q^i(𝐱,𝐳)}Q=\{\hat{q}_{i}(\mathbf{x},\mathbf{z})\} by (5).

Step 3:

M step of the higher-KSMM

The general model G^\hat{G} is updated so that the cost function of the higher-KSMM is minimized as

G^\displaystyle\hat{G} argminGEhigher[U^,GF^].\displaystyle\coloneqq\operatorname*{\arg\min}_{G}E^{\text{higher}}[\hat{U},G\mid\hat{F}].

The solution is given by

G^(𝐳,𝐮)\displaystyle\hat{G}(\mathbf{z},\mathbf{u}) ih𝒯(𝐮,𝐮^i)f^i(𝐳)ih𝒯(𝐮,𝐮^i).\displaystyle\coloneqq\frac{\sum_{i}h_{\cal T}(\mathbf{u},\hat{\mathbf{u}}_{i})\,\hat{f}_{i}(\mathbf{z})}{\sum_{i^{\prime}}h_{\cal T}(\mathbf{u},\hat{\mathbf{u}}_{i^{\prime}})}. (18)

Then we have the entire model distribution q(𝐱,𝐳,𝐮G^)q(\mathbf{x},\mathbf{z},\mathbf{u}\mid\hat{G}) using (2). This process is regarded as model transfer.

Step 4:

E step of the higher-KSMM

The task latent variables U^={𝐮i}\hat{U}=\{\mathbf{u}_{i}\} are updated so that the log-likelihood is maximized as

𝐮^i\displaystyle\hat{\mathbf{u}}_{i} argmax𝐮in𝒩ilogq(𝐱n,𝐳^n,𝐮iG^)\displaystyle\coloneqq\operatorname*{\arg\max}_{\mathbf{u}_{i}}\sum_{n\in{\cal N}_{i}}\log q(\mathbf{x}_{n},\hat{\mathbf{z}}_{n},\mathbf{u}_{i}\mid\hat{G})
=argmin𝐮in𝒩iG^(𝐳^n,𝐮i)𝐱n2.\displaystyle=\operatorname*{\arg\min}_{\mathbf{u}_{i}}\sum_{n\in{\cal N}_{i}}\left\|\hat{G}(\hat{\mathbf{z}}_{n},\mathbf{u}_{i})-\mathbf{x}_{n}\right\|^{2}. (19)

Then we have the task models as

f~i(𝐳)\displaystyle\tilde{f}_{i}(\mathbf{z}) :=G^(𝐳,𝐮^i)\displaystyle:=\hat{G}(\mathbf{z},\hat{\mathbf{u}}_{i})
q~i(𝐱,𝐳)\displaystyle\tilde{q}_{i}(\mathbf{x},\mathbf{z}) :=q(𝐱,𝐳𝐮^i,G^),\displaystyle:=q(\mathbf{x},\mathbf{z}\mid\hat{\mathbf{u}}_{i},\hat{G}),

which are the results of model transfer using (12) and (13). Then F~={f~i}\tilde{F}=\{\tilde{f}_{i}\} is fed back to the lower-KSMM.

Step 5:

E step of the lower-KSMM

Finally, the sample latent variables Z^={𝐳n}\hat{Z}=\{\mathbf{z}_{n}\} are updated so that the log-likelihood is maximized as

𝐳^n\displaystyle\hat{\mathbf{z}}_{n} argmax𝐳nlogq~in(𝐱n,𝐳n)\displaystyle\coloneqq\operatorname*{\arg\max}_{\mathbf{z}_{n}}\log\tilde{q}_{i_{n}}(\mathbf{x}_{n},\mathbf{z}_{n})
=argmin𝐳nf~in(𝐳n)𝐱n2.\displaystyle=\operatorname*{\arg\min}_{\mathbf{z}_{n}}\left\|\tilde{f}_{i_{n}}(\mathbf{z}_{n})-\mathbf{x}_{n}\right\|^{2}. (20)

These five steps are iterated until the calculation converges. During the iterations, the length constant of the smoothing kernels λ\lambda_{\cal L} and λ𝒯\lambda_{\cal T} are gradually reduced just like the ordinary SOM.

4.5 Parametric representation of embeddings

To apply the gradient method, we represent embedding ff parametrically using orthonormal basis functions defined on the RKHS (e.g., normalized Legendre polynomials), such as

f(𝐳)=f(𝐳𝐕)=l=1Lφl(𝐳)𝐯l=𝐕𝖳𝝋(𝐳),\displaystyle f(\mathbf{z})=f(\mathbf{z}\mid\mathbf{V})=\sum_{l=1}^{L}\varphi_{l}(\mathbf{z})\mathbf{v}_{l}=\mathbf{V}^{\mathsf{T}}{\boldsymbol{\varphi}}(\mathbf{z}),

where 𝝋=(φ1,,φL)𝖳{\boldsymbol{\varphi}}=(\varphi_{1},\dots,\varphi_{L})^{\mathsf{T}} is the basis set and 𝐕L×D\EuScriptX\mathbf{V}\in\mathbb{R}^{L\times D_{\EuScript}{X}} is the coefficient matrix. In the case of single-task KSMM, the estimator of the coefficient matrix 𝐕^\hat{\mathbf{V}} is determined as

𝐕^=𝐀1𝐁𝐗,\displaystyle\hat{\mathbf{V}}=\mathbf{A}^{-1}\mathbf{B}\mathbf{X}, (21)

where

𝐀\displaystyle\mathbf{A} =𝝋(𝐳)𝝋𝖳(𝐳)h¯𝑑𝐳\displaystyle=\int{\boldsymbol{\varphi}}(\mathbf{z}){\boldsymbol{\varphi}}^{\mathsf{T}}(\mathbf{z})\,\overline{h}_{\cal L}\,d\mathbf{z} (22)
𝐁\displaystyle\mathbf{B} =𝝋(𝐳)𝐡(𝐳)𝖳𝑑𝐳,\displaystyle=\int{\boldsymbol{\varphi}}(\mathbf{z})\,\mathbf{h}_{\cal L}(\mathbf{z})^{\mathsf{T}}\,d\mathbf{z}, (23)

and

𝐡(𝐳)\displaystyle\mathbf{h}_{\cal L}(\mathbf{z}) =(h(𝐳𝐳^1),,h(𝐳𝐳^N))𝖳\displaystyle=\bigl{(}h_{\cal L}(\mathbf{z}\mid\hat{\mathbf{z}}_{1}),\dots,h_{\cal L}(\mathbf{z}\mid\hat{\mathbf{z}}_{N})\bigr{)}^{\mathsf{T}} (24)
h¯(𝐳)\displaystyle\overline{h}_{\cal L}(\mathbf{z}) =nh(𝐳𝐳n).\displaystyle=\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n}). (25)

Thus, (9) is replaced by (21)–(25).

Similarly, in MT-KSMM, the coefficient matrices of the task models are estimated as

𝐕^i\displaystyle\hat{\mathbf{V}}_{i} =𝐀i1𝐁i𝐗,\displaystyle=\mathbf{A}_{i}^{-1}\mathbf{B}_{i}\mathbf{X}, (26)

where

𝐀i\displaystyle\mathbf{A}_{i} =𝝋(𝐳)𝝋𝖳(𝐳)h¯i(𝐳)𝑑𝐳\displaystyle=\int{\boldsymbol{\varphi}}(\mathbf{z})\,{\boldsymbol{\varphi}}^{\mathsf{T}}(\mathbf{z})\,\overline{h}_{i}(\mathbf{z})\,d\mathbf{z} (27)
𝐁i\displaystyle\mathbf{B}_{i} =𝝋(𝐳)𝐡i(𝐳)𝖳𝑑𝐳\displaystyle=\int{\boldsymbol{\varphi}}(\mathbf{z})\,\mathbf{h}_{i}(\mathbf{z})^{\mathsf{T}}\,d\mathbf{z} (28)
𝐡i(𝐳)\displaystyle\mathbf{h}_{i}(\mathbf{z}) =(ρinh(𝐳𝐳n))n=1N\displaystyle=\Bigl{(}\rho_{in}\,h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\Bigr{)}_{n=1}^{N} (29)
h¯i(𝐳)\displaystyle\overline{h}_{i}(\mathbf{z}) =nρinh(𝐳𝐳n).\displaystyle=\sum_{n}\rho_{in}\,h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n}). (30)

Thus, (17) is replaced by (26)–(30). Note that 𝐕¯=(𝐕i)i=1II×L×D𝒱\underline{\mathbf{V}}=(\mathbf{V}_{i})_{i=1}^{I}\in\mathbb{R}^{I\times L\times{D_{\cal V}}} becomes a tensor of order 3. For the higher-KSMM, the general model GG is represented as

G(𝐳,𝐮𝐖¯)\displaystyle G(\mathbf{z},\mathbf{u}\mid\underline{\mathbf{W}}) =k=1Kl=1L𝐰klψk(𝐮)φl(𝐳)\displaystyle=\sum_{k=1}^{K}\sum_{l=1}^{L}\mathbf{w}_{kl}\psi_{k}(\mathbf{u})\varphi_{l}(\mathbf{z})
=𝐖¯×1𝝍(𝐮)×2𝝋(𝐳),\displaystyle=\underline{\mathbf{W}}\times_{1}{\boldsymbol{\psi}}(\mathbf{u})\times_{2}{\boldsymbol{\varphi}}(\mathbf{z}), (31)

where 𝐖¯K×L×D𝒱\underline{\mathbf{W}}\in\mathbb{R}^{K\times L\times{D_{\cal V}}} is the coefficient tensor and ×m\times_{m} represents the tensor–matrix product, where 𝝍=(ψ1,,ψK)𝖳{\boldsymbol{\psi}}=(\psi_{1},\dots,\psi_{K})^{\mathsf{T}} is the basis set for the higher-KSMM. The coefficient tensor 𝐖¯\underline{\mathbf{W}} is estimated as

𝐖¯^\displaystyle\hat{\underline{\mathbf{W}}} =𝐕¯^×1(𝐂1𝐃)\displaystyle=\underline{\hat{\mathbf{V}}}\times_{1}\left(\mathbf{C}^{-1}\mathbf{D}\right) (32)
𝐂\displaystyle\mathbf{C} =𝝍(𝐮)𝝍𝖳(𝐮)h¯𝒯(𝐮)𝑑𝐮\displaystyle=\int{\boldsymbol{\psi}}(\mathbf{u})\,{\boldsymbol{\psi}}^{\mathsf{T}}(\mathbf{u})\,\overline{h}_{\cal T}(\mathbf{u})\,d\mathbf{u} (33)
𝐃\displaystyle\mathbf{D} =𝝍(𝐮)𝐡𝒯𝖳(𝐮)𝑑𝐮,\displaystyle=\int{\boldsymbol{\psi}}(\mathbf{u})\,\mathbf{h}_{\cal T}^{\mathsf{T}}(\mathbf{u})\,d\mathbf{u}, (34)

where

𝐡𝒯(𝐮)\displaystyle\mathbf{h}_{\cal T}(\mathbf{u}) =(h𝒯(𝐮𝐮i))i=1I\displaystyle=\Bigl{(}h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i})\Bigr{)}_{i=1}^{I} (35)
h¯𝒯(𝐮)\displaystyle\overline{h}_{\cal T}(\mathbf{u}) =ih𝒯(𝐮𝐮i).\displaystyle=\sum_{i}h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i}). (36)

Thus, (18) is replaced by (32)–(36). Note that the computational cost of the proposed method with respect to the data size is 𝒪(NID𝒱){\cal O}(NID_{\cal V}) for the lower-KSMMs and 𝒪(ID𝒱){\cal O}(ID_{\cal V}) for the higher-KSMM. Therefore, the computation cost is proportional to both the data size NN and number of tasks II.

5 Experimental results

5.1 Experiment protocol

To demonstrate the performance of the proposed method, we applied the method to three datasets. The first was an artificial dataset generated from two-dimensional (2D) manifolds, and the second and third were face image datasets of multiple subjects with various poses and expressions, respectively.

To examine how the information transfer improved performance, we compared methods that used different transfer styles, and KSMM was used for all methods as the common platform. Thus, (1) MT-KSMM (the proposed method) used both instance transfer and model transfer, (2) KSMM2 (the preceding method) used model transfer only, and (3) single-task KSMM did not transfer information between tasks333It is not able to examine the case of using instance transfer only because the task latent variable U^={𝐮^i}\hat{U}=\{\hat{\mathbf{u}}_{i}\} cannot be estimated without using the higher-KSMM, which also makes the model transfer.. KSMM2 is a modification of the preceding method SOM2 that replaces the higher and lower-SOMs with KSMMs. Note that this modification improves the performance of SOM2 by eliminating the quantization error. Similarly, single-task KSMM can be regarded as a SOM with continuous latent space.

In the experiment, we trained the methods on a set of tasks by changing S/T. After training was complete, we evaluated the performance both qualitatively and quantitatively using the test data. We used two types of test data: test data of existing tasks and test data of new tasks. The former evaluates the generalization performance for new data of existing tasks that have been learned, and the latter evaluates the generalization performance for new unknown tasks.

We assessed the learning performance quantitatively using two approaches: the root mean square error (RMSE) of the reconstructed samples and the mutual information (MI) between the true and estimated sample latent variables. RMSE (smaller is better) evaluates accuracy in the visible space 𝒱{\cal V} without considering the compatibility of the sample latent variables, whereas MI (larger is better) evaluates accuracy in the latent space {\cal L}, and considers how consistent the sample latent variables are between tasks. We evaluated both RMSE and MI for the test data of existing tasks and new tasks. For existing tasks, we used the task latent variable 𝐮^i\hat{\mathbf{u}}_{i} that was estimated in the training phase, whereas 𝐮^new\hat{\mathbf{u}}_{\text{new}} was estimated by minimizing (19) for each new task.

Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Figure 3: Results for the saddle shape artificial dataset. For training, 200 manifolds with different biases and two samples per task (2 S/T) were used. (a) Ground truth. (b) MT-KSMM. (c) KSMM2 (d) KSMM. Five out of 200 manifolds are indicated.
Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Figure 4: Quantitative evaluation when training samples per task (S/T) was changed. The root mean square error (RMSE, smaller is better) and mutual information (MI, larger is better) were evaluated for the test data of existing tasks and for new generated tasks. Error bars indicate the standard deviations. (a) RMSE for existing tasks. (b) MI for existing tasks. (c) RMSE for new tasks. (d) MI for new tasks.

5.2 Artificial datasets

We examined the performance of the proposed method using an artificial dataset. We used 2D saddle shape hyperboloid manifolds embedded in 10-dimensional (10D) visible space. The data generation model is given by

𝐱=F(𝐳,u)+𝜺\displaystyle\mathbf{x}=F(\mathbf{z},u)+{\boldsymbol{\varepsilon}} =(z1z2z12z22+u𝟎7)+𝜺,\displaystyle=\begin{pmatrix}z_{1}\\ z_{2}\\ z_{1}^{2}-z_{2}^{2}+u\\ \mathbf{0}_{7}\end{pmatrix}+{\boldsymbol{\varepsilon}}, (37)

where 𝐳=(z1,z2)[1,+1]2\mathbf{z}=(z_{1},z_{2})\in[-1,+1]^{2} and u[1,+1]u\in[-1,+1] are the latent variables generated by the uniform distribution and 𝟎7\mathbf{0}_{7} is a seven-dimensional zero vector. 𝜺{\boldsymbol{\varepsilon}} is 10D Gaussian noise 𝜺𝒩(𝟎10,σ2𝐈10){\boldsymbol{\varepsilon}}\sim{\cal N}(\mathbf{0}_{10},\sigma^{2}\mathbf{I}_{10}), where σ=0.1\sigma=0.1. We generated 300 manifolds (i.e., tasks) by changing uu, each of which had 100 samples. For training, 200 out of 300 tasks were used for training, each of which had JJ S/T, and the other samples were used as test data. Thus, samples of 100 tasks were used as test data for new tasks.

A representative result is shown in Figure 3. In this case, each task had only two samples for training (2 S/T). Thus, it was impossible to estimate the 2D manifold shape using single-task learning (Figure 3 (d)). Despite this, the proposed algorithm estimated manifold shapes successfully (Figure 3 (b)), although the border area of the manifolds was shrunk because of missing data. Compared with the proposed method, KSMM2 performed poorly in capturing the manifold shapes (Figure 3 (c)).

Refer to caption
Figure 5: Results for three types of manifold datasets: 400 tasks were used for training, each of which consisted of 3 S/T.
Table 1: Generalization abilities evaluated using RMSE and MI for four types of manifold datasets: 400 training tasks were used, each of which consisted of 3 S/T. RMSE and MI were evaluated using the test samples of the existing tasks and new tasks.
Saddle shape Convex Triangle Sine
RMSE MT-KSMM 0.368±0.004\mathbf{0.368\pm 0.004} 0.346±0.004\mathbf{0.346\pm 0.004} 0.315±0.001\mathbf{0.315\pm 0.001} 0.327±0.003\mathbf{0.327\pm 0.003}
Existing tasks KSMM2 0.679±0.0050.679\pm 0.005 0.544±0.0040.544\pm 0.004 0.419±0.0040.419\pm 0.004 0.591±0.0050.591\pm 0.005
KSMM 0.854±0.0100.854\pm 0.010 0.750±0.0100.750\pm 0.010 0.546±0.0050.546\pm 0.005 0.742±0.0090.742\pm 0.009
MT-KSMM 0.353±0.003\mathbf{0.353\pm 0.003} 0.339±0.004\mathbf{0.339\pm 0.004} 0.309±0.002\mathbf{0.309\pm 0.002} 0.319±0.002\mathbf{0.319\pm 0.002}
New tasks KSMM2 0.633±0.0060.633\pm 0.006 0.534±0.0030.534\pm 0.003 0.409±0.0050.409\pm 0.005 0.551±0.0070.551\pm 0.007
KSMM 0.619±0.0090.619\pm 0.009 0.492±0.0090.492\pm 0.009 0.374±0.0040.374\pm 0.004 0.432±0.0190.432\pm 0.019
MI MT-KSMM 2.662±0.012\mathbf{2.662\pm 0.012} 2.670±0.014\mathbf{2.670\pm 0.014} 1.467±0.066\mathbf{1.467\pm 0.066} 2.706±0.013\mathbf{2.706\pm 0.013}
Existing tasks KSMM2 1.895±0.0721.895\pm 0.072 2.152±0.0422.152\pm 0.042 1.458±0.0501.458\pm 0.050 1.631±0.0561.631\pm 0.056
KSMM 0.207±0.0040.207\pm 0.004 0.207±0.0040.207\pm 0.004 0.205±0.0050.205\pm 0.005 0.210±0.0040.210\pm 0.004
MT-KSMM 2.812±0.020\mathbf{2.812\pm 0.020} 2.724±0.023\mathbf{2.724\pm 0.023} 1.598±0.049\mathbf{1.598\pm 0.049} 2.768±0.019\mathbf{2.768\pm 0.019}
New tasks KSMM2 2.025±0.0792.025\pm 0.079 2.133±0.0532.133\pm 0.053 1.529±0.0221.529\pm 0.022 1.977±0.1631.977\pm 0.163
KSMM 1.036±0.0191.036\pm 0.019 1.253±0.0841.253\pm 0.084 0.937±0.0540.937\pm 0.054 1.717±0.1811.717\pm 0.181

Figure 4 (a) and (b) show the RMSE and MI for the test data of existing tasks when S/T was changed. The results show that MT-KSMM performed better than the other methods, particularly when S/T was small. As S/T increased, all methods reduced the RMSE. By contrast, in the case of MI, the performance of single-task KSMM did not improve, even when a sufficient number of training samples was provided. Because no information was transferred between tasks in single-task KSMM, the manifolds were estimated independently for each task. Consequently, the sample latent variables 𝐳\mathbf{z} were estimated inconsistently between tasks.

Figure 4 (c) and (d) show the RMSE and MI for the test data for new tasks. We generated 100 new tasks using (37), each of which had 100 samples. Then we estimated the task latent variable uu and sample latent variable 𝐳\mathbf{z} using the trained model. For new tasks, MT-KSMM also demonstrated excellent performance, which suggests that it has high generalization ability.

We further evaluated the performance using other artificial datasets, which are shown in Figure 5. In this experiment, we used 400 tasks and 3 S/T for training. As shown in the figure, MT-KSMM succeeded in modeling the continuous change of manifold shapes. In the case of the triangular manifold (b), MT-KSMM succeeded in capturing the rotation of triangular shapes. In the case of sine manifolds, all manifolds intersected on the same line, and the data points on the intersection could not be distinguished between tasks. Despite this, MT-KSMM succeeded in modeling the continuous change of the manifold shape. Table 1 summarizes the experiment on artificial datasets. For all cases, MT-KSMM consistently demonstrated the best performance among the three methods.

Refer to caption

(a)
Refer to caption
(b)

Figure 6: Results for the face image dataset A, when 2 S/T was used for training. Original and reconstructed face images with various poses for (a) an existing task and (b) a new task. The images indicated by asterisks (*) were used for training and the remainder were used for testing.
Refer to caption
Figure 7: Sample latent variables of the face image dataset A estimated using MT-KSMM. MT-KSMM was trained using 2 S/T. Colors indicate the true angles of poses (blue: 60-60^{\circ}, red: +60+60^{\circ}). Tasks 0–79 (below the dashed line) were used for training and tasks 80–89 (above the dashed line) were used for testing as the new task. The markers outlined in black represent the training data.
Refer to caption
Figure 8: Face images generated by MT-KSMM by changing the task latent variable 𝐮\mathbf{u}, while the sample latent variable 𝐳\mathbf{z} was fixed at 0.90.9. Two S/T were used for training.
Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Figure 9: Quantitative evaluation for the face image set A, when the training sample number per task was changed. RMSE of the reconstructed images, MI between the pose angles, and the sample latent variables were evaluated for new images of existing subjects that appeared in training, and for new images of new subjects. (a) RMSE for existing subjects. (b) MI for existing subjects. (c) RMSE for new subjects. (d) MI for new subjects.
Refer to caption
Figure 10: Reconstructed face images for dataset B, in which each face image is represented by a set of 64 landmark points. In this experiment, 2 S/T were used for training (indicated by an asterisk (*)). (From top to bottom) ground truth, and images reconstructed by MT-KSMM, KSMM2, and single-task KSMM. The color represents the type of emotion: pink: surprise, red: anger, purple: fear, green: disgust, orange: happiness, black: label missing.
Refer to caption Refer to caption
(a) (b)
Figure 11: Face images generated by MT-KSMM, which was trained using 2 S/T. (a) Various expressions of a subject generated by changing the sample latent variable 𝐳\mathbf{z}. (b) Various styles of ‘surprise’ generated by changing the task latent variable 𝐮\mathbf{u}.
Refer to caption Refer to caption Refer to caption Refer to caption  
(a) (b) (c)
Figure 12: Estimated sample latent variables. The markers outlined in black are the training data. (a) MT-KSMM. (b) KSMM2. (c) Single-task KSMM.

5.3 Face image dataset A: various poses

We applied the proposed method to a face image dataset with various poses444http://robotics.csie.ncku.edu.tw/Databases/FaceDetect_PoseEstimate.htm. The dataset consisted of the face images of 90 subjects taken from various angles. We used 25 images per subject, which were taken every 55^{\circ} from 60-60^{\circ} to +60+60^{\circ}. The face parts were cut out from the original images with 64×6464\times 64 pixels and a Gaussian filter was applied to remove noise. In the experiment, we used the face images of 80 subjects for the training tasks and the remaining images of 10 subjects for the test tasks. In this experiment, the dimensions of the sample latent space {\cal L} and the task latent space 𝒯{\cal T} were 1 and 2, respectively.

Figure 6 shows the reconstructed images for an existing task (a) and new task (b). In this case, only two images per subject were used for training (i.e., 2 S/T). In the case of the existing task (Figure 6 (a)), we used only left face images of the subject for training, but MT-KSMM supplemented the right face images by mixing the images of other subjects. As a result, MT-KSMM successfully modeled the continuous pose change555We make a brief comment about image quality. Because the manifolds were represented as a linear combination of training data, it was unavoidable that the reconstructed images became cross-dissolved images of the training images, particularly when the number of training data was small. To reconstruct more realistic images, it is necessary to use more training images, and/or to implement prior knowledge into the method, but this is out of scope for this study.. By contrast, KSMM2 and single-task KSMM failed to model the pose change. This ability of MT-KSMM was also observed for new tasks, for which no image was used for training. As Figure 6 (b) shows, MT-KSMM reconstructed face images of various angles, whereas KSMM2 and single-task KSMM failed.

Figure 7 shows the sample latent variables Z^={𝐳^n}\hat{Z}=\{\hat{\mathbf{z}}_{n}\} estimated by MT-KSMM. The estimated latent variables were consistent with the actual pose angles for both existing tasks (below the dashed line) and new test tasks (above the dashed line). Thus, the face manifolds were aligned for all subjects, so that the latent variable represented the same property regardless of the task difference.

By changing the task latent variable 𝐮\mathbf{u} and fixing the sample latent variable 𝐳\mathbf{z}, various faces in the same pose were expected to appear. Figure 8 shows the generated face images using MT-KSMM when 𝐮\mathbf{u} was changed. The results indicate that MT-KSMM worked as expected. In terms of the fiber bundle, the manifolds of various poses (Figure 6) corresponds to the fibers, whereas the manifold that consisted of various subjects corresponds to the base space (Figure 8).

Figure 9 shows the RMSE and MI for the test data for existing tasks ((a) and (b)) and new tasks ((c) and (d)). In this evaluation, RMSE was measured using the error between the true image and reconstructed image, and MI was measured between the true pose angle and sample latent variable 𝐳\mathbf{z}. Like the case of the artificial dataset, MT-KSMM performed better than the other methods, particularly when the number of training data was small.

5.4 Face image dataset B: various expressions

Refer to caption Refer to caption
(a) (b)
Figure 13: RMSE of the reconstructed images when the training samples per task (S/T) were changed. (a) For test data of existing tasks. (b) For test data of new tasks.
Refer to caption
Figure 14: Application to supervised learning cases. (a) Multi-task regression. (b) Multi-task regression with domain shift. Solid curves are training tasks (3 out of 400 tasks are indicated in the figure), and dashed curves are new tasks. Markers denote the training samples (2 S/T in this experiment).

We applied the proposed method to another face image dataset, in which subjects displayed various expressions [42, 43]. The dataset consisted of image sequences, which ranged from a neutral expression to a distinct expression. We chose 96 subjects from the dataset, each of which had 30 images. Eighty subjects were used for training as existing tasks and the remaining 16 subjects are used for testing as new tasks. In the dataset, the sequences were labeled according to the emotion type, but the label information was not used for training.

The face image data that we used were pre-processed as follows: From each face image, 64 landmark points, such as eyes and lip edges, were extracted, and 2D coordinates were obtained from the image. Thus, each face image was transformed into 128-dimensional vector data. To separate the emotional expression from the face shape, the landmark vectors were subtracted from the vector of the neutral expressions. Thus, for all subjects, the origin was the neutral expression. To learn the dataset, the 2D latent space was used for both {\cal L} and 𝒯{\cal T}.

Figure 12 shows the reconstructed images for an existing task, trained using 2 S/T. Although only two images were used for training (‘surprise’ and ‘anger’ for this subject), MT-KSMM reconstructed other expressions successfully, whereas KSMM2 and single-task KSMM failed.

MT-KSMM was expected to generate images expressing various emotions according to the sample latent variable 𝐳\mathbf{z}, whereas it was expected to generate images of various expression styles of the same emotion type according to the task latent variable 𝐮\mathbf{u}. Figure 12 shows the images generated by MT-KSMM, suggesting that the proposed method successfully represented the faces as expected. In terms of the fiber bundle, Figure 12 (a) represents a fiber, whereas Figure 12 (b) shows the base space.

Figure 12 shows the estimated sample latent variables Z^={𝐳^n}\hat{Z}=\{\hat{\mathbf{z}}_{n}\}. The results indicate that MT-KSMM (Figure 12 (a)) roughly classified the face images according to the expression, whereas the KSMM2 and single-task KSMM did not demonstrate such a classified representation of emotion types in the latent space. Finally, we evaluated the RMSE of the reconstructed images (Figure 13). Again, MT-KSMM performed better than the other methods for both existing and new tasks.

Refer to caption

(a)
Refer to caption
(b)

Figure 15: Conceptual figures from the viewpoint of information geometry. (a) Single-task case. The learning process of manifold modeling is the iteration of m- and e-projections. The arbitrariness with respect to the coordinate system of the latent space {\cal L} is indicated for the depth direction (e.g., black and blue represent equivalent solutions). When the number of training samples is small, the data distribution becomes further from the true distribution (indicated in orange). (b) Multi-task case. Instance transfer is translated as kernel smoothing in the m-mixture manner (m-smoothing), which generates manifold 𝒫~\tilde{\cal P}. By contrast, model transfer is kernel smoothing in the e-mixture manner (e-smoothing), which generates manifold 𝒬~\tilde{\cal Q}. Additionally, e-smoothing solves the manifold alignment problem.

5.5 Application to multi-task supervised learning

Although the proposed method was designed for multi-task manifold learning, the concepts of instance transfer and model transfer can be applied to other learning paradigms. Figure 14 shows two toy examples of multi-task regression. In this experiment, 400 tasks were generated by slanted sinusoidal functions s=(au)sin(t+bu)+(cu)t+dus=(au)\sin(t+bu)+(cu)t+du, where tt and ss are the input and output, respectively; uu is the task latent variable; and a,b,c,da,b,c,d are fixed parameters. There were 2 S/T, which were generated by the uniform random input.

In the first case (Figure 14 (a)), the input was distributed in [1,+1][-1,+1] uniformly. For this dataset, the latent variable zz was simply replaced by the input tt, whereas the corresponding output ss was regarded as the observed data xx. Thus, Step 5 in Algorithm 1 was omitted. As shown in Figure 14 (a), the proposed method successfully estimated the continuous change of sinusoidal functions from small sample sets.

The proposed method can be applied to multi-task regression when domain shifts occur. In the second case (Figure 14 (b)), the input distribution also changed according to the task latent variable uu. For this case, MT-KSMM was directly applied by regarding each input–output pair (t,s)(t,s) as the training sample of MT-KSMM; thus, 𝐱=(t,s)\mathbf{x}=(t,s). Although the border area was shrunk because of the absence of training samples, MT-KSMM successfully captured the continuous change of the function shapes, as well as the change of the input domain. Note that the estimation of the input distribution was also difficult for the small sample case.

6 Discussion

The key idea of the proposed method is to use two types of information transfer: instance transfer and model transfer. We discuss how these transfers work from the viewpoint of information geometry. We also discuss the problem formulation of multi-task manifold learning from the viewpoint of optimal transport.

6.1 From the information geometry viewpoint

Generative manifold model can be regarded as an infinite Gaussian mixture model (GMM) in which the Gaussian components are arranged continuously along the nonlinear manifold. Like the ordinary GMM, the EM algorithm can be used to solve it [38, 39]. We first depict the single-task KSMM case from the information geometry viewpoint (Figure 15 (a)) and then describe the multi-task case (Figure 15 (b)).

6.1.1 Single-task case

Information geometry is geometry in statistical space 𝒮{\cal S}, which consists of all possible probability distributions. In our case, 𝒮{\cal S} is the space that consists of all joint probabilities of 𝐱\mathbf{x} and 𝐳\mathbf{z}, that is, 𝒮={p(𝐱,𝐳)}{\cal S}=\{p(\mathbf{x},\mathbf{z})\}.

When both X={𝐱n}n=1NX=\{\mathbf{x}_{n}\}_{n=1}^{N} and Z={𝐳n}n=1NZ=\{\mathbf{z}_{n}\}_{n=1}^{N} are given, we assume that the data distribution is represented as

p(𝐱,𝐳X,Z)\displaystyle p(\mathbf{x},\mathbf{z}\mid X,Z)
=1Nn=1N𝒩(𝐱𝐱n,β1𝐈D𝒱)𝒩(𝐳𝐳n,β1𝐈D).\displaystyle\qquad=\frac{1}{N}\sum_{n=1}^{N}{\cal N}(\mathbf{x}\mid\mathbf{x}_{n},\beta^{-1}\mathbf{I}_{D_{\cal V}})\,{\cal N}(\mathbf{z}\mid\mathbf{z}_{n},\beta^{-1}\mathbf{I}_{D_{\cal L}}).

Thus, the data distribution p(𝐱,𝐳X,Z)p(\mathbf{x},\mathbf{z}\mid X,Z) corresponds to a point in 𝒮{\cal S}. Because dataset XX is known, whereas the latent variables ZZ are unknown, the set of data distributions for all possible ZZ becomes a data manifold 𝒟X={p(𝐱,𝐳X,Z)Z=(𝐳n)N}{\cal D}_{X}=\{p(\mathbf{x},\mathbf{z}\mid X,Z)\mid Z=(\mathbf{z}_{n})\in{\cal L}^{N}\}. Therefore, estimating ZZ from XX means determining the optimal point in the data manifold 𝒟X{\cal D}_{X}.

It should be noted that there is arbitrariness with respect to the coordinate system of {\cal L}. For example, if the coordinate system of {\cal L} is rotated or nonlinearly distorted, we obtain another equivalent solution of ZZ. Therefore, there is an infinite number of equivalent data distributions with respect to ZZ (indicated for the depth direction in Figure 15 (a)). Additionally, when the number of samples is small, 𝒟X{\cal D}_{X} is far from the true distribution, and there is a large variation depending on the sample set XX.

Similarly, we can consider a manifold consisting of all possible model distributions. In this study, we represent a model distribution as

q(𝐱,𝐳f)=𝒩(𝐱f(𝐳),β1𝐈)p(𝐳),\displaystyle q(\mathbf{x},\mathbf{z}\mid f)={\cal N}(\mathbf{x}\mid f(\mathbf{z}),\beta^{-1}\mathbf{I})\,p(\mathbf{z}),

where ff\in{\cal H}. Thus, the set of all possible model distributions forms a model manifold {\cal M}, which is homeomorphic to {\cal H}. Therefore, estimating ff from XX means determining the optimal point in {\cal M}. Like the case of data distribution, there is also arbitrariness with respect to the coordinate system of {\cal L}.

The aim of KSMM is to minimize the cross-entropy between p(𝐱,𝐳X,Z)p(\mathbf{x},\mathbf{z}\mid X,Z) and q(𝐱,𝐳f)q(\mathbf{x},\mathbf{z}\mid f), that is, minimize the Kullback–Leibler (KL) divergence DKL[pq]D_{\text{KL}}\left[p\,\middle\|\,q\right]666Rigorously speaking, the self-entropy of p(𝐱,𝐳X,Z)p(\mathbf{x},\mathbf{z}\mid X,Z) needs to be considered because it also depends on ZZ.. The KSMM algorithm is the EM algorithm by which KL-divergence is minimized with respect ff and ZZ alternately. From the information geometry viewpoint, the optimal solutions Z^\hat{Z} and f^\hat{f} are given as the nearest points of two manifolds 𝒟X{\cal D}_{X} and {\cal M}. To determine the optimal pair (Z^,f^)(\hat{Z},\hat{f}), projections between two manifolds are executed iteratively. Thus, in the M step, pp is projected to {\cal M}, whereas in the E step, qq is projected to 𝒟X{\cal D}_{X}. These two projections are referred to as m-projection and e-projections (Figure 15 (a)). m- and e-projections are executed repeatedly until pp and qq converge to the nearest points of 𝒟X{\cal D}_{X} and {\cal M} [44]. We note again that there are infinite solution pairs (Z^,f^)(\hat{Z},\hat{f}) because of the arbitrariness of the coordinate system of {\cal L}.

6.1.2 Information transfers

As shown in (11), the data distribution set P={pi}P=\{p_{i}\} is mixed by the instance transfer as

p~i(𝐱,𝐳)iαiimpi(𝐱,𝐳),\displaystyle\tilde{p}_{i}(\mathbf{x},\mathbf{z})\coloneqq\sum_{i^{\prime}}\alpha^{m}_{ii^{\prime}}\,p_{i^{\prime}}(\mathbf{x},\mathbf{z}), (38)

where

αiim=ρ(𝐮i𝐮i)i′′ρ(𝐮i𝐮i′′).\displaystyle\alpha^{m}_{ii^{\prime}}=\frac{\rho(\mathbf{u}_{i}\mid\mathbf{u}_{i^{\prime}})}{\sum_{i^{\prime\prime}}\rho(\mathbf{u}_{i}\mid\mathbf{u}_{i^{\prime\prime}})}.

Then we have P~={p~i}\tilde{P}=\{\tilde{p}_{i}\}. In information geometry, Eq. (38) is called the m-mixture of the data distribution set PP. Therefore, the instance transfer is translated as the transfer by the m-mixture in terms of information geometry.

By contrast, as shown in (14), the log of the model distributions Q={qi}Q=\{q_{i}\} are mixed by the model transfer as

logq~i(𝐱,𝐳)iαiielogqi(𝐱,𝐳)+𝑐𝑜𝑛𝑠𝑡,\displaystyle\log\tilde{q}_{i}(\mathbf{x},\mathbf{z})\coloneqq\sum_{i^{\prime}}\alpha^{e}_{ii^{\prime}}\log q_{i^{\prime}}(\mathbf{x},\mathbf{z})+\mathit{const}, (39)

where

αiie=h𝒯(𝐮i𝐮i)i′′h𝒯(𝐮i𝐮i′′).\displaystyle\alpha^{e}_{ii^{\prime}}=\frac{h_{\cal T}(\mathbf{u}_{i}\mid\mathbf{u}_{i^{\prime}})}{\sum_{i^{\prime\prime}}h_{\cal T}(\mathbf{u}_{i}\mid\mathbf{u}_{i^{\prime\prime}})}.

Then we have Q~={q~i}\tilde{Q}=\{\tilde{q}_{i}\}. Eq. (39) is called the e-mixture of model distributions QQ in information geometry. Therefore, the model transfer is translated as the transfer by the e-mixture. It has been shown that the e-mixture transfer improves multi-task learning in estimating the probability density [45].

6.1.3 Multi-task case

Based on the above discussion, the learning process of MT-KSMM can be interpreted as follows (Figure 15 (b)): Suppose that we have II data distributions P={pi}P=\{p_{i}\}. In Step 1, P={pi}P=\{p_{i}\} are mapped to P~={p~i}\tilde{P}=\{\tilde{p}_{i}\} using instance transfer (i.e., m-mixture transfer). However, MT-KSMM not only models the given tasks, but can also model unknown tasks. Therefore, we have a data distribution p~(𝐱,𝐳𝐮)\tilde{p}(\mathbf{x},\mathbf{z}\mid\mathbf{u}) for any 𝐮\mathbf{u}:

p~(𝐱,𝐳𝐮)\displaystyle\tilde{p}(\mathbf{x},\mathbf{z}\mid\mathbf{u}) =iρ(𝐮𝐮i)pi(𝐱,𝐳Xi,Zi)iρ(𝐮𝐮i).\displaystyle=\frac{\sum_{i}\rho(\mathbf{u}\mid\mathbf{u}_{i})p_{i}(\mathbf{x},\mathbf{z}\mid X_{i},Z_{i})}{\sum_{i^{\prime}}\rho(\mathbf{u}\mid\mathbf{u}_{i^{\prime}})}.

As a result, we have a smooth manifold 𝒫~={p~(𝐱,𝐳𝐮)𝐲𝒯}\tilde{\cal P}=\{\tilde{p}(\mathbf{x},\mathbf{z}\mid\mathbf{u})\mid\mathbf{y}\in{\cal T}\} in 𝒮{\cal S}, which is homeomorphic to 𝒯{\cal T}. Manifold 𝒫~\tilde{\cal P} is obtained by the kernel smoothing of the data distributions P={pi}P=\{p_{i}\} in the m-mixture manner. We refer to this process as m-smoothing. Thus, the instance transfer can be translated as m-smoothing, which generates the manifold 𝒫~\tilde{\cal P}.

In Step 2, manifold 𝒫~\tilde{\cal P} is projected to {\cal M} by the m-projection (26), and we obtain the model distribution set Q={qi}Q=\{q_{i}\} in {\cal M}. Then, in Step 3, we obtain a manifold 𝒬~\tilde{\cal Q} from QQ:

logq~(𝐱,𝐳𝐮)\displaystyle\log\tilde{q}(\mathbf{x},\mathbf{z}\mid\mathbf{u}) =cih𝒯(𝐮𝐮i)logqi(𝐱,𝐳fi)ih𝒯(𝐮𝐮i).\displaystyle\stackrel{{\scriptstyle c}}{{=}}\frac{\sum_{i}h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i})\,\log q_{i}(\mathbf{x},\mathbf{z}\mid f_{i})}{\sum_{i^{\prime}}h_{\cal T}(\mathbf{u}\mid\mathbf{u}_{i^{\prime}})}.

Because manifold 𝒬~\tilde{\cal Q} is obtained by the kernel smoothing of QQ in the e-mixture manner, we refer to this process as e-smoothing. Thus, the model transfer is now regarded as e-smoothing, which generates the manifold 𝒬~\tilde{\cal Q} in {\cal M}. In the manifold 𝒬~\tilde{\cal Q}, the model distributions of unknown tasks are also represented.

After e-smoothing, in Steps 4 and 5, the latent variables Z={𝐳n}Z=\{\mathbf{z}_{n}\} and U={𝐮i}U=\{\mathbf{u}_{i}\} are estimated by the e-projection, by which 𝒫~\tilde{\cal P} is updated. Thus, the MT-KSMM algorithm is depicted as the bidirectional m- and e-projections of 𝒫~\tilde{\cal P} and 𝒬~\tilde{\cal Q}, which are generated by m- and e-smoothing, respectively.

It is worth noting that m-smoothing tends to make the distributions broader, whereas e-smoothing tends to make them narrower. Thus, if either m- or e-smoothing is used alone, it will not work as expected, particularly when S/T is small. Therefore, the combination of m- and e-smoothing is essentially important for multi-task manifold modeling. Additionally, e-smoothing plays another important role in MT-KSMM. By performing e-smoothing, the neighboring models come closer to each other in {\cal M}, and their coordinate systems also come closer to each other. As a result, the manifolds are aligned between the tasks, and the latent variables acquire a consistent representation.

To summarize, the MT-KSMM algorithm consists of the following steps: (Step 1) By m-smoothing, we obtain manifold 𝒫~\tilde{\cal P} from the data distribution set PP. This process is the instance transfer. (Step 2) 𝒫~\tilde{\cal P} is projected to {\cal M} by m-projection, and we obtain the model distribution set QQ. (Step 3) By e-smoothing, we obtain manifold 𝒬~\tilde{\cal Q} from QQ. This process is the model transfer. (Steps 4 and 5) 𝒬~\tilde{\cal Q} is projected to 𝒟{\cal D} by e-projection, and latent variables UU and ZZ are updated.

In this discussion, we described the learning process of MT-KSMM from the viewpoint of information geometry. We discussed the depiction of two manifolds 𝒫~\tilde{\cal P} and 𝒬~\tilde{\cal Q} in 𝒮{\cal S}, which are generated by m- and e-smoothing, respectively, and are projected alternately by m- and e-projections, respectively. We believe that such geometrical operations are the essence of the two information transfer styles in multi-task manifold modeling.

6.2 From the optimal transport viewpoint

In this study, we assumed that the latent space {\cal L} was common to all tasks. For example, for the face image set, we expected the latent variable 𝐳n\mathbf{z}_{n} to represent the pose or expression of the face image 𝐱n\mathbf{x}_{n} regardless of which task it belonged to. Thus, if we had two manifolds 𝒳1{\cal X}_{1} and 𝒳2{\cal X}_{2}, and the corresponding embeddings f1f_{1} and f2f_{2}, we considered that 𝐱1=f1(𝐳)𝒳1\mathbf{x}_{1}=f_{1}(\mathbf{z})\in{\cal X}_{1} and 𝐱2=f2(𝐳)𝒳2\mathbf{x}_{2}=f_{2}(\mathbf{z})\in{\cal X}_{2} would represent the same intrinsic property, even if 𝐱1𝐱2\mathbf{x}_{1}\neq\mathbf{x}_{2}. Based on this assumption, information was transferred between tasks. However, some questions arise. The first question is how we can know whether latent variables that belong to different tasks represent the same intrinsic property. The second question is how we can regularize multi-task learning so that the latent variables represent the common intrinsic property. Note that the intrinsic property is unknown. Therefore, we need to redefine the objective of multi-task manifold learning without using such ambiguous terms.

In this study, we defined the distance between two manifolds 𝒳1{\cal X}_{1} and 𝒳2{\cal X}_{2} as (1). Thus, to measure the distance, the corresponding embeddings f1f_{1} and f2f_{2} need to be provided. However, because there is arbitrariness with respect to the coordinate system of {\cal L}, there are many equivalent embeddings that generate the same manifold. Therefore, if f1,f2f_{1},f_{2} are not specified, the distance between two manifolds cannot be determined uniquely. In such a case, it is natural to choose the pair (f1,f2)(f_{1},f_{2}) that minimizes (1). Thus, the distance D(𝒳1,𝒳2)D({\cal X}_{1},{\cal X}_{2}) is defined as

D2[𝒳1,𝒳2]=minf1𝒳1f2𝒳2f1(𝐳)f2(𝐳)2𝑑P(𝐳),\displaystyle D^{2}[{\cal X}_{1},{\cal X}_{2}]=\min_{\begin{subarray}{c}f_{1}\in{\cal F}_{{\cal X}_{1}}\\ f_{2}\in{\cal F}_{{\cal X}_{2}}\end{subarray}}\int_{\cal L}\left\|f_{1}(\mathbf{z})-f_{2}(\mathbf{z})\right\|^{2}\,dP(\mathbf{z}), (40)

where 𝒳i{\cal F}_{{\cal X}_{i}} is a set of embeddings, which yield the equivalent data distribution on 𝒳i{\cal X}_{i}; that is, for any fi,fi𝒳if_{i},f_{i}^{\prime}\in{\cal F}_{{\cal X}_{i}}, they satisfy q(𝐱fi)=q(𝐱fi)q(\mathbf{x}\mid f_{i})=q(\mathbf{x}\mid f^{\prime}_{i}), where q(𝐱fi)=𝒩(𝐱fi(𝐳),β1𝐈)𝑑P(𝐳)q(\mathbf{x}\mid f_{i})=\int{\cal N}(\mathbf{x}\mid f_{i}(\mathbf{z}),\beta^{-1}\mathbf{I})\,dP(\mathbf{z}). Such a distance between manifolds (40) becomes the optimal transport distance between 𝒳1{\cal X}_{1} and 𝒳2{\cal X}_{2}. Based on this, we can say that manifolds (𝒳1,𝒳2)({\cal X}_{1},{\cal X}_{2}) are aligned with respect to the coordinate system induced by (f1,f2)(f_{1},f_{2}), if and only if (f1,f2)(f_{1},f_{2}) minimizes (40). Additionally, if two manifolds are aligned, we assume that the latent variable represents the intrinsic property that is independent of the task.

Now we recall the problem formulation in terms of the fiber bundle (Figure 1 (d)), in which each task manifold is regarded as a fiber. Then we can consider the integral of the transport distance between fibers along the base space 𝒯{\cal T}, which equals the length of manifold 𝒴{\cal Y}\subseteq{\cal H} (Figure 1 (c)). When all task manifolds are aligned with neighboring ones, the total length of manifold 𝒴{\cal Y} is expected to be minimized. As already described, SOM (in addition to KSMM) works like an elastic net. Therefore, roughly speaking, the higher-KSMM tends to shorten the length of manifold 𝒴{\cal Y}, whereby the task manifolds are gradually aligned. Therefore, in MT-KSMM, the transport distance along the base space is implicitly regularized to be optimized. It would be worth trying to implement such optimal transport-based regularization in the objective function explicitly, and this is our future work. The underlying learning theory from the optimal transport viewpoint would be also an important issue in the future.

7 Conclusion

In this paper, we proposed a method for multi-task manifold learning by introducing two information transfers: instance transfer and model transfer. Throughout the experiments, the proposed method, MT-KSMM performed better than KSMM2 and single-task KSMM. Because in the comparison we consistently used KSMM as a platform for manifold learning, we can conclude that these results were only caused by differences in the manner of information transfer between tasks. Thus, the combination of instance transfer and model transfer are effective for multi-task manifold learning, particularly when the sample per task is small. It should be noted that KSMM2 (i.e., the preceding method SOM2) learns appropriately when sufficient samples are provided. Because KSMM2 uses model transfer only, we can also conclude that instance transfer is necessary for the small sample size case. By contrast, model transfer is necessary for manifold alignment, whereby the latent variables are estimated consistently across tasks, representing the same intrinsic feature of the data.

The proposed method projects the given data into the product space of the task latent space and sample latent space, in which the intrinsic feature of the data content is represented by the sample latent variables, whereas the intrinsic feature of the data style is represented by the task latent variable. For example, in the case of the face image set, ‘content’ refers to the expressed emotions and ‘style’ refers to the individual difference in the manner of expressing emotions. Such a learning paradigm is referred to as content–style disentanglement in the recent literature [46]. Therefore, the proposed method is not only multi-task manifold learning under the small sample size condition, but also has the ability to perform content–style disentanglement under the condition where data are observed partially for each style. Because recent studies on content–style disentanglement implement the autoencoder (AE) platform [47], which typically requires a larger number of samples, applying the proposed information transfers to an AE would be a challenging theme in future work.

Acknowledgements

We thank S. Akaho, PhD, from National Institute of Advanced Industrial Science and Technology of Japan, who gave us important advice for this study. This work was supported by JSPS KAKENHI [grant number 18K11472, 21K12061, 20K19865]; and ZOZO Technologies Inc. We thank Maxine Garcia, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript.

References

  • [1] X. Huo, X. Ni, A. K. Smith, A survey of manifold-based learning methods, in: T. W. Liao, E. Triantaphyllou (Eds.), In Recent Advances in Data Mining of Enterprise Data, World Scientific, 2007, Ch. 15, pp. 691–745.
  • [2] R. Pless, R. Souvenir, A survey of manifold learning for images, IPSJ Transactions on Computer Vision and Applications 1 (2009) 83–94. doi:10.2197/ipsjtcva.1.83.
  • [3] S.-J. Wang, H.-L. Chen, X.-J. Peng, C.-G. Zhou, Exponential locality preserving projections for small sample size problem, Neurocomputing 74 (17) (2011) 3654–3662. doi:10.1016/j.neucom.2011.07.007.
  • [4] C. Shan, S. Gong, P. McOwan, Appearance manifold of facial expression, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3766 LNCS (2005) 221–230. doi:10.1007/11573425\_22.
  • [5] Y. Chang, C. Hu, R. Feris, M. Turk, Manifold based analysis of facial expression, Image and Vision Computing 24 (6) (2006) 605–614. doi:10.1016/j.imavis.2005.08.006.
  • [6] T. Hospedales, A. Antoniou, P. Micaelli, A. Storkey, Meta-learning in neural networks: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021). doi:10.1109/TPAMI.2021.3079209.
    URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-85105850460&doi=10.1109%2fTPAMI.2021.3079209&partnerID=40&md5=91a28d509d70e45dadbe80985f2a7b93
  • [7] R. Caruana, Multitask learning, Machine Learning 28 (1) (1997) 41–75. doi:10.1023/A:1007379606734.
  • [8] Y. Zhang, Q. Yang, An overview of multi-task learning, National Science Review 5 (1) (2018) 30–43. doi:10.1093/nsr/nwx105.
  • [9] Z. Zhang, J. Zhou, Multi-task clustering via domain adaptation, Pattern Recognition 45 (1) (2012) 465–473. doi:10.1016/j.patcog.2011.05.011.
  • [10] X.-L. Zhang, Convex discriminative multitask clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (1) (2015) 28–40. doi:10.1109/TPAMI.2014.2343221.
  • [11] X. Zhang, X. Zhang, H. Liu, X. Liu, Multi-task clustering through instances transfer, Neurocomputing 251 (2017) 145–155. doi:10.1016/j.neucom.2017.04.029.
  • [12] X. Zhang, X. Zhang, H. Liu, X. Liu, Partially related multi-task clustering, IEEE Transactions on Knowledge and Data Engineering 30 (12) (2018) 2367–2380. doi:10.1109/TKDE.2018.2818705.
  • [13] I. Yamane, F. Yger, M. Berar, M. Sugiyama, Multitask principal component analysis, Journal of Machine Learning Research 63 (2016) 302–317.
  • [14] T. Furukawa, SOM of SOMs, Neural Networks 22 (4) (2009) 463–478. doi:10.1016/j.neunet.2009.01.012.
  • [15] T. Furukawa, SOM of SOMs: Self-organizing map which maps a group of self-organizing maps, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3696 LNCS (2005) 391–396. doi:10.1007/11550822\_61.
  • [16] R. Dedrick, J. Ferron, M. Hess, K. Hogarty, J. Kromrey, T. Lang, J. Niles, R. Lee, Multilevel modeling: A review of methodological issues and applications, Review of Educational Research 79 (1) (2009) 69–102. doi:10.3102/0034654308325581.
  • [17] A. Zweig, D. Weinshall, Hierarchical regularization cascade for joint learning, no. PART 2, 2013, pp. 1074–1082.
  • [18] L. Han, Y. Zhang, Learning multi-level task groups in multi-task learning, Vol. 4, 2015, pp. 2638–2644.
  • [19] L. Han, Y. Zhang, Learning tree structure in multi-task learning, Vol. 2015-August, 2015, pp. 397–406. doi:10.1145/2783258.2783393.
  • [20] J. Jiang, L. Zhang, T. Furukawa, A class density approximation neural network for improving the generalization of fisherface, Neurocomputing 71 (16-18) (2008) 3239–3246. doi:10.1016/j.neucom.2008.04.042.
  • [21] T. Ohkubo, K. Tokunaga, T. Furukawa, RBF×\timesSOM: An efficient algorithm for large-scale multi-system learning, IEICE Transactions on Information and Systems E92-D (7) (2009) 1388–1396. doi:10.1587/transinf.E92.D.1388.
  • [22] T. Ohkubo, T. Furukawa, K. Tokunaga, Requirements for the learning of multiple dynamics, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6731 LNCS (2011) 101–110. doi:10.1007/978-3-642-21566-7\_10.
  • [23] S. Yakushiji, T. Furukawa, Shape space estimation by SOM2, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7063 LNCS (PART 2) (2011) 618–627. doi:10.1007/978-3-642-24958-7\_72.
  • [24] S. Yakushiji, T. Furukawa, Shape space estimation by higher-rank of SOM, Neural Computing and Applications 22 (7-8) (2013) 1267–1277. doi:10.1007/s00521-012-1004-4.
  • [25] H. Ishibashi, R. Shinriki, H. Isogai, T. Furukawa, Multilevel–multigroup analysis using a hierarchical tensor SOM network, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9949 LNCS (2016) 459–466. doi:10.1007/978-3-319-46675-0\_50.
  • [26] H. Ishibashi, T. Furukawa, Hierarchical tensor SOM network for multilevel-multigroup analysis, Neural Processing Letters 47 (3) (2018) 1011–1025. doi:10.1007/s11063-017-9643-1.
  • [27] N. Lawrence, Probabilistic non-linear principal component analysis with gaussian process latent variable models, Journal of Machine Learning Research 6 (2005).
  • [28] K. Bunte, M. Biehl, B. Hammer, A general framework for dimensionality-reducing data visualization mapping, Neural Computation 24 (3) (2012) 771–804. doi:10.1162/NECO\_a\_00250.
  • [29] C. Bishop, M. Svensén, C. Williams, GTM: The generative topographic mapping, Neural Computation 10 (1) (1998) 215–234. doi:10.1162/089976698300017953.
  • [30] N. Lawrence, Gaussian process latent variable models for visualisation of high dimensional data, 2004.
  • [31] P. Meinicke, S. Klanke, R. Memisevic, H. Ritter, Principal surfaces from unsupervised kernel regression, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (9) (2005) 1379–1391. doi:10.1109/TPAMI.2005.183.
  • [32] T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43 (1) (1982) 59–69. doi:10.1007/BF00337288.
  • [33] C. Wang, S. Mahadevan, Manifold alignment without correspondence, 2009, pp. 1273–1278.
  • [34] T. Daouda, R. Chhaibi, P. Tossou, A.-C. Villani, Geodesics in fibered latent spaces: A geometric approach to learning correspondences between conditions (2020). arXiv:2005.07852.
  • [35] S. Luttrell, Self-organisation: A derivation from first principles of a class of learning algorithms, 1989, pp. 495–498. doi:10.1109/ijcnn.1989.118288.
  • [36] Y. Cheng, Convergence and ordering of kohonen’s batch map, Neural Computation 9 (8) (1997) 1667–1676. doi:10.1162/neco.1997.9.8.1667.
  • [37] T. Graepel, M. Burger, K. Obermayer, Self-organizing maps: Generalizations and new optimization techniques, Neurocomputing 21 (1-3) (1998) 173–190. doi:10.1016/S0925-2312(98)00035-6.
  • [38] T. Heskes, Self-organizing maps, vector quantization, and mixture modeling, IEEE Transactions on Neural Networks 12 (6) (2001) 1299–1305. doi:10.1109/72.963766.
  • [39] J. Verbeek, N. Vlassis, B. Kröse, Self-organizing mixture models, Neurocomputing 63 (SPEC. ISS.) (2005) 99–123. doi:10.1016/j.neucom.2004.04.008.
  • [40] R. Durbin, D. Willshaw, An analogue approach to the travelling salesman problem using an elastic net method, Nature 326 (6114) (1987) 689–691. doi:10.1038/326689a0.
  • [41] A. Utsugi, Hyperparameter selection for self-organizing maps, Neural Computation 9 (3) (1997) 623–635. doi:10.1162/neco.1997.9.3.623.
  • [42] T. Kanade, J. Cohn, Y. Tian, Comprehensive database for facial expression analysis, 2000, pp. 46–53. doi:10.1109/AFGR.2000.840611.
  • [43] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, 2010, pp. 94–101. doi:10.1109/CVPRW.2010.5543262.
  • [44] S.-i. Amari, Information geometry of the em and em algorithms for neural networks, Neural Networks 8 (9) (1995) 1379–1408. doi:10.1016/0893-6080(95)00003-8.
  • [45] K. Takano, H. Hino, S. Akaho, N. Murata, Nonparametric e-mixture estimation, Neural Computation 28 (12) (2016) 2687–2725. doi:10.1162/NECO\_a\_00888.
  • [46] H. Kazemi, S. Iranmanesh, N. Nasrabadi, Style and content disentanglement in generative adversarial networks, 2019, pp. 848–856. doi:10.1109/WACV.2019.00095.
  • [47] J. Na, W. Hwang, Representation learning for style and content disentanglement with autoencoders, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12047 LNCS (2020) 41–51. doi:10.1007/978-3-030-41299-9\_4.
Table 2: Symbol list used in this paper
For problem formulation (single task case)
𝒱D𝒱{\cal V}\equiv\mathbb{R}^{D_{\cal V}} High-dimensional visible data space with dimension D𝒱D_{\cal V}
D{\cal L}\subseteq\mathbb{R}^{D_{\cal L}} Low-dimensional latent space with dimension DD_{\cal L}
X={𝐱n}n=1NX=\{\mathbf{x}_{n}\}_{n=1}^{N} Observed dataset containing NN samples, where 𝐱n𝒱\mathbf{x}_{n}\in{\cal V}
Z={𝐳n}n=1NZ=\{\mathbf{z}_{n}\}_{n=1}^{N} Latent variable set of samples, where 𝐳n\mathbf{z}_{n}\in{\cal L}
𝒳𝒱{\cal X}\subseteq{\cal V} Manifold representing the data distribution
f(𝐳)π1(𝐳)f(\mathbf{z})\equiv\pi^{-1}(\mathbf{z}) Embedding from {\cal L} to 𝒳{\cal X}, referred to as the model
p(𝐳)p(\mathbf{z}) Prior of 𝐳\mathbf{z}\in{\cal L}
q(𝐱,𝐳f)q(\mathbf{x},\mathbf{z}\mid f) Probabilistic generative model of 𝐱\mathbf{x} and 𝐳\mathbf{z}
β\beta Inverse variance (precision) of observation noise
For problem formulation (multi-task case)
Xi={𝐱ij}j=1JiX_{i}=\{\mathbf{x}_{ij}\}_{j=1}^{J_{i}} Dataset of the iith task containing JiJ_{i} samples
X={𝐱n}n=1NX=\{\mathbf{x}_{n}\}_{n=1}^{N} Entire dataset of II tasks, where N=iJiN=\sum_{i}J_{i}
𝐗=(𝐱nT)\mathbf{X}=\big{(}\mathbf{x}_{n}^{T}\big{)} N×D𝒱N\times D_{\cal V} design matrix consisting of the entire dataset of II tasks
II Number of tasks
ini_{n} Task to which sample 𝐱n\mathbf{x}_{n} belongs
𝒩i={𝐱nin=i}{\cal N}_{i}=\{\mathbf{x}_{n}\mid i_{n}=i\} Sample set belonging to task ii
𝒳i{\cal X}_{i} Manifold of task ii
fif_{i}\in{\cal H} Embedding from {\cal L} to 𝒳i{\cal X}_{i}, referred to as the task model
{\cal H} Function space (RKHS) consisting of embeddings from {\cal L} to 𝒱{\cal V}
𝒯D𝒯{\cal T}\subseteq\mathbb{R}^{D_{\cal T}} Low-dimensional latent space for tasks with dimension D𝒯D_{\cal T}
𝒴{\cal Y}\subseteq{\cal H} Task manifold embedded into {\cal H}
U={𝐮i}i=1IU=\{\mathbf{u}_{i}\}_{i=1}^{I} Latent variable set for tasks, where 𝐮i𝒯\mathbf{u}_{i}\in{\cal T}
g(𝐮)g(\mathbf{u}) Embedding from 𝒯{\cal T} to 𝒴{\cal Y}
G(𝐳,𝐮)[g(𝐮)](𝐳)G(\mathbf{z},\mathbf{u})\equiv\left[g(\mathbf{u})\right](\mathbf{z}) Embedding from ×𝒯{\cal L}\times{\cal T} to 𝒱{\cal V}, referred to as the general model
p(𝐮)p(\mathbf{u}) Prior of 𝐮𝒯\mathbf{u}\in{\cal T}
qi(𝐱,𝐳fi)q_{i}(\mathbf{x},\mathbf{z}\mid f_{i}) Probabilistic generative model of task ii
q(𝐱,𝐳,𝐮G)q(\mathbf{x},\mathbf{z},\mathbf{u}\mid G) Probabilistic generative model of 𝐱\mathbf{x}, 𝐳\mathbf{z}, and 𝐮\mathbf{u}
For KSMM (single task case)
h(𝐳𝐳)𝒩(𝐳|𝐳,λ2𝐈)h_{\cal L}(\mathbf{z}\mid\mathbf{z}^{\prime})\equiv{\cal N}\left(\mathbf{z}\,\middle|\,\mathbf{z}^{\prime},\lambda^{2}_{\cal L}\mathbf{I}\right) Non-negative smoothing kernel with length constant λ\lambda_{\cal L}
Z^={𝐳^n}n=1N\hat{Z}=\{\hat{\mathbf{z}}_{n}\}_{n=1}^{N} Tentative estimators of latent variables Z={𝐳n}Z=\{\mathbf{z}_{n}\}
f^\hat{f} Tentative estimator of mapping ff
p(𝐱,𝐳X,Z)p(\mathbf{x},\mathbf{z}\mid X,Z) Joint empirical distribution of 𝐱\mathbf{x} and 𝐳\mathbf{z} under XX and ZZ are given
𝝋(𝐳)=(φl(𝐳)){\boldsymbol{\varphi}}(\mathbf{z})=\left(\varphi_{l}(\mathbf{z})\right) Basis functions for a parametric representation
𝐕\mathbf{V} Coefficient matrix to represent ff parametrically
Multi-task case (MT-KSMM)
ρ(𝐮i,𝐮i)\rho(\mathbf{u}_{i},\mathbf{u}_{i^{\prime}}) Function that determines the weight of the instance transfer from task ii^{\prime} to ii
λρ\lambda_{\rho} Length constant of ρ\rho
X~i={(𝐱n,ρin)}\tilde{X}_{i}=\{(\mathbf{x}_{n},\rho_{in})\} Merged dataset of task ii obtained by instance transfer
X~\tilde{X} is a weighted set where ρin\rho_{in} denotes the weight of 𝐱n\mathbf{x}_{n}
Z~i={𝐳n,ρin)}\tilde{Z}_{i}=\{\mathbf{z}_{n},\rho_{in})\} Merged latent variable set obtained by instance transfer
J~inρin\tilde{J}_{i}\equiv\sum_{n}\rho_{in} Number of merged sample sets of task ii
f~i(𝐳)\tilde{f}_{i}(\mathbf{z}) Task model fif_{i} after model transfer is executed
p~i(𝐱,𝐳X~i,Z~i)\tilde{p}_{i}(\mathbf{x},\mathbf{z}\mid\tilde{X}_{i},\tilde{Z}_{i}) Joint empirical distribution after instance transfer is executed
q~i(𝐱,𝐳f~i)\tilde{q}_{i}(\mathbf{x},\mathbf{z}\mid\tilde{f}_{i}) Generative model for the iith task after model transfer is executed
h𝒯(𝐮,𝐮)𝒩(𝐮|𝐮,λ𝒯2𝐈)h_{\cal T}(\mathbf{u},\mathbf{u}^{\prime})\equiv{\cal N}\left(\mathbf{u}\,\middle|\,\mathbf{u}^{\prime},\lambda^{2}_{\cal T}\mathbf{I}\right) Smoothing kernel for the higher-KSMM with length constant λ𝒯\lambda_{\cal T}
𝝍(𝐮)=(ψk(𝐮)){\boldsymbol{\psi}}(\mathbf{u})=\left(\psi_{k}(\mathbf{u})\right) Basis function for the higher-KSMM
𝐕i\mathbf{V}_{i} Coefficient matrix to represent task model fif_{i} parametrically
𝐕¯=(𝐕i)\underline{\mathbf{V}}=(\mathbf{V}_{i}) Coefficient tensor consisting of {𝐕i}\{\mathbf{V}_{i}\}
𝐖¯\underline{\mathbf{W}} Coefficient tensor to represent general model GG parametrically

Appendix A Symbols and notation

In this paper, scalars and functions are usually written in italics (e.g., λ\lambda, ff, and GG). Indices and their upper limits are written in lowercase and uppercase italics, respectively (e.g., nn and NN). Sets are also written in uppercase italics (e.g., XX and ZZ). Vectors and matrices are written in lowercase and uppercase boldface, respectively (e.g., 𝐱\mathbf{x} and 𝐗\mathbf{X}), whereas tensors are written in uppercase boldface and underlined (e.g., 𝐖¯)\underline{\mathbf{W}}). Continuous spaces and manifolds are written in cursive script (e.g., 𝒱{\cal V} and 𝒳{\cal X}).

The list of symbols used in this paper is shown in Table 2.

Appendix B Details of the equation derivation

B.1 Derivation of the KSMM algorithm

The cost function of KSMM is given by (3). First, we show that this cost function equals the cross-entropy between (4) and (5), except the constant. The cross-entropy is given by

H[p(𝐱,𝐳),q(𝐱,𝐳)]\displaystyle H[p(\mathbf{x},\mathbf{z}),q(\mathbf{x},\mathbf{z})]
=p(𝐱,𝐳)logq(𝐱,𝐳)𝑑𝐱𝑑𝐳\displaystyle=-\iint p(\mathbf{x},\mathbf{z})\log q(\mathbf{x},\mathbf{z})\,d\mathbf{x}\,d\mathbf{z}
=1Nn𝒩(𝐱|𝐱n,β1𝐈)h\EuScriptL(𝐳𝐳n)\displaystyle=-\frac{1}{N}\sum_{n}\iint{\cal N}\left(\mathbf{x}\,\middle|\,\mathbf{x}_{n},\beta^{-1}\mathbf{I}\right)h_{\EuScript}{L}(\mathbf{z}\mid\mathbf{z}_{n})
×log[𝒩(𝐱|f(𝐳),β1𝐈)p(𝐳)]d𝐱d𝐳.\displaystyle\qquad\times\log\left[{\cal N}\left(\mathbf{x}\,\middle|\,f(\mathbf{z}),\beta^{-1}\mathbf{I}\right)\,p(\mathbf{z})\right]\,d\mathbf{x}\,d\mathbf{z}.

Note that, in this study, we assume that the prior p(𝐳)p(\mathbf{z}) is a uniform distribution. Thus,

H[p(𝐱,𝐳),q(𝐱,𝐳)]\displaystyle H[p(\mathbf{x},\mathbf{z}),q(\mathbf{x},\mathbf{z})]
=1Nn𝒩(𝐱|𝐱n,β1𝐈)h(𝐳𝐳n)\displaystyle=-\frac{1}{N}\sum_{n}\iint{\cal N}\left(\mathbf{x}\,\middle|\,\mathbf{x}_{n},\beta^{-1}\mathbf{I}\right)\,h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})
×[β2𝐱f(𝐳)2+D𝒱2logβ2π]d𝐱d𝐳\displaystyle\qquad\times\left[-\frac{\beta}{2}\left\|\mathbf{x}-f(\mathbf{z})\right\|^{2}+\frac{D_{\cal V}}{2}\log\frac{\beta}{2\pi}\right]\,d\mathbf{x}\,d\mathbf{z}
=β2Nnh(𝐳𝐳n)𝐱nf(𝐳)2𝑑𝐳+𝑐𝑜𝑛𝑠𝑡,\displaystyle=\frac{\beta}{2N}\sum_{n}\int h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\left\|\mathbf{x}_{n}-f(\mathbf{z})\right\|^{2}\,d\mathbf{z}+\mathit{const},

and we obtain the cost function (3). We apply the expectation of the square error for the normal distribution:

𝒱𝒩(𝐱|𝝁,β1𝐈)𝐱f(𝐳)2d𝐱\displaystyle\int_{\cal V}{\cal N}\left(\mathbf{x}\,\middle|\,{\boldsymbol{\mu}},\beta^{-1}\mathbf{I}\right)\,{\left\|\mathbf{x}-f(\mathbf{z})\right\|^{2}}\,d\mathbf{x} =𝝁f(𝐳)2+Dβ.\displaystyle=\left\|{\boldsymbol{\mu}}-f(\mathbf{z})\right\|^{2}+\frac{D_{\cal L}}{\beta}.

To estimate the latent variables, we need to minimize the cost function (3) with respect to each 𝐳n\mathbf{z}_{n} as (6). Because f^(𝐳)\hat{f}(\mathbf{z}) is obtained by kernel smoothing, f^(𝐳)\hat{f}(\mathbf{z}) can be considered sufficiently smooth to be locally approximated as linear. By considering the first-order Taylor expansion, the square error f(𝐳)𝐱n2\left\|f(\mathbf{z})-\mathbf{x}_{n}\right\|^{2} is approximated as a quadratic function centered on the minimum point. Because h(𝐳𝐳n)𝒩(𝐳|𝐳n,λ2𝐈)h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\equiv{\cal N}\left(\mathbf{z}\,\middle|\,\mathbf{z}_{n},\lambda_{\cal L}^{2}\mathbf{I}\right) is symmetric around 𝐳n\mathbf{z}_{n}, the minimum point of (6) equals the minimum point of (7).

By contrast, to estimate mapping ff, we need to optimize (3) using the variation technique. Because the functional derivative of EE is given by

δEδf(𝐳)\displaystyle\frac{\delta E}{\delta f(\mathbf{z})} =βNnh(𝐳𝐳n)(f(𝐳)𝐱n),\displaystyle=\frac{\beta}{N}\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\left(f(\mathbf{z})-\mathbf{x}_{n}\right),

the stationary point of ff should satisfy

nh(𝐳𝐳n)f(𝐳)=nh(𝐳𝐳n)𝐱n.\displaystyle\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\,f(\mathbf{z})=\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\,\mathbf{x}_{n}.

Thus, the optimal ff is determined as

f(𝐳)\displaystyle f(\mathbf{z}) =nh(𝐳𝐳n)𝐱nnh(𝐳𝐳n),\displaystyle=\frac{\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\,\mathbf{x}_{n}}{\sum_{n^{\prime}}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n^{\prime}})},

and we obtain (9).

B.2 Derivation of the MT-KSMM algorithm

The extension from KSMM to MT-KSMM is straightforward: The cost function of the lower-KSMM (15) is an extension of (3) for a weighted dataset. Thus, we obtain (19) by modifying (7), where the weights of data are given by ρin\rho_{in}. Similarly, we have (17) by extending (9) considering ρin\rho_{in}.

The cost function of the higher-KSMM (16) looks the same as (3). The difference is that the square norm in (3) is defined in the ordinal vector space 𝒱{\cal V}, whereas the square norm in (16) is defined in the RKHS {\cal H}. Similarly, we obtain (19) and (18).

B.3 Derivation of the parametric representation of KSMM

Using the orthonormal basis set {φ1,,φL}\{\varphi_{1},\dots,\varphi_{L}\}, we represent the mapping ff parametrically as

f(𝐳𝐕)\displaystyle f(\mathbf{z}\mid\mathbf{V}) =lφl(𝐳)𝐯l=𝐕𝖳𝝋(𝐳),\displaystyle=\sum_{l}\varphi_{l}(\mathbf{z})\mathbf{v}_{l}=\mathbf{V}^{\mathsf{T}}{\boldsymbol{\varphi}}(\mathbf{z}),

where 𝝋(𝐳)=(φ1(𝐳),,φL(𝐳))𝖳{\boldsymbol{\varphi}}(\mathbf{z})=\left(\varphi_{1}(\mathbf{z}),\dots,\varphi_{L}(\mathbf{z})\right)^{\mathsf{T}}. In this case, the cost function (3) becomes

E\displaystyle E =β2Nnh(𝐳𝐳n)𝐕𝖳𝝋(𝐳)𝐱n2𝑑𝐳.\displaystyle=\frac{\beta}{2N}\sum_{n}\int h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\left\|\mathbf{V}^{\mathsf{T}}{\boldsymbol{\varphi}}(\mathbf{z})-\mathbf{x}_{n}\right\|^{2}\,d\mathbf{z}.

If we differentiate EE with respect to 𝐕\mathbf{V}, then

E𝐕\displaystyle\frac{\partial E}{\partial\mathbf{V}} =βNnh(𝐳𝐳n)𝝋(𝐳)(𝐕𝖳𝝋(𝐳)𝐱n)𝖳𝑑𝐳.\displaystyle=\frac{\beta}{N}\sum_{n}\int h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\,{\boldsymbol{\varphi}}(\mathbf{z})\,\left(\mathbf{V}^{\mathsf{T}}{\boldsymbol{\varphi}}(\mathbf{z})-\mathbf{x}_{n}\right)^{\mathsf{T}}\,d\mathbf{z}.

Because the stationary point of 𝐕\mathbf{V} should satisfy E/𝐕=0\partial E/\partial\mathbf{V}=0, we have the following equation:

nh(𝐳𝐳n)𝝋(𝐳)𝝋𝖳(𝐳)𝐕𝑑𝐳\displaystyle\sum_{n}\int h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\,{\boldsymbol{\varphi}}(\mathbf{z}){\boldsymbol{\varphi}}^{\mathsf{T}}(\mathbf{z})\mathbf{V}\,d\mathbf{z} (41)
=nh(𝐳𝐳n)𝝋(𝐳)𝐱n𝖳𝑑𝐳.\displaystyle\qquad=\sum_{n}\int h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n}){\boldsymbol{\varphi}}(\mathbf{z})\,\mathbf{x}_{n}^{\mathsf{T}}\,d\mathbf{z}. (42)

The left-hand side (41) becomes

[(nh(𝐳𝐳n))𝝋(𝐳)𝝋𝖳(𝐳)𝑑𝐳]𝐕\displaystyle\left[\int\left(\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n})\right)\,{\boldsymbol{\varphi}}(\mathbf{z}){\boldsymbol{\varphi}}^{\mathsf{T}}(\mathbf{z})\,d\mathbf{z}\right]\,\mathbf{V}
=[h¯(𝐳)𝝋(𝐳)𝝋𝖳(𝐳)𝑑𝐳]𝐕\displaystyle=\left[\int\overline{h}_{\cal L}(\mathbf{z})\,{\boldsymbol{\varphi}}(\mathbf{z})\,{\boldsymbol{\varphi}}^{\mathsf{T}}(\mathbf{z})\,d\mathbf{z}\right]\,\mathbf{V}
𝐀𝐕,\displaystyle\eqqcolon\mathbf{A}\mathbf{V},

where h¯(𝐳)=nh(𝐳𝐳n)\overline{h}_{\cal L}(\mathbf{z})=\sum_{n}h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{n}) and 𝐀\mathbf{A} is an L×LL\times L matrix. By contrast, the right-hand side (42) becomes

𝝋(𝐳)(h(𝐳𝐳1),,h(𝐳𝐳N))(𝐱1𝖳𝐱N𝖳)𝑑𝐳\displaystyle\int{\boldsymbol{\varphi}}(\mathbf{z})\,\left(h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{1}),\dots,h_{\cal L}(\mathbf{z}\mid\mathbf{z}_{N})\right)\,\begin{pmatrix}\mathbf{x}_{1}^{\mathsf{T}}\\ \vdots\\ \mathbf{x}_{N}^{\mathsf{T}}\end{pmatrix}\,d\mathbf{z}
=(𝝋(𝐳)𝐡(𝐳)𝖳𝑑𝐳)𝐗\displaystyle=\left(\int{\boldsymbol{\varphi}}(\mathbf{z})\,\mathbf{h}_{\cal L}(\mathbf{z})^{\mathsf{T}}\,d\mathbf{z}\right)\,\mathbf{X}
𝐁𝐗.\displaystyle\eqqcolon\mathbf{B}\mathbf{X}.

Because 𝐀𝐕=𝐁𝐗\mathbf{A}\mathbf{V}=\mathbf{B}\mathbf{X} holds, the estimator of 𝐕\mathbf{V} is given by

𝐕^=𝐀1𝐁𝐗.\displaystyle\hat{\mathbf{V}}=\mathbf{A}^{-1}\mathbf{B}\mathbf{X}. (43)

Thus, we have (26). In the KSMM algorithm, the tentative estimator of latent variables Z^={𝐳^n}\hat{Z}=\{\hat{\mathbf{z}}_{n}\} is used instead of Z={𝐳N}Z=\{\mathbf{z}_{N}\}.

B.4 Derivation of the parametric representation of MT-KSMM

Similar to the non-parametric case, we obtain the parametric representation of MT-KSMM by a straightforward extension, but a tensor–matrix product is required in the notation.

First, we introduce some tensor operations. Suppose that 𝐗=(Xmn)\mathbf{X}=(X_{mn}) is an M×NM\times N matrix, and let 𝐱ˇvec(𝐗)\check{\mathbf{x}}\equiv\mathrm{vec}(\mathbf{X}) be the vector representation of 𝐗\mathbf{X}. Thus, 𝐱ˇ\check{\mathbf{x}} is an (M×N)(M\times N)-dimensional vector obtained by flattening 𝐗\mathbf{X}. When we have LL such matrices {𝐗1,,𝐗L}\{\mathbf{X}_{1},\dots,\mathbf{X}_{L}\}, the entire matrix set is represented as a matrix 𝐗ˇ=(𝐱ˇ1,,𝐱ˇL)\check{\mathbf{X}}=(\check{\mathbf{x}}_{1},\dots,\check{\mathbf{x}}_{L}). Note that 𝐗ˇ\check{\mathbf{X}} is an L×(M×N)L\times(M\times N) matrix, which is also denoted by 𝐗¯=(Xlmn)\underline{\mathbf{X}}=(X_{lmn}) in tensor notation. We further suppose that 𝐘ˇ=𝐀𝐗ˇ\check{\mathbf{Y}}=\mathbf{A}\check{\mathbf{X}} is a linear transformation, where 𝐘ˇ\check{\mathbf{Y}} is an I×(M×N)I\times(M\times N) matrix, whereas 𝐀\mathbf{A} is an I×LI\times L matrix. Using the tensor–matrix product, this linear transformation is denoted by 𝐘¯=𝐗¯×1𝐀\underline{\mathbf{Y}}=\underline{\mathbf{X}}\times_{1}\mathbf{A}, which means Yimn=lAilXlmnY_{imn}=\sum_{l}A_{il}X_{lmn} in element-wise notation, where 𝐘¯=(Yimn)\underline{\mathbf{Y}}=(Y_{imn}) is a tensor of order three and size I×M×NI\times M\times N. Similarly, 𝐘¯=𝐗¯×1𝐀×2𝐁\underline{\mathbf{Y}}=\underline{\mathbf{X}}\times_{1}\mathbf{A}\times_{2}\mathbf{B} means Yijn=lmAilBjmXlmnY_{ijn}=\sum_{l}\sum_{m}A_{il}B_{jm}X_{lmn}.

Using the above notation, the parametric representation of MT-KSMM becomes as follows: For the lower-KSMMs, we estimate the coefficient matrix 𝐕i\mathbf{V}_{i} as 𝐕i=𝐀i1𝐁𝐗i\mathbf{V}_{i}=\mathbf{A}_{i}^{-1}\mathbf{B}\mathbf{X}_{i} for each task ii; using this, the task model is represented as fi(𝐳)=𝐕i𝝋(𝐳)f_{i}(\mathbf{z})=\mathbf{V}_{i}{\boldsymbol{\varphi}}(\mathbf{z}). Consequently, the entire coefficient matrices become a tensor 𝐕¯=(𝐕i)\underline{\mathbf{V}}=(\mathbf{V}_{i}) of order three and size I×L×D𝒱I\times L\times D_{\cal V}. By flattening 𝐕i\mathbf{V}_{i} using 𝐯ˇi=vec(𝐕i)\check{\mathbf{v}}_{i}=\mathrm{vec}(\mathbf{V}_{i}), 𝐕¯\underline{\mathbf{V}} can also be represented by a matrix 𝐕ˇ=(𝐯ˇi)\check{\mathbf{V}}=(\check{\mathbf{v}}_{i}). Because we use an orthonormal basis set {𝝋l}\{{\boldsymbol{\varphi}}_{l}\}, the metric in function space {\cal H} equals the Euclidean metric in the ordinary vector space of the coefficient matrices. Thus, we consider that {𝐯ˇi}\{\check{\mathbf{v}}_{i}\} are equivalent to {fi}\{f_{i}\}, including the metric.

The higher-KSMM estimates g(𝐮)g(\mathbf{u}) by regarding the task model set {fi}\{f_{i}\} as a dataset. Thus, in the parametric representation, the aim of the higher-KSMM is to estimate a function 𝐕(𝐮)=g(𝐮)\mathbf{V}(\mathbf{u})=g(\mathbf{u}), where 𝐮\mathbf{u} is the latent variable of tasks. By regarding 𝐕ˇ\check{\mathbf{V}} as a data matrix and applying (43), we have

𝐖ˇ\displaystyle\check{\mathbf{W}} =𝐂1𝐃𝐕ˇ,\displaystyle=\mathbf{C}^{-1}\mathbf{D}\check{\mathbf{V}}, (44)

where 𝐂\mathbf{C} and 𝐃\mathbf{D} are defined in (33) and (34), respectively. Using tensor–matrix product notation, (44) is denoted by 𝐖=𝐕×1(𝐂1𝐃)\mathbf{W}=\mathbf{V}\times_{1}(\mathbf{C}^{-1}\mathbf{D}).

Finally, the general model G(𝐳,𝐮𝐖¯)G(\mathbf{z},\mathbf{u}\mid\underline{\mathbf{W}}) is represented as follows: Note that

G(𝐳,𝐮)\displaystyle G(\mathbf{z},\mathbf{u}) =𝐕(𝐮)𝝋(𝐳)\displaystyle=\mathbf{V}(\mathbf{u}){\boldsymbol{\varphi}}(\mathbf{z}) (45)
𝐯ˇ(𝐮)\displaystyle\check{\mathbf{v}}(\mathbf{u}) =𝐖ˇ𝝍(𝐮).\displaystyle=\check{\mathbf{W}}{\boldsymbol{\psi}}(\mathbf{u}). (46)

Using tensor–vector product notation, (46) is denoted by 𝐕(𝐮)=𝐖¯×1𝝍(𝐮)\mathbf{V}(\mathbf{u})=\underline{\mathbf{W}}\times_{1}{\boldsymbol{\psi}}(\mathbf{u}), which means Vld(𝐮)=kWkldψk(𝐮)V_{ld}(\mathbf{u})=\sum_{k}W_{kld}\psi_{k}(\mathbf{u}). Then (45) becomes (31) as follows:

G(𝐳,𝐮)\displaystyle G(\mathbf{z},\mathbf{u}) =𝐖¯×1𝝍(𝐮)×2𝝋(𝐳),\displaystyle=\underline{\mathbf{W}}\times_{1}{\boldsymbol{\psi}}(\mathbf{u})\times_{2}{\boldsymbol{\varphi}}(\mathbf{z}),

which means Gd(𝐳,𝐮)=klWkldψk(𝐮)φl(𝐳)G_{d}(\mathbf{z},\mathbf{u})=\sum_{k}\sum_{l}W_{kld}\psi_{k}(\mathbf{u})\varphi_{l}(\mathbf{z}), where GdG_{d} denotes the ddth entry of the D𝒱D_{\cal V}-dimensional vector GG.