A Meta-Learning Approach for Graph Representation Learning in Multi-Task Settings

Davide Buffelli
Department of Information Engineering
University of Padova
Padova, Italy
[email protected]
&Fabio Vandin
Department of Information Engineering
University of Padova
Padova, Italy
[email protected]

Abstract

Graph Neural Networks (GNNs) are a framework for graph representation learning, where a model learns to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. However, generating node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is an open problem. We propose a novel meta-learning strategy capable of producing multi-task node embeddings. Our method avoids the difficulties arising when learning to perform multiple tasks concurrently by, instead, learning to quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks singularly. We show that the embeddings produced by our method can be used to perform multiple tasks with comparable or higher performance than classically trained models. Our method is model-agnostic and task-agnostic, thus applicable to a wide variety of multi-task domains.

1 Introduction

Figure 1: Performance drop when transferring node embeddings between (a) Node Classification (NC), (b) Graph Classification (GC), and (c) Link Prediction (LP) on the ENZYMES dataset. “x ->y” indicates that the embeddings obtained from a model trained on task x are used for task y.

Graph Neural Networks (GNNs) are deep learning models that operate on graph structured data obtaining great empirical performance, and are a very active area of research. Three tasks in particular have received the most attention: graph classification, node classification, and link prediction. GNNs are centered around the concept of node representation learning, and typically follow the same architectural pattern with an encoder-decoder structure [13, 5, 37]. The encoder produces node embeddings (low-dimensional vectors capturing structural and feature-related information about each node), while the decoder uses the embeddings to carry out the desired downstream task. The model is then trained in an end-to-end manner, giving rise to highly specialized node embeddings. In fact, taking the embeddings from a trained GNN, and using them to train a decoder for a different task, leads to substantial performance loss (see Figure 1).

The low transferability of node embeddings requires the use of task-specific encoders and decoders. However, many practical machine learning applications operate in resource-constrained environments where being able to share parameters between tasks is of great importance. Learning models that perform multiple tasks is known as Multi-Task Learning (MTL), and is an open area of research [35].

We show that training a multi-head model with the classical procedure, i.e. by performing multiple tasks concurrently on each graph, and updating the parameters with some form of gradient descent to minimize the sum of the single-task losses, can lead to performance loss with respect to single-task models. We then propose a novel optimization-based meta-learning [10] procedure that can generate node embeddings that generalize across tasks. Our meta-learning procedure does not aim at a setting of the parameters that can perform multiple tasks concurrently (like a classical method would do), or to a setting that allows fast multi-task adaptation (like traditional meta-learning), but to a setting that can easily be adapted to perform each of the tasks singularly. In fact, our procedure aims at a setting of the parameters where a few steps of gradient descent on a given task, can lead to good performance on that task, hence removing the burden of learning to solve multiple tasks concurrently.

We summarize our contributions as follows:

•

We propose a novel meta-learning strategy for multi-task representation learning. We apply it on graph MTL, and show that a GNN trained with our method produces higher quality node embeddings with respect to classical training procedures. Our method is model-agnostic and task-agnostic, thus easily applicable to a wide range of multi-task domains.
•

To the best of our knowledge, we are the first to propose a GNN model generating a single set of node embeddings that can be used to perform the three most common graph-related tasks. In fact, our embeddings lead to comparable or higher performance with respect to single-task models even when used as input to a simple linear classifier.
•

We show that the episodic training strategy in our meta-learning procedure leads to better node embeddings even for single-task models. We believe this finding provides interesting directions for future work on connections between meta-learning and representation learning.

2 Related Work

GNNs, MTL, and meta-learning are very active areas of research. We highlight works that are at the intersections of these subjects, and point the interested reader to comprehensive reviews of each field.

Graph Neural Networks. GNNs have a long history [32], but in the past few years the field has grown exponentially. Seminal works include ChebNet [7], GCN [20], GAT [36], and GIN [38]. For a thorough review of the field we refer the reader to Chami et al. [5] and Wu et al. [37].

Multi-Task Learning. Works at the intersection of MTL and GNNs have focused on multi-head architectures for several applications [26, 15, 38, 2, 21], but no single model has been proposed for the three most common tasks on graphs. Other works use GNNs as a tool for MTL: Liu et al. [23] use GNNs to allow communication between tasks, while Zhang et al. [40] use GNNs to estimate the test error of a MTL model. For an exhaustive review of deep MTL we refer to Vandenhende et al. [35].

Meta-Learning. Meta-Learning has attracted considerable attention (see the review by Hospedales et al. [16]), specially in the area of few-shot learning. Some works use GNNs directly for few-shot learning [11], others as a tool for enhancing meta-learning [22, 34], and others use meta-learning to train GNNs in few-shot learning scenarios for graph-related problems [41, 39, 18, 18, 1, 4, 28], Other works combining meta-learning and GNNs involve adversarial attacks [42] and active learning [24].

3 Preliminaries

3.1 Graph Neural Networks

Many GNNs follow the message-passing paradigm [12]. Let us represent a graph $\mathcal{G}=(\mathbf{A},\mathbf{X})$ with an adjacency matrix $\mathbf{A}\in\{0,1\}^{n\times n}$ , and a node feature matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ , where the $v$ -th row $\mathbf{X}_{v}$ represents the $d$ dimensional feature vector of node $v$ . Let $\mathbf{H}^{(\ell)}\in\mathbb{R}^{n\times d^{\prime}}$ be the node representation matrix at layer $\ell$ . A message passing layer updates the representation of every node $v$ as follows:

\text{{msg}}^{(\ell)}_{v}=\text{AGGREGATE}(\{\mathbf{H}_{u}^{(\ell)}\text{ }\forall u\in\mathcal{N}_{v}\}),\quad\mathbf{H}_{v}^{(\ell+1)}=\text{UPDATE}(\mathbf{H}_{v}^{(\ell)},\text{{msg}}^{(\ell)}_{v})

where $\mathbf{H}^{(0)}=\mathbf{X}$ , $\mathcal{N}_{v}$ is the set of neighbours of node $v$ , AGGREGATE is a permutation invariant function, and UPDATE is usually a neural network. After $L$ message-passing layers, the final node embeddings $\mathbf{H}^{(L)}$ are used to perform a given task, and the network is trained end-to-end.

3.2 Model-Agnostic Meta-Learning and ANIL

MAML (Model-Agnostic Meta-Learning)[10] is an optimization-based meta-learning strategy. Let $f_{\theta}$ be a deep learning model, where $\theta$ are its parameters. Let $p(\mathcal{E})$ be a distribution over episodes¹¹1The meta-learning literature usually derives episodes from tasks (i.e. tuples containing a dataset and a loss function). We focus on episodes to avoid using the term task for both a MTL task, and a meta-learning task.. An episode $\mathcal{E}_{i}\sim p(\mathcal{E})$ is defined as a tuple containing a loss function $\mathcal{L}_{\mathcal{E}_{i}}$ , a support set $\mathcal{S}_{\mathcal{E}_{i}}$ , and a target set $\mathcal{T}_{\mathcal{E}_{i}}$ : $\mathcal{E}_{i}=(\mathcal{L}_{\mathcal{E}_{i}}(\cdot),\mathcal{S}_{\mathcal{E}_{i}},\mathcal{T}_{\mathcal{E}_{i}})$ (support and target sets are sets of labelled examples). MAML’s goal is to find a value of $\theta$ that can quickly, i.e. in a few steps of gradient descent, be adapted to new episodes. This is done with a nested loop optimization procedure: an inner loop adapts the parameters to the support set of an episode by performing some steps of gradient descent, and an outer loop updates the initial parameters to allow fast adaptation. Formally, let $\theta^{\prime}_{i}(t)$ be the parameters after $t$ adaptation steps on the support set of episode $\mathcal{E}_{i}$ , then the computations in the inner loop are

\theta^{\prime}_{i}(t)=\theta^{\prime}_{i}(t-1)-\alpha\nabla_{\theta^{\prime}_{i}(t-1)}\mathcal{L}_{\mathcal{E}_{i}}(f_{\theta^{\prime}_{i}(t-1)},\mathcal{S}_{\mathcal{E}_{i}}),\text{ with }\theta^{\prime}_{i}(0)=\theta

where $\mathcal{L}(f_{\theta^{\prime}_{i}(t-1)},\mathcal{S}_{\mathcal{E}_{i}})$ indicates the loss over the support set $\mathcal{S}_{\mathcal{E}_{i}}$ of the model with parameters $\theta^{\prime}_{i}(t-1)$ , and $\alpha$ is the learning rate. The meta-objective that the outer loop tries to minimize is defined as $\mathcal{L}_{\text{{meta}}}=\sum_{\mathcal{E}_{i}\sim p(\mathcal{E})}\mathcal{L}_{\mathcal{E}_{i}}(f_{\theta^{\prime}_{i}(t)},\mathcal{T}_{\mathcal{E}_{i}})$ , which leads to the following parameter update²²2We limit ourself to one step of gradient descent for clarity, but any optimization strategy could be used.

\theta=\theta-\beta\nabla_{\theta}\mathcal{L}_{\text{{meta}}}=\theta-\beta\nabla_{\theta}\sum_{\mathcal{E}_{i}\sim p(\mathcal{E})}\mathcal{L}_{\mathcal{E}_{i}}(f_{\theta^{\prime}_{i}(t)},\mathcal{T}_{\mathcal{E}_{i}}).

Raghu et al. [31] showed that feature reuse is the dominant factor in MAML: in the adaptation loop, only the last layer(s) in the network are updated, while the first layer(s) remain almost unchanged. The authors then propose ANIL (Almost No Inner Loop) where they split the parameters in two sets: one that is used for adaptation in the inner loop, and one that is only updated in the outer loop. This simplification leads to computational improvements while maintaining performance.

4 Our Method

Our novel representation learning technique, based on meta-learning, is built on three insights:

(i) optimization-based meta-learning is implicitly learning robust representations. The findings by Raghu et al. [31] suggest that in a model trained with MAML, the first layer(s) learn features that are reusable across episodes, while the last layer(s) are set up for fast adaptation. MAML is then implicitly focusing on learning reusable representations that generalize across episodes.

(ii) meta-learning episodes can be designed to encourage generalization. If we design support and target set to mimic the training and validation sets of a classical training procedure, then the meta-learning procedure is effectively optimizing for generalization.

(iii) meta-learning can learn to quickly adapt to multiple tasks singularly, without having to learn to solve multiple tasks concurrently. We design the meta-learning procedure so that, for each considered task, the inner loop adapts the parameters to a task-specific support set, and tests the adaptation on a task-specific target set. The outer loop then updates the parameters to allow fast multiple single-task adaptation. This strategy is searching for a parameter setting that can be easily adapted for good single-task performance, without learning to solve multiple tasks concurrently. (See Appendix A for a comparison with classical training and meta-learning strategies.)

Based on (ii) and (iii), we develop a novel meta-learning procedure where the inner loop adapts to multiple tasks singularly, each time with the goal of single-task generalization. Using an encoder-decoder architecture, and episodes that involve adapting to multiple tasks, (i) suggests that this procedure leads to an encoder that learns features that are reusable across episodes (and hence tasks).

Intuition. Training multi-task models is challenging, as tasks may negatively interfere with each other [33]. We design a meta-learning procedure where the learner does not have to find a configuration of the parameters that concurrently performs all tasks, but a configuration that can easily be adapted to perform each of the tasks singularly. Finally, leveraging the robust representation learning that happens with MAML and ANIL, we can extract an encoder generating node representations that generalize across tasks.

We now formally present our novel meta-learning procedure in three steps: (1) Episode Design: how is a an episode composed, (2) Model Architecture Design: what is the architecture of our model, (3) Meta-Training Design: how, and which, parameters are adapted/updated.

Refer to caption — Figure 2: (a) Multi-task episode: for each task, support and target sets mimic training and validation sets. (b) iSAME: both backbone and task-specific output layers are adapted (one at a time) in the inner loop. (c) eSAME: only task-specific output layers are adapted (one at a time) in the inner loop.

4.1 Episode Design

In our case, an episode becomes a multi-task episode (Figure 2 (a)). Let us consider the case where the tasks are graph classification (GC), node classification (NC), and link prediction (LP). We define a multi-task episode $\mathcal{E}^{(m)}_{i}\sim p(\mathcal{E}^{(m)})$ as a tuple $\mathcal{E}^{(m)}_{i}=(\mathcal{L}^{(m)}_{\mathcal{E}_{i}},\mathcal{S}^{(m)}_{\mathcal{E}_{i}},\mathcal{T}^{(m)}_{\mathcal{E}_{i}})$ , with

	$\displaystyle\mathcal{L}^{(m)}_{\mathcal{E}_{i}}$	$\displaystyle=\lambda^{(GC)}\mathcal{L}^{(\text{GC})}_{\mathcal{E}_{i}}+\lambda^{(NC)}\mathcal{L}^{(\text{NC})}_{\mathcal{E}_{i}}+\lambda^{(LP)}\mathcal{L}^{(\text{LP})}_{\mathcal{E}_{i}}$
	$\displaystyle\mathcal{S}^{(m)}_{\mathcal{E}_{i}}$	$\displaystyle=\{\mathcal{S}^{(\text{GC})}_{\mathcal{E}_{i}},\mathcal{S}^{(\text{NC})}_{\mathcal{E}_{i}},\mathcal{S}^{(\text{LP})}_{\mathcal{E}_{i}}\},\quad\mathcal{T}^{(m)}_{\mathcal{E}_{i}}=\{\mathcal{T}^{(\text{GC})}_{\mathcal{E}_{i}},\mathcal{T}^{(\text{NC})}_{\mathcal{E}_{i}},\mathcal{T}^{(\text{LP})}_{\mathcal{E}_{i}}\}$

where $\lambda^{(\cdot)}$ are balancing coefficients. The meta-objective of our method then becomes:

\mathcal{L}^{(m)}_{\text{{meta}}}=\sum_{\mathcal{E}^{(m)}_{i}\sim p(\mathcal{E}^{(m)})}\lambda^{(GC)}\mathcal{L}^{(\text{GC})}_{\mathcal{E}_{i}}+\lambda^{(NC)}\mathcal{L}^{(\text{NC})}_{\mathcal{E}_{i}}+\lambda^{(LP)}\mathcal{L}^{(\text{LP})}_{\mathcal{E}_{i}}.

Support and target sets are set up to resemble a training and a validation set. Therefore the outer loop’s objective becomes to maximize the performance on a validation set, given a training set, hence pushing towards generalization (additional details are provided in Appendix B).

4.2 Model Architecture Design

We use an encoder-decoder model with a multi-head architecture. The backbone (which represents the encoder) is a 3 layer GCN [20], while the decoder is composed of three heads (one per task) with standard architectures. For additional information we refer the interested reader to Appendix C.

4.3 Meta-Training Design

To avoid the problems arising from training a model that performs multiple tasks concurrently, we design a meta-learning procedure where the inner loop adaptation and the meta-objective computation involve a single task at a time. Only the parameter update performed to minimize the meta-objective involves multiple tasks, but, crucially, it does not aim at a setting of parameters that can solve, or quickly adapt to, multiple tasks concurrently, but to a setting that allows multiple fast single-task adaptation.

Input : Model

f_{\theta}

; Episodes

\mathcal{E}=\{\mathcal{E}_{1},..,\mathcal{E}_{n}\}

init

(\theta)

for $\mathcal{E}_{i}$ in $\mathcal{E}$ do

\text{{o\_loss}}\leftarrow 0

for t in (GC, NC, LP) do

\theta^{\prime(\text{{t}})}\leftarrow\theta

\theta^{\prime(\text{{t}})}\leftarrow\text{{ADAPT}}(f_{\theta},\mathcal{S}^{(\texttt{t})}_{\mathcal{E}_{i}},\mathcal{L}^{(\texttt{t})}_{\mathcal{E}_{i}})

\text{{o\_loss}}\leftarrow\text{{o\_loss}}+\text{{TEST}}(f_{\theta^{\prime(\text{{t}})}},\mathcal{T}^{(\texttt{t})}_{\mathcal{E}_{i}},\mathcal{L}^{(\texttt{t})}_{\mathcal{E}_{i}})

end for

\theta\leftarrow\text{{UPDATE}}(\theta,\text{{o\_loss}},\theta^{\prime(GC)},\theta^{\prime(NC)},\theta^{\prime(LP)})

end for

Algorithm 1 Proposed Meta-Learning Procedure

The pseudocode of our procedure is in Algorithm 1. ADAPT performs a few steps of gradient descent on a task specific loss function and support set, TEST computes the value of a meta-objective component on a task specific loss function and target set, and UPDATE optimizes the parameters by minimizing the meta-objective. Notice how the multiple heads of the decoder in our model are never used concurrently.

Let us partition the parameters $\theta$ of our model in four sets: one representing the backbone ( $\theta_{GCN}$ ), and one for each head ( $\theta_{NC},\theta_{GC},\theta_{LP}$ ). We name our meta-learning strategy SAME (Single-Task Adaptation for Multi-Task Embeddings), and present two variants (Figure 2 (b)-(c)): implicit SAME (iSAME), and explicit SAME (eSAME). In iSAME all the parameters $\theta$ are used for adaptation. iSAME makes use of the implicit feature-reuse factor of MAML, leading to parameters $\theta_{\text{GCN}}$ that are general across multi-task episodes. In eSAME only the head parameters $\theta_{\text{NC}},\theta_{\text{GC}},\theta_{\text{LP}}$ are used for adaptation. eSAME explicitly aims at parameters $\theta_{\text{GCN}}$ that are general across multi-task episodes by only updating them in the outer loop.

5 Experiments

Our goal is to assess the quality of the representations learned by our proposed method by answering four questions (Q1-Q4). Furthermore, by examining the results of the two variants of SAME, we can observe if the explicit strategy applied by eSAME is necessary for obtaining useful features, or if the implicit mechanism of iSAME is enough. We use GC to refer to graph classification, NC for node classification, and LP for link prediction. Unless otherwise stated, accuracy (%) is used for NC and GC, while ROC AUC (%) is used for LP.

Table 1: Results for a single-task model trained in a classical supervised manner (Cl), and a linear classifier trained on the embeddings produced by our meta-learning strategies (iSAME, eSAME).

Task	Model	Dataset
		ENZYMES	PROTEINS	DHFR	COX2
NC	Cl	$87.5\pm 1.9$	$72.3\pm 4.4$	$97.3\pm 0.2$	$96.4\pm 0.3$
	iSAME	$87.3\pm 0.8$	$81.8\pm 1.6$	$96.6\pm 0.3$	$96.1\pm 0.4$
	eSAME	$87.8\pm 0.7$	$82.4\pm 1.6$	$96.8\pm 0.2$	$96.5\pm 0.6$
GC	Cl	$51.6\pm 4.2$	$73.3\pm 3.6$	$71.5\pm 2.3$	$76.7\pm 4.7$
	iSAME	$50.8\pm 2.9$	$73.5\pm 1.2$	$73.2\pm 3.2$	$76.3\pm 4.6$
	eSAME	$52.1\pm 5.0$	$72.6\pm 1.6$	$71.6\pm 2.4$	$75.6\pm 4.1$
LP	Cl	$75.5\pm 3.0$	$85.6\pm 0.8$	$98.8\pm 0.7$	$98.3\pm 0.8$
	iSAME	$81.7\pm 1.7$	$84.0\pm 1.1$	$99.2\pm 0.4$	$99.1\pm 0.5$
	eSAME	$80.1\pm 3.4$	$84.1\pm 0.9$	$99.2\pm 0.3$	$99.2\pm 0.7$

Experimental Setup. We consider datasets from the TUDataset library [27] that allow multi-task settings, and perform a 10-fold cross validation. To ensure a fair comparison, we use the same architecture for all training strategies. For more information we refer to Appendix D.

Q1: Do iSAME and eSAME lead to high quality node embeddings in the single-task setting? For every task, we train a linear classifier on top of the embeddings produced by a model trained using our proposed methods, and compare against a network with the same architecture trained in a classical manner. Results are shown in Table 1. For all tasks, the linear classifier achieves comparable, if not superior, performance to the end-to-end model. In fact, the linear classifier is never outperformed by more than 2%, and it can outperform the classical end-to-end model by up to 12%.

Figure 3: Results for neural network, trained on the embeddings generated by a multi-task model, performing a task that was not seen by the multi-task model. “

x,y

z

” indicates that

x,y

are the tasks for training the multi-task model, and

z

is the new task.

Q2: Do iSAME and eSAME lead to high quality node embeddings in the multi-task setting? We train a model with our proposed methods, on all multi-task combinations, and use the embeddings as input for a linear classifier. We compare against single-task models trained in the classical manner, and with a fine-tuning baseline. The latter is a model that has been trained on all three tasks, and then fine-tuned on two specific tasks. The idea is that the initial training on all tasks should lead the model towards the extraction of features that it would otherwise not consider (by only seeing 2 tasks), and the fine-tuning process should then allow the model to use these features to target the specific tasks of interest. Results are shown in Table 2. We notice that the linear classifier, achieves comparable performance to the end-to-end models, as it is never outperformed by more than 3%, and in 50% of the cases it actually performs better, confirming the high quality of the node embeddings learned with iSAME and eSAME. We further notice that the fine-tuning baseline severely struggles, and is almost always outperformed by both single-task models, and our proposed methods. These results indicate that the episodic meta-learning procedure adopted by SAME is extracting features that are otherwise not accessible with standard training techniques.

Table 2: Results for a single-task model trained in a classical supervised manner, a fine-tuned model (trained on all three tasks, and fine-tuned on the two shown tasks), and a linear classifier trained on node embeddings learned with our proposed strategies (iSAME, eSAME) in a multi-task setting.

Classical End-to-End Training
Task			Dataset
GC	NC	LP	ENZYMES	PROTEINS	DHFR	COX2
			GC / NC / LP	GC / NC / LP	GC / NC / LP	GC / NC / LP
✓			51.6 / 00.0 / 00.0	73.3 / 00.0 / 00.0	71.5 / 00.0 / 00.0	76.7 / 00.0 / 00.0
	✓		00.0 / 87.5 / 00.0	00.0 / 72.3 / 00.0	00.0 / 97.3 / 00.0	00.0 / 96.4 / 00.0
		✓	00.0 / 00.0 / 75.5	00.0 / 00.0 / 85.6	00.0 / 00.0 / 98.8	00.0 / 00.0 / 98.3
Fine-Tuning
✓	✓		48.3 / 85.3 / 00.0	73.6 / 72.0 / 00.0	66.4 / 92.4 / 00.0	80.0 / 92.3 / 00.0
✓		✓	49.3 / 00.0 / 71.6	69.6 / 00.0 / 80.7	65.3 / 00.0 / 58.9	80.2 / 00.0 / 50.9
	✓	✓	00.0 / 87.7 / 73.9	00.0 / 80.4 / 81.5	00.0 / 80.7 / 56.6	00.0 / 87.4 / 52.3
iSAME (ours)
✓	✓		50.1 / 86.1 / 00.0	73.1 / 76.6 / 00.0	71.6 / 94.8 / 00.0	75.2 / 95.4 / 00.0
✓		✓	50.7 / 00.0 / 83.1	73.4 / 00.0 / 85.2	71.6 / 00.0 / 99.2	77.5 / 00.0 / 98.9
	✓	✓	00.0 / 86.3 / 83.4	00.0 / 79.4 / 87.7	00.0 / 96.5 / 99.3	00.0 / 95.5 / 99.0
✓	✓	✓	50.0 / 86.5 / 82.3	71.4 / 76.6 / 87.3	71.2 / 95.5 / 99.5	75.4 / 95.2 / 99.2
eSAME (ours)
✓	✓		51.7 / 86.1 / 00.0	71.5 / 79.2 / 00.0	70.1 / 95.7 / 00.0	75.6 / 95.5 / 00.0
✓		✓	51.9 / 00.0 / 80.1	71.7 / 00.0 / 85.4	70.1 / 00.0 / 99.1	77.5 / 00.0 / 98.8
	✓	✓	00.0 / 86.7 / 82.2	00.0 / 80.7 / 86.3	00.0 / 96.6 / 99.4	00.0 / 95.6 / 99.1
✓	✓	✓	51.5 / 86.3 / 81.1	71.3 / 79.6 / 86.8	70.2 / 95.3 / 99.5	77.7 / 95.7 / 98.8

Q3: Do iSAME and eSAME extract information that is not captured by classically trained multi-task models? We train a network, which we refer to as classifier, on the embeddings generated by a multi-task model, to perform a task that was not seen during the training of the latter. We compare the performance of the classifier on the embeddings learned by a model trained in a classical manner, and with our proposed methods. This test allows us to quantify if our approaches lead to “more informative” node embeddings. Results on the ENZYMES dataset are shown in Figure 3. We notice that embeddings learned by our proposed approaches lead to at least 10% higher performance. We observe an analogous trend on the other datasets (as reported in Appendix E).

Table 3:

\Delta_{m}

(%) results for a classical multi-task model (Cl), a fine-tuned model (FT; trained on all three tasks and fine-tuned on two) and a linear classifier trained on the node embeddings learned using our meta-learning strategies (iSAME, eSAME) in a multi-task setting.

Task			Model	Dataset
GC	NC	LP		ENZYMES	PROTEINS	DHFR	COX2
✓	✓		Cl	$-0.1\pm 0.5$	$4.0\pm 1.0$	$-0.3\pm 0.2$	$0.5\pm 0.1$
			FT	$-4.5\pm 1.2$	$0.1\pm 0.5$	$-7.4\pm 1.4$	$0.1\pm 0.4$
			iSAME	$-2.3\pm 0.9$	$2.7\pm 1.5$	$-1.2\pm 0.4$	$-1.6\pm 0.2$
			eSAME	$-0.8\pm 0.8$	$3.2\pm 1.4$	$-1.8\pm 0.3$	$-1.2\pm 0.3$
✓		✓	Cl	$-25.3\pm 3.2$	$-5.3\pm 1.2$	$-28.3\pm 4.3$	$-21.4\pm 3.4$
			FT	$-5.1\pm 1.9$	$-5.4\pm 1.5$	$-24.5\pm 3.7$	$-22.6\pm 3.8$
			iSAME	$4.1\pm 0.5$	$-0.2\pm 0.9$	$0.2\pm 3.2$	$0.2\pm 0.5$
			eSAME	$3.2\pm 0.4$	$-1.2\pm 1.1$	$-0.7\pm 3.4$	$-0.8\pm 0.7$
	✓	✓	Cl	$7.2\pm 2.7$	$6.8\pm 0.9$	$-29.1\pm 7.7$	$-28.2\pm 4.5$
			FT	$-1.0\pm 0.3$	$3.1\pm 1.2$	$-28.9\pm 6.4$	$-28.3\pm 4.2$
			iSAME	$4.4\pm 1.1$	$6.1\pm 1.0$	$-0.1\pm 6.2$	$-0.6\pm 2.5$
			eSAME	$3.9\pm 1.3$	$6.1\pm 1.1$	$0.1\pm 6.4$	$-0.6\pm 2.6$
✓	✓	✓	Cl	$1.6\pm 1.3$	$2.9\pm 0.3$	$-18.9\pm 2.3$	$-16.9\pm 3.1$
			iSAME	$1.5\pm 1.0$	$2.2\pm 0.2$	$-0.5\pm 1.4$	$-0.9\pm 1.3$
			eSAME	$1.8\pm 0.9$	$2.8\pm 0.2$	$-1.0\pm 1.7$	$-0.4\pm 1.2$

Q4: Can the node embeddings learned by iSAME and eSAME be used to perform multiple tasks with comparable or better performance than classical multi-task models? We train the same multi-task model, both in the classical supervised manner, and with our proposed approaches, on all multi-task combinations. For our approaches, we then train a linear classifier on top of the node embeddings. We further consider the fine-tuning baseline introduced in Q2. We use the $\Delta_{m}$ metric [25] , defined as the average per-task drop with respect to the single-task baseline: $\Delta_{m}=\frac{1}{T}\sum_{i=1}^{T}\left(M_{m,i}-M_{b,i}\right)/M_{b,i},$ where $M_{m,i}$ is the value of a task’s metric for the multi-task model, and $M_{b,i}$ is the value for the baseline. Results are shown in Table 4. We first notice that usually multi-task models achieve lower performance than specialized single-task ones. We then highlight that linear classifiers trained on the embeddings produced by our procedures are comparable, and in many cases superior, to end-to-end models. In fact, the latter are highly sensible to the tasks that are being learned (e.g. GC and LP), with a worst-case average drop in performance of 29%. Our methods seem much less sensible, with a worst-case average drop of less than 3%. Finally, we also notice that the fine-tuning baseline generally performs worst than classically trained models, confirming that transferring knowledge in multi-task settings is not easy, and more advanced techniques, like our proposed method SAME, are needed.

Considerations on iSAME and eSAME. In all our experiments we notice that the performance between the two variants of SAME achieve comparable results. This suggests that the representation learning capabilities are an intrinsic property of optimization-based meta-learning approaches like MAML [10], and that strategies like ANIL [31] can help us lower the computational burden, while maintaining the desired properties.

6 Conclusions

We propose a novel meta-learning strategy for representation learning in multi-task settings. Our method overcomes the problems that arise when learning to solve multiple tasks concurrently by optimizing for a parameter setting that can quickly, i.e. with few steps of gradient descent, be adapted for high single-task performance on multiple tasks. We apply our method to graph representation learning, and find that it leads to higher quality node embeddings, both in the multi-task and in the single-task setting. We believe this work draws new interesting connections between meta-learning, representation learning, and multi-task learning, providing many directions for future research.

Acknowledgments and Disclosure of Funding

Part of this work was supported by the MIUR, the Italian Ministry of Education, University and Research, under PRIN Project n. 20174LF3T8 AHeAD (Efficient Algorithms for HArnessing Networked Data) and the initiative “Departments of Excellence” (Law 232/2016), and by the University of Padova under project SEED 2020 RATED-X.

References

Alet et al. [2019] Ferran Alet, Erica Weng, Tomas Lozano-Perez, and L. Kaelbling. Neural relational inference with fast modular meta-learning. In NeurIPS, 2019.
Avelar et al. [2019] Pedro Avelar, Henrique Lemos, Marcelo Prates, and Luis Lamb. Multitask learning on graph neural networks: Learning multiple graph centrality measures with a unified network. In ICANN Workshop and Special Sessions. 2019.
Bakshy et al. [2018] Eytan Bakshy, Lili Dworkin, Brian Karrer, Konstantin Kashin, Benjamin Letham, Ashwin Murthy, and Shaun Singh. Ae: A domain-agnostic platform for adaptive experimentation. In NeurIPS Systems for ML Workshop, 2018.
Bose et al. [2019] Avishek Joey Bose, Ankit Jain, Piero Molino, and William L Hamilton. Meta-graph: Few shot link prediction via meta learning. arXiv, 2019.
Chami et al. [2020] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and K. Murphy. Machine learning on graphs: A model and comprehensive taxonomy. arXiv, 2020.
Chen et al. [2018] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018.
Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, 2016.
Deleu et al. [2019] Tristan Deleu, Tobias Würfl, Mandana Samiei, Joseph Paul Cohen, and Yoshua Bengio. Torchmeta: A Meta-Learning library for PyTorch. arXiv, 2019.
Fey and Lenssen [2019] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
Garcia and Bruna [2018] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In ICLR, 2018.
Gilmer et al. [2017] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Holtz et al. [2019] Chester Holtz, Onur Atan, Ryan Carey, and Tushit Jain. Multi-task learning on graphs with node and graph level labels. In NeurIPS Workshop on Graph Representation Learning, 2019.
Hospedales et al. [2020] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv, 2020.
Kendall et al. [2018] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
Kim et al. [2019] Jongmin Kim, Taesup Kim, S. Kim, and C. Yoo. Edge-labeling graph neural network for few-shot learning. In CVPR, 2019.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
Li and Ji [2019] Diya Li and Heng Ji. Syntax-aware multi-task graph convolutional networks for biomedical relation extraction. In LOUHI, 2019.
Liu et al. [2019a] Lu Liu, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Learning to propagate for graph meta-learning. In NeurIPS, 2019a.
Liu et al. [2019b] Pengfei Liu, J. Fu, Y. Dong, Xipeng Qiu, and J. Cheung. Learning multi-task communication with message passing for sequence learning. In AAAI, 2019b.
Madhawa and Murata [2020] Kaushalya Madhawa and Tsuyoshi Murata. Active learning on graphs via meta learning. In ICML Workshop on Graph Representation Learning and Beyond, ICML, 2020.
Maninis et al. [2019] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. In CVPR, 2019.
Montanari et al. [2019] Floriane Montanari, Lara Kuhnke, Antonius Ter Laak, and Djork-Arné Clevert. Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Molecules, 2019.
Morris et al. [2020] Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML Workshop on Graph Representation Learning and Beyond, 2020.
Nguyen et al. [2020] Cuong Q. Nguyen, Constantine Kreatsoulas, and Branson Kim M. Meta-learning gnn initializations for low-resource molecular property prediction. In ICML Workshop on Graph Representation Learning and Beyond, ICML, 2020.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. 2019.
Pedregosa et al. [2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011.
Raghu et al. [2020] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In ICLR, 2020.
Scarselli et al. [2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009.
Standley et al. [2020] Trevor Standley, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Which tasks should be learned together in multi-task learning? In ICML, 2020.
Suo et al. [2020] Qiuling Suo, Jingyuan Chou, Weida Zhong, and Aidong Zhang. Tadanet: Task-adaptive network for graph-enriched meta-learning. In ACM SIGKDD, 2020.
Vandenhende et al. [2020] Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Revisiting multi-task learning in the deep learning era. arXiv, 2020.
Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In ICLR, 2018.
Wu et al. [2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
Yao et al. [2020] Huaxiu Yao, Chuxu Zhang, Ying Wei, Meng Jiang, Suhang Wang, Junzhou Huang, Nitesh V. Chawla, and Zhenhui Li. Graph few-shot learning via knowledge transfer. In AAAI, 2020.
Zhang et al. [2018] Yu Zhang, Ying Wei, and Qiang Yang. Learning to multitask. In NeurIPS, 2018.
Zhou et al. [2019] Fan Zhou, Chengtai Cao, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Ji Geng. Meta-gnn: On few-shot node classification in graph meta-learning. In CIKM, 2019.
Zügner and Günnemann [2019] Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via meta learning. In ICLR, 2019.

Appendix A Comparison with Traditional Training Apporaches

Our proposed meta-learning approach is significantly different from the classical training strategy (Algorithm 2), and the traditional meta-learning approaches (Algorithm 3).

The classical training approach for multi-task models takes as input a batch of graphs, which is simply a set of graphs, where on each graph the model has to execute all the tasks. Based on the cumulative loss on all tasks

\mathcal{L}=\lambda^{(GC)}\mathcal{L}^{(\text{GC})}+\lambda^{(NC)}\mathcal{L}^{(\text{NC})}+\lambda^{(LP)}\mathcal{L}^{(\text{LP})}

for all the graphs in the batch, the parameters are updated with some form of gradient descent, and the procedure is repeated for each batch.

The traditional meta-learning approach takes as input an episode, like our approach, but for every graph in the episode all the tasks are performed. The support set and target set are single sets of graphs, where every task can be performed on all graphs. The support set is used to obtain the adapted parameters $\theta^{\prime}$ , which have the goal of concurrently solving all tasks on all graphs in the target set. The loss functions, both for the inner loop and for the outer loop, are the same as the one used by the classical training approach. The outer loop then updates the parameters aiming at a setting that can easily, i.e. with a few steps of gradient descent, be adapted to perform multiple tasks concurrently given a support set.

Input : Model

f_{\theta}

; Batches

\mathcal{B}=\{\mathcal{B}_{1},..,\mathcal{B}_{n}\}

init

(\theta)

for $\mathcal{B}_{i}$ in $\mathcal{B}$ do

\text{{loss}}\leftarrow

concurrently perform all tasks on all graphs in

\mathcal{B}_{i}

\theta\leftarrow\text{{UPDATE}}(\theta,\text{{loss}})

end for

Algorithm 2 Classical Training

Input : Model

f_{\theta}

; Episodes

\mathcal{E}=\{\mathcal{E}_{1},..,\mathcal{E}_{n}\}

init

(\theta)

for $\mathcal{E}_{i}$ in $\mathcal{E}$ do

\text{{i\_loss}}\leftarrow

concurrently perform all tasks on all support set graphs

\theta^{\prime}\leftarrow\text{{ADAPT}}(\theta,\text{{i\_loss}})

\text{{o\_loss}}\leftarrow

concurrently perform all tasks on all target set graphs using parameters

\theta^{\prime}

\theta\leftarrow\text{{UPDATE}}(\theta,\theta^{\prime},\text{{o\_loss}})

end for

Algorithm 3 Traditional Meta-Learning

Appendix B Episode Design Algorithm

Algorithm 4 contains the procedure for the creation of the episodes for our meta-learning procedures. The algorithm takes as input a batch of graphs (with graph labels, node labels, and node features) and the loss function balancing weights, and outputs a multi-task episode. We assume that each graph has a set of attributes that can be accessed with a dot-notation (like in most object-oriented programming languages).

Notice how the episodes are created so that only one task is performed on each graph. This is important as in the inner loop of our meta-learning procedure, the learner adapts and tests the adaptated parameters on one task at a time. The outer loop then updates the parameters, optimizing for a representation that leads to fast single-task adaptation. This procedure bypasses the problem of learning parameters that directly solve multiple tasks, which can be very challenging.

Another important aspect to notice is that the support and target sets are designed as if they were the training and validation splits for training a single-task model with the classical procedure. This way the meta-objective becomes to train a model that can generalize well.

Input : Batch of

n

randomly sampled graphs

\mathcal{B}=\{\mathcal{G}_{1},..,\mathcal{G}_{n}\}

Loss weights

\lambda^{(GC)},\lambda^{(NC)},\lambda^{(LP)}\in[0,1]

Output : Episode

\mathcal{E}_{i}=(\mathcal{L}^{(m)}_{\mathcal{E}_{i}},\mathcal{S}^{(m)}_{\mathcal{E}_{i}},\mathcal{T}^{(m)}_{\mathcal{E}_{i}})

\mathcal{B}^{(GC)},\mathcal{B}^{(NC)},\mathcal{B}^{(LP)}\leftarrow

equally divide the graphs in

\mathcal{B}

in three sets

/* Graph Classification */

\mathcal{S}^{(\text{GC})}_{\mathcal{E}_{i}},\mathcal{T}^{(\text{GC})}_{\mathcal{E}_{i}}\leftarrow

randomly divide

\mathcal{B}^{(GC)}

with a 60/40 split

/* Node Classification */

for $\mathcal{G}_{i}$ in $\mathcal{B}^{(NC)}$ do

\text{{num\_labelled\_nodes}}\leftarrow\mathcal{G}_{i}\text{{.num\_nodes}}\times 0.3

\mathcal{N}\leftarrow

divide nodes per class, then iteratively randomly sample one node per class without replacement and add it to

\mathcal{N}

until

\lvert\mathcal{N}|=\text{{num\_labelled\_nodes}}

\mathcal{G}_{i}^{\prime}\leftarrow\text{{copy}}(\mathcal{G}_{i})

\mathcal{G}_{i}\text{{.labelled\_nodes}}\leftarrow\mathcal{N}

;

\quad\mathcal{G}_{i}^{\prime}\text{{.labelled\_nodes}}\leftarrow\mathcal{G}_{i}\text{{.nodes}}\setminus\mathcal{N}

\mathcal{S}^{(NC)}_{\mathcal{E}_{i}}\text{{.add}}(\mathcal{G}_{i})

;

\quad\mathcal{T}^{(NC)}_{\mathcal{E}_{i}}\text{{.add}}(\mathcal{G}_{i}^{\prime})

end for

/* Link Prediction */

for $\mathcal{G}_{i}$ in $\mathcal{B}^{(LP)}$ do

E_{i}^{(N)}\leftarrow

randomly pick negative samples (edges that are not in the graph; possibly in the same number as the number of edges in the graph)

E_{i}^{1,(N)},E_{i}^{2,(N)}\leftarrow

divide

E_{i}^{(N)}

with an

80/20

split

E_{i}^{(P)}\leftarrow

randomly remove

20\%

of the edges in

\mathcal{G}_{i}

\mathcal{G}_{i}^{\prime(1)}\leftarrow\mathcal{G}_{i}

removed of

E_{i}^{(P)}

\mathcal{G}_{i}^{\prime(2)}\leftarrow\text{{copy}}(\mathcal{G}_{i}^{\prime(1)})

\mathcal{G}_{i}^{\prime(1)}\text{{.positive\_edges}}\leftarrow\mathcal{G}_{i}^{\prime(1)}\text{{.edges}}

;

\quad\mathcal{G}_{i}^{\prime(2)}\text{{.positive\_edges}}\leftarrow E_{i}^{(P)}

\mathcal{G}_{i}^{\prime(1)}\text{{.negative\_edges}}\leftarrow E_{i}^{1,(N)}

;

\quad\mathcal{G}_{i}^{\prime(2)}\text{{.negative\_edges}}\leftarrow E_{i}^{2,(N)}

\mathcal{S}^{(LP)}_{\mathcal{E}_{i}}\text{{.add}}(\mathcal{G}_{i}^{\prime(1)})

;

\quad\mathcal{T}^{(LP)}_{\mathcal{E}_{i}}\text{{.add}}(\mathcal{G}_{i}^{\prime(2)})

end for

\mathcal{S}^{(m)}_{\mathcal{E}_{i}}\leftarrow\{\mathcal{S}^{(\text{GC})}_{\mathcal{E}_{i}},\mathcal{S}^{(\text{NC})}_{\mathcal{E}_{i}},\mathcal{S}^{(\text{LP})}_{\mathcal{E}_{i}}\}

\mathcal{T}^{(m)}_{\mathcal{E}_{i}}\leftarrow\{\mathcal{T}^{(\text{GC})}_{\mathcal{E}_{i}},\mathcal{T}^{(\text{NC})}_{\mathcal{E}_{i}},\mathcal{T}^{(\text{LP})}_{\mathcal{E}_{i}}\}

\mathcal{L}^{(\text{GC})}_{\mathcal{T}_{i}}\leftarrow

Cross-Entropy

(\cdot)

;

\quad\mathcal{L}^{(\text{NC})}_{\mathcal{T}_{i}}\leftarrow

Cross-Entropy

(\cdot)

\mathcal{L}^{(\text{LP})}_{\mathcal{T}_{i}}\leftarrow

Binary Cross-Entropy

(\cdot)

\mathcal{L}^{(m)}_{\mathcal{E}_{i}}=\lambda^{(GC)}\mathcal{L}^{(\text{GC})}_{\mathcal{T}_{i}}+\lambda^{(NC)}\mathcal{L}^{(\text{NC})}_{\mathcal{T}_{i}}+\lambda^{(LP)}\mathcal{L}^{(\text{LP})}_{\mathcal{T}_{i}}

Return

\mathcal{E}=(\mathcal{L}^{(m)}_{\mathcal{E}_{i}},\mathcal{S}^{(m)}_{\mathcal{E}_{i}},\mathcal{T}^{(m)}_{\mathcal{E}_{i}})

Algorithm 4 Episode Design Algorithm

Appendix C Model Architecture

We use an encoder-decoder model with a multi-head architecture. The backbone (which represents the encoder) is composed of 3 GCN [20] layers with ReLU non-linearities and residual connections [14]. The decoder is composed of three heads. The node classification head is a single layer neural network with a Softmax activation that is shared across nodes and maps node embeddings to class predictions. In the graph classification head, first a single layer neural network (shared across nodes) performs a linear transformation (followed by a ReLU activation) of the node embeddings. The transformed node embeddings are then averaged and a final single layer neural network with Softmax activation outputs the class predictions. The link prediction head is composed of a single layer neural network with a ReLU non-linearity that transforms node embeddings, and another single layer neural network that takes as input the concatenation of two node embeddings and outputs the probability of a link between them.

Appendix D Additional Experimental Details

In this section we provide additional information on the implementation of the models used in our experimental section. We implement our models using PyTorch [29], PyTorch Geometric [9] and Torchmeta [8]. For all models the number and structure of the layers is as described in Appendix C, where we use 256-dimensional node embeddings at every layer.

To perform multiple tasks, we consider datasets with graph labels, node attributes, and node labels from the widely used TUDataset library [27]. At every cross-validation fold the datasets are split into $70\%$ for training, $10\%$ for validation, and $20\%$ for testing. For each model we perform 100 iterations of hyperparameter optimization over the same search space (for shared parameters) using Ax [3].

We tried some sophisticated methods to balance the contribution of loss functions during multi-task training like GradNorm [6] and Uncertainty Weights [17], but we saw that usually they do not positively impact performance. Furthermore, in the few cases where they increase performance, they work for both classically trained models, and for models trained with our proposed procedures. We then set the balancing weights to $\lambda^{(GC)}=\lambda^{(NC)}=\lambda^{(LP)}=1$ to provide better comparisons between the training strategies.

The multi-task performance $\Delta_{m}$ metric [25] is defined as the average per-task drop with respect to the single-task baseline: $\Delta_{m}=\frac{1}{T}\sum_{i=1}^{T}\left(M_{m,i}-M_{b,i}\right)/M_{b,i},$ where $M_{m,i}$ is the value for the multi-task model, and $M_{b,i}$ for the baseline.

Linear Model.

The linear model trained on the embeddings produced by our proposed method is a standard linear SVM. In particular we use the implementation available in Scikit-learn [30] with default hyperparameters. For graph classification, we take the mean of the node embeddings as input. For link prediction we take the concatenation of the embeddings of two nodes. For node classification we keep the embeddings unaltered.

Deep Learning Baselines.

We train the single task models for 1000 epochs, and the multi-task models for 5000 epochs, with early stopping on the validation set (for multi-task models we use the sum of the task validation losses or accuracies as metrics for early-stopping). Optimization is done using Adam [19]. For node classification and link prediction we found that normalizing the node embeddings to unit norm in between GCN layers helps performance.

Our Meta-Learning Procedure.

We train the single task models for 5000 epochs, and the multi-task models for 15000 epochs, with early stopping on the validation set (for multi-task models we use the sum of the task validation losses or accuracies as metrics for early-stopping). Early stopping is very important in this case as it is the only way to check if the meta-learned model is overfitting the training data. The inner loop adaptation consists of 1 step of gradient descent. Optimization in the outer loop is done using Adam [19]. We found that normalizing the node embeddings to unit norm in between GCN layers helps performance.

Table 4: Results of a neural network trained on the embeddings generated by a multi-task model, to perform a task that was not seen during training by the multi-task model. “

x

y

z

” indicates that the multi-task model was trained on tasks

x

and

y

, and the neural network is performing task

z

Task	Model	Dataset
		ENZYMES	PROTEINS	DHFR	COX2
GC,NC ->LP	Cl	$56.9\pm 3.9$	$54.4\pm 1.4$	$61.2\pm 2.2$	$59.8\pm 0.4$
	iSAME	$77.3\pm 4.5$	$88.5\pm 1.8$	$99.8\pm 1.8$	$97.1\pm 2.0$
	eSAME	$78.9\pm 2.8$	$89.1\pm 1.5$	$99.7\pm 2.2$	$95.8\pm 3.3$
GC,LP ->NC	Cl	$69.1\pm 1.2$	$57.3\pm 1.6$	$58.3\pm 9.3$	$68.9\pm 10.7$
	iSAME	$73.3\pm 2.1$	$59.2\pm 2.5$	$77.6\pm 1.6$	$78.1\pm 4.6$
	eSAME	$79.1\pm 1.7$	$64.7\pm 3.0$	$76.1\pm 2.7$	$76.9\pm 3.3$
NC,LP ->GC	Cl	$47.1\pm 2.4$	$75.3\pm 1.5$	$77.5\pm 3.1$	$79.9\pm 3.4$
	iSAME	$48.5\pm 5.5$	$76.1\pm 2.3$	$76.1\pm 3.7$	$79.7\pm 5.1$
	eSAME	$56.6\pm 3.1$	$74.6\pm 2.7$	$77.1\pm 3.6$	$79.3\pm 6.2$

Appendix E Full Results for Q3

Table 4 contains results for a neural network, trained on the embeddings generated by a multi-task model, to perform a task that was not seen during the training of the multi-task model. Accuracy (%) is used for node classification (NC) and graph classification (GC); ROC AUC (%) is used for link prediction (LP). The embeddings produced by our meta-learning methods lead to higher performance (up to 35%), showing that our procedures lead to the extraction of more informative node embeddings with respect to the classical end-to-end training procedure.