Is Self-Supervised Pretraining Good for Extrapolation in Molecular Property Prediction?

Shun Takashige
The University of Tokyo
[email protected]
&Masatoshi Hanai
The University of Tokyo
[email protected]
Toyotaro Suzumura
The University of Tokyo
[email protected]
&Limin Wang
The University of Tokyo
[email protected]
&Kenjiro Taura
The University of Tokyo
[email protected]

Abstract

The prediction of material properties plays a crucial role in the development and discovery of materials in diverse applications, such as batteries, semiconductors, catalysts, and pharmaceuticals. Recently, there has been a growing interest in employing data-driven approaches by using machine learning technologies, in combination with conventional theoretical calculations. In material science, the prediction of unobserved values, commonly referred to as extrapolation, is particularly critical for property prediction as it enables researchers to gain insight into materials beyond the limits of available data. However, even with the recent advancements in powerful machine learning models, accurate extrapolation is still widely recognized as a significantly challenging problem. On the other hand, self-supervised pretraining is a machine learning technique where a model is first trained on unlabeled data using relatively simple pretext tasks before being trained on labeled data for target tasks. As self-supervised pretraining can effectively utilize material data without observed property values, it has the potential to improve the model’s extrapolation ability. In this paper, we clarify how such self-supervised pretraining can enhance extrapolation performance. We propose an experimental framework for the demonstration and empirically reveal that while models were unable to accurately extrapolate absolute property values, self-supervised pretraining enables them to learn relative tendencies of unobserved property values and improve extrapolation performance.

1 Introduction

The prediction of material property values is essential for the development and discovery of materials across a broad range of applications such as batteries, semiconductors, catalysts, and pharmaceuticals. The prediction is typically conducted for a large set of candidate materials, aiming to identify those that meet the desired property requirements. As the computational cost of calculating material properties based on physical simulations, such as density functional theory (DFT) or coupled cluster (CC), can be excessively high, data-driven approaches that employ surrogate models trained on a subset of simulation results from the candidate materials have attracted much attention.

The existing work on the prediction of material property values has mainly focused on interpolation problems, where the assumption is that the training data and test data are independent and identically distributed (i.i.d.). However, in the material development and discovery, extrapolation problems are practically much more important, i.e., training data and test data are assumed to have different distributions. One of the ultimate goals in material development is to discover a material with a completely unseen physical property, which would be achieved only through extrapolation.

Refer to caption — Figure 1: Two examples of extrapolation and distributions in the case of predicting an underlying function $g(x)=\sin(x)$ . The red dotted line is the expected prediction line. Left: based on input features, the training distribution and extrapolation are determined. Right: based on output labels, the training distribution and extrapolation is determined.

The formation of data distribution is significant in formulating the extrapolation problem, and the existing researches can be categorized into two distinct approaches: those based on input features and those on output labels. Figure 1 is one of the simplest cases of extrapolation and shows the difference between them when predicting $g(x)=\sin(x)$ . In the former approach, the data forms distribution based on input features as most researches [8, 17, 11, 18, 1, 7, 16, 20, 26, 27, 12, 23] work on it. Thus, the training data and test data include different types of input, such as different colors or sizes of image data. In the latter approach, the data forms distribution based on (continuous) labels [24]. Thus, the training data and test data include different types of labels, such as different ages in face image data. The typical classification task always uses the input-feature-based data distribution for the extrapolation since its labels are discrete, making it impossible to predict unknown labels.

In this paper, we focus on the extrapolation problem formulated by the label-based data distribution as material developments have a strong need to predict unobserved physical property values, i.e., unseen labels. Our research differs from the existing label-based approarch [24] in two key aspects.

First, we define the extrapolation differently in a manner that is more appropriate for the application of predicting material property values. As [24] has mainly focused on addressing label imbalance issues in image recognition tasks, the extrapolation region is defined as the special case of the imbalance region where the label data is completely absent. In contrast, we define an extrapolation region as one that is not included within the maximum and minimum label values. This definition is more relevant in material property value prediction, where extreme values (e.g., exceptionally large or small values) are crucial.

Second, our study puts emphasizes the utilization of self-supervised pretraining for the extrapolation, whereas [24] proposes a training method to correct an imbalance of label data. Our research highlights the unique situation in material property prediction, where numerous unlabeled data points are available with known input structural data, yet the material properties themselves remain unknown. Therefore, for the utilization of such unlabeled data, we leverage self-supervised pretraining, which enables the model to learn useful features from the input structural data without relying on explicit label information.

In this paper, we first formulate extrapolation for material property prediction considering the distribution of labels. Next, we analyze the extrapolation ability of existing model based on the formulation. Finally, we demonstrate the effectiveness of several self-supervised pretraining strategies in improving the extrapolation ability.

To summarize the main contributions, this paper

•

re-defines an extrapolation problem in material property value prediction.
•

reveals that existing models could not predict over a certain value in extrapolation.
•

reveals that self-supervised pretraining improves the results of extrapolating the relative tendencies to some extent, although property value itself could not be answered exactly.
•

reveals that the utilization of validation data when pretraining does contribute the performance improvement.

2 Background

2.1 Extrapolation

Extrapolation generally means predicting data sampled from unseen distributions or domains. The general definition of extrapolation is described in the following way according to [22]. Firstly, this paper lets $\mathcal{X}$ be the domain of interests such as the structural data of materials. A target function $g\colon\mathcal{X}\to\mathbb{R}$ generates labels from the inputs. A label $y\in\mathbb{R}$ and an input $\boldsymbol{x}\in\mathcal{X}$ have a relationship of $y=g(\boldsymbol{x})$ . Given that $\mathcal{D}$ is the support of certain training distribution, training data are expressed as $\{(\boldsymbol{x}_{i},y_{i})\}^{n}_{i=1}\subset\mathcal{D}$ . Note that as an input $\boldsymbol{x}\in\mathcal{X}$ can be an arbitrary format such as a graph, various definitions can be considered for the data region and distribution. Extrapolation aims to learn the distribution $\mathcal{X}\setminus\mathcal{D}$ by minimizing the extrapolation error $\mathbb{E}_{\boldsymbol{x}\sim\mathcal{X}\setminus\mathcal{D}}\left[\ell\left(f(\boldsymbol{x}),g(\boldsymbol{x})\right)\right]$ . Note that $f\colon\mathcal{X}\to\mathbb{R},\ell\colon\mathbb{R}\times\mathbb{R}\to\mathbb{R}$ is a model and a loss function.

Also, it is recognized as a difficult task as models are poor at learning non-linearity outside the training distributions. This issue was examined by Xu et al.(2021) [22] proving that MLP with ReLU as an activation function could not converge nonlinear function outside training distribution.

2.2 Molecular Property Prediction

In molecular property prediction, the input is a molecular graph and the output is a continuous label. Since a set of graphs can be represented as $\mathcal{G}=(V,E)$ using a set of nodes $V$ representing atoms, and edges $E$ representing bonds, the problem is to learn the relationships $g\colon\mathcal{G}\to\mathbb{R}$ . To model it, GNNs and transformer-based models are extensively developed [13, 14, 25].

Although they mainly work in i.i.d. assumption, some research tackled extrapolation, or out-of-distribution generalization in molecular property prediction. Then, they are specified according to the formation of data distributions: one based on input features or based on out labels.

Most previous studies [23, 12] defined extrapolation by dividing data distributions based on input features. For example, Yang et al. (2022) [23] use scaffolds and sizes of graphs to decide distributions. Although Chan et al. (2021)[2] formed distributions by output labels, its aim is generating data rather than prediction. Whereas the definition based on input features has the advantage of making models robust to a wide variety of inputs, it doesn’t meet the needs of finding material extraordinary property value. To realize it, a new definition of extrapolation based on output labels in molecular property prediction is expected.

2.3 Self-Supervised Learning on Graph

Self-supervised learning is a method of training a model using only input data without labels. Since self-supervised learning does not use labels, it has been widely studied for utilizing unlabeled data. Considering the fact that there are plenty of molecules whose target property values are unknown, it is a powerful tool for our research. The following three types of pretext tasks are related to self-supervised learning on graphs.

Attribute Level Task. A popular method for node-level pretext tasks is a node and edge masking problem. This task is a classification problem, and the model predicts true attributes from the hidden inputs by masking. These masking tasks were first developed in the field of natural language processing, such as BERT [3], and have been applied to graphs as well. Through this task, encoders can gain effective information related to graph attributes. [19, 5]

Structure Level Task. The task handles structures in graphs and extracts important information regarding the connectivity between nodes without considering the attributes. Structures strongly influence the characteristics of molecules because electrons involved in the bonds have an important role to determine the property values. For example, the number of rings in a graph can be a useful indicator for grasping molecules.

Attribute + Structure Level Task. The task is mainly to predict the properties of subgraphs with attributes and structures. This often needs to prepare a set of specific graphs in advance for training objectives, e.g., whether the specific graphs are included in a given input graph. Most of them perform graph matching between subgraphs of inputs and specific graphs in a set of graphs called motif vocabulary [19]. The motif vocabularies are often created using domain knowledge in chemistry.

3 Experimental Framework

We have to define extrapolation to fit the aim of finding molecules with high or low property values.For the new definition, distribution based on output labels should be described clearly, as we expect models to predict the property values, or output labels, outside that of training distribution. It can be realized since the molecular property prediction is a regression task and its labels are continuous. In addition to this, self-supervised pretraining is considered to be a good solution for improving extrapolation performance. Since the technique can utilize graph structures whose labels are unknown when training and may improve the problems of non-linearity according to [22], we have to demonstrate its impact on extrapolation.

To evaluate the performance considering the above points, we propose an experimental framework. Firstly, the formulation of label-based data distribution and extrapolation in material science are introduced. Then, the hypothesis regarding self-supervised pretraining and the utilization of unlabeled data are described. Finally, details of experiments for verifying them are explained including the way to pretraining.

3.1 Problem Definition

Instead of the traditional definition, we define extrapolation by dividing data distributions based on output labels. We formulate this definition using the same notations in section 2.1, and it can be described as follows.

Definition 1 (Extrapolation in our study)

We assume the model $f\colon\mathcal{X}\to\mathbb{R}$ is trained given a training set $\{(\boldsymbol{x}_{i},y_{i})\}^{n}_{i=1}\subset\mathcal{D}$ with a target function $g\colon\mathcal{X}\to\mathbb{R}$ . Also, a training distribution $\mathcal{D}$ is defined by the following formulation.

\mathcal{D}=\{(\boldsymbol{x},y)\mid\min{\{y_{i}\}^{n}_{i=1}}\leq y\leq\max{\{y_{i}\}^{n}_{i=1}}\}

Let $\ell\colon\mathbb{R}\times\mathbb{R}\to\mathbb{R}$ be a loss function and $\mathcal{P}$ be a distribution over $\mathcal{X}\setminus\mathcal{D}$ . Then extrapolation is a task that model $f$ learns the distribution $\mathcal{P}$ by minimizing extrapolation error $\mathbb{E}_{\boldsymbol{x}\sim\mathcal{P}}\left[\ell\left(f(\boldsymbol{x}),g(\boldsymbol{x})\right)\right]$

3.2 Hypothesis

In this research, two hypotheses regarding the effects of self-supervised pretraining on extrapolation will be verified. We give an overview of the hypotheses and the technical reasons why they are made.

3.2.1 H1: Is self-supervised pretraining effective for extrapolation?

First of all, we hypothesized that self-supervised pretraining itself is effective for extrapolation. There are following two reasons for hypothesizing in this way. Firstly, there are several results based on previous research that self-supervised learning and pretraining make models robust [8, 15, 7]. It would be of great merit to be able to acquire such expressions with various types of data. Secondly, it is possible to incorporate non-linearity by pretraining. As explained in section 2.1, MLP with ReLU cannot predict non-linearity outside the training distribution. As a solution to this, Xu et al.(2021) [22] proposed to use representation learning with a different task before fine-tuning. Based on the paper, we assume that non-linearity can be given to the model by performing self-supervised pretraining.

3.2.2 H2: Does utilization of unlabeled data help extrapolation?

As a second hypothesis, we insist that extrapolation can be improved by utilizing unlabeled data in self-supervised pretraining. It can be expected that significant performance degradation on unlabeled data occurs since training by only labeled data leads to overfitting to them. This expectation arises from one assumption that the unlabeled data holds important information which labeled data don’t have in our definition.

Therefore, we hypothesized that it might be possible to reduce overfitting by treating both labeled and unlabeled data as inputs for self-supervised pretraining. It can handle unlabeled data since it requires not labels but graph structures.

3.3 Experiment Design

Firstly, we give an overview of the flow when using self-supervised pretraining in our experiment in Figure 2. Training models in this experiment consists of two steps: self-supervised pretraining, and fine-tuning. Also, there are three important elements: 1) the model for predicting property from the molecular graph; 2) the way to train the model; 3) the method of data split; We will introduce all of them in this section.

3.3.1 Model

In this research, graphormer [25] based on a transformer is adopted which consists of encoders and a linear layer. We chose the model since it is currently near the state of the art in molecular property prediction.

3.3.2 Training

we first train the model by self-supervised pretraining with training and validation data This step aims to train the encoder by either of three specific pretext tasks. After that, fine-tuning is conducted using labeled data by minimizing Mean Absolute Error (MAE).

Then, the three pretext tasks are 1) node-level masking task (attribute prediction); 2) geometric structure prediction (structure prediction); 3) motif prediction (attribute + structure prediction); We chose them because they can cover various sizes of trends in molecular graphs. As node attributes are one of the minimum units in a graph, the node-level masking task is expected to enable the encoder to learn micro tendencies. The geometric structures prediction also can give connectivity information. Motif prediction is a combination of attribute and structure prediction. We provide detailed explanations for each of the three tasks.

Task1 : Node-level masking task. This tasks is a classification problem, predicting node attributes of masked parts given the input with partially masked nodes. The nodes are randomly selected before starting training, and the same masks are consistently used during the training. As for the masking rate, one node in one molecular graph is masked. The model is trained to minimize the cross-entropy loss function.

Task2 : Geometric structure prediction. The task is to predict geometric structures only. In other words, a model judges the presence of connections between nodes without considering attributes. To achieve this, we enumerated 141 fully connected graphs whose nodes consist of 3 to 6 vertices. Then, we selected 83 structures that at least one molecule includes and created the structure vocabulary as illustrated in Figure 4. We encoded information of whether certain structures are included in a molecule or not as an 83-dimensional label. The labels are made by using function subgraph_is_isomorphic in NetworkX [6]. Also, since this problem can be treated as a multi-label binary classification problem, we train a model to minimize the binary cross-entropy loss function.

Task3 : Motif prediction. The last task is to predict the presence of motifs in a graph. Motifs are graph structures with important functional features based on domain knowledge. They contain not only structures but also attributes, and this task is a graph-level prediction. This research uses the molecular structures determined by chemical domain knowledge called MACCS keys [4]. These motifs contain meaningful 167 substructures, such as -COOH and -OH, as seen from Figure 4. As conducted in task2, each molecule is checked if it contains these motifs, creating a label for the 167-dimensional motifs. These operations are performed by the function GetMACCSKeysFingerprint provided by RDKit [10]. In addition, we used the binary cross entropy loss function as the loss function.

3.3.3 Data Split

Cross-validation is the mainstream of conventional data-splitting methods. However, this method is not suitable for generating validation data for evaluating extrapolation. Therefore, we use the forward holdout validation proposed by Xiong et al. (2020) [21] for preparing validation data. The method splits dataset into training data $\{(\boldsymbol{x}^{t}_{i},y^{t}_{i})\}^{n_{t}}_{i=1}$ and validation data $\{(\boldsymbol{x}^{v}_{i},y^{v}_{i})\}^{n_{v}}_{i=1}$ so that $\max\left(\{y^{t}_{i}\}^{n_{t}}_{i=1}\right)<\min\left(\{y^{v}_{i}\}^{n_{v}}_{i=1}\right)$ .

3.4 Experiment Protocol

To validate the two hypotheses, we train models by the following 7 training methods.

No pretraining (baseline). As a baseline, models are trained by only fine-tuning with training data. Then their interpolation and extrapolation performance is evaluated by holdout validation and forward holdout validation.

Pretraining without validation data (task1, task2, task3). To check on whether utilizing unlabeled data in self-supervised pretraining is effective or not, the models are firstly pretrained with only training data. As self-supervised pretraining, task1, task2, and task3 are used. After that, they are fine-tuned by training data. Their extrapolation performance is evaluated by forward holdout validation.

Pretraining with validation data (task1, task2, task3). As opposed to the above methods, in this training method, we use both training and validation data when pretraining. The rest procedures are the same as above.

4 Evaluation

Firstly, we explain experimental settings throughout the three experiments. We use PCQM4Mv2 for molecular property prediction provided by OGB-LSC [9] as a dataset. It provides molecular graphs and HOMO LUMO gaps, and the model has to predict the gap values. The model is graphormer, and we set the number of encoder layers to 12, the dimensions of feature vectors to 512, and adopted AdamW provided by PyTorch as the optimizer. As for the experimental environment, all experiments were performed using 4 GPUs, and the model is trained with 30 epochs for pretraining and 80 epochs for fine-tuning. Evaluation metrics are the averages of three runs in each method.

As an overview of the experiments, there are mainly three important findings as follows.

•

Graphormer trained by only fine-tuning can’t extrapolate well. Especially it can’t output over a certain property value, which would be a piece of evidence that it can’t learn nonlinearity.
•

Self-supervised pretraining improves performance to predict ranks of property values slightly, although models can’t predict the property values exactly.
•

The impact on utilization of unlabeled data can be seen from differences of rank correlations between with and without unlabeled data when pretraining.

4.1 Analysis of MAE

We evaluated all of the performances by Mean Absolute Error. Firstly, we compare the result of interpolation and extrapolation in the baseline. The average of minimum MAEs in interpolation reaches 0.091, whereas that in extrapolation does 0.692. Additionally, the minimums in extrapolation are recorded from the 1st epoch to the 10th epoch. It means that, in extrapolation, the model is considered to be poorly learned.

To investigate the issue in more detail, we explain Figure 6 which shows the relationship between the predicted value and the label. First of all, in interpolation, the validation data is plotted along the black dotted line. On the other hand, in extrapolation, the validation data are far apart, and the predicted value hardly exceeds a certain value. One of the reasons for this result may be that there are no data with labels greater than about 8 in the training data. However, since a large number of predicted values exist around 8, the model seems to be able to predict the tendency that the validation data is relatively high in the dataset.

Figure 6 shows the mean absolute error in extrapolation with 7 training methods. Considering baseline and pretraining with validation data, we can find that the MAEs of task1, 2, and 3 are lower than that of the baseline method When compared to the baseline, the values decreased by 4.91% for method1, 3.47% for method2, and 9.68% for method3. However, these MAEs are much larger than the interpolation MAEs, and the extrapolation performances are far behind the interpolation.

When focusing on the result of pretraining with and without unlabeled data, tasks 1 and 2 with unlabeled data are almost 3% better than without them. In contrast, the MAE of task 3 increased by 0.8% when unlabeled data is used in pretraining.

In contrast, all of the MAEs are significantly worse than that in interpolation. It means that self-supervised pretraining can’t help models extrapolate exact values. Therefore, in terms of predicting exactly, the hypothesis h1 and h2 are false.

Also, we illustrate Figure 7 to show the relationship between labels and predicted values.In these four figures, the models could not almost predict values higher than 8. In contrast, the difference between them is that the validation data in the baseline figure is slightly tilted to the lower right, while in task1 and 2 is almost parallel, and in task3 slightly tilted to the upper right. This fact suggests that pretraining may help the model to learn relative trends in the validation data. From this point of view, we considered that MAE alone is insufficient as an evaluation metric for extrapolation in this study, and decide to conduct further analysis.

4.2 Analysis of Rank

To analyze the relative tendency of the predictions and find out the difference between pretraining and non-pretraining methods more clearly, we introduce the idea of ranking.

This is an index that indicates whether the ranking in descending order of physical property value is correctly predicted in the validation data or not. In addition, in the field of materials chemistry, understanding this ranking also has a positive effect on improving the efficiency of material development, so we decided to adopt this index. Furthermore, ranking is a discrete value, and it is expected that the plot will be more sparse than Figure 7, making it easier to see the difference from the figure.

To evaluate it, the rankings of the labels and the predicted values are calculated in advance, and the correlation coefficient is used as the evaluation metrics. The reason why we choose the metric is that since the number of validation data is about 70,000, the metric is suitable for grasping the macro trends related to rankings.

Figure 9 illustrates the ranking correlation coefficients in extrapolation with 8 training methods. What we can see from this figure is that the correlation coefficients for task1, 2, and 3 when using validation data are much closer to 1.0 than the baseline. Specifically, the figure of task3 is 0.516, which is the best result. On the other hand, the baseline is 0.255, which is the lowest value. Correlation coefficients in the range from 0.4 to 0.7 are considered to have some degree of positive correlation, while if it is between -0.2 and 0.2 there is almost no correlation. It can be said that self-supervised pretraining enhanced the ability to judge whether certain data is greater than other data in the validation data.

Then, for checking hypothesis h2, we contrast pretraining with and without validation data. The result is that the metrics in all of the tasks decrease from almost 0.5 to 0.4 when removing validation data from self-supervised pretraining. Therefore, it can be concluded that validation data has a good influence on extrapolation performance in terms of predicting the relative tendency.

Also, we calculated the correlation coefficients for each epoch and gain Figure 9 showing the change through training. From this figure, we can see the fact that by 10 epochs the correlation coefficient values fall into negative for most methods. Especially in the first 5 epochs, the correlation coefficient is very high, and the method using pretraining has a value of 0.4 or higher, but it can be seen that it drops sharply after the epoch. This phenomenon might be because, through training, a model that is too well-fitted to the training data fails to predict label values so high that the training data doesn’t include it. The issue is a significant limitation in our study and should be well analyzed.

5 Conclusion

In this paper, we have proposed the experimental framework for evaluating the impacts of self-supervised pretraining on extrapolation by dividing distributions based on labels and proved the hypotheses empirically using the benchmark of molecular property prediction. Firstly, we have organized existing definitions of extrapolation and pointed out the necessity to decide distributions by a label for material development. In the experiments, three methods with pretraining were demonstrated to be superior to baseline methods without pretraining by evaluating MAE, ranking. Although the pretraining with both the labeled and unlabeled data couldn’t improve the performance of extrapolating exact values, it could help to learn the tendency of unseen relative distributions. The extrapolation ability could contribute to improving the efficiencies to search for new materials.

6 Limitation

Up to now, we explained the contributions of this work, but there are some limitations that should be solved in the future as follows.

Impacts of pretraining As explained in Figure 9, extrapolation performances were improved in early epochs, and after that, the model seemed to be overfitted to training distribution. Therefore, the impacts should last as long as possible.

Imbalance of data split In our study, the dataset is split by forward holdout validation for extrapolation. This method doesn’t consider balance in distributions of training and validation data, there are still issues related to the way to divide the dataset for validation.

Lack of downstream tasks As we have conducted only one benchmark OGB-LSC, there are possibilities that self-supervised pretraining is not effective in other benchmarks. Therefore, future work has to verify the hypothesis in various downstream tasks.

References

[1] Dyah Adila and Dongyeop Kang. Understanding out-of-distribution: A perspective of data dynamics. In Melanie F Pradier, Aaron Schein, Stephanie Hyland, Francisco J R Ruiz, and Jessica Z Forde, editors, Proceedings on “I (Still) Can’t Believe It’s Not Better!” at NeurIPS 2021 Workshops, volume 163 of Proceedings of Machine Learning Research, pages 1–8. PMLR, December 2022.
[2] Alvin Chan, Ali Madani, Ben Krause, and Nikhil Naik. Deep extrapolation for attribute-enhanced generation. Advances in Neural Information Processing Systems, 34:14084–14096, 2021.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[4] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6):1273–1280, 2002.
[5] Jinjia Feng, Zhen Wang, Yaliang Li, Bolin Ding, Zhewei Wei, and Hongteng Xu. Mgmae: Molecular representation learning by reconstructing heterogeneous graphs with a high mask ratio. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 509–519, 2022.
[6] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
[7] Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020.
[8] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems, 32, 2019.
[9] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
[10] Greg Landrum. Rdkit documentation, 2013.
[11] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5400–5409, 2018.
[12] Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. Ood-gnn: Out-of-distribution generalized graph neural network. IEEE Transactions on Knowledge and Data Engineering, 2022.
[13] Lihang Liu, Donglong He, Xiaomin Fang, Shanzhuo Zhang, Fan Wang, Jingzhou He, and Hua Wu. Gem-2: Next generation molecular property prediction network with many-body and full-range interaction modeling. arXiv preprint arXiv:2208.05863, 2022.
[14] Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, and Dominique Beaini. Gps++: An optimised hybrid mpnn/transformer for molecular property prediction. arXiv preprint arXiv:2212.02229, 2022.
[15] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang Wang. Self-supervised learning for generalizable out-of-distribution detection. 34(04):5216–5223, 2020.
[16] Gyoung S Na and Chanyoung Park. Nonlinearity encoding for extrapolation of neural networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1284–1294, 2022.
[17] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. pages 1406–1415, 2019.
[18] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
[19] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33:12559–12571, 2020.
[20] Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. Handling distribution shifts on graphs: An invariance perspective. arXiv preprint arXiv:2202.02466, 2022.
[21] Zheng Xiong, Yuxin Cui, Zhonghao Liu, Yong Zhao, Ming Hu, and Jianjun Hu. Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation. Computational Materials Science, 171:109203, 2020.
[22] Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. How neural networks extrapolate: from feedforward to graph neural networks. In International Conference on Learning Representations (ICLR), 2021.
[23] Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. Learning substructure invariance for out-of-distribution molecular representations. In Advances in Neural Information Processing Systems, 2022.
[24] Yuzhe Yang, Kaiwen Zha, Yingcong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. In International Conference on Machine Learning, pages 11842–11851. PMLR, 2021.
[25] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.
[26] Junchi Yu, Jian Liang, and Ran He. Finding diverse and predictable subgraphs for graph domain generalization. arXiv preprint arXiv:2206.09345, 2022.
[27] Zeyang Zhang, Xin Wang, Ziwei Zhang, Haoyang Li, Zhou Qin, and Wenwu Zhu. Dynamic graph neural networks under spatio-temporal distribution shift. In Advances in Neural Information Processing Systems, 2022.

Appendix A Detail of Dataset

In our study, we utilized the PCQM4Mv2 dataset, proposed as a benchmark for molecular property prediction in OGB-LSC [9]. This dataset incorporates the two-dimensional graph structure and three-dimensional coordinates of molecules as inputs; however, we exclusively focused on the former.

This task involves predicting the HOMO-LUMO gap, and the distribution of its values is illustrated in Figure 10 In this study, we exclusively utilized the training and validation data for our analysis. The dataset comprises a total of 3,746,619 molecules, which are divided into train/validation/test-dev/test-challenge with a ratio of 90/2/4/4. For our study, we selected forward holdout validation to split the data, with the left side of the distribution representing the training data and the right side representing the validation data, maintaining a ratio of 90/2.

Appendix B Implementation Detail

B.1 Model

This research uses graphormer as the model and re-implements it using PyTorch for our experiment. It firstly embeds node and edge features from the graph structure whose dimensions are set to 512 and 128 respectively. Additionally, the encoder layer, which includes an attention mechanism, is configured with 12 layers, and the dropout rate is set at 10%.

B.2 Training

As described in the section of experiment design, we adopted a self-supervised pretraining approach as our training method. The details of this self-supervised pretraining are outlined in Section 3.3.2. Our training methodology consists of seven approaches, including one approach without pretraining, three approaches with pretraining not using validation data, and three approaches with pretraining using validation data. The pretraining phase runs for 30 epochs, followed by fine-tuning for 80 epochs. An initial learning rate is 2 $e$ -4, controlled by the get_polynomical_decay_schedule_with_warmup function ¹¹1https://github.com/huggingface/transformers/blob/main/src/transformers/optimization.py, and the AdamW is used as optimizer.