Towards understanding evolution of science through language model series

Junjie Dong, Zhuoqi Lyu, Qing Ke J. Dong, Z. Lyu, Q. Ke are with Department of Data Science, College of Computing, City University of Hong Kong, Hong Kong, China.
E-mail: [email protected]

Abstract

We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and “one model to rule them all”, AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models’ behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.

I Introduction

Natural language processing (NLP) has been experiencing a new paradigm where Transformers-based language models pretrained on a massive text corpus have significantly improved performances in downstream NLP tasks [1]. In our focused scientific domains, both the encoder and decoder component of Transformers have been extensively employed to develop scientific language models pretrained on full-text of papers [2].

When developing language models for a target domain, prior studies have suggested a number of pretraining methods. The first is continual training, which uses a model pretrained already on a general domain corpus and continues the pretraining process on the target domain corpus (i.e., domain adaption, mixed-domain pretraining) [3]. BioBERT is one example adapted from the original BERT to the biomedical domain using the PubMed corpus [4]. The second approach is training from scratch directly on the target domain corpus. Recent development of BERT-like models, including both multi-domain [5, 6] and domain-specific ones [7, 8, 9], adopts this strategy, each aimed to better capture linguistic and semantic features of their respective fields. Training from scratch has been shown to over-perform continual training, and one reason for the over-performance is because it constructs a vocabulary encompassing more in-domain tokens, leading to more effective learning from the domain corpus. Continual training, by contrast, uses the same vocabulary from the original model and consequently fragments complex, domain-relevant words into less meaningful pieces. For example, “chloramphenicol” is treated as one token by BiomedBERT (PubmedBERT) but six tokens by BioBERT (“ch-lor-amp-hen-ico-l”).

Here we propose a novel pretraining strategy that consists of two stages: first domain specific pretraining from scratch followed by progressively continual training on corpora within the same domain organized by year (Fig. 1). Our method is designed to, in addition to recognizing the importance of domain specific pretraining, capture the temporal evolution of language use. Temporal change is one of the most salient features in real-world language data: Scientific papers are published in a chronological order, which collectively represent the evolving landscape of scientific knowledge during its development; similarly, news articles and social media posts are produced over time, the collection of which shapes the global news agenda. However, existing efforts, regardless of the training methods, largely ignore the temporal aspect of language data and instead are directed towards developing “one model to rule them all”—one model that can excel in various downstream scientific NLP tasks like named entity recognition.

Refer to caption — Figure 1: Workflow for developing our AnnualBERT models pretrained on arXiv papers. We first train from scratch a RoBERTa model on papers published until 2008, denoted as $\mathcal{M}_{\text{base}}$ , and then for each subsequent year $t$ from 2009, we use papers published in that year for continual training of $\mathcal{M}_{t-1}$ to obtain $\mathcal{M}_{t}$ . During the training process, we use whole word token, represented as $\text{WT}_{1}$ , $\text{WT}_{2}$ , …, $\text{WT}_{n}$ .

By comparison, our proposal enables developing a series of language models on scientific documents, each of which represents one historical period and the collection of which allows us to understand the evolution of science. At a broader level, our vision is that a language model itself can be regarded as a highly knowledgeable researcher with extensive reading capabilities across all the scientific fields, who updates its inner workings (i.e., model parameters) upon reading all the papers published in a year. By probing how different versions of the model perform on the same task and by mining how model parameters change, we may be able to uncover how science evolves temporally. From another perspective, language models serve to compress documents over a certain period, thereby reflecting various characteristics of language use during that time, such as vocabulary, semantics, and topics. In this regard, our work is related to some recent studies that identify the temporal misalignment phenomenon [10, 11] and the remedies to derive customized models to different periods [11, 12, 13]. However, the language models used in the first place was pretrained on a time discorded corpus, which confounds the derivation.

To demonstrate the utility of our method, we present AnnualBERT as a running example. It is a series of language models developed by first pretraining RoBERTa on a large corpus of full-text of arXiv papers published until 2008 and then continually training on the corpora formed by arXiv papers in individual later years. We conduct extensive experimentation to demonstrate the validity and utility of the AnnualBERT models. Specifically, the main contributions of our study are summarized as follows.

•

We propose a novel training method that involves organizing scientific text by year and continually training our language model accordingly. We release the pretrained AnnualBERT, a series of language models specifically tailored for scientific text.
•

We evaluate AnnualBERT models on a suit of standard NLP tasks in the scientific domain, showing comparable performances to similar models. Moreover, we formulate several domain-specific tasks and demonstrate AnnualBERT has better performances for these in-domain tasks.
•

We showcase the utility of AnnualBERT models by performing systematic evaluations on link prediction tasks in the arXiv citation network. Under both static and temporal settings, AnnualBERT achieves uniquely better performances, indicating its effectiveness in generating more informative embeddings for the scientific domain.
•

Through model weight visualizations and a series of probe tasks, we quantify the extent of representation learning and forgetting of AnnualBERT models and find that the learning and forgetting processes are highly task-specific.

II Related work

II-A Scientific language models

The Transformers architecture, along with the growing accessibility of full-text corpora of scientific papers, has heightened the significance of employing NLP techniques for document understanding and knowledge extraction [2]. Similar to general domains, transfer learning through pretraining models can leverage information learned in a self-supervised fashion to aid in downstream scientific NLP tasks where datasets are scarce, leading to the introduction of numerous scientific language models (SciLMs) for a number of fields, including biomedicine [6, 4, 8, 14, 15, 16], chemistry [17], material science [9], and social science [18].

TABLE I: Different domain-specific language models for scientific text.

Model	Domain	Corpus	Tokenization	Vocab.	Pretraining	Corpus Size
BERT	General	Wiki + Books	WordPiece	28996	Scratch	3.3B words / 16GB
BioBERT	Bio	PubMed abstract	WordPiece	28996	Continual	4.5B words
SciBERT	Multi	Semantic Scholar full text	WordPiece	31090	Scratch	3.2B words
BiomedBERT	Bio	PubMed abstract	WordPiece	28895	scratch	3.2B words / 21GB
ScholarBERT	Multi	Public.Resource.org full text	WordPiece	50000	Scratch	221B words
AnnualBERT₂₀₀₈	Multi	arXiv full text	Whole word	53072	Scratch	1.2B words / 12GB
AnnualBERT₂₀₂₀	Multi	arXiv full text	Whole word	53072	Scratch	2.7B words / 41GB

Table I summarizes some notable SciLMs that employ the BERT architecture. BioBERT [4] was resulted from continual pretraining of the original BERT model on PubMed abstracts and showed better performances on biomedical NLP tasks. SciBERT [6] was pretrained from scratch on the full-text of 1.1M biomedical and computer science papers in the Semantic Scholar corpus [19]. Evaluations on multiple tasks showed over-performances than the original BERT and comparable performances as BioBERT on biomedical tasks. BiomedBERT [7] was also a pretrained model from scratch on 3.2B tokens in 14M PubMed abstracts. The work highlights the necessity of pretraining from scratch, as it yields a domain specific vocabulary that is useful for downstream tasks. In their case, the vocabulary contains more biomedical terms like “acetyltransferase”, in contrast to the vocabulary in BioBERT that consists of terms common to the general domain. OAG-BERT [5] was trained on the AMiner and PubMed corpora. It is based on the BERT architecture but fine-tuned for academic graph related tasks such as author name disambiguation, paper recommendation, and citation prediction. ScholarBERT [20] was pretrained from scratch on perhaps the largest corpus (221B tokens in 75M journal articles provided by public.resource.org). One observation from this work is that performances in downstream scientific tasks saturate when increasing training data, model size, or training time.

In addition to scientific text, other modalities of scientific data have been used for pretraining language models, including math equations [21], chemicals, proteins, etc. Notably, SMILES strings of chemicals have been utilized to pretrain BERT-like models, yielding several chemical language models [22, 23]. Similarly, protein language models are trained using protein sequences [24].

II-B Temporality of language models

As we shall pretrain SciLMs on temporally organized corpora, our work is broadly related to training models adaptive to diverse domains (e.g., [25], [26]) and specifically related to the temporal aspect of language models. Some studies found that language models implicitly encode various external knowledge including space and time [27]. However, an increasing number of studies reported the temporal misalignment phenomenon, which means that models trained on text from one period can have performance degradation when tested on text from another. For example, [10] showed the effect of temporal drift on NER tasks in temporally-diverse tweets. Likewise, [11] quantified performance degradation due to temporal misalignment across domains and tasks. These studies corroborate the importance of customizing models to different time periods, which is the practice we adopt in this work. To mitigate the effect of temporal misalignment, existing studies suggested a few methods, including continuous pretraining [11, 12], adding year flags to training instances [28], and discarding outdated facts [29]. However, these methods are proposed to improve performances in downstream tasks, but it turned out that the improvement is limited [11].

Another strategy aims at achieving a desired outcome by directly editing a model itself, which involves altering a pretrained model in the weight space to enhance model performance in specific tasks [30, 13]. Current approaches to model editing focused on task arithmetic [30] and weight interpolation [31, 32]. In the context of addressing temporal misalignment, model editing through time vector as a special case of task vector was used to obtain an updated pretrained model for a specific period [13]. However, a critical limitation of time vector based model editing is that the language model used in the first place was pretrained on a time discorded corpus, which confounds the derivation of time vectors. Moreover, using such language model in the scientific domain may be especially problematic, as the corpus size grows drastically due to the exponential growth of science publishing, which means that later corpora would over-shadow earlier ones in determining model weights. By contrast, instead of using model editing, we obtain an updated model through continual training on a series of corpora organized temporally.

In this regard, our pretraining strategy can be considered as a special case of domain adaption [3], where each corpus in a separate period is treated as a “domain”. A natural consequence of domain adaption that was studied is representation forgetting: As a language model adapts to new domains, it forgets some knowledge learned from previous ones, leading to performance degradation [33]. However, in our work, we find that representation forgetting is highly dependent on tasks: We identify two rather similar tasks where for one of them, representation forgetting does manifest, but for the other the opposite of forgetting is observed.

II-C Computational analysis of evolution of science

Distinguishing from the majority of existing SciLMs that concern about downstream NLP tasks, our motivation of pretraining a series of SciLMs is to utilize them to understand quantitatively how science evolves over time through the lens of scientific text. Thus, our work is related to inquiry into the evolution of science, using diverse types of data including full-text and metadata like authorship and citations. [34] proposed dynamic topic models and applied them to analyze the temporal evolution of topics in papers published in the journal Science. Analyzing authors’ transitions between different topics over time, [35] identified distinctive stages in the history of computational linguistics. [36] developed a classifier to assign rhetorical functions (e.g., background, method, conclusion, etc.) to sentences in abstracts and demonstrated the ability to predict the rise and fall of scientific topics by tracking the trajectories of rhetorical functions appeared in different topics over time. [37] developed a classifier to label citation functions (e.g., background, motivation, etc.) and applied it to NLP papers to reveal the evolution of this field.

III Developing AnnualBERT

III-A Preparing corpus

As the number of scientific papers has surged, researchers have increasingly opted to post their manuscripts on preprint platforms to share with the scientific community. arXiv is perhaps the first such platform, which was initially for physics and over time expanded to many other fields. It has gained significant popularity recently and archived nearly 2.4 million papers.

We download from the arXiv website the source files of all the $1,752,637$ papers posted between 1990 and 2020. Fig. 2a shows the field decomposition of papers by year, indicating that physics constantly dominates the corpus. In total, it accounts for 61.3% of the articles, followed by mathematics (20.2%). Computer science publications, although represent only 14.3%, have increased exponentially in recent years. Papers from the other fields—Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics—all together constitute less than 5% of the dataset.

To pretrain our models, we first need to prepare a full-text corpus extracted from the downloaded files, which are overwhelmingly LaTeX and PDF files. For LaTeX files, we use the LaTeXML tool [38] to convert them into HTML files, which facilitates our followup processing, as there are myriad author-defined commands and symbols in the LaTeX files. We then remove contents like figures, tables, and equations as well as metadata like author information, acknowledgment, and pagination, by striping away HTML elements corresponding to these contents. Next we replace certain markups with particular tokens, e.g., [EQU] for inline equations, [CITE] for citations, etc., which improves the clarity and utility of the text for training. As for PDF documents, we first use PyMuPDF to extract plain text and employ neatext to further clean elements like tables, emojis, and URLs, considering that PDF files often lack a clear document structure that is conducive to cleaning. We then filter out sentences with less than 20 characters and sentences that contain more than 40% of non-essential elements like punctuation, special characters, stopwords, emojis, and URLs. The code for data processing and training can be found in the https://github.com/jd445/TrainAnuualBERT.

After extracting text from LaTeX and PDF files, we use NLTK [39] for sentence segmentation and filter out sentences with less than three words. This completes our corpus preparation, and Fig. 2b presents the total number of sentences over time.

Next, we build our customized vocabulary. Existing language models use different tokenization methods to build vocabularies. The original BERT, for example, uses WordPiece, whereas RoBERTa uses BPE. Both tokenization methods consider subword as token, which has the advantage in representing unseen words by combining a list of existing tokens in the vocabulary. Subword tokenization, however, is not suitable for our purpose of examining temporal evolution of scientific discourse at the word level. We thus utilize whole words as tokens, an approach scatteredly taken in previous studies [40]. Particularly, we employ spaCy to tokenize all the abstracts from 1990 to 2020 into words and keep those with frequency $\geq 50$ , resulting in a vocabulary with about $53,000$ words. We then add some special tokens including [CITE], [EQU], [FIG], [REF], and [SEC] to the vocabulary for ease of LaTeX markup representation. This tailored approach ensures that our vocabulary is appropriate for interpretations of terminology pertaining to distinct fields of study.

Fig. 2b presents the total number of tokens over time. Fig. 3 presents the Jaccard similarity matrix between the vocabularies of different BERT models (see § II-A for their details). We note that the similarities between BERT models and the other models (except BioBERT) are generally around 20%. BioBERT and BERT_case have similarity of one, since BioBERT adopts the vocabulary of BERT_case. The similarities among SciBERT, BiomedBERT, and ScholarBERT are approximately 30%, which is higher than that of BERT. This reflects differing vocabulary distributions between general and scientific domains. Our AnnualBERT exhibits lower similarities between the other models, which is attributed to the unique tokenization strategy.

III-B Pretraining AnnualBERT

The original BERT model is based on a bidirectional Transformer architecture and pretrained through two objectives: Masked Language Model (MLM) and Next Sentence Prediction (NSP). In MLM, $15\%$ of input tokens are masked, among which $80\%$ are replaced with the [MASK] token, $10\%$ are replaced with randomly selected tokens from the vocabulary, and the remaining $10\%$ stay unchanged. The objective is to predict masked tokens using cross-entropy loss. NSP predicts whether or not two masked sentences follow each other in the original text. Built on BERT, the RoBERTa model [41] improves the pretraining by eliminating the NSP objective, which has been found less useful, and by using dynamic masking, where masked tokens are changed for each epoch, as opposed to fixed masking used in BERT where masked tokens are fixed across epochs.

Here we adopt the RoBERTa implementation to pretrain from scratch our base AnnualBERT model, $\mathcal{M}_{2008}$ , on the corpus formed by all the papers published until 2008, which has about 1.2 billion words. The architecture of our AnnualBERT matches with the BERT ${}_{\text{base}}$ model, namely 12 hidden layers, 12 attention heads, and 768 hidden dimensions, totalling 110M parameters. During pretraining, we use the standard BERT tokenization approach, but incorporating our own customized arXiv vocabulary. We set the vocabulary size to $53,100$ , the max_len parameter to $512$ , and lowercase the corpus and the vocabulary. The pretraining process took around 108 hours to finish two epochs on a computing platform with 8 Nvidia A40 GPUs.

With $\mathcal{M}_{2008}$ , we perform, for each year $t>2008$ , continual training of $\mathcal{M}_{t-1}$ on the corpus formed by papers published in $t$ for one epoch, resulting in $\mathcal{M}_{t}$ . For instance, a continual training of $\mathcal{M}_{2008}$ on the corpus based on papers published in 2009 yields $\mathcal{M}_{2009}$ , which is then used to form $\mathcal{M}_{2010}$ using papers in 2010, in general following:

\mathcal{M}_{t}=\text{Continual-train}\left(\mathcal{M}_{t-1},\ \text{corpus}_{t}\right)\,.

(1)

Through this approach, we allow each instance of our AnnualBERT to capture evolving language and terminology specific to that year’s publications, so that we can represent the knowledge of a specific year in a more sophisticated way. The corpora used for pretraining our AnnualBERT grow to 2.7 billion words by the time we extend the model to AnnualBERT₂₀₂₀.

IV Performance on downstream NLP tasks

IV-A Task and dataset descriptions

To validate our AnnualBERT models, we experiment on 3 downstream NLP tasks: named entity recognition (NER), text classification (CLS), and relation classification (REL). In designing those experiments, we consider tasks using benchmark datasets and tasks designed specifically for the arXiv domain. For benchmark datasets, many existing ones for scientific text are from biomedical domains. However, texts from those fields are largely absent from our training corpus. Therefore, in our experiment we do not include biomedical related benchmark datasets. Instead, we collect and categorize datasets from computer science and multi-subject domains, given that the arXiv corpus used in our work tends to come from science and engineering disciplines. Particularly, for NER tasks, we use the following 3 datasets.

1.

The SciERC dataset [42] annotates entities, relations, and coreference clusters in 500 abstracts from 12 AI conference and workshop proceedings indexed in the Semantic Scholar Corpus. It includes a total of $8,089$ distinct named entities.
2.

The ScienceExam dataset [43] comprises entities from science exam questions aimed at students from 3rd to 9th grade, totaling $133,000$ entities.
3.

The ScienceIE dataset [44] provides $9,946$ entities appeared in 500 scientific paragraphs.

For CLS tasks, we use the following 3 standard benchmark datasets.

1.

The SciERC dataset [42] includes $4,716$ annotated sentences categorized for classification according to relations such as USED-FOR, FEATURE-OF, PART-OF, etc.
2.

The ACL-ARC dataset [45] comprises $1,941$ citation instances from 186 papers in the ACL Anthology Reference Corpus, with each instance annotated by domain experts with categories such as Future, Background, Motivation, etc.
3.

The SciCite dataset [46] encompasses $11,020$ citation instances from $6,627$ scholarly papers, with each citation classified into categories such as Background, Method, and Result Comparison.

While the 3 benchmark datasets are popular, they all involve sentence-level classifications. We therefore formulate another 3 CLS tasks that use entire abstracts of arXiv papers as input. They are predictions of:

4.

major category of a paper (8 categories including physics, computer science, etc.);
5.

sub-category of a paper (172 labels including Astrophysics of Galaxies, Artificial Intelligence, etc.);
6.

whether a paper spans more than one major category, which can be considered as an interdisciplinary paper, as assessed by the submitting author.

The last task is motivated by the increasing prevalence of interdisciplinary science [47] and its role in tackling important societal problems [48].

Finally, for REL tasks, we use the abstracts of two papers to predict whether they are in the same broad category.

TABLE II: Experimental results of NLP tasks using different (scientific) BERT variants. Reported are F1 scores.

Task	Field	Dataset	BERT	BioBERT	SciBERT	BioMedBERT	AnnualBERT
CLS	CS	SciERC	85.96	87.18	87.21	86.96	87.60
	CS	ACL-ARC	76.22	75.97	80.32	79.53	79.88
	Multi	SciCite	85.20	86.63	86.33	86.33	85.77
	Multi	arXiv (major)	88.83	88.96	90.69	89.95	91.41
	Multi	arXiv (sub)	44.48	45.31	57.03	51.60	58.24
	Cross-field identification	arXiv	83.58	85.42	87.85	85.82	92.13
NER	CS	SciERC	59.69	63.61	62.55	61.84	60.96
	Multi	ScienceExam	80.53	82.88	82.13	80.69	77.69
	Multi	ScienceIE	33.56	34.86	33.97	34.40	33.95
REL	Multi	ArXiv	86.79	87.64	89.18	87.44	89.52

IV-B Experimental setup

We compare our AnnualBERT models with other scientific BERT models, including BioBERT [4], SciBERT [6], BioMedBERT [7], as well as the original BERT [49]. For our evaluations, we utilize the AnnualBERT model corresponding to the dataset’s year of release.

To ensure consistent comparisons across various BERT variants, we standardize the training configuration for each model. For the NER tasks, we adopt the NER model structure from SimpleTransformers, which incorporates a direct linear layer, and set the training parameters as follows: maximum input length of 512, training duration of 10 epochs, and batch size of 32. The best model parameters after each epoch are saved. For the CLS tasks, the vectors obtained using the [CLS] token pooling strategy are used as the input into a multilayer perceptron (MLP) for classification during the fine-tuning phase. The primary training parameters are set as follows: a training duration of 4 epochs, a batch size of 16, and weight decay at 0.01, and the best-performing model on the development set is saved every 100 steps. For the REL task, we use the dual tower structure [50]. We feed the abstracts of two arXiv papers into the identical BERT-like model and employ the [CLS] tag pooling strategy to obtain their respective feature vectors, $u$ and $v$ , that effectively represent each abstract. We then concatenate $u$ and $v$ and their element-wise absolute difference $\left|u-v\right|$ . This concatenated vector is then fed into a MLP layer for classification.

We conduct all the experiments on a workstation with two Nvidia A6000 GPUs.

IV-C Experimental results

Table II presents the experimental results. Firstly, our AnnualBERT does not perform well on the NER tasks, and interestingly, neither do the other two models pretrained from scratch (SciBERT and BioMedBERT). This might due to the tokenization strategy we use during the pretraining process. It may be also related to the fact that BioBERT, the best-performing model, is resulted from continual pretraining from the general Wikipedia and books corpus, which may have a better alignment with the benchmark datasets like the ScienceExam.

On the other hand, our AnnualBERT has the best performance in the CLS task on the SciERC dataset and comparable performances to the other specialized BERT models on the ACL-ARC and SciCite datasets. In general, the performance differences among these models in the three tasks are rather small. However, significant performance gaps become evident for the three domain-specific tasks, where our AnnualBERT performs the best. Particularly, for the sub-category prediction task, which involves classifying papers into one of the 172 sub-fields based on their abstracts, our AnnualBERT significantly leads the other models. Notably, SciBERT also has a comparable performance in this task, likely due to its training corpus containing a substantial number of computer science papers. Finally, AnnualBERT has a dominant performance in the cross-field paper identification task, indicating its superior knowledge representation ability for interdisciplinary papers that stretch multiple domains, while the other models, despite fine-tuned, struggle with this task. Taken together, these results demonstrate the validity of our AnnualBERT models and the effectiveness of domain-specific pretraining for domain-specific tasks, in consistence with prior studies [7].

V Citation link prediction

Another perspective to assessing capabilities of our AnnualBERT models in learning and encoding semantic information of papers is through link prediction task in citation network. The rationale behind using this task lies in the nature of scientific discourse, where citations are formed by not only topical alignment but also nuanced interplay of numerous other factors like ideas and methodologies that are shared between two papers. Therefore, the ability of a language model in encoding abstracts into feature vectors that are highly predictive of whether two papers are related through citations reflects the model’s proficiency in grasping those complex semantic relationships, which in turn are instrumental for generating meaningful insights into domain-specific research [51].

As our setup requires a citation network between arXiv papers, we derive such a network from the Microsoft Academic Graph (MAG) dataset [52]. Doing so can avoid the otherwise extensive efforts of matching each referenced entry within the arXiv corpus, which may involve identifying the published version of a arXiv paper. To build our citation network, we first identify the corresponding MAG versions of each arXiv paper, by extracting arXiv IDs from paper URLs provided by MAG, and then use the citation relationships between MAG papers to construct the network. Through this method, we identify nearly 1.5 million arXiv papers and 14.8 million citation relationships between them.

Given the constructed citation network, we study link prediction in both static and temporal settings.

V-A Link prediction in static network

Let $G=(V,E)$ denote the static citation network observed at 2020, where $V$ is the set of $N=\left|V\right|$ nodes and $E=\{(u,v)\}$ is the set of links indicating citation relationships between two arXiv papers $u,v\in V$ . We study link prediction under the supervised learning setting. That is, we learn from data a model mapping from node-pair features to link missingness. Here we train classifiers using 5-fold cross-validation by forming datasets $\mathcal{D}=(E^{+},E^{-})$ that are composed of positive training examples as a subset of observed links $E^{+}\subset E$ and negative examples generated from sampling the same number of non-existent links $E^{-}\subset V\times V-E$ .

V-A1 Methods

Link prediction in networks has been extensively studied in the past [53, 54]. Here we consider two families of approaches to predicting missing links: topological predictors and graph neural networks (GNN).

Topological predictors

Classical methods consider structural information in a network and assign a score $s(u,v)$ for a pair of nodes $u$ and $v$ . We use the following 6 topological predictors.

1.

Common neighbors is the number of shared neighbors between $u$ and $v$ , i.e., $s=\left|\Gamma(u)\cap\Gamma(v)\right|$ , where $\Gamma(\cdot)$ denotes the neighbors of a node.
2.

Jaccard coefficient (JC) is the normalized overlap between the neighbors: $s=\frac{\left|\Gamma(u)\cap\Gamma(v)\right|}{\left|\Gamma(u)\cup\Gamma(v)\right|}$ .
3.

Preferential attachment (PA) index captures the tendency that more connected nodes are more likely to attract new links: $s=\left|\Gamma(u)\right|\times\left|\Gamma(v)\right|$ .
4.

The Adamic/Adar (AA) index [53] defines the score as $s=\sum_{z\in\Gamma(u)\cap\Gamma(v)}\frac{1}{\log\left|\Gamma(z)\right|}$ . It considers the importance of a common neighbor $z$ , by giving more weight if $z$ has fewer connections, capturing the intuition that common neighbors in a less-populated field can be more significant in predicting a link between $u$ and $v$ .
5.

The resource allocation (RA) index [55] defines the score as $s=\sum_{z\in\Gamma(u)\cap\Gamma(v)}\frac{1}{\left|\Gamma(z)\right|}$ . It measures the likelihood of a link between two nodes based on the principle of resource allocation, where nodes share resources through their common neighbors.
6.

Personalized PageRank (PPR) corresponds to the $v$ -th entry in the stationary distribution of a random walk with restart from $u$ . Compared with the above predictors, PPR utilizes more global structural information in a network.

For all these methods, we use logistic regression.

GNN

One limitation of topological predictors is that they are unable to exploit node features, which in our context are the abstract of a paper. The rapid development of GNN has made it possible to integrate graph topology and node features for downstream tasks like link prediction. Here we leverage GraphSAGE [56] to generate embeddings $H^{(L)}$ of the nodes in our arXiv citation network, which are then used for link prediction. Note that our purpose is to evaluate different scientific BERT models rather than GNNs, and we choose GraphSAGE due to its popularity. Different from earlier graph embedding methods that are transductive, GraphSAGE is an inductive graph representation learning algorithm and learns aggregation functions so that it can inductively generate embeddings of unseen nodes.

The same as other GNNs, GraphSAGE generates embeddings of a node by sampling and aggregating embeddings of its local neighbors. In our experiment, given an initial node embedding matrix $H^{(0)}$ , we use graph convolutional network (GCN) as the aggregator:

H^{(l)}=\sigma\left(\widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}H^{(l-1)}W^{(l-1)}\right)\,,

(2)

where $H^{(l)}$ represents the updated node representations after the $l$ -th layer of GraphSAGE, $\sigma$ is the activation function, $\widetilde{D}$ is the degree matrix of the network $G$ , $\widetilde{A}$ is the adjacency matrix of the network with added self-loops, and $W^{(l-1)}$ is a trainable weight matrix at layer $l-1$ . We set the number of layers $L$ to 2 and use 2 MLP layers for modeling interactions between embeddings of node $u$ and $v$ :

\begin{split}s(u,v)&=MLP\left(h_{u}^{(L)},h_{v}^{(L)}\right)\\ &=w_{2}\sigma\left(w_{1}\left[h_{u}^{(L)};h_{v}^{(L)}\right]+b_{1}\right)+b_{2}\,.\end{split}

(3)

The loss function is the binary cross-entropy loss:

\begin{split}\mathcal{L}=-\sum_{u,v\in V}\Big{(}&y_{u\sim v}\log(s(u,v))+\\ &(1-y_{u\sim v})\log(1-s(u,v))\Big{)}\,,\end{split}

(4)

where $y_{u\sim v}$ is 1 if there is a link between $u$ and $v$ , and 0 otherwise.

We consider the following 8 ways to obtain $H^{(0)}$ .

1.

We associate each node with a random $768$ -dimensional feature vector, thus $H^{(0)}\in\mathbb{R}^{N\times 768}$ . The purpose is to, by feeding no useful node features to GraphSAGE, make it learn only topological information, so that we can compare with the above topological predictors.
2.

We generate the one-hot encoding matrix for major category of papers as the $H^{(0)}\in\mathbb{R}^{N\times 8}$ (8 categories). This is based on the fact that citations tend to occur within the same field, and injecting field label information to GraphSAGE would improve its citation link prediction performance, thereby serving as a strong baseline to check to what extent paper abstract provides additional information than paper category does.
3.

Similarly, we use the one-hot encoding matrix for paper sub-category labels as the $H^{(0)}\in\mathbb{R}^{N\times 172}$ (172 subcategories).
4.

The rest 5 feature matrices $H^{(0)}\in\mathbb{R}^{N\times 768}$ are encoded by each of the 5 scientific BERT models using the [CLS] tag pooling strategy on paper abstracts. For our models, we use AnnualBERT₂₀₂₀. By using different feature matrices encoded by the surveyed models, we can ascertain whether our AnnualBERT is the most useful encoder in predicting citation relationship between papers.

TABLE III: Experimental results for link prediction in the static arXiv citation network.

Model	AUC-ROC	Accuracy	Precision	Recall	F1 Score
Common Neighbors	0.739 (0.000)	0.670 (0.000)	0.949 (0.000)	0.266 (0.000)	0.506 (0.000)
Jaccard Coefficient	0.739 (0.000)	0.739 (0.000)	0.997 (0.000)	0.333 (0.001)	0.648 (0.000)
Adamic/Adar	0.708 (0.000)	0.708 (0.000)	1.000 (0.000)	0.148 (0.001)	0.589 (0.000)
Preferential Attachment	0.895 (0.000)	0.820 (0.000)	0.978 (0.000)	0.235 (0.000)	0.813 (0.000)
Resource Allocation	0.740 (0.000)	0.738 (0.000)	1.000 (0.000)	0.266 (0.007)	0.646 (0.000)
Personalized PageRank	0.870 (0.000)	0.776 (0.000)	0.814 (0.015)	0.651 (0.033)	0.767 (0.000)
GraphSAGE + Random	0.827 (0.013)	0.749 (0.013)	0.772 (0.012)	0.706 (0.021)	0.737 (0.016)
GraphSAGE + Major category	0.858 (0.014)	0.782 (0.003)	0.722 (0.002)	0.914 (0.004)	0.807 (0.003)
GraphSAGE + Subcategory	0.975 (0.002)	0.928 (0.003)	0.940 (0.003)	0.917 (0.002)	0.928 (0.003)
GraphSAGE + BERT	0.984 (0.002)	0.936 (0.003)	0.954 (0.006)	0.915 (0.002)	0.934 (0.002)
GraphSAGE + BioBERT	0.978 (0.001)	0.923 (0.002)	0.941 (0.003)	0.902 (0.003)	0.921 (0.002)
GraphSAGE + SciBERT	0.984 (0.002)	0.936 (0.003)	0.954 (0.006)	0.915 (0.002)	0.934 (0.002)
GraphSAGE + BioMedBERT	0.980 (0.005)	0.928 (0.008)	0.945 (0.010)	0.910 (0.006)	0.927 (0.008)
GraphSAGE + AnnualBERT	0.991 (0.002)	0.955 (0.004)	0.967 (0.006)	0.942 (0.003)	0.955 (0.004)

V-A2 Experimental results

Table III reports the experimental results, providing a comprehensive comparison of various link prediction methods applied to our arXiv citation network. Firstly, the results indicate that some topological predictors—AA and RA—obtain the highest (possible) precision, highlighting their effectiveness in predicting connections between academic papers. However, these methods have rather low recall, suggesting that they miss many true positives. PPR appears to balance the best between precision and recall, considering especially its much higher recall than the other topological predictors.

Secondly, turning to GraphSAGE-based methods with different initial node feature matrices $H^{(0)}$ , we find that feeding no useful node features (random $H^{(0)}$ ) results in performance similar to PPR. Providing paper major category information to GraphSAGE significantly improves recall (0.914), while overall performance metrics (AUC-ROC and F1) remain largely similar to PPR. This meets with our expectation that major category is a highly useful discriminative predictor in distinguishing pairs of papers with or without citations. Supplementing sub-category information to GraphSAGE further improves performance in a great extent, with the AUC-ROC increasing from 0.858 to 0.975. Such a improvement is mainly attributed to the boost in precision from 0.722 to 0.940, while the recall stays almost unchanged around 0.91. These results make this simple representation method a highly competitive alternative to sophisticated language models.

Thirdly, using all the 5 different BERT models further improves AUC-ROC, implying that features encoded by them indeed provide additional information than subcategory does. Our AnnualBERT model stands out distinctly, achieving the highest score in AUC-ROC (0.991). Again, the same with the subcategory case, the increases in AUC-ROC stem from higher precision for all the BERT models except our AnnualBERT, as we can see that their recall remains constant around 0.91. Our AnnualBERT is the only model that improves both precision (0.967) and recall (0.942). These results demonstrate the effectiveness of our model in learning and encoding semantic information of papers among all the evaluated methods.

V-B Link prediction in temporal network

V-B1 Experimental setup

The arXiv citation network is actually a temporal (growing) network, as we know the publication year of each paper. This raises the question of how different models perform in a more challenging setting of prospective link prediction, where we train a classifier using links and non-links observed during an earlier period and then apply it to predict the appearance of future links. Formally, let $G_{t_{0}}=(V_{t_{0}},E_{t_{0}})$ denote the citation network observed at year $t_{0}$ , where the node set $V_{t_{0}}=\{v|t_{v}\leq t_{0}\}$ represents all the arXiv papers published until $t_{0}$ and $E_{t_{0}}=\{(u,v)|u,v\in V_{t_{0}}\}$ is the set of citation links between them. With $G_{t_{0}}$ , we use GraphSAGE and MLP to model interactions between embeddings of two arXiv papers in $V_{t_{0}}$ , following the same process as in the static network case. We then apply the resulting architecture to predict links appeared in $t_{0}+1$ . Specifically, we denote the test dataset as $\mathcal{D}_{\text{test}}=(B_{t_{0}+1}^{+},B_{t_{0}+1}^{-})$ , where $B_{t_{0}+1}^{+}$ contains all the observed citations (positive examples) from papers published in $t_{0}+1$ to papers published until $t_{0}+1$ , i.e., $B_{t_{0}+1}^{+}=\{(u,v)|t_{u}=t_{0}+1,t_{v}\leq t_{0}+1\}$ , and $B_{t_{0}+1}^{-}$ is the set of non-existent links (negative examples) that are generated by, for each observed link $(u,v)\in B_{t_{0}+1}^{+}$ , randomly choosing a different endpoint $v^{\prime}$ published until $t_{0}$ , i.e., $B_{t_{0}+1}^{-}=\{(u,v^{\prime})|t_{u}=t_{0}+1,t_{v}^{\prime}\leq t_{0},(u,v)\not\in B_{t_{0}+1}^{+}\}$ .

In our experiment, we consider $t_{0}=2014$ . The network $G_{t_{0}}$ has $\left|V_{t_{0}}\right|=637,740$ nodes and $\left|E_{t_{0}}\right|=3,999,781$ links, and there are $\left|V_{t_{0}+1}-V_{t_{0}}\right|=73,356$ “future” papers published in year $t_{0}+1$ , which in total emanate $\left|B_{t_{0}+1}^{+}\right|=617,134$ citations. We evaluate the performances of the above mentioned 8 ways of obtaining initial node embeddings $H^{(0)}$ by testing on the $\mathcal{D}_{\text{test}}$ dataset. For our method, we take AnnualBERT ${}_{t_{0}}$ for feature encoding.

V-B2 Experimental results

TABLE IV: Experimental results for link prediction in the temporal arXiv citation network.

Model	AUC-ROC	Accuracy	Precision	Recall	F1 Score
GraphSAGE + Random	0.660	0.582	0.651	0.352	0.457
GraphSAGE + Major category	0.756	0.576	0.809	0.198	0.318
GraphSAGE + Subcategory	0.965	0.900	0.942	0.853	0.896
GraphSAGE + BERT	0.951	0.846	0.931	0.747	0.829
GraphSAGE + BioBERT	0.960	0.866	0.931	0.790	0.855
GraphSAGE + SciBERT	0.972	0.880	0.956	0.796	0.869
GraphSAGE + BioMedBERT	0.962	0.871	0.940	0.792	0.860
GraphSAGE + AnnualBERT	0.988	0.939	0.968	0.907	0.937

Table IV presents the experimental results, comparing the performances of the 8 different methods for generating the $H^{(0)}$ representations. Firstly, the two methods that provide insufficient information (i.e., random and major category) perform poorly across all the metrics on the temporal network, as there lacks topological information that could be leveraged during the prediction process. By contrast, the subcategory-based method yields surprisingly strong results, even surpassing BERT and BioBERT in terms of AUC-ROC. This impressive performance may be attributed to the specificity of the subcategories, with a total of 172 distinct subcategories providing highly targeted information. Our AnnualBERT model still stands out distinctly when comparing with the other BERT-based models, achieving the highest scores in AUC-ROC (0.988), accuracy (0.939), recall (0.907), and F1 Score (0.937), underscoring the effectiveness of domain-specific pretraining in capturing richer semantic representations.

VI Mining AnnualBERT

In the previous two sections, we have demonstrated the validity of our AnnualBERT series of models ${\mathcal{M}_{t}}$ and their superior performances than the other scientific BERT models for tasks specific to the arXiv domain. In this section, we are interested in mining the weights of these models, as we reason that each of the models serves as a condensed representation of the scientific literature over a specific period, the mining of which would enable us to understand the changing landscape of science as captured in the arXiv corpus.

To motivate, we examine if and how the model weights change when it keeps adapting to new corpora published over time. To this end, we select various layers to visualize their weights. For instance, in the word embedding layer, we transform each model’s $53100\times 768$ matrix associated with that layer into a one-dimensional vector, resulting in a $13\times 40780800$ matrix. Subsequently, we employ principal component analysis (PCA) to reduce the dimensionality of the flattened word embedding weights to two dimensions. The same procedure is applied to other layers to obtain visualizations shown in Fig. 4, from which we draw two observations. Firstly, for the majority of layers, two components account for over 60% of the variance, while in the case of the word embedding layer, they explain almost 90% of the variance, indicating that the model weights are highly redundant and possess distillability [57]. Secondly, Fig. 4 also shows that, across layers, the model’s weight space appears to have a distinct pattern, with all the weight points lying on a quadratic curve, implying that the model parameters exhibit observable changes when the training corpora are organized by year. These results support the hypothesis that continual pretraining on corpora formed by time leads to constant adaptations in the model’s knowledge representation. This finding is important for understanding how changes in scientific literature are encapsulated within the evolving parameters of domain-specific language models like AnnualBERT.

VI-A Temporal adaption

VI-A1 Experimental setup

In continual learning, learning new knowledge often leads to forgetting old knowledge, which is manifested as improved performance on new tasks but decreased performance on old ones. Different from the previous focus, we concentrate on temporal differences. Particularly, the observation that model weights change when adapting to timestamped corpora within the same domain naturally leads to the questions of how both learning and forgetting manifest and how we can quantify the extent of knowledge forgetting. We answer them in the context of prediction tasks.

Given a specific prediction task, for each data year $\tau$ , we randomly sample $1,600$ and 200 abstracts published in $\tau$ respectively as the training and test set, extract their feature vectors from the [CLS] tokens generated by each model $\mathcal{M}_{t}$ , and train a Random Forests model using the training set and test it on the test set, resulting a performance matrix $P_{t,\tau}^{(r)}$ . To mitigate the effects due to random sampling, we repeat this process for 50 runs, using a different random seed for each run $r$ . We then summarize the performance matrices from individual runs into one summary matrix, by (1) averaging the matrices $\overline{P}_{t,\tau}=\sum_{r}P_{t,\tau}^{(r)}/50$ ; (2) performing column wise (data year $\tau$ ) min-max normalization of $\overline{P}_{t,\tau}$ into $\widetilde{P}_{t,\tau}$ , to account for the inherent variability in classification difficulty of individual data years; and (3) performing column wise subtraction of the diagonal elements in $\widetilde{P}_{t,\tau}$ from $\widetilde{P}_{t,\tau}$ , to obtain relative performances compared to the $t=\tau$ cases. We denote the final matrix as $\widehat{P}_{t,\tau}$ .

Below we consider three tasks: predicting major and sub-category of a paper and whether it is interdisciplinary, and the raw performance measure is F1-score.

VI-A2 Experimental results

Fig. 5a presents $\widehat{P}_{t,\tau}$ for the major category prediction task. The lower triangular part corresponds to cases where a model $\mathcal{M}_{t}$ predicts the major category of a “past” abstract ( $t>\tau$ ). We observe under-performance, indicating that AnnualBERT does forget: As the model continues adapting to a newly published corpus, it captures semantic representation changes specific to that period while forgetting some knowledge from the past, such that the performances in the knowledge representation on the previous datasets decrease. This representation forgetting becomes more pronounced when the models make predictions of more distant past papers (darker blues moving away from the diagonal dashed line).

The upper triangular part in Fig. 5a corresponds to cases where a model $\mathcal{M}_{t}$ makes prediction on “future” abstracts ( $t<\tau$ ). It demonstrates declining performances as well, due to the models’ inability to encode an abstract not yet published. However, we notice that for the 2014–2016 test sets, the $\mathcal{M}_{2010}$ – $\mathcal{M}_{2013}$ models perform better than the others. We hypothesize that this is because the cumulative nature of science advances is particularly evident, to the extent that there is a significant amount of flow of scientific discourse from full-text of past papers to background information in abstracts of future papers. To test this hypothesis, we re-run the entire prediction task, but using “ablated” training and test sets where we retain only the second half of the sentences in an abstract, through which we effectively, albeit roughly, reduce the impact of the scientific discourse flow. The results, which can be found in Fig. 5b, show that there is a notable decline in the predictive performances for those years (the lighter red block), partially validating our hypothesis.

Fig. 5c presents $\widehat{P}_{t,\tau}$ for the subcategory prediction task. Similar to the previous task, predicting subcategory of future abstracts remains as a challenging task. However, different from Fig. 5a, the phenomenon of representation forgetting for subcategory classification is not observed. On the contrary, there are noticeable performance increases when the models predict past abstracts (the lower triangular part). Such over-performance is not due to a smaller number of subcategory labels in earlier years, as the observation is actually reinforced if we focus on subcategories with the most papers. Instead we speculate that this might be attributed to the need for constant training corpora in order for the model to establish robust representation for specialized, fine-grained knowledge, and as the model progressively sees more relevant training data from subsequent years, its performance for these recent years improves.

Fig. 5d shows $\widehat{P}_{t,\tau}$ for the task of identifying cross-field papers. We note that the results are different from the previous two tasks. AnnualBERT under-performs in identifying past cross-field papers, indicating representation forgetting. Yet, it over-performs for future papers, suggesting that interdisciplinary papers could be published ahead of time and potentially shape the development of later papers.

In summary, our results above paint a more complex picture about the representation forgetting phenomenon studied previously. We find that it is highly task dependent: While for tasks like major category prediction, AnnualBERT does forget, but for a slightly different task of predicting subcategory, representation forgetting is no longer evident.

VI-B Interpolation of models

While we have found that AnnualBERT adapts to new corpora, Fig. 4 in the meantime suggests that model weights are located close in the weight space, raising the question of whether we can interpolate models so that the interpolated ones $\mathcal{M}_{t_{0}}^{\Delta t}$ have similar performances with the “real” model $\mathcal{M}_{t_{0}}$ in prediction tasks. We derive the interpolated model $\mathcal{M}_{t_{0}}^{\Delta t}$ ( $\Delta t\geq 1$ ) by “averaging” $\mathcal{M}_{t_{0}-\Delta t}$ and $\mathcal{M}_{t_{0}+\Delta t}$ , treating it as a linear system:

\mathcal{M}_{t_{0}}^{\Delta t}=\frac{1}{2}\left(\mathcal{M}_{t_{0}-\Delta t}\oplus\mathcal{M}_{t_{0}+\Delta t}\right)\,.

(5)

We then compare performances for the 3 prediction tasks with same settings as described before.

Focusing on $t_{0}=2014$ , we have 6 interpolated models by varying $\Delta t$ , and Fig. 6 presents the average F1 scores of these models together with $\mathcal{M}_{2014}$ for the 3 tasks. We also display the $p$ -values of the Mann-Whitney U tests comparing F1 scores of each interpolated model with $\mathcal{M}_{2014}$ . Firstly, we observe that across the 3 tasks, all the interpolated models perform well and exhibit comparable performances with the real model. To further demonstrate that both $\mathcal{M}_{t_{0}-\Delta t}$ and $\mathcal{M}_{t_{0}+\Delta t}$ are necessary in creating the interpolated model, we note that using either of them would not generate similar performance as $\mathcal{M}_{t_{0}}$ (see Figs. 5a, c, d). Furthermore, we conduct additional experiments where we replace $\mathcal{M}_{2008}$ with a purely random model $\mathcal{M}_{2008}^{\text{random}}$ that has the same mean and variance as $\mathcal{M}_{2008}$ , and interpolating between $\mathcal{M}_{2020}$ and $\mathcal{M}_{2008}^{\text{random}}$ yields significantly worse performances than $\mathcal{M}_{2014}$ for the 3 tasks (F1 scores are 0.66, 0.07, and 0.79 respectively, with all $p$ -values below 0.05.).

The second observation from Fig. 6 is that as $\Delta t$ increases (interpolating between two models farther way temporarily), the performance gap between $\mathcal{M}_{2014}^{\Delta}$ and $\mathcal{M}_{2014}$ tend to increase, although lack of statistical significance. This may imply that the model weights encode temporal information via continual training, and it is feasible to edit the model to fulfil the desired task within a specific time frame.

VI-C Random shuffle experiment

Finally, having examined temporal adaptability and interpolation of AnnualBERT, we explore the root cause of these behaviors—training corpus—and study how it affects the learning and forgetting behaviors by designing experiments on synthetic datasets. In particular, we first create a new, shuffled corpus, where we randomly shuffle the sequence of the words in a sentence from the abstracts in 2020. We then rerun continual training of $\mathcal{M}_{2020}$ on this shuffled corpus for 5 epochs, resulting in a new model $\mathcal{M}_{2020}^{(\text{s})}$ . For comparison, we also repeat the same procedure but on the original, unaltered abstracts to obtain $\mathcal{M}_{2020}^{(\text{o})}$ . In the shuffled case, we force the model to adapt to a semantically unmeaningful corpus and consequently to forget the learned semantics, whereas in the unaltered case, the model reinforces the learning process. Next, we evaluate the extent of learning and forgetting in the same manner as in § VI-A, by sampling from shuffled abstracts a training set and a test set, denoted as $\mathcal{D}^{(\text{s})}$ , and similarly from unshuffled abstracts $\mathcal{D}^{(\text{o})}$ . We then test the classification performances of the two models on the two datasets and get the following accuracy matrix:

	$\mathcal{D}^{(\text{o})}$	$\mathcal{D}^{(\text{s})}$
$\mathcal{M}_{2020}^{(\text{o})}$	$A_{o,o}$	$A_{o,s}$
$\mathcal{M}_{2020}^{(\text{s})}$	$A_{s,o}$	$A_{s,s}$

For dataset $\mathcal{D}^{(\text{o})}$ , we quantify performance deterioration by calculating relative accuracy to the baseline:

\widetilde{A}^{\text{o}}=\frac{A_{s,o}-A_{o,o}}{\text{baseline}}\,.

(6)

Here baseline is the accuracy by random guessing the label:

\text{baseline}=\sum_{i}\left(\frac{N_{i}}{N}\right)^{2}\,,

(7)

where $N_{i}$ is the number of samples of class $i$ and $N$ is the total samples. For dataset $\mathcal{D}^{(\text{s})}$ , the relative accuracy $\widetilde{A}^{\text{s}}$ is defined similarly.

Table V presents $\widetilde{A}^{\text{o}}$ and $\widetilde{A}^{\text{s}}$ for the 3 tasks. It indicates a persistent performance shift across tasks: The model $\mathcal{M}_{2020}^{(\text{s})}$ trained on a shuffled corpus under-performs on real data but over-performs on shuffled data, showcasing the model’s adaptation to the disordered text and exhibition of semantic forgetting. This phenomenon is most pronounced for subcategory prediction, where the model critically relies on the semantically meaningful text to learn the subtleties useful in identifying fine grained categories, the destroying of which causes the model under-perform severely. This result is also consistent with Fig. 5c showing that the model needs sustained training instances to have a reliable representation of a particular subcategory.

TABLE V: Relative accuracy of 3 prediction tasks on the original and shuffled abstracts.

Task	$\widetilde{A}^{\text{o}}$	$\widetilde{A}^{\text{s}}$
Major category prediction	-0.002	0.030
Subcategory prediction	-0.779	0.392
Cross-field identification	-0.014	0.007

VII Conclusion

In this work, we presented a new method for pretraining a series of models by organizing the corpora chronologically and introduced AnnualBERT, a series of language models for academic text with different years. This series of language models allows for the tracking of scientific documents across various time periods, facilitating the representation of knowledge over time. We validated our models by fine-tuning them on several standard benchmark datasets, demonstrating that our method achieved a comparable performance to similar models. Moreover, we employed a domain-specific link prediction task on the arXiv dataset to evaluate the performance of our model, and the results confirmed that our model achieved the state of the art performance in the domain-specific context. To comprehend how models adjust to the evolution of scientific discourse over extended periods, we visualized the model weights and employed a probing task classifying the arXiv corpus to quantify the model’s behavior in terms of learning and forgetting as time progresses. Our findings revealed that our model not only exhibits performance on par with the other domain-specific BERT models in standard natural language processing tasks but also captures and reflects the unique linguistic characteristics of the corresponding publication year.

Acknowledgment

This work is supported by the National Natural Science Foundation of China (72204206), City University of Hong Kong (Project No. 9610552, 7005968), and the Hong Kong Institute for Data Science.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] X. Ho, A. K. D. Nguyen, A. T. Dao, J. Jiang, Y. Chida, K. Sugimoto, H. Q. To, F. Boudin, and A. Aizawa, “A survey of pre-trained language models for processing scientific text,” arXiv preprint arXiv:2401.17824, 2024.
[3] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language models to domains and tasks,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8342–8360.
[4] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
[5] X. Liu, D. Yin, J. Zheng, X. Zhang, P. Zhang, H. Yang, Y. Dong, and J. Tang, “Oag-bert: Towards a unified backbone language model for academic knowledge services,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3418–3428.
[6] I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A pretrained language model for scientific text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3615–3620.
[7] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021.
[8] U. Naseem, M. Khushi, V. Reddy, S. Rajendran, I. Razzak, and J. Kim, “Bioalbert: A simple and effective pre-trained language model for biomedical named entity recognition,” in 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–7.
[9] A. Trewartha, N. Walker, H. Huo, S. Lee, K. Cruse, J. Dagdelen, A. Dunn, K. A. Persson, G. Ceder, and A. Jain, “Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science,” Patterns, vol. 3, no. 4, p. 100488, 2022.
[10] S. Rijhwani and D. Preoţiuc-Pietro, “Temporally-informed analysis of named entity recognition,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7605–7617.
[11] K. Luu, D. Khashabi, S. Gururangan, K. Mandyam, and N. A. Smith, “Time waits for no one! analysis and challenges of temporal misalignment,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5944–5958.
[12] L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 5362–5383, 2023.
[13] K. Nylund, S. Gururangan, and N. A. Smith, “Time is encoded in the weights of finetuned language models,” arXiv preprint arXiv:2312.13401, 2023.
[14] K. R. Kanakarajan, B. Kundumani, and M. Sankarasubbu, “Bioelectra: pretrained biomedical text encoder using discriminators,” in Proceedings of the 20th Workshop on Biomedical Language Processing, 2021, pp. 143–154.
[15] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu, “Biogpt: generative pre-trained transformer for biomedical text generation and mining,” Briefings in Bioinformatics, vol. 23, no. 6, p. bbac409, 2022.
[16] E. Bolton, A. Venigalla, M. Yasunaga, D. Hall, B. Xiong, T. Lee, R. Daneshjou, J. Frankle, P. Liang, M. Carbin et al., “Biomedlm: A 2.7 b parameter language model trained on biomedical text,” arXiv preprint arXiv:2403.18421, 2024.
[17] J. Guo, A. S. Ibanez-Lopez, H. Gao, V. Quach, C. W. Coley, K. F. Jensen, and R. Barzilay, “Automated chemical reaction extraction from scientific literature,” Journal of Chemical Information and Modeling, vol. 62, no. 9, pp. 2035–2045, 2021.
[18] S. Shen, J. Liu, L. Lin, Y. Huang, L. Zhang, C. Liu, Y. Feng, and D. Wang, “Sscibert: A pre-trained language model for social science texts,” Scientometrics, vol. 128, no. 2, pp. 1241–1263, 2023.
[19] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. S. Weld, “S2orc: The semantic scholar open research corpus,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 4969–4983.
[20] Z. Hong, A. Ajith, J. Pauloski, E. Duede, K. Chard, and I. Foster, “The diminishing returns of masked language models to science,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 1270–1283.
[21] S. Peng, K. Yuan, L. Gao, and Z. Tang, “Mathbert: A pre-trained model for mathematical formula understanding,” arXiv preprint arXiv:2105.00377, 2021.
[22] J. Ross, B. Belgodere, V. Chenthamarakshan, I. Padhi, Y. Mroueh, and P. Das, “Large-scale chemical language representations capture molecular structure and properties,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1256–1264, 2022.
[23] W. Ahmad, E. Simon, S. Chithrananda, G. Grand, and B. Ramsundar, “Chemberta-2: Towards chemical foundation models,” arXiv preprint arXiv:2209.01712, 2022.
[24] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger et al., “Prottrans: Toward understanding the language of life through self-supervised learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7112–7127, 2021.
[25] L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training: Bridging supervised and self-supervised learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 3, pp. 1362–1377, 2024.
[26] S. Fan, X. Wang, C. Shi, P. Cui, and B. Wang, “Generalizing graph neural networks on out-of-distribution graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 322–337, 2024.
[27] W. Gurnee and M. Tegmark, “Language models represent space and time,” arXiv preprint arXiv:2310.02207, 2023.
[28] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen, “Time-aware language models as temporal knowledge bases,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 257–273, 2022.
[29] M. Zhang and E. Choi, “Mitigating temporal misalignment by discarding outdated facts,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14 213–14 226.
[30] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” in Proceedings of the 11th International Conference on Learning Representations, 2023.
[31] M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022.
[32] M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” in International Conference on Machine Learning, 2022, pp. 23 965–23 998.
[33] A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d’Autume, T. Kocisky, S. Ruder et al., “Mind the gap: Assessing temporal generalization in neural language models,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 348–29 363, 2021.
[34] D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 113–120.
[35] A. Anderson, D. Jurafsky, and D. McFarland, “Towards a computational history of the acl: 1980-2008,” in Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, 2012, pp. 13–21.
[36] V. Prabhakaran, W. L. Hamilton, D. McFarland, and D. Jurafsky, “Predicting the rise and fall of scientific topics from trends in their rhetorical framing,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1170–1180.
[37] D. Jurgens, S. Kumar, R. Hoover, D. McFarland, and D. Jurafsky, “Measuring the evolution of a scientific field through citation frames,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 391–406, 2018.
[38] B. Miller, “LaTeXML: A LaTeX to XML/HTML/MathML converter,” https://math.nist.gov/~BMiller/LaTeXML/, 2024.
[39] E. Loper and S. Bird, “Nltk: The natural language toolkit,” arXiv preprint cs/0205028, 2002.
[40] Z. Feng, D. Tang, X. Feng, C. Zhou, J. Liao, S. Wu, B. Qin, Y. Cao, and S. Shi, “Pretraining without wordpieces: learning over a vocabulary of millions of words,” International Journal of Machine Learning and Cybernetics, pp. 1–10, 2024.
[41] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[42] Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi, “Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3219–3232.
[43] H. Smith, Z. Zhang, J. Culnan, and P. Jansen, “Scienceexamcer: A high-density fine-grained science-domain corpus for common entity recognition,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4529–4546.
[44] I. Augenstein, M. Das, S. Riedel, L. Vikraman, and A. McCallum, “Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications,” arXiv preprint arXiv:1704.02853, 2017.
[45] D. Jurgens, S. Kumar, R. Hoover, D. McFarland, and D. Jurafsky, “Measuring the evolution of a scientific field through citation frames,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 391–406, 2018.
[46] A. Cohan, W. Ammar, M. van Zuylen, and F. Cady, “Structural scaffolds for citation intent classification in scientific publications,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 3586–3596.
[47] A. J. Gates, Q. Ke, O. Varol, and A.-L. Barabási, “Nature’s reach: narrow work has broad impact,” Nature, vol. 575, pp. 32–34, 2019.
[48] H. Ledford, “How to solve the world’s biggest problems,” Nature, vol. 525, no. 7569, pp. 308–311, 2015.
[49] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[50] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992.
[51] H. Peng, Q. Ke, C. Budak, D. M. Romero, and Y.-Y. Ahn, “Neural embeddings of scholarly periodicals reveal complex disciplinary organizations,” Science Advances, vol. 7, no. 17, p. eabb9004, 2021.
[52] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. Hsu, and K. Wang, “An overview of microsoft academic service (mas) and applications,” in Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 243–246.
[53] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social networks,” Journal of the American Society for Information Science and Technology, vol. 58, pp. 1019–1031, 2007.
[54] A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and A. Clauset, “Stacking models for nearly optimal link prediction in complex networks,” Proceedings of the National Academy of Sciences, vol. 117, no. 38, pp. 23 393–23 400, 2020.
[55] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, pp. 623–630, 2009.
[56] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[57] Y. Liu, J. Cao, B. Li, W. Hu, and S. Maybank, “Learning to explore distillability and sparsability: a joint framework for model compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3378–3395, 2022.