Recommending Multiple Positive Citations for Manuscript via Content-Dependent Modeling and Multi-Positive Triplet

Yang Zhang [email protected] Graduate School of Informatics, Kyoto UniversityKyoto, Japan and Qiang Ma [email protected] Graduate School of Informatics, Kyoto UniversityKyoto, Japan

(2021)

Abstract.

Considering the rapidly increasing number of academic papers, searching for and citing appropriate references have become a non-trial task during the wiring of papers. Recommending a handful of candidate papers to a manuscript before publication could ease the burden of the authors, and help the reviewers to check the completeness of the cited resources. Conventional approaches on citation recommendation generally consider recommending one ground-truth citation for a query context from an input manuscript, but lack of consideration on co-citation recommendations. However, a piece of context often needs to be supported by two or more co-citation pairs. Here, we propose a novel scientific paper modelling for citation recommendations, namely Multi-Positive BERT Model for Citation Recommendation (MP-BERT4CR), complied with a series of Multi-Positive Triplet objectives to recommend multiple positive citations for a query context. The proposed approach has the following advantages: First, the proposed multi-positive objectives are effective to recommend multiple positive candidates. Second, we adopt noise distributions which are built based on the historical co-citation frequencies, so that MP-BERT4CR is not only effective on recommending high-frequent co-citation pairs; but also the performances on retrieving the low-frequent ones are significantly improved. Third, we propose a dynamic context sampling strategy which captures the “macro-scoped” citing intents from a manuscript and empowers the citation embeddings to be content-dependent, which allow the algorithm to further improve the performances. Single and multiple positive recommendation experiments testified that MP-BERT4CR delivered significant improvements. In addition, MP-BERT4CR are also effective in retrieving the full list of co-citations, and historically low-frequent co-citation pairs compared with the prior works.

Citation Recommendation, Text Recommendation, Document Embedding

^†^†journalyear: 2021^†^†copyright: acmcopyright^†^†conference: IEEE/WIC/ACM International Conference on Web Intelligence; December 14–17, 2021; ESSENDON, VIC, Australia^†^†booktitle: IEEE/WIC/ACM International Conference on Web Intelligence (WI-IAT ’21), December 14–17, 2021, ESSENDON, VIC, Australia^†^†price: 15.00^†^†doi: 10.1145/3486622.3494002^†^†isbn: 978-1-4503-9115-3/21/12^†^†ccs: Information systems Recommender systems^†^†ccs: Information systems Content analysis and feature selection

1. Introduction

Considering the massive amount of academic papers, which is accounted for over 300 million since 2018, and growing at 5% per year according to Johnson et al. (2018). It becomes a challenging task for researchers to search and find appreciate papers for citing when they are writing their own papers, and for the reviewers to check the completeness of the cited resources.

Currently, researchers are generally relying on keyword-based search engines (such as Google Scholar) for searching relevant papers via input keywords. However, it is argued that the input keywords might be over-simplified to reflect the users’ searching needs (Jia and Saule, 2017, 2018), and hence they often lead to unsatisfactory recommendations. To appropriately detect the citing intent of the users, a line of studies considered to adapt a collection of seed papers (e.g. the papers previously cited or read by the user) to find the topically relevant articles via collaborative filtering (McNee et al., 2002), random-walk methods (Gori and Pucci, 2006; Küçüktunç et al., 2013; Jia and Saule, 2018), or matrix decomposition methods (Caragea et al., 2013). Seed papers could potentially reflect richer information than keywords. However, these methods still lack considerations on extracting the citing intent from the user’s manuscript. Later researches utilized the “local context” (surrounding words around a target citation) to extract the semantics of the citing intent and then rank the most similar papers via citations embeddings (Han et al., 2018; Zhang and Ma, 2020b).

Context-based approaches (Han et al., 2018; Zhang and Ma, 2020b) detect citing intents from the sentences needed support, and based on which it finds the best-matched candidate papers. These approaches could be potentially applied in the real world to assist researchers in finding citations during the writing of papers. Nevertheless, context-based methods may still be constrained in four aspects. First, they generally consider recommend one ground-truth citation for the query context. However, it is common that a piece of context cites more than one reference to form co-citations. Second, we aim to improve the performances on retrieving both high and low frequent co-citation pairs by adapting noise frequency distributions which are built based on historical co-citation frequencies. Prior works ((Small, 1973; Gipp and Beel, 2009)) generally have been testified effectively (Kobayashi et al., 2018; Han et al., 2018) on retrieving the most frequently occurred co-citation pairs in history. However, co-citation pairs that were published in later years, or by non-scientific-elites (Parker et al., 2013) may come with lower occurrences in history. We consider low-frequent co-citation pair could also demonstrate high topic relatedness as studied by Small (1973), and therefore can potentially be utilized to improve recommendation performances. Third, “local context” is helpful to determine the citing intent in a micro-scope (Han et al., 2018; Zhang and Ma, 2020a). However, we argue that a lack of consideration of the rest of the body content may lead to inaccurate judgments on extracting the topic semantics of the manuscript. Forth, the prior works might be limited to representing the content semantics by adapting the label-based embeddings (Han et al., 2018; Zhang and Ma, 2020a) that do not contain content knowledge, or content embeddings generated solely from abstracts (Cohan et al., 2020).

To this end, we propose a Multi-Positive BERT Model for Citation Recommendation (MP-BERT4CR), which is a content model for citation recommendations coupled with multi-positive triplet objectives. It has the following advantages: First, the proposed multi-positive triplet objective functions allow the algorithm to be optimized for recommending both single and multiple positive candidates, which can effectively find multiple positive candidates than conventional triplet objective. Second, we leverage the historical co-citation frequencies to generate positive samples. The underlying logic is: if two (or more) citations historically appeared as co-citations in the past, and one of them is cited by a manuscript at the present, then it is likely that other co-cited papers should be cited, as illustrated in Figure 1. We build noise distributions based on historical frequencies to draw co-citation samples, which is effective in retrieving both high and low frequent co-citations in history. Third, a dynamic context sampling strategy is proposed to extract the citing intents in a macro-scope by extracting the topic semantics from the remaining part of the body content of a manuscript, and producing citation embeddings carrying content semantics from candidate papers. Recommendations made by matching the content semantics of candidate papers with the “macro-scoped” citing intents produced superior performances compared with the baselines.

Refer to caption — Figure 1. Recommending multiple positive citations by adapting historical co-citation frequencies

We conduct experiments on single and multiple citation recommendations to evaluate MP-BERT4CR with the conventional baseline models in the field on four datasets. The results revealed that the MP-BERT4CR had provided significant improvements for both single and multiple positive recommendations compared to the baselines.

2. Related Work

2.1. Citation Recommendation

Citation recommendation denotes the task of finding relevant papers based on an input query. A line of studies use a collection of seed papers as the input query to find topically relevant papers via collaborative filtering (McNee et al., 2002), random-walk methods (Gori and Pucci, 2006; Küçüktunç et al., 2013; Jia and Saule, 2018), or matrix decomposition methods (Caragea et al., 2013). These methods could be helpful during the surveying of a topic at an early stage. However, when applying to assist the writing of a paper, they generally lack considerations on detecting the citing intents of users. Context-based methods (Han et al., 2018; Zhang and Ma, 2020b) aimed to extract citing intent via an input local context (surrounding words around a target citation), and hence find the most relevant papers based on the detected citing intent. Nevertheless, their methods still leave room for further improvements. First, previous methods only consider recommending one positive citation to a context, which may constraint the usability; second, local context may only reveal the “micro-scoped” citing intent which could not infer the topic in a macro-scope of the manuscript, and therefore may lead to inaccurate judgments on extracting the true citing intent of users; third, these method adapted label-based embeddings without content semantics, which may further constraint the performances.

2.2. Document Embedding

Document Embedding refers to the studies on representing documents as continuous vectors. The early approaches, Word2Vec (Mikolov et al., 2013) and Doc2Vec (Le and Mikolov, 2014) were proposed to learn word embeddings by preserving the contextual information via DNN-like networks. However, Word2Vec and Doc2Vec generally treated the input documents as “plain texts” which may lead to information loss issues. Later approaches were considered to fine-tune with specific information from academic papers, such as hyperlinks (Han et al., 2018), and section headers and word-wise relations in the local context (Zhang and Ma, 2020b). The recent language modelling models such as BERT (Devlin et al., 2019), SciBERT (Beltagy et al., 2019) are effective on multiple NLP tasks. In this study, we further extend the language modeling for citation recommendations, considering multiple positive candidates.

3. Problem Definition

We adapt part of terminologies and definitions for academic papers, citation relationships, and citation recommendation following the past studies (Han et al., 2018; Zhang and Ma, 2020b).

3.1. Terminologies and definitions

Let $\mathit{w}\in\mathit{W}$ represent a word from a vocabulary set $\mathit{W}$ , and $\mathcal{H}_{i}$ represent the $\mathit{i}$ -th paper from a text collection $\mathbb{H}$ .

Definition 3.1 (Academic Paper).

The textual information of paper $\mathcal{H}_{i}$ is represented as its content words, and cited papers, i.e., $\mathcal{H}_{i}:=\mathit{\hat{W}_{i}}\cup\mathbb{\hat{H}}_{i}^{c}$ , where $\mathit{\hat{W}}\subseteq\mathit{W}$ , and $\mathbb{\hat{H}}_{i}^{c}\subset\mathbb{H}$ .

Definition 3.2 (Citation Relationship).

Given a source paper, $\mathcal{H}_{s}:=\mathit{\hat{W}}_{s}\cup\mathbb{\hat{H}}_{s}^{c}$ , cites a target paper, $\mathcal{H}_{t}:=\mathit{\hat{W}_{t}}\cup\mathbb{\hat{H}}_{t}^{c}$ , where $\mathcal{H}_{t}\in\mathbb{\hat{H}}_{s}^{c}$ . A citation relationship is constructed and expressed using a tuple, which is composed content words of the source papers, and the target paper, i.e., $\mathcal{R}:=\langle\mathit{\hat{W}_{s}},\mathit{\hat{W}_{t}}\rangle$ .

3.2. Task Definition on Citation Recommendation

Given a query paper $\mathcal{H}_{s}$ , a collection of papers $\mathbb{H}$ , and an embedding model $\varmathbb{E}$ , the task is defined to find top $k$ ranked papers $\{\mathcal{H}_{1},\mathcal{H}_{2},...,\mathcal{H}_{k}\}$ based on the geometric distances between embeddings of $\mathbb{H}$ , i.e., $\mathbf{D}=\varmathbb{E}(\mathbb{H})$ and the embedding of $\mathcal{H}_{s}$ , i.e., $\mathbf{d}_{s}=\varmathbb{E}(\mathcal{H}_{s})$ .

4. MP-BERT4CR

The algorithm firstly composes the content words of the input papers into a hierarchical structural for encoding. Given a paper defined in Definition 3.1, where $\mathcal{H}_{i}:=\mathit{\hat{W}_{i}}\cup\mathit{\hat{D}_{i}^{c}}$ , the content words, $\mathit{\hat{W}_{i}}$ are categorized into sentences, i,e, $\mathit{\hat{W}^{categorized}_{i}}:=\{\mathcal{S}^{i}_{1},\mathcal{S}^{i}_{2},...,\mathcal{S}^{i}_{|\mathcal{D}|}\}$ , where $\mathcal{S}^{i}_{j}:=\{w^{j}_{1},w^{j}_{2},...,w^{j}_{|\mathcal{S}_{j}|}\}$ . At the end of each sentence, a special end-of-sentence (EOS) token is added.

We adapt the hierarchical transformer (Zhang et al., 2019) to pre-train the model. However,faewfwfwe replace the EOS pooling to MEAN pooling at the sentence-level encoder from the original model.

4.1. Fine-tuning for Recommendation Modelling

Fine-tuning is conducted after pre-training to maximize the similarity between the citing intent of an input manuscript and the content semantics of its ground-truth citations.

We first conduct dynamic context sampling for two purposes: 1) to extract the citing intent from the input manuscript from both of “micro-scoped” and “macro-scoped” view; 2) to generate content-depend embeddings carrying content semantics for the citations.

And then, we adapt a triplet-encoder neural architecture (Reimers and Gurevych, 2019; Schroff et al., 2015) coupled with multi-positive triplet objectives to encode the sampled contexts, for optimizing the algorithm for single and multiple recommendations.

4.1.1. Dynamic Context Sampling

It aims to extract essential contexts from the manuscript regarding a predicting location for detecting the micro and macro scoped citing intent, as well as capturing content semantics of the cited papers and other positive samples. The sampled contexts involve two components: a fundamental context functions as the backbone for inferring the citing intent or content semantics, and a supplemental context aims to provide additional knowledge on the topic semantics of the paper. The detail of the sampling strategy is illustrated in Figure 2(b).

As illustrated in Figure 2(b), for a manuscript, the fundamental context is defined to include three sentences: the sentences include the target citations, the sentence before the target citation, and the sentence after it, which is functionally the same as the local context defined in prior works (Han et al., 2018; Zhang and Ma, 2020b) to infer the citing intent in a micro-scope. The supplemental context is defined as a pre-set number of sentences randomly selected from the finished content (the sentences appearing before the citation excluding the base context) to infer the overall topic of the manuscript, and thus to help determine the citing intent in a macro-scope.

For a cited paper, the fundamental context is defined to be the sentences from the abstract, which works as the backbone for inferring the content semantics. The supplemental context is defined as a pre-set number of randomly selected sentences from the body content to provide additional information on the semantics.

The default settings of the dynamic context sampling strategy were as follows: For the manuscript, we set the total number of sentences to 30, which includes the fundamental context composed of three sentences (the sentence including the target citation, the sentence before it, and the sentence after). Supplemental contexts include 27 sentences, which are randomly selected from the content before the base context. For a citation paper, we set the total number of sampling contexts to 40, which includes 10 sentences for the fundamental context (as it is found that the abstract averagely contain 10 sentences from our datasets) and 30 for the supplemental context (randomly selected from the body content). The number of the selected sentence was chosen according to the maximum memory of our GPUs. In addition, 40 sentences cover 10% of the contents from our dataset (papers come with 317 sentences overage). We presume that the abstract and 10% (maximumly 20% for 2 iterations of fine-tuning) of the body content are sufficient to capture the content semantics.

4.1.2. Neural Architecture

It comprises a manuscript encoder for encoding the sampled context from the manuscript regarding a ground-truth citation, a citation encoder for encoding the sampled context from the ground-truth citation and other positively sampled citations, and a pre-defined number of negative-citation encoders for encoding the negatively sampled citations (Figure 2(b)). The architecture was inspired by the prior works (Reimers and Gurevych, 2019; Schroff et al., 2015), where the three encoders are identical to the encoders proposed in the pre-training model of MP-BERT4CR. The parameters of the three encoders are initialized from the pre-trained encoder. During training, the parameters of the citation encoder and negative-citation encoders are completely shared, whereas the manuscript encoder is individually updated. Additionally, a sum pooling layer and a normalization layer (Ba et al., 2016) are added: the former summarizes the sentence vectors into a document vector, and the latter facilitates the convergence of the objective functions.

4.1.3. Multi-Positive Sampling

For the query context with multiple ground-truth citations, in addition to the target citation, we sample a pre-defined number of citations according to their historical co-citation frequencies with the target citation as the additional positive samples. Noise distribution is built based on the historical frequencies of co-citation pairs are built, so that both high frequent and low frequent pairs are assigned a probability to be drawn. High frequent pairs are assigned a relatively higher probability, since it is likely for a pair to be co-cited again, if they are frequently cited before. The probability for low frequent pairs can be adjusted by the power value, which is set to $\frac{3}{4}$ for a default value in Equation 1.

Specially, given $\mathcal{H}_{t*}$ as the target ground-truth citation for prediction, and $\mathbb{H}_{p}$ for the full list of co-citations, the algorithm samples $n$ number (a pre-defined value) of positive citations from the paper collection $\mathbb{H}_{d}$ via the noise probability distribution (Mikolov et al., 2013):

(1)

\medsize\mathbf{p}(\mathcal{H}_{ti}\in\mathbb{H}_{p})=\frac{\mathbf{frequency(\mathcal{H}_{t*},\mathcal{H}_{ti})}^{\frac{3}{4}}}{\sum_{\mathcal{H}_{tj}\in\mathbb{H}_{d}}\mathbf{frequency(\mathcal{H}_{t*},\mathcal{H}_{tj})}^{\frac{3}{4}}},

where $frequency$ denotes the count of the two input papers being appeared as co-citations from the dataset. The number of positive samples is set to be 3, as it is found that the number of co-citations with greater than 3 items are neglectable small in our datasets.

4.1.4. Negative Sampling

A negative sampling strategy is adapted to pick the papers which are not cited by the input context as the targets for similarity minimization. The objective is that the irrelevant papers should not appear in the top recommendation list.

The negative citations are sampled based on their occurrences as citations. The underlying intuition is that if a paper is frequently cited by other papers, however, it is not cited by the input context, then it would be drawn more frequently for similarity minimization between it and the input context. Similar to positive sampling, a noise distribution is constructed based on their occurrences as citations:

(2)

\medsize\mathbf{p}(\mathcal{H}_{i}\in\mathbb{H})=\frac{\mathbf{count(\mathcal{H}_{i})}^{\frac{3}{4}}}{\sum_{\mathcal{H}_{j}\in\mathbb{H}}\mathbf{count(\mathcal{H}_{j})}^{\frac{3}{4}}},

where $\mathbb{H}$ denotes all the papers in the dataset except the positive citations; and $count$ denotes the number of occurrences as citations of a paper. A pre-defined $m$ number of papers are picked from the distribution as negative samples, noted as $\mathbb{H}_{ns}$ . We set $m$ to be 4 in this study.

4.1.5. Multi-Positive Triplet Objectives

They are designed to optimize the algorithm for recommending multiple positive citations.

Suppose a piece of context in the manuscript, $\mathcal{H}_{s}$ containing the ground-truth citation, $\mathcal{H}_{t}$ , along with $N$ number of co-citations $\mathcal{H}_{t}$ , i.e., $\mathcal{H}_{c}^{1},...,\mathcal{H}_{c}^{N}$ . The algorithm retrieves $n$ positive samples $\{\mathcal{H}_{c}^{n}|1\leq n\leq N\}$ from Equation 1, and $m$ negative samples $\{\mathcal{H}_{ns}^{m}|1\leq m\}$ from Equation 2.

We propose multiple positive objectives considering the multiple positive samplings by modifying the original triplet-encoder (Reimers and Gurevych, 2019; Schroff et al., 2015) which originally considers one target and one negative sample:

(3)

\medsize\max(||\mathbf{d}_{s}-\mathbf{d}_{t}||-||\mathbf{d}_{s}-\mathbf{d}_{ns}||+\varepsilon,0),

where $||.||$ denotes euclidean distance, and $\varepsilon$ is the margin which is normally set to 1. The original triplet loss implies that the embedding of the manuscript is only guaranteed to be geometrically closed to one target citation. When applying for recommending multiple positive citations, this training objective might be limited.

Hence, the three multi-positive triplet objectives are proposed, so that the embedding of the manuscript $\mathbf{d}_{s}$ should not only be geometrically closed to one target citation $\mathbf{d}_{t}$ , but also be closed to the other positive citations $\{\mathbf{d}_{c}^{n}|1\leq c\leq n\}$ . Meanwhile, the manuscript embedding should be distanced to the $m$ number of negative embeddings $\{\mathbf{d}_{ns}^{m}|1\leq m\}$ .

Three strategies were designed to achieve the objective, which are illustrated in Figure 3:

•

“Target-based” optimization strategy: Minimize the distance between $\mathbf{d}_{s}$ and $\mathbf{d}_{t}$ , and the distances between $\mathbf{d}_{t}$ and $\{\mathbf{d}_{c}^{n}|1\leq c\leq n\}$ . Hence, when $\mathbf{d}_{t}$ is found to be similar, embeddings $\{\mathbf{d}_{c}^{n}|1\leq c\leq n\}$ could also be recommended.
•

“Source-based” optimization strategy: Minimize the distance between $\mathbf{d}_{s}$ and $\mathbf{d}_{t}$ , and the distances between $\mathbf{d}_{s}$ and $\{\mathbf{d}_{c}^{n}|1\leq c\leq n\}$ , so that given $\mathbf{d}_{s}$ , the both of the target and positive citations could be retrieved.
•

“Source-target-based” optimization strategy: Combining “target-based” and “source-based strategies, it is proposed to minimize the distances between $\mathbf{d}_{s}$ , and both of the target and positive embeddings, and the distances between the target and positive embeddings.

Based on the N-tuplet loss function (Sohn, 2016), we propose three designs of multi-positive triplet objectives following the aforementioned strategies:

•

Multi-positive target-based triplet (mpt-tgt):

(4)

\mathcal{L}_{mpt-tgt}=\log(1+\sum^{n}_{i=1}\sum^{m}_{j=1}\exp(||\mathbf{d}_{s}-\mathbf{d}_{t}||-||\mathbf{d}_{s}-\mathbf{d}_{ns}^{j}||)\\ +\exp(||\mathbf{d}_{c}^{i}-\mathbf{d}_{t}||-||\mathbf{d}_{c}^{i}-\mathbf{d}_{ns}^{j}||))

•

Multi-positive manuscript-based triplet (mpt-src):

(5)

\mathcal{L}_{mpt-ms}=\log(1+\sum^{m}_{j=1}\exp(||\mathbf{d}_{s}-\mathbf{d}_{t}||-||\mathbf{d}_{s}-\mathbf{d}_{ns}^{j}||)+\sum^{m}_{j=1}\sum^{n}_{i=1}\exp(||\mathbf{d}_{s}-\mathbf{d}_{c}^{i}||-||\mathbf{d}_{s}^{i}-\mathbf{d}_{ns}^{j}||))

•

Multi-positive source-target-based triplet (mpt-src-tgt):

	$\displaystyle\mathcal{L}_{mpt-src-tgt}=\log(1+\sum^{m}_{j=1}\exp(\|\|\mathbf{d}_{s}-\mathbf{d}_{t}\|\|-\|\|\mathbf{d}_{s}-\mathbf{d}_{ns}^{j}\|\|)+\sum^{m}_{j=1}\sum^{n}_{i=1}\exp(\|\|\mathbf{d}_{s}-\mathbf{d}_{c}^{i}\|\|$
(6)		$\displaystyle-\|\|\mathbf{d}_{s}-\mathbf{d}_{ns}^{j}\|\|)+\sum^{m}_{j=1}\sum^{n}_{i=1}\exp(\|\|\mathbf{d}_{t}-\mathbf{d}_{c}^{i}\|\|-\|\|\mathbf{d}_{t}-\mathbf{d}_{ns}^{j}\|\|))$

We set the maximum number of positive samples to be 3, since the number of co-citations with greater than 3 pairs is neglectable small from our datasets. Mpt-src-tgt is set to be the default objective for experiments in Section 5.3, since it is testified to be the most effective objective according to Section 5.5.

5. Experiments

5.1. Dataset

Four datasets including ACL Anthology (2013 release) and three datasets generated from the DBLP corpus were adapted for experiments. The ACL Anthology corpus includes 20,405 papers with 108,729 citations, whereas the DBLP corpus contains 649,114 papers, with 2,874,303 citations. Three datasets were produced from the DBLP corpus, i.e., DBLP-1, DBLP-2, and DBLP-3, each of which includes 50,000 papers. A “biased-individuality” dataset generating strategy was adapted to produce the three DBLP datasets, by which DBLP-2 and DBLP-3 shares 20% papers (10,000) in-common to evaluate the stability on the performance of the model; whereas the DBLP-1 dataset contains completely different papers to DBLP-1 and DBLP-2. The complete ACL corpus was adapted for the fourth dataset. We split the datasets into train and test sets as listed in Table 1.

Table 1. Statistics of Datasets

	Paper No. (Total / Train / Test)	Train Cit No.	Test Cit for P $\geq$ 1	Test Cit for P=1	Test Cit for P $\geq$ 2
DBLP-1	50,000 / 40,000 / 10,000	74,153	21,688	20,537	1,151
DBLP-2	50,000 / 40,000 / 10,000	128,380	26,300	24,643	1,657
DBLP-3	50,000 / 40,000 / 10,000	103,467	28,382	27,066	1,316
ACL	20,406 / 16,325 / 4,081	31,017	21,420	15,783	188

5.2. Implementation Details

MP-BERT4CR was developed based on Fairseq 0.4.0 (Ott et al., 2019), Gensim 2.3.0 (Řehůřek and Sojka, 2010), and Pytorch 1.6.0 (Paszke et al., 2019). For the baseline models, Word2Vec and Doc2Vec were implemented using Gensim 2.3.0; HyperDoc2Vec was developed based on Gensim 2.3.0; DACR was developed based on Pytorch 1.6.0 and Gensim 2.3.0; SciBERT and Specter were implemented using Hugginface 4.2.0 (Wolf et al., 2020).

Adam optimizer (Kingma and Ba, 2015) was adapted to optimize MP-BERT4CR with parameters illustrated in Table 2. The batch sizes are set to be 7 for pre-training, and 1 for fine-tuning. We run 10 iterations of pre-training and 2 iterations of fine-tuning on DBLP-1, DBLP-2, and DBLP-3 datasets; or 50 iterations of pre-training and 5 iterations of fine-tuning on the ACL dataset. The the number of negative samples is set to 4, and maximum number of positive samples is set to 3.

We adapted six baseline modesl: Word2Vec (W2V) (Mikolov et al., 2013), Doc2Vec (D2V) (Le and Mikolov, 2014), HyperDoc2Vec (HD2V) (Han et al., 2018), DACR (Zhang and Ma, 2020b), SciBERT (Beltagy et al., 2019) and Specter (Cohan et al., 2020). We train the models with default parameters. SciBRET and Specter were fine-tuned with the abstract combined with fundamental contexts (i.e. augmented abstracts), so that the learnt vectors carrying the semantics of the abstract and the citing intents.

Table 2. Parameters of MP-BERT4CR

Transformer Params (Sent, Doc Encoder, and Decoder)	Block No. (L)	Hidden Size (H)	Attention No. (A)
Transformer Params (Sent, Doc Encoder, and Decoder)	6	768	12
Optimizer Params	Update Schedule	Warmup Updates	Update Frequency
	Inverse Square Root	1,000 (ACL) / 2,000 (DBLP)	4 (pretrain) / 8 (finetune)
	Warmup Learning Rate	Learning Rate	Weight Decay
	1e-7 (pretrain) / 1e-9 (finetune)	1e-4 (pretrain) / 2e-5 (finetune)	0.01
	$\beta_{1}$	$\beta_{2}$	Dropout
	0.9	0.999 (pretrain) / 0.98 (finetune)	0.1

Table 3. Recommendation Scores for Single and Multiple Positive Citations (*

p<0.05

for paired t test against best baselines)

Datasets	DBLP-1						DBLP-2						DBLP-3						ACL
No. of Pst. Cits	positive $=$ 1		positive $\geq$ 1		positive $\geq$ 2		positive $=$ 1		positive $\geq$ 1		positive $\geq$ 2		positive $=$ 1		positive $\geq$ 1		positive $\geq$ 2		positive $=$ 1		positive $\geq$ 1		positive $\geq$ 2
Metrics	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP	Recall	MAP
W2V	13.26	6.29	13.26	6.29	13.48	7.10	19.25	10.25	19.35	10.38	20.81	12.41	17.76	9.09	17.76	9.09	17.77	9.15	17.16	8.2	17.13	8.25	14.37	10.37
D2V	2.07	0.48	2.07	0.48	2.17	0.68	1.27	0.29	1.27	0.30	1.26	0.38	2.44	0.78	2.45	0.79	2.48	0.92	2.22	5.73	2.23	5.75	2.69	0.74
HD2V	34.15	17.50	34.15	17.50	41.01	22.49	40.12	21.56	40.53	21.86	46.59	26.22	40.64	21.64	40.67	21.64	41.35	21.55	30.91	16.40	30.83	16.34	23.63	11.97
DACR	22.42	10.34	22.42	10.34	25.06	11.94	27.42	14.07	27.79	14.36	33.21	18.51	27.15	13.48	27.14	13.45	26.89	12.92	24.31	12.21	24.25	12.19	19.49	10.93
SciBERT	8.95	4.03	8.95	4.03	10.23	5.23	7.79	3.40	7.87	3.46	9.06	4.39	10.42	4.64	10.51	4.70	12.45	6.01	0.16	0.03	0.16	0.03	0.21	0.06
Specter	4.51	2.26	4.51	2.26	5.42	3.72	3.42	1.67	3.45	1.74	3.89	2.71	4.51	2.26	4.51	2.26	5.42	3.72	3.42	1.67	3.45	1.74	3.89	2.71
MB4R_{no_dynamic}	38.97	18.98	39.42	19.33	47.29	25.48	39.51	18.71	40.00	19.03	47.21	23.67	43.35	21.69	43.77	21.96	52.30	27.54	30.47	14.31	30.50	14.34	32.78	17.09
MB4R_triplet	40.07	18.08	40.34	18.33	45.04	22.79	40.44	19.51	40.79	20.01	48.24	27.73	47.52	22.61	48.05	22.98	59.02	30.24	30.82	13.23	30.92	13.29	39.55	18.30
MB4R_mpt-src-tgt	44.81*	21.45*	45.21*	21.77*	52.49*	27.50*	47.25*	23.41*	47.66*	23.76*	53.69*	29.10*	48.86*	23.42*	49.38*	23.73*	60.05*	30.60*	36.13*	16.90*	36.18*	16.93*	40.61*	19.89*

5.3. Recommendation Methodology

We present the methods for generating recommendations for MP-BERT4CR (MB4R thereafter) and the baselines. Recall and MAP at top 10 recommendations were reported for analyses.

For MB4R, we firstly conduct dynamic sampling for the papers in the test and train set. The sampled manuscript and citation contexts are then encoded via the fine-tuned manuscript encoder or citation encoder to get query vectors and citation vectors. Recommendations were selected as the top 10 citations by cosine similarities.

For W2V, D2V, HD2V, and DACR, we use the same method from the original studies (Han et al., 2018; Zhang and Ma, 2020b). The query vectors are computed by averaging the word vectors of the local context. The citation vectors are computed via different methods. For W2V: we use the input embedding of the citation ids as citation vectors; for D2V: we use the trained model to infer citation vectors from the content words; and forHD2V,DACR: use the output embedding of the citation ids as citation vectors. For SciBERT and Specter, we generate query vectors by using the fundamental context sampled. The citation vectors are produced by encoding and sum-pooling the augmented abstracts. Recommendations are made by cosine similarities.

Table 4. Comparison on Multi-Positive Triplet Objectives with Conventional Triplet on DBLP-1

No. of Positive Cits	positive = 1		positive $\mathbf{\geq}$ 1		positive $\geq$ 2
Metrics	Recall	MAP	Recall	MAP	Recall	MAP
Best Baseline	34.15	17.50	34.15	17.50	41.01	22.49
MB4R_triplet	40.07	18.08	40.34	18.33	45.04	22.79
MB4R_Mpt-src	41.86	19.99	42.23	20.22	48.82	24.26
MB4R_Mpt-tgt	44.85	21.69	45.10	21.94	49.68	26.53
MB4R_Mpt-src-tgt(default)	44.81	21.45	45.21	21.77	52.49	27.50

5.4. Analysis on Citation Recommendations

5.4.1. Single Positive

Two points could be drawn from Table 3 ( $positive=1$ cases). First, MB4R outperformed all the baseline models across all the datasets and metrics by significant margins, from which MB4R_triplet’s superiority compared with baselines testified the effectiveness of the proposed neural architecture, and MB4R_mpt-src-tgt further testified the effectiveness of the proposed multi-positive objectives. Second, the multi-positive objective function is not only helpful to identify multiple ground-truth citations, the single positive performances are also improved when comparing MB4R_mpt-src-tgt to MB4R_triplet. In addition, owing to the more hierarchical transformer and dynamic sampling strategy, MB4R outperformed Sci-BERT and Specter.

5.4.2. Multiple Positive

According to the scores with multiple ground-truth citations (cases of $positive\geq 1$ and $positive\geq 2$ ) in Table 3, the scores for multiple positive recommendations from MB4R_mpt-src-tgt remained effective comparing with the baselines and MB4R_triplet. However, as the $positive\geq 2$ test samples are much less than the test samples for $positive\geq 1$ , so the ability of mpt might not be fully demonstrated, especially on DBLP-3 and ACL datasets. We will produce datasets with a higher number of positive samples in the later stage for further tests.

5.5. Comparisons on Different Designs of Multi-Positive Objectives

We compare the three designs of multi-positive objectives in Table 4. First, all the three proposed multi-positive objectives (MB4R_mpt-src, MB4R_mpt-tgt, and MB4R ${}_{mpt_{s}rc-tgt}$ ) outperformed the best baseline (HD2V), and the original triplet objective (MB4R_triplet). Second, the two target-based strategies MB4R_mpt-tgt and MB4R ${}_{mpt_{s}rc-tgt}$ are superior than the source-based strategy (MB4R_mpt-src), which means that the distances between the ground-truth candidate and other positive candidates play the central role for multiple retrieving. Third, MB4R ${}_{mpt_{s}rc-tgt}$ produced superior performances comparing with MB4R_mpt-tgt for case positive $\geq 2$ , but very close performances for cases positive $=1$ and positive $\geq 1$ , and hence it was set as the default. We also tested MB4R with or without dynamic sampling for an ablation analysis, it is found that the dynamic sampling mechanism can significantly improve the performances, by comparing the resutls from $MB4R_{mpt-src-tgt}$ and $MB4R_{no_{d}ynamic}$ .

6. Conclusion

In this study, we proposed MP-BERT4CR, a content modeling for multi-positive citation recommendations. It has the following advantages: first, it comes with a series of multiple positive objectives to optimize the model for multi-positive recommendation; second, it can effectively recommend co-citations by leveraging the historical patterns; Third, the proposed dynamic context sampling strategy empowers the citation embeddings to carry content semantics, to further improve the performances.

Acknowledgements.

This research has been supported in part by JSPS KAKENHI under Grant Number 19H04116.

References

(1)
Ba et al. (2016) Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016). arXiv:1607.06450
Beltagy et al. (2019) Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In EMNLP.
Caragea et al. (2013) Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra, and C. Lee Giles. 2013. Can’t see the forest for the trees?: a citation recommendation system. In JCDL.
Cohan et al. (2020) Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. 2020. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL2019.
Gipp and Beel (2009) Bela Gipp and Jöran Beel. 2009. Citation Proximity Analysis (CPA) : A New Approach for Identifying Related Work Based on Co-Citation Analysis. In ISSI. São Paulo.
Gori and Pucci (2006) Marco Gori and Augusto Pucci. 2006. Research Paper Recommender Systems: A Random-Walk Based Approach. In WI. 778–781.
Han et al. (2018) Jialong Han, Yan Song, Wayne Xin Zhao, Shuming Shi, and Haisong Zhang. 2018. hyperdoc2vec: Distributed Representations of Hypertext Documents. In ACL.
Jia and Saule (2017) Haofeng Jia and Erik Saule. 2017. An Analysis of Citation Recommender Systems: Beyond the Obvious. In ASONAM.
Jia and Saule (2018) Haofeng Jia and Erik Saule. 2018. Local Is Good: A Fast Citation Recommendation Approach. In ECIR, Vol. 10772.
Johnson et al. (2018) Rob Johnson, Anthony Watkinson, and Michael Mabe. 2018. The STM Report: An overview of scientific and scholarly publishing. International Association of Scientific, Technical and Medical Publishers (2018), 1–214.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
Kobayashi et al. (2018) Yuta Kobayashi, Masashi Shimbo, and Yuji Matsumoto. 2018. Citation Recommendation Using Distributed Representation of Discourse Facets in Scientific Articles. In JCDL.
Küçüktunç et al. (2013) Onur Küçüktunç, Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. 2013. Towards a personalized, scalable, and exploratory academic recommendation service. In ASONAM.
Le and Mikolov (2014) Quoc V. Le and Tomás Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML, Vol. 32.
McNee et al. (2002) Sean M. McNee, István Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K. Lam, Al Mamunur Rashid, Joseph A. Konstan, and John Riedl. 2002. On the recommending of citations for research papers. In CSCW. 116–125.
Mikolov et al. (2013) Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.
Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
Parker et al. (2013) John N. Parker, Stefano Allesina, and Christopher J. Lortie. 2013. Characterizing a scientific elite (B): publication and citation patterns of the most highly cited scientists in environmental science and ecology. Scientometrics 94, 2 (2013), 469–480.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NIPS.
Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In LREC Workshop.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In CVPR.
Small (1973) Henry Small. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24, 4 (1973), 265–269.
Sohn (2016) Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In NIPS.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP.
Zhang et al. (2019) Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In ACL.
Zhang and Ma (2020a) Yang Zhang and Qiang Ma. 2020a. DocCit2Vec: Citation Recommendation via Embedding of Content and Structural Contexts. IEEE Access 8 (2020), 115865–115875.
Zhang and Ma (2020b) Yang Zhang and Qiang Ma. 2020b. Dual Attention Model for Citation Recommendation. In COLING.