\DeclareNewFootnote

AAffil[arabic] \DeclareNewFootnoteANote[fnsymbol]

Link Prediction for Wikipedia Articles as a Natural Language Inference Task

Chau-Thang Phan^1,2,∗,^§, Quoc-Nam Nguyen^1,2,†,^§, and Kiet Van Nguyen^1,2,⋆,^¶
¹Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam
²Vietnam National University, Ho Chi Minh City, Vietnam
{^∗20520929, ^†20520644}@gm.uit.edu.vn, ^⋆[email protected]

Abstract

Link prediction task is vital to automatically understanding the structure of large knowledge bases. In this paper, we present our system to solve this task at the Data Science and Advanced Analytics 2023 Competition "Efficient and Effective Link Prediction" (DSAA-2023 Competition) [1] with a corpus containing 948,233 training and 238,265 for public testing. This paper introduces an approach to link prediction in Wikipedia articles by formulating it as a natural language inference (NLI) task. Drawing inspiration from recent advancements in natural language processing and understanding, we cast link prediction as an NLI task, wherein the presence of a link between two articles is treated as a premise, and the task is to determine whether this premise holds based on the information presented in the articles. We implemented our system based on the Sentence Pair Classification for Link Prediction for the Wikipedia Articles task. Our system achieved 0.99996 Macro F1-score and 1.00000 Macro F1-score for the public and private test sets, respectively. Our team UIT-NLP ranked 3rd in performance on the private test set, equal to the scores of the first and second places. Our code¹¹1https://github.com/phanchauthang/dsaa-2023-kaggle/ is publicly for research purposes.

Index Terms:

Link prediction, Natural Language Inference, DSAA2023 Competition

^§^§footnotetext: Equal contribution^¶^¶footnotetext: Corresponding author

I Introduction

Wikipedia²²2https://www.wikipedia.org/, the world’s largest collaborative encyclopedia, has become an invaluable resource for obtaining knowledge on a wide range of topics. With millions of articles covering diverse subjects, Wikipedia offers a vast repository of information that is constantly expanding. However, despite its impressive size, the interlinking between articles within Wikipedia is not always comprehensive, leading to gaps in information connectivity.

Link prediction, a fundamental task in network analysis, aims to predict missing links in a given network based on existing connections. In the context of Wikipedia, link prediction becomes particularly relevant as it can help enhance the navigability of the encyclopedia, improve information accessibility, and promote a deeper understanding of related topics. The DSAA 2023 challenge focuses on the link prediction task applied to Wikipedia articles. Additionally, the focus of the DSAA 2023 challenge [1] is to propose methods for link prediction in network-like data structures, such as network reconstruction and network development, using Wikipedia articles as the primary data source.

Natural Language Inference (NLI) is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise" [2]. The link prediction for Wikipedia Articles task is defined as giving a sparsified sub-graph of the Wikipedia network and predicting if a link exists between two Wikipedia pages. In this paper, we leverage the inherent similarities between the Natural Language Inference task and link prediction for Wikipedia articles. Drawing upon this connection, we apply the widely used Sentence Pair Classification (SPC) technique, commonly employed in NLI, to the specific context of the link prediction task.

This paper addresses the link prediction challenge for Wikipedia articles by combining SPC based on NLI and pre-processing techniques. Traditional methods for link prediction in Wikipedia often rely on graph-based algorithms or textual similarity measures [3], which may overlook the nuanced relationships embedded within the text of articles. By integrating sentence pair classification and pre-processing techniques, we aim to capture the semantic and contextual information in Wikipedia articles more effectively, improving link prediction accuracy.

Our contributions are summarized as follows:

•

Firstly, we adopted efficient data pre-processing techniques to cleanse the comments obtained from Wikipedia. The utilization of these techniques serves to elevate the overall quality of the data and yields significant improvements in the extraction of relevant information before training the model for the Link Prediction task.
•

Secondly, we leverage the similarities between NLI and link prediction for Wikipedia articles by applying the widely used technique of SPC from NLI to link prediction.
•

Finally, we achieved the best result on this task, accounting for 0.99996 on the public test and 1.00000 on the private test, ranking 8th and 3rd, respectively.

II Related Works

II-A Link Prediction

Link prediction, a critical issue in network-structured data [3], was initially introduced as a supervised learning task by Al Hasan et al. [4]. They also identified key performance features within the supervised learning framework. Building upon the concept of analyzing user-item interactions as graphs, Huang et al. [5] utilized link prediction techniques from the recent network modeling literature to enhance collaborative filtering recommendations. In contrast, Liu and Lü [6] proposed a method based on local random walk, which achieves competitive or even superior predictions compared to other random-walk-based approaches while maintaining lower computational complexity. Another approach suggested by Trouillon et al. [7] involves solving the link prediction task through latent factorization, enabling scalability to large datasets with linear space and time complexity. Additionally, the heuristic learning paradigm for link prediction was explored by Zhang and Chen [8]. Furthermore, Negi and Chaudhury [9] framed the link prediction task in heterogeneous networks as a multi-task metric learning task.

II-B Natural Language Inference

In the field of artificial intelligence, inference has always been a prominent topic. While significant advancements have been made in automatic methods for formal deduction, the Natural Language Inference (NLI) task has seen comparatively slower progress. NLI refers to the task of determining the validity of inferring a natural language hypothesis ( $h$ ) from a natural language premise ( $p$ ) [10]. Bowman et al. [11] introduced the Stanford Natural Language Inference corpus, which consists of labeled sentence pairs created by humans engaged in a novel task based on image captioning. This corpus is freely accessible and serves as a valuable resource. Wang and Jiang [12] proposed a specialized architecture based on long short-term memory (LSTM) networks for NLI. Their approach utilizes a match-LSTM that conducts word-by-word comparisons between the hypothesis and the premise, which enhances the performance of NLI. Chen et al. [13] demonstrated that carefully designing sequential inference models using chain LSTMs can surpass previous models in terms of performance.

III Algorithmic Techniques

A brief of link prediction for Wikipedia articles task definition, dataset, our pre-processing techniques, and our proposed approach for the DSAA 2023 challenge [1] are presented in this section.

III-A Task Definition

The task involves predicting the presence or absence of an edge between a pair of nodes $(u,v)$ . In this competition, our focus is on link prediction for Wikipedia articles. To be more precise, given a sparsified subgraph of the Wikipedia network, the objective is to determine whether a link exists between two pages, $u$ and $v$ , in the context of Wikipedia.

III-B DSAA-2023 Dataset

The dataset used in the contest is extracted from Wikipedia, where graph nodes are annotated with text. The data of the DSAA-2023 Competition include the following:

•

train.csv file: It contains pairs of nodes along with an indication of whether there is an edge between them. The file has four columns: $id$ , $id1$ , $id2$ , and $label$ . The id column represents the pairing identifier, $id1$ , and $id2$ are the node IDs, and the label column declares the presence or absence of an edge (0 or 1).
•

nodes.zip file: This file is a compressed version of nodes.tsv and contains information about each node. It consists of two columns: id (node identifier) and text (textual description of the node).
•

sample_submission.csv file: This demo submission file has two columns: the pair ID and the corresponding label ( $0$ or $1$ ).

The statistics of the training set labels after combining train.csv and nodes.tsv files together are presented in Table I.

	Non-related (0)	Related (1)
Frequency	512,389	435,843
Percentage	54.03 (%)	45.97 (%)

TABLE I: Statistics of the Training Set Labels.

III-C Pre-Processing Techniques

The dataset consists of Wikipedia objects’ main content and CSS codes within Wikipedia. However, these CSS codes are considered noise and are not required for training. Consequently, it is crucial to remove them. To accomplish this, we employ two algorithms, the Balance Curly Bracing and Remove Double Curly Bracing Algorithms, Algorithm 1 and Algorithm 2, respectively, to effectively eliminate these unnecessary codes. Punctuation marks, such as periods, commas, question marks, exclamation marks, and others, and redundant spaces can introduce noise and disrupt the natural flow of text during training. Therefore, we apply the Regex library³³3https://docs.python.org/3/library/re.html to remove all of them. Our preprocessing techniques for this task of the DSAA-2023 Competition are summarized as follows: Balancing curly braces, removing double curly braces, removing all punctations, and removing redundant spaces.

Algorithm 1 Balancing Curly Braces

1:procedure BalanceCurlyBraces(text)

opening\_count\leftarrow

text

.count(‘{’)

closing\_count\leftarrow

text

.count(‘}’)

4: if

opening\_count>closing\_count

then

5: while

opening\_count>closing\_count

index\leftarrow

text

.find(‘{’)

7: if

index\neq-1

then

text

text

[:index] +

text

[index + 1:]

opening\_count\leftarrow opening\_count-1

10: else if

closing\_count>opening\_count

then

11: while

closing\_count>opening\_count

12:

index\leftarrow

text

.rfind(‘}’)

13: if

index\neq-1

then

14:

text

text

[:index] +

text

[index + 1:]

15:

closing\_count\leftarrow closing\_count-1

16: return

text

.strip()

Algorithm 2 Removing Double Curly Braces

1:procedure RemoveDoubleCurlyBraces(text)

stack\leftarrow[]

clean\_text\leftarrow\text{` '}

4: for

char

text

5: if

char=\text{`}\{\text{'}

then

\text{$stack$.push}(char)

7: else if

char=\text{`}\}\text{'}

then

8: if

\text{not isEmpty}(stack)

and

\text{top}(stack)=\text{`}\{\text{'}

then

10:

\text{$stack$.pop}(stack)

11: else

12: if

\text{isEmpty}(stack)

then

13:

clean\_text\leftarrow clean\_text+char

14: return

clean\_text

III-D Our Proposed Approach

Our approach draws inspiration from recent successes in natural language understanding tasks, particularly in pre-trained language models. We posit that a link between two Wikipedia articles can be treated as a premise, and predicting the link’s existence is akin to determining the logical relationship between the premise and a hypothesis. We encode the content of the two articles as the premise and the hypothesis, respectively, and utilize this NLI setup for link prediction.

In sentence pair classification, the model takes two sentences as input and aims to determine their relationship. It learns to classify the pairs into different categories based on the desired task. Link prediction for Wikipedia Articles task is to predict whether there is an edge or link between the two given nodes.

In this paper, we have implemented a Sentence Pair classification model based on XLM-Roberta, a well-known architecture introduced in the work of Conneau et al. (2020) [14], which is widely used for Natural Language Inference (NLI). Our implementation is tailored for generic and specific Wikipedia article link prediction tasks. Additionally, we incorporate a linear prediction layer on top of the XLM-Roberta architecture to calculate the final output. The details of our proposed approach’s architecture are thoroughly presented in Figure 1.

Refer to caption — Figure 1: Our Approach Architecture for Link Prediction for Wikipedia Articles.

IV Experimental Results

IV-A Experimental Configurations

We followed fairly standard practices for fine-tuning, most of which are described in [15]. We use a batch size of 128, a maximum token length of 128, a learning rate of 2e-5, and AdamW optimizer with an epsilon of 1e-8.

We empirically SentencePair classification model for our system using simpletransformers⁴⁴4https://simpletransformers.ai/ (ver 0.63.11). We trained our model on 1 $\times$ RTX4090 GPU on the Vast.ai⁵⁵5https://vast.ai platform for four hours. Ten seconds is needed for predicting the test set of this dataset.

IV-B Experimental Results

In this section, we describe our experiments and results for Link Prediction task. Only the macro F1-score is used for evaluation. Experimental results are presented in Table II

By implementing robust and efficient data preprocessing techniques tailored explicitly for cleansing comments obtained from Wikipedia, we aimed to enhance the overall data quality and substantially improve extracting relevant information before training the model for link prediction tasks. Combining these pre-processing techniques with the sentence pair classification model resulted in a notable performance improvement compared to scenarios where such techniques were not applied.

Our proposed approach achieved remarkable results, with a macro F1-score of 0.99996 for the public test and a perfect score of 1.00000 for the private test. These exceptional scores indicate the effectiveness of our methodology in accurately predicting links within the given context. In terms of ranking, we secured the 8th position on the public test and an impressive 3rd position on the private test.

With Preprocessing techniques
	Public test	Private test
Our approach	0.99996	1.00000
Without Preprocessing techniques
Our approach	0.97680	0.97663

TABLE II: Experimental Results in terms of Macro F1-score.

V Conclusion

The significance of the link prediction task lies in its ability to comprehend the structure of extensive knowledge bases automatically. In this paper, we provide a detailed account of our proposed approach to addressing this task within the context of the Data Science and Advanced Analytics 2023 Competition titled "Efficient and Effective Link Prediction". To tackle the link prediction task for Wikipedia articles, we implemented a sentence pair classification based on natural language inference and an efficient pre-processing techniques pipeline. Notably, our system achieved exceptional performance with a macro F1-score of 0.99996 for the public test set and a perfect score of 1.00000 for the private test set. As a result, our system secured ranked 3rd on the private test set, equal to the scores of the first and second places.

Acknowledgement

This research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.

References

Papadopoulos [2023] A. N. Papadopoulos, “Dsaa 2023 competition,” 2023. [Online]. Available: https://kaggle.com/competitions/dsaa-2023-competition
Storks et al. [2019] S. Storks, Q. Gao, and J. Y. Chai, “Recent advances in natural language inference: A survey of benchmarks, resources, and approaches,” arXiv preprint arXiv:1904.01172, 2019. [Online]. Available: https://arxiv.org/abs/1904.01172
Kumar et al. [2020] A. Kumar, S. S. Singh, K. Singh, and B. Biswas, “Link prediction techniques, applications, and performance: A survey,” Physica A: Statistical Mechanics and its Applications, vol. 553, p. 124289, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0378437120300856
Al Hasan et al. [2006] M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki, “Link prediction using supervised learning,” in SDM06: workshop on link analysis, counter-terrorism and security, vol. 30, 2006, pp. 798–805.
Huang et al. [2005] Z. Huang, X. Li, and H. Chen, “Link prediction approach to collaborative filtering,” in Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, 2005, pp. 141–142. [Online]. Available: https://iopscience.iop.org/article/10.1209/0295-5075/89/58007/meta
Liu and Lü [2010] W. Liu and L. Lü, “Link prediction based on local random walk,” Europhysics Letters, vol. 89, no. 5, p. 58007, 2010. [Online]. Available: https://doi.org/10.48550/arXiv.1001.2467
Trouillon et al. [2016] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard, “Complex embeddings for simple link prediction,” in Proceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. F. Balcan and K. Q. Weinberger, Eds., vol. 48. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 2071–2080. [Online]. Available: https://proceedings.mlr.press/v48/trouillon16.html
Zhang and Chen [2018] M. Zhang and Y. Chen, “Link prediction based on graph neural networks,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2018/file/53f0d7c537d99b3824f0f99d62ea2428-Paper.pdf
Negi and Chaudhury [2016] S. Negi and S. Chaudhury, “Link prediction in heterogeneous social networks,” ser. CIKM ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 609–617. [Online]. Available: https://doi.org/10.1145/2983323.2983722
MacCartney [2009] B. MacCartney, Natural language inference. Stanford University, 2009. [Online]. Available: https://www.proquest.com/openview/1f496dd128e01b6c0b2a030d2a2447f8/1?pq-origsite=gscholar&cbl=18750
Bowman et al. [2015] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, Sep. 2015, pp. 632–642. [Online]. Available: https://aclanthology.org/D15-1075
Wang and Jiang [2016] S. Wang and J. Jiang, “Learning natural language inference with LSTM,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, Jun. 2016, pp. 1442–1451. [Online]. Available: https://aclanthology.org/N16-1170
Chen et al. [2017] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen, “Enhanced lstm for natural language inference,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). Vancouver: ACL, July 2017.
Conneau et al. [2020] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 8440–8451. [Online]. Available: https://aclanthology.org/2020.acl-main.747
Devlin et al. [2019] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423/