Large Language Models as Event Forecasters

Libo Zhang and Yue Ning

Abstract

Key elements of human events are extracted as quadruples that consist of subject, relation, object, and timestamp. This representation can be extended to a quintuple by adding a fifth element: a textual summary that briefly describes the event. These quadruples or quintuples, when organized within a specific domain, form a temporal knowledge graph (TKG). Current learning frameworks focus on a few TKG-related tasks, such as predicting an object given a subject and a relation or forecasting the occurrences of multiple types of events (i.e., relation) in the next time window. They typically rely on complex structural and sequential models like graph neural networks (GNNs) and recurrent neural networks (RNNs) to update intermediate embeddings. However, these methods often neglect the contextual information inherent in each quintuple, which can be effectively captured through concise textual descriptions. In this paper, we investigate how large language models (LLMs) can streamline the design of TKG learning frameworks while maintaining competitive accuracy in prediction and forecasting tasks. We propose LEAF, a unified framework that leverages large language models as event forecasters. Specifically, we develop multiple prompt templates to frame the object prediction (OP) task as a standard question-answering (QA) task, suitable for instruction fine-tuning with an encoder-decoder generative LLM. For multi-event forecasting (MEF), we design simple yet effective prompt templates for each TKG quintuple. This novel approach removes the need for GNNs and RNNs, instead utilizing an encoder-only LLM to generate fixed intermediate embeddings, which are subsequently processed by a prediction head with a self-attention mechanism to forecast potential future relations. Extensive experiments on multiple real-world datasets using various evaluation metrics validate the effectiveness and robustness of our approach.

Introduction

Graph neural networks (GNNs) and recurrent neural networks (RNNs) are popular modules in temporal knowledge graph (TKG) learning framework design, where GNNs are good at capturing neighboring knowledge among subjects, relations, and objects, while RNNs can grasp sequential information following timestamps (Pan et al. 2024; Ma et al. 2023). However, to update intermediate embeddings within GNNs and RNNs, these frameworks (Ma et al. 2023; Deng, Rangwala, and Ning 2020; Jin et al. 2019) are designed so complicatedly that the contextual knowledge within each TKG quintuple itself, which can be simply represented by five separate strings, has not been thoroughly studied (Liao et al. 2023). Combining the applications of large language models (LLMs), we identify several key challenges for typical TKG-related tasks such as object prediction (OP) and multi-event forecasting (MEF) (Shang and Huang 2024; Ma et al. 2023; Deng, Rangwala, and Ning 2020).

The underestimation of individual TKG quintuple contextualization. Existing state-of-the-art (SOTA) methods for TKG-related tasks have been underestimating the contextual knowledge and the structural information embedded within each TKG quintuple itself (Pan et al. 2024; Liao et al. 2023). For example, during TKG data pre-processing, a structural framework counts the total number of unique subject and object entities to initialize the entity embedding matrix for its graph neural network (Ma et al. 2023), but ignores the fact that each subject or object is a short phrase and has its own contextual meaning, which could be informative and valuable for downstream prediction optimization.

The unfamiliarity of domain-specific knowledge for closed-source LLMs. For object prediction, we are asked to predict the missing object entity in a query quintuple, given the other four elements in the query as well as historical TKG information tracing back to a certain sequence length (Ma et al. 2023). Although predicting one or several words to fill up the missing object seems to be easy for generative LLMs (Ouyang et al. 2022), we observe that prompting SOTA commercialized closed-source LLMs, such as GPT-3.5-Turbo-Instruct (OpenAI 2023) cannot yield satisfactory performance, possibly due to the unfamiliarity of domain-specific knowledge given only 4096 maximum input tokens for in-context learning and instruction (Dong et al. 2022).

The rather limited maximum input context length for open-source LLMs. For multi-event forecasting, we are asked to predict possible relation occurrences in the future, given only historical TKG information tracing back to a certain length (Deng, Rangwala, and Ning 2020). Compared with object prediction, there are two major differences. Firstly, there is no query to provide auxiliary information in the future. Secondly, there could be multiple unique relations as the correct prediction, whose textual length is much longer than an object. Typically, the maximum number of input tokens for open-source LLMs is rather limited, such as 512 tokens for RoBERTa (Liu et al. 2019) and 1024 tokens for FLAN-T5 (Chung et al. 2024). Therefore, to handle large number of historical quintuples, we must perform significant sub-sampling during prompt engineering, which eventually results in very poor multi-relation prediction performance even after instruction fine-tuning (Wei et al. 2021).

To tackle these challenges, we propose LEAF, a unified framework that leverages large language models as event forecasters. We fine-tune both encoder-only LLMs, such as RoBERTa (Liu et al. 2019), and encoder-decoder LLMs, such as FLAN-T5 (Chung et al. 2024), along with optimizing various customized downstream prediction heads, to achieve competitive accuracy in both object prediction and multi-event forecasting. Our major contributions can be summarized as follows.

•

First, we perform two attempts for object prediction: 1) As a ranking task, we leverage a fine-tuned encoder-only LLM, RoBERTa-base (RoBERTa for short), to encode the fifth element, textual summary, in a given query quintuple, and combine the output embeddings back to a structural decoder through linear projection to complete the ranking prediction. 2) As a generative task in a standard question-answering (QA) format, we design various prompt templates as questions and set correct objects as answers. We introduce an encoder-decoder LLM, FLAN-T5-base (FLAN-T5 for short) for generation, and leverage instruction fine-tuning following the QA format to improve the object prediction accuracy.
•

Second, we formulate multi-event forecasting as a multi-label binary classification task. We design a straightforward prompt template on the quintuple level, and utilize a pre-trained LLM, RoBERTa-large, to encode each quintuple prompt. Subsequently, we design a simple prediction head using self-attention (Vaswani et al. 2017) to handle LLM’s output embedding and to predict possible event (i.e., relation) occurrences in the future.

We conduct comprehensive experiments with multiple sociopolitical ICEWS datasets (Boschee et al. 2015) involving different countries and various evaluation metrics, to analyze and demonstrate the validity and effectiveness of our approaches formulated in LEAF.

Related Work

TKG learning has been comprehensively studied and various kinds of frameworks have been proposed to address specific challenges including, but not limited to prediction accuracy, optimization efficiency, and reasoning interpretability (Pan et al. 2024). Overall, these TKG learning frameworks can be divided into three categories as follows (Liao et al. 2023).

•

Rule-based framework: The key components for rule-based frameworks are a list of manually pre-defined strategies to locate, identify, and extract parts of TKG quadruples or quintuples, which are believed to be helpful for downstream inference (Liu et al. 2022). Usually, the list of strategies is numbered, and strategies with larger numbers tend to be much more relaxing than those with smaller numbers to ensure the inclusion of all possible scenarios (Pan et al. 2023). Without any trainable parameters, rule-based frameworks tend to be computationally efficient and easily interpretable by strictly following a list of rules. However, the implementation of these manually defined rules seem to be a lonely and exhaustive mining process within the TKG dataset itself.
•

In-context learning framework: The key components for in-context learning frameworks are generative LLMs with frozen parameters (Dong et al. 2022). Various kinds of prompts can be designed to include examples for few-shot learning and instructions for completing TKG-related tasks. These prompts are fed into generative LLMs and then we are ready to collect results. Although significant efforts for downstream inference can be saved, there are two explicit shortcomings. Firstly, free open-source generative LLMs, such as FLAN-T5 (Chung et al. 2024) and BART (Lewis et al. 2019), have rather limited maximum input token length, which makes it difficult to yield satisfactory prediction accuracy. Secondly, commercialized LLMs, such as GPT-3.5-Turbo-Instruct (OpenAI 2023) and ChatGPT (OpenAI 2024), are strong generators which support much longer input context, but given abundant TKG datasets, the expensive price for making frequent API calls cannot be ignored.
•

Embedding-based framework: The key components for embedding-based frameworks are entity and relation embeddings updated by GNNs, as well as temporally sequential embeddings captured by RNNs (Ma et al. 2023). Extending quadruples to quintuples, it could be tricky on how to handle additional text summaries. Specifically, for object prediction, SeCoGD (Ma et al. 2023) introduces the latent Dirichlet allocation (LDA) clustering algorithm (Blei, Ng, and Jordan 2003), which is pre-trained on collected and filtered text summaries, to separate quintuples into different context groups, and then leverages hypergraphs to collaborate intermediate embeddings among different groups. For multi-event forecasting, Glean (Deng, Rangwala, and Ning 2020) builds a word graph by calculating the point-wise mutual information (Church and Hanks 1990) between words tokenized and filtered from collected text summaries in a TKG dataset. Both SeCoGD and Glean are important competitors for our approaches, and by making appropriate use of open-source LLMs as event forecasters, we aim to outperform these delicate frameworks in terms of prediction accuracy.

LLMs for Object Prediction

For object prediction, we aim to predict a missing object $o_{q}$ , given a query TKG quintuple $(s_{q},r_{q},?,t_{q},x_{q})$ containing known subject, relation, timestamp, and textual summary, as well as some historical knowledge $G_{\leq t_{q}}$ (Ma et al. 2023). The historical knowledge could be traced back to several days when framing object prediction as a ranking task, or back to just a few quintuples when framing object prediction as a generative task.

Refer to caption — Figure 1: An overview of our approach for object prediction as a ranking task

Table 1: Data statistics and RoBERTa-base fine-tuning results for object prediction

ICEWS dataset (Boschee et al. 2015)	Afghanistan	India	Russia
Number of training quintuples	212,540	318,471	275,477
Number of validation quintuples	32,734	75,439	46,516
Number of test quintuples	34,585	85,739	51,371
Test perplexity before fine-tuning	3.53	3.65	2.85
Test perplexity after fine-tuning	2.03	2.32	2.01

Table 2: Model architectures for object prediction

Model	GNN	RNN	Decoder	How to handle text?
ConvTransE (Shang et al. 2019)	N/A	N/A	ConvTransE $\times$ 1	N/A
SeCoGD, LDA, 5 (Ma et al. 2023)	RGCN $\times$ 5	GRU $\times$ 5	ConvTransE $\times$ 5	Context clusters
Baseline w/o LLM	RGCN $\times$ 1	GRU $\times$ 1	ConvTransE $\times$ 1	N/A
LEAF-OP (ours)	RGCN $\times$ 1	GRU $\times$ 1	ConvTransE $\times$ 1	Query embeddings

Table 3: Object prediction results as a ranking task on the ICEWS dataset

Historical sequence length $=$ 3 (back to 3 days), fine-tuned RoBERTa-base as an encoder-only LLM
Model	Afghanistan			India			Russia
Metric: Hits@	1	3	10	1	3	10	1	3	10
Baseline w/o LLM	0.1538	0.3137	0.5408	0.1704	0.3125	0.4952	0.1332	0.2345	0.3767
SeCoGD, LDA, 5	0.1878	0.3570	0.5740	0.2064	0.3554	0.5357	0.1768	0.2909	0.4351
ConvTransE	0.1235	0.2704	0.4916	0.1521	0.2821	0.4600	0.1009	0.1791	0.3078
LEAF-OP (ours)	0.3691	0.5630	0.7317	0.3675	0.5507	0.7233	0.3751	0.5390	0.6831
Historical sequence length $=$ 7 (back to 7 days), fine-tuned RoBERTa-base as an encoder-only LLM
Baseline w/o LLM	0.1551	0.3298	0.5544	0.1724	0.3175	0.5004	0.1335	0.2349	0.3781
SeCoGD, LDA, 5	0.1833	0.3652	0.5862	0.2056	0.3516	0.5352	0.1661	0.2823	0.4433
ConvTransE	0.1243	0.2759	0.4909	0.1544	0.2911	0.4740	0.0973	0.1804	0.2971
LEAF-OP (ours)	0.3861	0.5884	0.7664	0.3935	0.5831	0.7454	0.3861	0.5590	0.7077

Table 4: Multiple prompt templates for object prediction as a generative task

Name	Prompt Template
Original	I ask you to perform an object prediction task after I provide you with five examples. Each example is a knowledge quintuple containing two entities, a relation, a timestamp, and a brief text summary. Each knowledge quintuple is strictly formatted as (subject entity, relation, object entity, timestamp, text summary). For the object prediction task, you should predict the missing object entity based on the other four available elements. Now I give you five examples.
	## Example 1
	( $\langle$ subject 1 $\rangle$ , $\langle$ relation 1 $\rangle$ , $\langle$ MISSING OBJECT ENTITY $\rangle$ , $\langle$ timestamp 1 $\rangle$ , $\langle$ text summary 1 $\rangle$ ) \n
	The $\langle$ MISSING OBJECT ENTITY $\rangle$ is: $\langle$ object 1 $\rangle$ \n
	⋮
	## Example 5
	( $\langle$ subject 5 $\rangle$ , $\langle$ relation 5 $\rangle$ , $\langle$ MISSING OBJECT ENTITY $\rangle$ , $\langle$ timestamp 5 $\rangle$ , $\langle$ text summary 5 $\rangle$ ) \n
	The $\langle$ MISSING OBJECT ENTITY $\rangle$ is: $\langle$ object 5 $\rangle$ \n
	Now I give you a query:
	( $\langle$ subject 6 $\rangle$ , $\langle$ relation 6 $\rangle$ , $\langle$ MISSING OBJECT ENTITY $\rangle$ , $\langle$ timestamp 6 $\rangle$ , $\langle$ text summary 6 $\rangle$ ) \n
	Please predict the missing object entity. You are allowed to predict new object entity which you have
	never seen in examples. The correct object entity is:
Zero-shot	Remove all five in-context learning examples in the original prompt.
No-text	Remove all text summaries of five in-context learning examples and the query in the original prompt.

Table 5: Object prediction results as a generative task (for all test samples)

Historical sequence length $=$ 3 (back to 3 days), fine-tuned RoBERTa-base as an encoder-only LLM
ICEWS dataset	Afghanistan			India			Russia
Metric: ROUGE $-$	1	2	L	1	2	L	1	2	L
SeCoGD, LDA, 5	0.4271	0.1658	0.4271	0.4812	0.2902	0.4813	0.3698	0.1414	0.3700
LEAF-OP (ours)	0.5287	0.2670	0.5286	0.5500	0.3821	0.5501	0.4974	0.2629	0.4973
Historical sequence length $=$ 7 (back to 7 days), fine-tuned RoBERTa-base as an encoder-only LLM
SeCoGD, LDA, 5	0.4295	0.1723	0.4295	0.4831	0.2905	0.4833	0.3831	0.1404	0.3831
LEAF-OP (ours)	0.5675	0.3200	0.5674	0.5892	0.4034	0.5893	0.5242	0.2636	0.5238
Prompt engineering: back to 5 samples or no sample, fine-tuned FLAN-T5-base as a generative LLM
Original prompt (ours)	0.8638	0.5656	0.8638	0.8594	0.6962	0.8594	0.8415	0.4544	0.8414
Zero-shot prompt (ours)	0.8601	0.5601	0.8600	0.8530	0.6885	0.8531	0.8318	0.4480	0.8317
No-text prompt (ours)	0.4216	0.1322	0.4215	0.4779	0.2517	0.4780	0.3612	0.1007	0.3614

Table 6: Object prediction results as a generative task (for the first 5000 test samples)

ICEWS dataset	Afghanistan			India			Russia
Metric: ROUGE $-$	1	2	L	1	2	L	1	2	L
Historical sequence length $=$ 3 (back to 3 days), fine-tuned RoBERTa-base as an encoder-only LLM
SeCoGD, LDA, 5	0.4137	0.1676	0.4134	0.4916	0.2957	0.4918	0.3326	0.1189	0.3331
LEAF-OP (ours)	0.5320	0.2848	0.5322	0.5709	0.3888	0.5709	0.4671	0.2455	0.4674
Historical sequence length $=$ 7 (back to 7 days), fine-tuned RoBERTa-base as an encoder-only LLM
SeCoGD, LDA, 5	0.4222	0.1770	0.4221	0.5016	0.2977	0.5022	0.3385	0.1175	0.3385
LEAF-OP (ours)	0.5732	0.3502	0.5728	0.6167	0.4251	0.6168	0.5024	0.2554	0.5015
Prompt engineering: back to 5 samples or no sample, fine-tuned FLAN-T5-base as a generative LLM
Original prompt (ours)	0.8789	0.5887	0.8786	0.8666	0.7191	0.8670	0.8154	0.4942	0.8150
Zero-shot prompt (ours)	0.8742	0.5805	0.8741	0.8622	0.7128	0.8624	0.8053	0.4874	0.8047
No-text prompt (ours)	0.3940	0.1307	0.3941	0.4810	0.2572	0.4817	0.3522	0.1151	0.3525
Prompt engineering: back to 5 samples, closed-source GPT-3.5-Turbo-Instruct as a generative LLM
Original prompt	0.4097	0.1302	0.4092	0.3644	0.1601	0.3640	0.3480	0.1255	0.3485

Object Prediction as A Ranking Task

To make good use of the contextualized potential embedded within the fifth element, the text summary briefly describing a TKG quadruple, we collect and concatenate summaries from all quintuples to form a large textual corpus for the training/validation/test split respectively. Subsequently, we leverage the corpus to fine-tune an encoder-only LLM, RoBERTa-base, with the standard masked language modeling loss (Liu et al. 2019). The RoBERTa-base encoder takes a textual summary $x$ as input, and outputs an embedding vector of $\boldsymbol{e}\in\mathbb{R}^{50265}$ for each token, then we apply mean pooling to obtain the sentence embedding $\boldsymbol{e_{x}}\in\mathbb{R}^{50265}$ for the textual summary of each quintuple.

Before utilizing the sentence embeddings, we evaluate the effects of masked language modeling fine-tuning (Liu et al. 2019) and leverage perplexity (Jelinek et al. 1977) as the evaluation metric. Based on Table 1, we can observe a significant decrease in test perplexities for all three datasets after fine-tuning. Therefore, we expect that better semantic meanings can be encoded into the sentence embeddings obtained from the fine-tuned RoBERTa-base, which has stronger contextualized understanding abilities.

Going back to structural frameworks, the architectures of our approach and other competitors are briefly summarized in Figure 1 and Table 2, where the GNN module is relational graph convolutional network (RGCN) (Schlichtkrull et al. 2018) and the RNN module is gated recurrent unit (GRU) (Cho et al. 2014). For the structural decoder which combines queries with embeddings from GNN and RNN and eventually ranks all candidate objects after Softmax activation, we refer to ConvTransE (Shang et al. 2019). Compared with SeCoGD (Ma et al. 2023), where the context IDs are acquired from pre-trained clustering algorithms (Blei, Ng, and Jordan 2003), we believe that the semantic information encoded in those queries’ sentence embeddings, along with linear projection and non-linear ReLU activation, can significantly enrich the encoder optimization and improve the prediction accuracy.

When framing object prediction as a ranking task, we use hits at 1/3/10 as evaluation metrics, to rank the correct object among all candidates based on output probabilities after Softmax activation. For this ranking task, we consider both the original triples and the reversed triples by swapping the subject and the object, which follows the same setting as SeCoGD and can be regarded as a data augmentation strategy. Accordingly, the number of candidates becomes the number of unique entities in each dataset.

Based on Table 3, we can observe that our approach with sentence embeddings encoded from fine-tuned RoBERTa can achieve much better hits at 1/3/10 scores than other competitors. In addition, the overall scores will slightly increase if we trace back to more historical days. However, the historical sequence length, as a hyperparameter, cannot grow unlimitedly due to increasing computational cost.

Object Prediction as A Generative Task

We have demonstrated the impressive capabilities of a fine-tuned encoder-only LLM to support a structural framework in the embedding space. Our next question is will LLMs be able to predict objects directly as a generative predictor? In other words, if we stack historical knowledge, queries, and instructions into a textual prompt $C_{in}$ , and feed $C_{in}$ into a generative LLM, such as FLAN-T5 (Chung et al. 2024) or GPT-3.5 (OpenAI 2023), how likely can it generate the missing object $o_{q}$ correctly? To answer this question, based on Figure 2, we frame object prediction as a generative task, where we introduce ROUGE scores (Lin 2004) to directly compare the true object and the predicted object as two strings, instead of ranking all candidate objects. Therefore, to transfer from the ranking task to the generative task, we directly read out the first candidate which has the highest output probability as the predicted object, iteratively across all test samples, and then compute the ROUGE scores for selected structural frameworks. Based on Table 5, we can observe that our structural approach with sentence embeddings can still outperform SeCoGD (Ma et al. 2023) significantly, following similar trends as in Table 3.

We try to construct the generative predictor from two perspectives. On the one hand, we can prompt commercialized LLMs, such as GPT-3.5-Turbo-Instruct (GPT-3.5 for short) (OpenAI 2023), to directly generate predictions. However, there are several obvious shortcomings of this approach. First, prompting GPT-3.5 is not free of charge, and GPT-4 is much more expensive than GPT-3.5. Second, Chain-of-Thought prompting hinders post-processing and automatic evaluation given too many queries. Third, due to the closed-source property of GPT-3.5, we can do nothing other than few-shot learning. On the other hand, we can fine-tune an open-source LLM, such as FLAN-T5 (Chung et al. 2024), an encoder-decoder LLM to generate predictions. Since FLAN-T5 has much fewer parameters and a shorter input/output token length than GPT-3.5, it may have much weaker commonsense generative capabilities compared with GPT-3.5. Therefore, we fine-tune a FLAN-T5 model for each dataset specifically following a standard question-answering format, where the “question” is a 5-shot learning prompt with one query along with some explanations and instructions, and the “answer” is the object entity.

For this generative task, we consider only the original triples without swapping the subject and object, which is different from the data augmentation setting in previous ranking task. This is due to the rather limited maximum input context length of FLAN-T5, which makes it challenging to stack more reversed triples into the prompt as additional examples and queries. To address the different setting scenario, when reading out the first candidate for structural frameworks, we only read out from original triples and do not consider those reversed triples. Therefore, after transferring from the ranking task to the generative task, all methods can share the same setting before computing ROUGE scores to make sure a fair comparison.

As shown in Table 4, we fine-tune FLAN-T5 with a 5-shot learning prompt (original prompt). To evaluate the generalizability, we remove the 5 learning examples to build a zero-shot prompt. To further evaluate the effectiveness introduced by text, we design another 5-shot learning prompt (no-text prompt) but remove all text summaries in 5 learning examples and the query as an ablation study.

Based on Table 5, we can observe that fine-tuning FLAN-T5 as a generative predictor can yield much better ROUGE scores even than our best structural approach with sentence embeddings. In addition, the zero-shot prompt can achieve competitively similar scores as the original prompt, demonstrating good generalization. However, the no-text prompt has significantly degraded scores across all three datasets, which indicates that the fifth element of a query quintuple, textual summary, is very important to enhance the contextualized understanding abilities of a generative LLM during instruction fine-tuning.

Overall, we can see that the fine-tuned FLAN-T5 is a great generative predictor. However, what about directly prompting GPT-3.5, can it make better use of text summaries? To answer this question, as well as to save computational cost in terms of making frequent OpenAI API (OpenAI 2023) calls to predict large amount of missing objects, we select the first 5000 test samples, and prompt GPT-3.5 only with the original prompt. Similar to Table 5, we collect the outputs iteratively and compute the ROUGE scores based on those true objects without any additional post-processing.

Based on Table 6, we can observe that, given the original prompt, GPT-3.5 has much worse generative performance than the fine-tuned FLAN-T5. GPT-3.5 also has lower ROUGE scores than our structural approach with sentence embeddings. Therefore, simply prompting commercialized closed-source LLMs is not always the best option, especially for one or several specifically formatted problems, where relatively smaller open-source LLMs can be fine-tuned to obtain improved performance. For object prediction, the reasoning is rather intuitive, in that there could be thousands of unique objects in one dataset totally (Boschee et al. 2015), and the open-source FLAN-T5 can learn what most objects should be during fine-tuning, while the closed-source GPT-3.5 could only derive the output based on its pre-trained commonsense knowledge and the provided prompt, which is impossible to fit all object candidates as the maximum number of input tokens for GPT-3.5-Turbo-Instruct is 4096 (OpenAI 2023).

LLMs for Multi-event Forecasting

Table 7: The simple prompt template for the quintuple-level encoding of multi-event forecasting task

Name	Prompt Template
Simple	Subject: $\langle$ subject $\rangle$ ; \n
	Relation: $\langle$ relation $\rangle$ ; \n
	Object: $\langle$ object $\rangle$ ; \n
	Timestamp: $\langle$ timestamp $\rangle$ ; \n
	Text Summary: $\langle$ text summary $\rangle$

Table 8: Data statistics and RoBERTa-large fine-tuning results for multi-event forecasting

ICEWS dataset	Afghanistan		India		Russia
Number of	Quintuples	Days	Quintuples	Days	Quintuples	Days
Training set	212,540	1931	318,471	1931	275,477	1931
Validation set	32,734	224	75,439	224	46,516	224
Test set	34,585	224	85,739	224	51,371	224
Metric	Perplexity		Perplexity		Perplexity
Before fine-tuning	12.27		11.76		11.79
After fine-tuning	1.21		1.25		1.22

Table 9: Model architectures for multi-event forecasting

Model	GNN	RNN	How to handle text?
DNN	N/A	N/A	N/A
DynGCN	DynGCN	N/A	Word graph only
T-GCN	GCN	GRU	Word graph only
RENET	RGCN	RNN	Event graph only
Glean	CompGCN	GRU	Word graph $+$ event graph
LEAF-MEF (ours)	N/A	N/A	TKG quintuple prompt encoding

Table 10: Multi-event forecasting results, historical sequence length

=

7 (back to 7 days)

ICEWS dataset	Afghanistan			India			Russia
Metric	F1	Recall	Precision	F1	Recall	Precision	F1	Recall	Precision
DNN	55.77	68.14	47.20	52.49	56.38	49.10	53.81	62.61	47.18
Dynamic GCN	50.05	57.75	44.16	41.80	43.19	40.50	52.81	60.14	47.07
Temporal GCN	60.04	76.93	49.23	60.73	67.20	55.40	56.36	67.66	48.29
RENET	60.58	77.75	49.62	58.44	64.18	53.64	55.85	65.66	48.59
Glean	62.48	82.84	50.15	66.69	77.31	58.64	58.92	73.57	49.14
LEAF-MEF w/o SA (ours)	60.93	78.32	49.86	59.21	64.76	54.54	56.67	68.38	48.38
LEAF-MEF w/ SA (ours)	63.63	88.69	49.61	70.99	87.31	59.81	62.80	86.81	49.19

For multi-event forecasting, we aim to predict possible relation occurrences in the future (specifically, the next day) among all unique relations $r_{i,t_{q}}\in R,i=1,2,...,|R|$ , given partial historical knowledge $G_{<t_{q}}$ (Deng, Rangwala, and Ning 2020). Intuitively, multi-event forecasting is more challenging than object prediction. First, object prediction provides four elements of a query quintuple, which could be valuable for identifying the missing object, while multi-event forecasting only has historical information (e.g., past events). Second, object prediction is on the quintuple level, while multi-event forecasting is on the daily level. This leads to much fewer samples for training our approaches (specifically, from hundreds of thousands to only thousands) (Boschee et al. 2015). Therefore, we aim to design a prompt template that does not exceed the limited maximum input tokens for open-source LLMs while including as much important historical knowledge as possible.

Prompt Engineering: One Day vs One Quintuple

Since multi-event forecasting operates on a daily level and each day can include hundreds of quintuples, resulting in concatenated strings that far exceed thousands of tokens, substantial TKG sampling is required. Specifically, we aim to collect only a small subset of quintuples from each day, typically ranging from tens to twenties, to use as historical knowledge in the prompt. However, after various sampling strategies, we find that even with Longformer (Beltagy, Peters, and Cohan 2020), which is based on RoBERTa (Liu et al. 2019) and fine-tuned for a much longer maximum input token length (4096 tokens instead of 512 tokens for RoBERTa), daily prompts constructed by sampling TKG quintuples within the same day would yield unsatisfactory results compared with Glean’s random initialization without optimization (Deng, Rangwala, and Ning 2020). Therefore, we aim to develop an approach where we do not need to pay the price of sampling while handling TKG structures and historical sequences without introducing GNNs and RNNs.

We start with a straightforward simple prompt template, as shown in Table 7, to incorporate the structure of a quintuple. The total number of prompts is equal to the total number of quintuples for each ICEWS dataset (Boschee et al. 2015). Then, we leverage the encoder-only LLM, RoBERTa-large, along with mean pooling, to generate an embedding vector $\boldsymbol{e_{p}}\in\mathbb{R}^{50265}$ for each quintuple prompt. Now, since each prompt contains only one quintuple, the maximum input length is no longer challenging, and we do not need to perform significant sampling. Given this simple prompt template, we hope that the RoBERTa-large encoder could easily grasp the structural as well as the contextualized meanings embedded within each quintuple.

Downstream Prediction Head Design

Subsequently, as shown in Figure 3, these quintuple embeddings tracing back to several historical days will be fed into a downstream prediction head, which contains two modules for completing the multi-event forecasting task.

The first module is a single-head self-attention layer (Vaswani et al. 2017). The self-attention mechanism is applied towards different quintuples within the same historical day, and our motivation is to grasp potentially useful information by identifying and weighing the importance difference across all quintuples within each historical day. Otherwise, it would be simply a mean aggregation to collapse the dimension of multiple quintuples per day.

The second module is simply a fully-connected layer, where the output dimension is equal to the unique number of relations, followed by element-wise Sigmoid activation with a threshold of 0.5 to decide occurrences.

The Multi-label Binary Classification Task

During experiments, we only optimize the downstream prediction head, while freezing the LLM encoder. Table 8 shows data statistics of five countries from ICEWS (Boschee et al. 2015). Each country has 1931 days for training, 224 days for validation, and 224 days for test. Table 9 briefly summarizes our approaches and other competitors, such as the deep neural network (DNN) (Bengio et al. 2009), the dynamic GCN (Deng, Rangwala, and Ning 2019), the temporal GCN (Zhao et al. 2019), the recurrent event network (RENET) (Jin et al. 2019), and Glean (Deng, Rangwala, and Ning 2020) with composition-based multi-relational GCN (CompGCN) (Vashishth et al. 2019). Compared with other SOTA methods, our approaches neither involve GNNs nor RNNs, which significantly saves the design efforts on updating intermediate embeddings.

Based on Table 10, with good prompt engineering and appropriate downstream prediction head design, we can achieve better multi-event forecasting accuracy than a carefully structured framework, such as Glean (Deng, Rangwala, and Ning 2020), which is equipped with delicate GNNs for TKGs and RNNs for historical sequences. We also conduct ablation studies by removing the self-attention (SA) mechanism (Vaswani et al. 2017). According to Table 10, we can also observe that introducing self-attention before daily aggregation, which collapses the dimension of multiple historical days, is important. Intuitively, we believe there involves more complicated relationship among daily quintuples that continuously affect each other with time going by, which requires more detailed research as future work.

Last but not least, before optimizing the prediction head, we also try to fine-tune RoBERTa-large with masked language modeling loss (Liu et al. 2019), by stacking all quintuple prompts to form a large textual corpus. Our motivation is to make the RoBERTa encoder more compatible with such a knowledge-intensive task. However, comparing the fine-tuned LLM with the pre-trained version, we find that the accuracy improvement is rather trivial. We provide two perspectives to explain such scenario:

•

Optimization: We train our downstream prediction head thoroughly, providing a small learning rate and many epochs while saving model checkpoints based on the best validation recall. Though the prompt embeddings as inputs are different before and after LLM fine-tuning, both layers in the downstream prediction head could arrive at the same local optima after thorough optimization.
•

Problem Formulation: The pre-trained RoBERTa-large encoder has already been very powerful, especially given relatively short prompts compared to its maximum input length. Furthermore, the multi-event forecasting task is to predict possible relation occurrences in the future given only historical knowledge. This setting itself has much uncertainty. Recall that the perplexity is used to evaluate how well a language model has learned the distribution of the text on which it has been trained (Jelinek et al. 1977). Therefore, having a high perplexity within the originally pre-trained RoBERTa-large may not accurately reflect the performance of multi-event forecasting.

Discussion and Conclusion

In this section, we discuss the limitations of our approaches, and extend several ideas which could be further explored as potential future work.

Limitations

First, all of our approaches for object prediction and multi-event forecasting aim to simplify the design of existing TKG learning frameworks by introducing either encoder-only or encoder-decoder LLMs. As time evolves, the capability of pre-trained LLMs to reason and infer the development of human societies can be further explored.

Second, although multiple countries are involved, we only utilize ICEWS datasets (Boschee et al. 2015) for experiments, while other popular TKG datasets, such as GDELT (Leetaru and Schrodt 2013) and YAGO (Pellissier Tanon, Weikum, and Suchanek 2020), are not considered.

Third, a typical TKG unit should be a quadruple, instead of a quintuple (Cai et al. 2022). In other words, extending a quadruple to a quintuple by introducing a brief text summary can be regarded as a retrieval augmentation strategy (Lewis et al. 2020), which also demands more careful study.

Potential Future Work

The first future direction is to transfer from quintuples back to quadruples for better generalization. Although we may not start with abundant text summaries which are important for LLMs’ contextualized understanding, many existing TKG learning frameworks, either rule-based or embedding-based, hold a default setting that TKGs are built up with quadruples instead of quintuples (Liao et al. 2023). If it is necessary to introduce the fifth element as helpful text, then we aim to develop our own retrieval augmentation methods to slice informative sentences from open-source knowledge base, such as Wikidata (Vrandečić and Krötzsch 2014).

Second, we aim to propose a new problem formulation for TKG learning. While object prediction is relatively simple and straightforward, multi-event forecasting is highly challenging due to the need to predict the occurrences of hundreds of unique relations simultaneously for a future day. Therefore, we aim to find a balance between these tasks. One potential future direction is to closely examine each unique relation on each future day and dynamically adjust the prompt context. Although this approach may be computationally expensive, it promises to provide more detailed historical knowledge during optimization.

References

Beltagy, Peters, and Cohan (2020) Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
Bengio et al. (2009) Bengio, Y.; et al. 2009. Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1): 1–127.
Blei, Ng, and Jordan (2003) Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan): 993–1022.
Boschee et al. (2015) Boschee, E.; Lautenschlager, J.; O’Brien, S.; Shellman, S.; Starz, J.; and Ward, M. 2015. ICEWS Coded Event Data.
Cai et al. (2022) Cai, B.; Xiang, Y.; Gao, L.; Zhang, H.; Li, Y.; and Li, J. 2022. Temporal knowledge graph completion: A survey. arXiv preprint arXiv:2201.08236.
Cho et al. (2014) Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
Chung et al. (2024) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70): 1–53.
Church and Hanks (1990) Church, K.; and Hanks, P. 1990. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1): 22–29.
Deng, Rangwala, and Ning (2019) Deng, S.; Rangwala, H.; and Ning, Y. 2019. Learning dynamic context graphs for predicting social events. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1007–1016.
Deng, Rangwala, and Ning (2020) Deng, S.; Rangwala, H.; and Ning, Y. 2020. Dynamic knowledge graph based multi-event forecasting. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1585–1595.
Dong et al. (2022) Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; and Sui, Z. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234.
Jelinek et al. (1977) Jelinek, F.; Mercer, R. L.; Bahl, L. R.; and Baker, J. K. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1): S63–S63.
Jin et al. (2019) Jin, W.; Qu, M.; Jin, X.; and Ren, X. 2019. Recurrent event network: Autoregressive structure inference over temporal knowledge graphs. arXiv preprint arXiv:1904.05530.
Leetaru and Schrodt (2013) Leetaru, K.; and Schrodt, P. A. 2013. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, volume 2, 1–49. Citeseer.
Lewis et al. (2019) Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Lewis et al. (2020) Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 9459–9474.
Liao et al. (2023) Liao, R.; Jia, X.; Ma, Y.; and Tresp, V. 2023. GenTKG: Generative Forecasting on Temporal Knowledge Graph. arXiv preprint arXiv:2310.07793.
Lin (2004) Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
Liu et al. (2022) Liu, Y.; Ma, Y.; Hildebrandt, M.; Joblin, M.; and Tresp, V. 2022. Tlogic: Temporal logical rules for explainable link forecasting on temporal knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 4120–4127.
Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Ma et al. (2023) Ma, Y.; Ye, C.; Wu, Z.; Wang, X.; Cao, Y.; and Chua, T.-S. 2023. Context-aware event forecasting via graph disentanglement. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1643–1652.
OpenAI (2023) OpenAI. 2023. GPT-3.5 Turbo. https://platform.openai.com/docs/models/gpt-3-5-turbo. Accessed: 2024-03-30.
OpenAI (2024) OpenAI. 2024. ChatGPT. https://openai.com/chatgpt/. Accessed: 2024-05-01.
Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744.
Pan et al. (2023) Pan, J.; Nayyeri, M.; Li, Y.; and Staab, S. 2023. Do Temporal Knowledge Graph Embedding Models Learn or Memorize Shortcuts? In Temporal Graph Learning Workshop@ NeurIPS 2023.
Pan et al. (2024) Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; and Wu, X. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering.
Pellissier Tanon, Weikum, and Suchanek (2020) Pellissier Tanon, T.; Weikum, G.; and Suchanek, F. 2020. Yago 4: A reason-able knowledge base. In The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings 17, 583–596. Springer.
Schlichtkrull et al. (2018) Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15, 593–607. Springer.
Shang et al. (2019) Shang, C.; Tang, Y.; Huang, J.; Bi, J.; He, X.; and Zhou, B. 2019. End-to-end structure-aware convolutional networks for knowledge base completion. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 3060–3067.
Shang and Huang (2024) Shang, W.; and Huang, X. 2024. A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications. arXiv preprint arXiv:2404.14809.
Vashishth et al. (2019) Vashishth, S.; Sanyal, S.; Nitin, V.; and Talukdar, P. 2019. Composition-based multi-relational graph convolutional networks. arXiv preprint arXiv:1911.03082.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Vrandečić and Krötzsch (2014) Vrandečić, D.; and Krötzsch, M. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10): 78–85.
Wei et al. (2021) Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Zhao et al. (2019) Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; and Li, H. 2019. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE transactions on intelligent transportation systems, 21(9): 3848–3858.