Global Prediction of COVID-19 Variant Emergence Using Dynamics-Informed Graph Neural Networks
Abstract
During the COVID-19 pandemic, a major driver of new surges has been the emergence of new variants. When a new variant emerges in one or more countries, other nations monitor its spread in preparation for its potential arrival. The impact of the new variant and the timings of epidemic peaks in a country highly depend on when the variant arrives. The current methods for predicting the spread of new variants rely on statistical modeling, however, these methods work only when the new variant has already arrived in the region of interest and has a significant prevalence. Can we predict when a variant existing elsewhere will arrive in a given region? To address this question, we propose a variant-dynamics-informed Graph Neural Network (GNN) approach. First, we derive the dynamics of variant prevalence across pairs of regions (countries) that apply to a large class of epidemic models. The dynamics motivate the introduction of certain features in the GNN. We demonstrate that our proposed dynamics-informed GNN outperforms all the baselines, including the currently pervasive framework of Physics-Informed Neural Networks (PINNs). To advance research in this area, we introduce a benchmarking tool to assess a user-defined model’s prediction performance across 87 countries and 36 variants.
Index Terms:
Epidemiology, Graph Neural Networks, Public Health Planning, ForecastingI Introduction
The COVID-19 pandemic presented an unprecedented global health crisis, severely affecting millions of people worldwide and demanding swift and effective responses from governments and healthcare providers [1]. As the pandemic progressed, multiple variants of COVID-19 emerged, each possessing genetic mutations that can significantly impact transmissibility, virulence, and even vaccine efficacy. During much of the COVID-19 epidemic, the surge in cases and severe outcomes (hospitalizations and deaths) have been driven by the emergence of new variants [2]. Consequently, monitoring the emergence and spread of these variants is crucial for devising appropriate public health measures and optimizing containment strategies [3]. A critical factor driving the surge in a given region is the arrival time of the new variant [4]. We can observe how a new variant, that has not appeared in region yet, spreads in region . We can study the spread in the region to understand the properties of this new variant. If it is a highly transmissible or immune-evading variant [5], we can expect it to spread in the region eventually. However, precisely when it would happen in region remains unknown. Making a good prediction of arrival time would lead to more effective preparation and resource management.
We focus on the problem of predicting the arrival of a new variant in a given region provided that it has appeared somewhere else. Due to testing delays and the fact that not all cases are genomically analyzed, it is difficult to pin down when a variant arrives. Therefore, we consider measuring the delay to reach a certain proportion of prevalence.
Figure 1 shows the proportions of different variants over time in the United Kingdom and Sweden. We can see that there is often a clear delay (see 20.Alpha.V1 and 21.J.Delta) in arrival of the variants in Sweden. Furthermore, a variant may start to spread, but quickly get dominated by another variant before it reaches a significant proportion of the circulating cases (e.g., 21.A.Delta in Figure 1). Therefore, we reformulate the problem as the following:

Problem 1 (Emergence Delay Prediction).
Given the prevalence (proportion) of a variant in regions , predict when the variant will reach a proportion of in region , provided that this variant has not yet reached region .
Here, the term “proportion” refers to the fraction of new cases created by the variant under consideration out of total new cases. Note that this is a regression problem – we seek to predict a quantity, which is the delay between the current date and the date of reaching the desired threshold. Further, we wish to perform this prediction for a given region before any sample of the variant of interest is observed in that region.
Prior works have focused on the problem of predicting the prevalence of a variant when it has already arrived in the region of interest [6, 7]. Specifically, logistic regression turns out to provide a good estimate of the prevalence over time [8, 9]. However, these techniques require the variant to have a non-zero proportion already in the region of interest. To the best of our knowledge, no prior work exists predicting when the delay reaches a certain prevalence even when there is zero prevalence in the region.
When a new variant emerges with evolutionarily favorable properties, it can only be transferred to a different region through a host (in the case of COVID-19 – a human). This encourages the idea of using an underlying network of mobility to address the proposed problem. Since reaching a certain prevalence may depend on other currently circulating variants, their dynamics (how fast one variant can spread over others) also play a role. Therefore, we propose a variant dynamics-informed Graph Neural Network (GNN) that utilizes a network of mobility and features inspired by variant dynamics to solve the proposed problem. This is different than typical Physics-Informed Neural Networks (PINNs) where dynamics act as a regularization for the loss [10]. Here, we show that our approach of constructing appropriate features results in lower errors compared to the PINNs approach.
Our key contributions represent a novel effort in addressing the challenge of predicting the emergence of COVID-19 variants at a global level, and in doing so establishing a new benchmark for evaluating Delay Prediction on the “CoVariants” dataset [11]. We expect that this benchmark will help build the capacity to predict arrival times of the future variants of COVID-19 and other outbreaks. More specifically, our contributions are as follows:
-
1.
We develop novel adaptations of GNNs that account for complex inter-dependencies between countries using GNNs while incorporating variant delay dynamics at the node level. To the best of our knowledge, we are the first to derive these delay dynamics for variants. Experiments demonstrate that our approach leads to superior results compared to several Machine Learning (ML) methods, including the currently pervasive framework of PINNs that integrate dynamics in the loss function.
-
2.
We make our evaluation pipeline publicly available so that it can be used to evaluate any user-defined PyTorch model111https://github.com/itssmutnuri/gnnvariants. This will advance public health research and enable public health experts and policymakers to leverage this benchmark pipeline for improvement of their models and consequently, for enhancing global health outcomes.
II Related Work and Background
II-A Variants and Variant Dynamics
In the context of infectious diseases, the term “variant” refers to a version of a virus with some changes in its genetic material, known as genome [12]. These changes happen through genetic mutations and can affect the virus’s characteristics, like how easily it spreads, how severe the illness it causes is, and whether it can evade the immune system or not. Variants can be categorized in multiple ways based on genetic differences, and the significance of these differences can vary. Some categorizations focus on a common ancestor, while others may specifically highlight mutations in key regions of the virus’s genome [12]. In our study, we rely on the categorizations provided by the CoVariants dataset [11].

Furthermore, the proportions of different variants can be analyzed over time. Some studies have observed that these proportions follow a straight line when plotted on a semi-logarithmic scale. They leverage this fact to understand the dynamics of variant prevalence in populations [13]. We also observe this in our data as shown in Figure 2. However, to the best of our knowledge, existing studies do not derive any dynamics capable of predicting delays in the emergence of variants between multiple regions/countries.
II-B Graph Neural Networks and Epidemics
Many efforts were made to develop forecasting models capable of predicting the progression and future trends of the COVID-19 pandemic [14, 15, 16]. However, these forecasting approaches have primarily focused on predicting the spread of the virus rather than predicting delays in variant emergence. Predicting the timing of variant-specific outbreaks can significantly enhance public health preparedness and response strategies, allowing for timely interventions such as targeted vaccination campaigns and tailored public health measures [3].
Many Deep Learning (DL) techniques have been explored for this purpose, including GNNs which help capture the graph effects of the spread of infectious diseases over time across regions [17, 18, 19, 20, 21]. Other methods like PINNs [22, 23, 24], agent-based models [25, 26], and a combination of spatio-temporal networks with location-aware features [27], including mobility data [28] have also been used.
Instead of only predicting virus spread, as others have done, we aim to predict when a new variant will emerge in a region given that it has emerged elsewhere. This involves deriving dynamics of variant spread and integrating these insights into a GNN. Notably, our strategy revolves around crafting relevant features rather than altering the loss function, providing a distinctive and effective approach.
III Methodology
III-A Variant Dynamics
First, we develop an understanding of the dynamics that play a role in the spread of a new variant and how we can utilize them to make informed predictions. We start by extracting each variant’s global growth rate from observed data. This value provides an indication of the dominance of the variants over other competing ones. Subsequently, we construct linear models based on the dynamics of infectious diseases and the derived growth rates to predict delays between regions, establishing this Dynamics-based Model as our baseline for Delay Prediction. Finally, we consider different approaches in which we can incorporate some disease dynamics using GNNs that are chosen for their ability to intuitively model spatial dynamics, with the goal of enhancing the prediction accuracy of our model.
Throughout the text, we use the terms “prevalence” and “prevalence ratio”, where the “prevalence”, , is the proportion of cases of a given variant, and the “prevalence ratio” is defined as , where is the prevalence of the variant , and is the prevalence of the most prevalent variant, , that is not . In our experiments, a variant is considered dominant if it reaches an value of at least . We define this value as the dominance threshold, .
III-B Variant Growth Rates
We first derive the dynamics of two competing variants in one region. While logistic regression has been widely used to fit these dynamics, for completeness, we present a derivation. We start by using a model of infectious diseases that generalizes a large class of models [29]:
(1) |
where is the new infections at time , are transmission parameters, is the fraction of susceptible population, and is some small number. The number of contacts at time is given by . The approximation follows from Taylor expansion, and using the fact that is a smooth function. The constant happens to be the mean serial interval – the average time between two successive infections. Now, given two variants and , assuming full cross-immunity between all variants, i.e. , the prevalence ratio is approximated as:
(2) |
The above further assumes that the mean serial intervals of the two variants are close, i.e., . Note that we assumed full cross-immunity which is true for a number of variants before the Omicron sub-variants [30]. If there is only a partial cross immunity, . However, in a short interval, the new immunity (reduction is susceptibility) created by the variants is small. As a result, it can be shown that for some constant which can be absorbed in the parameter above. Solving the above recurrence relation, we get:
Taking the logarithm reduces the equation to:
(3) |
Here, constant is the mean serial interval – the average time between two successive infections, and is the ratio of transmission rates between the two competing variants. Parameter can be referred to as the relative growth rate of variant over variant [29].
Now, consider a scenario in two regions and — has the variant emerging over the previously dominant variant ; but variant never reached region where variant is dominant. This makes it difficult to assess the potential impact of variant on region when we only know . To deal with this challenge, we note that the parameter should ideally be independent of the region (this may not hold due to the simplifying assumptions above, so we actually have a region-specific relative growth rate for region ). Therefore, we can attempt to find the growth-rate of any variant with a fixed variant , specifically, the original COVID-19 variant. For ease of notation, we represent the growth advantage of variant with respect variant as , with , and refer to these values as global growth rates. Ideally, , irrespective of the region where the is estimated.
To estimate these the global growth rates for all the variants that exist by the time over all regions , we can set up the following system of linear equations:
(4) |
where is the growth advantage of variant over variant in region and are error terms. The solution is given by minimizing the sum of squares of .
We, then, define an objective function and solve for the value as
(5) |
Finally, based on the global growth rates, we define global relative growth rates as .
These growth rates are used in our proposed methods for Delay Prediction.
III-C Dynamics-based Model
Consider three variants , , and where is an emerging variant. We assume that either of the following holds – (i) Variants and , respectively in regions and , are the variants that constitute most of the other infections; or (ii) All variants other than in the respective regions and have similar enough global growth rates that can be grouped into variants and . Suppose variant appears in region first, followed by region . Then we have:
(6) |
(7) |
For a new variant, initially, the number of infections increase because of the new infections coming from a different region (importations). After some point, when transmission starts to happen within region (community transmission), the imported infections become negligible in comparison, and this is the point after which above equations are valid.
Suppose we select as the time at which is large enough such that the community transmission dictates the dynamics in the region rather than importations and so, both Equations 6 & 7 hold for . Suppose that the threshold for prevalence in region for community transmission to dominate is and that for region is . We assume that and are independent of the variant. Then, by setting in equations 6 & 7, we get:
(8) |
Therefore, both & are variant independent. Let and be the times at which variant reach target prevalence of in regions and , respectively. We want to find . From equations 6 and 7:
(9) | |||
Also from equation 7:
(10) |
Plugging this value into equation 9, we get:
(11) |
Here is a given prevalence, , , and are global relative growth rates estimated as shown in Section III-B. and are unknowns. Therefore for each pair of regions and , to identify the delay in reaching a given prevalence , we can build a linear model
(12) |
We then use the calculated weights from our linear models in equation 12 and plug them into equation 11 to find the value between two regions. This is a pairwise linear model that estimates the delay between two regions rather than providing a delay from the current date. Algorithm 1 details how we compute the date at which a variant arrives in a region . We calculate the median of the outputs from all pairwise models from regions to region . We evaluate the performance of all the models until the current date and select the top three performing source regions among . Notably, best performance here refers to the lowest error in predicting the delay. We identify three source regions that produced the lowest error in predicting the delays at . Then, we take the median of the predictions produced by these selected models at time which represents the delay at time for the new variant to appear in region .
III-D Dynamics-Informed GNN
Key Idea: We observe that global growth rates play a crucial role in the dynamics. Furthermore, based on Equation 3, under some assumptions, the logarithm of the prevalence ratio grows linearly with time. We hypothesize that using and as features simplify the underlying patterns to be learned by an ML algorithm. An ML algorithm may also complement any violations of the assumptions made in the dynamics-based model.
We propose a GNN-based network for this problem as GNNs are adept at capturing spatial relationships, making them well-suited for problems where geographical proximity between regions, such as countries, plays a crucial role. Additionally, the problem induces a natural graph structure, where countries can be represented as nodes with edges denoting some relationship between the countries. Furthermore, we explore different techniques to incorporate disease dynamics into the GNN in an effort to capture the complex interactions and patterns associated with the spread of infectious diseases.

III-D1 Graph Construction
The graph construction is illustrated in Figure 3. First, we represent each country as a node in the graph, and edges are established to represent relationships between the countries. There are 87 countries in which at least one variant appears and becomes dominant in the CoVariants dataset [11]. The dataset provides the state of variants and mutations of interest for COVID-19 from August 2020 through October 2023. It has a bi-weekly resolution for 36 variants starting from 20A.EU1 to 23F.Omicron. While not shown in Figure 3, self-loops are included to account for internal transmissions.
Next, we consider the temporal aspect of the problem. The graph may evolve over time to capture changing relationships or influences among countries. This is introduced into our graph as dynamic edge weights based on border control data in the “OxCGRT” dataset [31]. This dataset is from January 2020 to January 2023 measuring the variations in government responses using their COVID-19 Government Response Stringency Index, which is a simple additive score of nine indicators ranging from school closures to vaccination policies. Of these indicators, we picked international travel closure controls since they best capture the emergence and cross-border transmission of variants. These values are encoded in an ordinal scale as follows: 0 - no measures, 1 - screening, 2 - quarantine on high-risk regions, 3 - ban on high-risk regions, 4 - total border closure. The recorded values were then scaled down to a range between 0.1 and 1, with 1 indicating no border restrictions and 0.1 representing complete border closure. Note that 0.1 was chosen over 0, acknowledging the possibility of some mobility between connecting countries due to necessary trade or border crossings via open land routes. Given that our CoVariants data extends beyond January 2023, we extend OxCGRT by assuming that all border restrictions have remained unchanged since Jan 8, 2023.
To obtain the full graph, we create multiple copies of the above graph representing regions and inter-connectivity – one per variant at a given timestamp . This approach allows us to leverage variant-specific dynamics as our node features. At , each node has features consisting of a time series of the logarithm of prevalence ratios of a present variant and the most prevalent variant , spanning time steps. This results in a vector . The corresponding value of the variant is then concatenated to form each node’s final feature vector .
III-D2 Feature Augmented Graph Convolutional Network (FA-GCN)

To extract both temporal and geographical information of the data, we propose a hybrid architecture summarized in Figure 4. In this model, we employ a simple Gated Recurrent Unit (GRU) [32] to extract temporal information of the variant transmissions in time steps. We temporarily remove the growth rate (which is not time-dependent) from node features and embed the 1-dimensional (1D) time-dependent features (prevalence ratios) into a 2D latent feature vector. Note that in this configuration, we embed the features separately for each country, treating the nodes as distinct input samples grouped into a single batch, resulting in a 3D latent feature vector. The 3D latent feature vector is then flattened to obtain a final 2D embedding, which serves as input for a one-layer Graph Convolutional Network (GCN). The Leaky ReLU activation function introduces non-linearity after the GCN layer. Additionally, two GraphNorm[33] layers are applied after the GRU and GCN layers for normalization, and dropout is incorporated after the GCN layer for regularization. Finally, the embedded features are fed into a Fully Connected (FC) layer to predict delay until emergence for each country.
III-D3 Physics Informed GCN (PI-GCN)
For comparison, we use a PINN-inspired [10] approach to make predictions based on dynamics for the Delay Prediction task. This model utilizes a 2-layer GCN architecture, incorporating dropout and two GraphNorm layers which are applied to the output of each GCN layer, along with Leaky ReLU activation functions. Typically this GCN would be trained using a Mean Squared Error (MSE) loss, but we modify this loss function to steer our GCN towards dynamic predictions made by the dynamics-based model.
The adjusted loss function is given by:
(13) |
where N is the number of countries, is our model’s output, is the ground truth, and is a hyperparameter ranging from 0 to 1, signifying the influence of , the output from the dynamics-based model. This modification ensures that, during training, the model is penalized more when its prediction deviates from the linear model’s prediction, providing a form of dynamics-informed regularization. The best value was found to be 0.1 by grid search.
III-E Training Procedure
All training and validation of the models were performed retrospectively, as outlined in Algorithm 2. This means that the model is trained and validated at each time step using only the ”observed” data available up to that time step, denoted as . For each variant, the retrospective algorithm iterates through all the biweekly data associated with that variant, training the model only on the weeks that have already passed (i.e., the observed data). Note that the pre-processing is also done retrospectively, accounting for any smoothing, calculation of the values, or interpolation in the case of the dynamics-based model. This ensures that the approach is applicable in a prospective setting, where no data from the future is available.
The train/validation split is also done temporally to ensure that validation is performed only on the most recently observed data. This aims to achieve a better fit to the latest data, enhancing the model’s predictive performance. An Early Stopper monitors the validation loss and halts the training when overfitting is detected, signified by a constant and substantial increase in the validation loss. The Early Stopper saves the best-performing weights, which are reused in the next iteration. This practice eliminates the need for the model to start training from scratch, leveraging knowledge gained from previous iterations.
We use MSE, or its adjusted version, as the loss function. Another crucial aspect is how to handle data related to variants that either do not appear in a given country or appear but fail to become dominant, i.e. rapidly diminish. In such cases, where regression targets would be undefined (potentially infinite), we address this challenge by creating a mask. This mask is employed to conceal nodes corresponding to these specific variants/country pairs, ensuring they do not contribute to our training loss. Essentially, our graph models are not trained on these nodes.
IV Experiments
IV-A Dataset
To validate our methodology, we’ve used a combination of multiple datasets which are detailed below.
IV-A1 Covariant Data
CoVariants [11] is our primary dataset which provides the current state of variants and mutations of interest for SARS-CoV-2. This data is enabled by GISAID [34] and ranges from August 2020 through October 2023, with a bi-weekly resolution for 36 variants starting from 20A.EU1 to 23F.Omicron, which is current as of this work. CoVariants follows the NextStrain Clade schema [35], in which variants can descend from other variants.
IV-A2 Border Restrictions Data
We considered the border restrictions data provided in the OxCGRT [31] dataset, with details on what level of border controls were enacted by governments in real-time from Jan 1, 2020, to Jan 8, 2023. OxCGRT measured the variations in government responses using their COVID-19 Government Response Stringency Index, which is a simple additive score of nine indicators ranging from school closures to vaccination policies. Of these indicators, we picked international travel closure controls since they best represent the goal of this paper in capturing the emergence and cross-border transmission of variants. These values are encoded in an ordinal scale as follows: 0 - no measures, 1 - screening, 2 - quarantine on high risk region, 3 - ban on high risk regions, 4 - total border closure. The recorded values were then scaled down to a range between 0.1 and 1, with 1 indicating no border restrictions and 0.1 representing complete border closure. Note that 0.1 was chosen over 0, acknowledging the possibility of some mobility between connecting countries due to necessary trade or border crossings via open land routes. Furthermore, given that our CoVariants data extends beyond January 2023, we extend OxCGRT by making the assumption that all border restrictions have remained unchanged since Jan 8, 2023.
IV-A3 Country Graphs
We considered two graphs representing the interconnection between countries. This dataset depicts country neighborhoods according to the Correlates of War (CoW) database. Countries were also deemed neighbors if they share a land or river border or are located within 24 miles of each other across bodies of water. Adjacency: We used the country adjacency dataset [36], which offers a list of countries and their neighboring countries. OpenFlights: This is a collection of regular flights along with their routes mapped from airports around the world [37]. Given the scope of the current work, we primarily focus on airports and the flight route data, and was mainly used during ablation studies. Airports data is a UTF-8 encoded collection of over 14,000 airports, train stations and ferry terminals across the world as of January 2017. It consists of information like city/country, IATA codes, geographic coordinates and the type of the airport (air terminal, ferry station, etc.). Routes data provides information about 67663 directional routes between 3321 airports on 548 airlines. This dataset was used in an ablation study to validate the effectiveness of utilizing these flight routes as edges in our graph creation process.
A specific adjustment in preprocessing was necessary for the OpenFlights dataset, involving the mapping of city routes to their respective country routes. Leveraging the provided airport data, which contains comprehensive city and country details, facilitated this mapping process. The transformation ensured that subsequent analyses operated consistently at the country level.
IV-B Benchmark Pipeline

Since we are the first to attempt solving this problem, we provide a benchmarking tool for the community. Figure 5 presents the comprehensive pipeline employed for retrospective training, validation, and testing of diverse models. Its design prioritizes user-friendliness, allowing seamless integration of new PyTorch models or direct utilization of scikit-learn models through a configuration file. Users only need to specify whether the model requires graph data, leaving the pipeline to handle the rest. Additionally, the data pre-processing is encapsulated in a separate function, providing users with the flexibility to make modifications.
As shown in Figure 5, the evaluation is also being done retrospectively. It means that for each date that the model is trained on, it is used to predict for the subsequent date representing a 1-timestep ahead forecast. This aligns with the biweekly nature of the data, resulting in a two-week ahead forecast. Furthermore, model performance is being evaluated per variant, meaning that for each given variant we retrospectively assess the trained models over time.
In the context of delay prediction, considering the biweekly data, the target is consistently a multiple of 14 days. Therefore, the output of our models is always rounded up to the nearest multiple of 14. Furthermore, evaluations exclude countries where a variant has already become dominant or never achieves dominance.
IV-C Pre-processing
The initial pre-processing stage aligned the country names across diverse datasets, with an emphasis on inclusion based on data availability. Only countries present in all datasets were retained, ensuring a standardized set for subsequent analyses. For each country, the prevalence of each variant was calculated at each timestep. Variants with a prevalence of less than 5% were disregarded in our analysis. To calculate the prevalence ratios, the two variants with the highest prevalence were identified for each timestep. For each variant on that particular day, the prevalence ratio was calculated. Finally, variants that did not reach a prevalence ratio of 0.2 or those that were present in less than three countries were filtered out. These steps ensured that only variants with a significant presence and wide geographic distribution were included in the subsequent analysis. All subsequent pre-processing steps were conducted retrospectively. This implies that as time progresses, more data is used for computing the global growth rates () and for training.
IV-D Benchmark Evaluation Metrics
To measure performance, we define Mean/Median Absolute Errors (MAE/MedAE) as follows:
(14) | ||||
(15) |
where ranges from 1 to and represents the number of valid countries to predict for. The , subscripts indicate that these errors are calculated for each variant at each timestep. Since this problem is retrospective, we also evaluate performance over time by finding the mean and median of the errors:
where ranges from 1 to and represents the total number of timesteps for which the variant circulates before reaching total dominance, i.e. it globally reached all its targets and begins to disappear. Finally, we conduct a mean across all the variants to get our four error metrics: MMAE, MMedAE, MedMAE, and MedMedAE. While all these metrics offer valuable insights, our main focus lies on MedMAE and MedMedAE, emphasizing median performance over time. This choice is made due to certain data inconsistencies, such as variants appearing in a single country for an extended period before spreading to other countries, resulting in large errors that may skew results.
IV-E Benchmark Models
We implemented a number of benchmark models that act as baselines for comparisons.
Mean Model – Delay Prediction: This model predicts delays based on average delays on the variants between the countries under consideration. Specifically, at time , for a new variant on a target country , suppose is the set of all countries where the variant has appeared. Then, the predicted time of arrival is given by
(16) |
Here is the delay between the appearance of variant from country to country . We take the mean of over all prior variants that appeared in both countries and . Finally, we take the median of the delays over all countries in where variant has already appeared.
Dynamics-based Model – Delay Prediction: This is the derived dynamics-based linear model discussed in the previous section. This model serves as a baseline mechanistic model, which, under the right assumptions, offers an explainable and efficient method for calculating delays.
Decision Tree: We use Decision Tree regression models to address Delay Prediction. This involves transforming the dataset into an matrix. Here, represents the total number of samples, where each sample corresponds to the 5 features of a given variant at a specific point in time in a specific country. Meaning , where is the total number of timesteps, is the total number of variants, and is the total number of countries. These trees use the Gini impurity and mean squared error for their criterion functions. No maximum depth was set for either.
Multi Layer Perceptron: We employ a Multi-Layer Perceptron (MLP) regressor for Delay Prediction. They were configured with two hidden layers, consisting of 16 and 8 neurons, respectively. The selection of hyperparameters, including the number of hidden layers, activation function, and solver, was determined through comprehensive experimentation and optimization tailored to our specific problem. The training procedure closely followed the approach outlined in Section 3.3, with the only distinction being the structuring of our dataset into an matrix, as described for the decision tree. A batch size equivalent to the number of countries was used, where each batch represented the list of countries and their features at each timestep for each variant, ensuring that the gradient updates were performed collectively for each group of variants at a given timestep.
Gated Recurrent Unit Network We utilize a simple GRU network to encode the temporal information of variant transmissions and predict theirdelay. To train this model, first, we feed temporal features of the variant to the GRU architecture which embeds them for each country distinctly. Then, we flatten the embedded features and append them with and directly give them to an FC layer for predicting the time it takes for that variant to appear in a specific country. We employ GraphNorm layer here after the concatenation to accelerate training by smoothing graph aggregation distributions.
Graph Convolutional Network (GCN) We consider a simple GCN, with a basic architecture in an attempt to capture local relationships and features within the graph structure. It is comprised of two GCN layers, which output 32 and 16 channels respectively. Dropout is strategically employed between these two layers for regularization. The final graph embedding is flattened and fed into an FC layer which outputs the delay for each country.
Encoded Dynamics GCN Each country has a variable number of circulating variants, each with different growth rates (as derived earlier). These growth rates can be concatenated and represented as a feature vector , which correspond to the variants present in a region at time . We can learn a latent vector representation of this data through the use of encoders. We implement this in two ways:
Using an Autoencoder (AE-GCN): We employ a simple autoencoder with two linear layers and ReLU activation in the encoder and decoder blocks. The model takes the feature vector as input. We train the autoencoder using MSE loss to reconstruct the feature vector . After training, we take the intermediate latent encoding output from the encoder and append it to the feature matrix before constructing a spatio-temporal snapshot graph data for the GCN model.
Embedded Encoder (EE-GCN): Using an autoencoder implies that the model is guided towards learning an embedding using the decoder output, and these two are inherently separate models. The encoder weights are also unaffected during graph training. An alternative approach is to embed this into the larger GCN model. In this method, as before, the variable-sized is fed to an encoder block (consisting of two linear layers with ReLU activation). Instead of using a decoder, we directly concatenate the resulting latent vector representation to the node features fed into the GCN. This means the encoder weights are updated in the graph training loop, allowing the model to learn a better latent representation.
IV-F Results
In this section, we present our key results for Emergence Delay Prediction. We previously noted that prior research has concentrated on predicting the prevalence of a variant after its arrival in a specific region, with logistic regression demonstrating effectiveness in estimating its prevalence over time. However, as logistic regression can manage this task relatively easily, our evaluation focused solely on forecasting a variant’s prevalence before its arrival in a specific region. In other words, during evaluation, we excluded cases where a variant is present in a region but has not yet reached .
Numerous models were tested and configured as benchmarks to test against the performance of our proposed method. Other implementations include a trivial Mean Model, the baseline Dynamics-based Model, a Decision Tree, a Multilayer Perceptron (MLP), a GRU, and a GCN without an adjusted loss function. In brief, the Mean Model predicts delays based on the average delays of all prior variants between all the countries in which the variant appeared and our intended country. The Decision Tree, MLP, and GRU regression models make use of extra features which consist of the average of neighboring countries transforming the dataset into a matrix of size . Here, is the total number of samples, is the total number of retrospective dates, is the total number of variants, and is the total number of countries. The full details on these models can be found in the Appendix.
To measure the performance of the models, we use the defined Mean/Median Absolute Errors (MAE/MedAE) in the Appendix and measure the performance over time by finding the mean and median of these errors as:
where ranges from 1 to , being the total number of timesteps for which variant circulates before reaching total dominance, i.e. it globally reaches all its targets and begins to disappear. Finally, the mean across all the variants gets our two error metrics: MedMAE, and MedMedAE.
Model | MedMedAE | MedMAE |
---|---|---|
Mean Model | 3.2 | 3.87 |
Dynamics-based Model | 1.86 | 2.9 |
Decision Tree | 3.48 | 3.43 |
MLP | 2.38 | 2.48 |
GRU | 1.5 | 1.37 |
GCN | 1.32 | 1.43 |
FA-GCN | 1.27 | 1.29 |
PI-GCN | 2.24 | 2.01 |
AE-GCN | 1.82 | 2.25 |
EE-GCN | 2.00 | 2.49 |
Table I presents the models’ performance metrics for Delay Prediction. The results indicate that our proposed FA-GCN model outperforms all other models in terms of MedMedAE and MedMAE, showcasing its effectiveness in capturing temporal dependencies for Delay Prediction. The dynamics-based model also has a significantly worse MedMAE than MedMedAE. This suggests that the dynamics-based model is susceptible to outlier predictions for some country pairs, which skews its results. Additionally, we observe that the PI-GCN does not surpass its GCN counterpart, highlighting the unnecessary use of dynamics as a regularization of the loss when employing the correct features.

V Discussion
V-A Results Analysis
Figure 6 provides a detailed representation of the results for the baseline dynamics-based model and the FA-GCN model. The heatmaps illustrate the number of countries eligible for prediction at each timestamp. The numbers indicate the errors over time for each variant throughout their existence. Notably, several variants in the dynamics-based model are denoted by , signifying instances of model or evaluation failure. These failures stem from three primary factors. Firstly, the linear model requires a minimum of two common variants between two countries to build an effective model. Secondly, the linear model can encounter significant challenges when variant is identical in growth rate to variant (i.e., ). In such cases, the linear model fits a plane solely along a fixed axis and will thus always predict some constant , as seen in Equation 11. But as time passes and variant is no longer identical to variant for another variant, this model fails drastically since it would still fit on a fixed axis. These scenarios were more prevalent in the early stages of the pandemic when only a few variants were circulating, leading to their exclusion from the dynamics-based model’s training process. To ensure fair comparisons, all final errors across all models were calculated excluding these variants. Finally, there are cases where a variant appears in a country but fails to reach the threshold . These instances were disregarded during the evaluation of all models as they were considered trivial.
The observations from Figure 6 give us two insights. Firstly, the primary source of error is associated with variants that persist for an extended duration, such as 21J.Delta or 23E.Omicron. Examining the color gradients, these variants appear to linger in only a few countries before spreading to others. In reality, variants typically do not endure for such prolonged periods without either being superseded by another variant or transmitting to additional countries [30, 38]. This suggests a potential issue in the data capturing these lingering variants. Secondly, despite this being a major source of error for FA-GCN, it exhibits overall robustness to outlier predictions, unlike the dynamics-based model as seen in Figure 6. This, in turn, severely affects the results of the PI-GCN model.




While the dynamics-based model does output good predictions, it fails often and dramatically in some cases. Figure 7 presents examples of 3D points in space utilized for fitting linear models. In Figure 7(a), the 3D points form a space where a plane can be fitted exceptionally well. However, as newer COVID variants emerge, the dynamics may evolve, causing some initial assumptions, such as full cross-immunity [30], to become invalid. This evolution makes it more challenging to perfectly fit a linear model, as depicted in Figure 7(c).
V-B Ablation Study
V-B1 Dynamics
Model | |||||
---|---|---|---|---|---|
Decision Tree | 3.48 | 3.0 | 2.81 | 3.0 | 2.86 |
MLP | 2.38 | 2.67 | 2.69 | 2.55 | 2.76 |
GRU | 1.50 | 1.50 | 1.95 | 2.32 | n/a |
GCN | 1.32 | 1.50 | 1.78 | 2.32 | 2.38 |
FA-GCN | 1.27 | 1.55 | 1.73 | 1.73 | n/a |
We examine the usefulness of integrating the extracted dynamics as inputs for our models. In this analysis, we train the same models using different sets of features. Specifically, we investigate the effectiveness of using as a feature and the impact of using a time series of . Where excludes as a feature, relying solely on the time series of prevalence ratios rather than the prevalence . Additionally, we investigate the effectiveness of by comparing to and utilizing as the only feature.
From the results in Table II, we observe that the combination outperformed all other configurations except for the case of the Decision Tree. There is also an improvement when utilizing over for the temporal-based models, with the GCN performing similarly and the others exhibiting poorer performance. This suggests that the temporal models effectively encode the latent information contained in the time series of , while the simpler models struggle to do so. Additionally, inclusion of as a feature enhances results across all cases. Combining and significantly improves performance in most cases, indicating a synergistic effect likely stemming from the linear relationship between the two, as previously discussed. It is important to note that the GRU and FA-GCN were not trained using just since excluding the time series encoding reduces these models to an MLP and GCN, respectively.
V-B2 Graphs
Model | MedMedAE | MedMAE |
---|---|---|
FA-GCN | 1.27 | 1.29 |
FA-GCN_30 | 1.46 | 1.45 |
FA-GCN_60 | 1.57 | 1.46 |
FA-GCN_flights_adj | 1.95 | 1.46 |
FA-GCN_flights | 1.88 | 1.53 |
FA-GCN_fullcon | 2.10 | 1.67 |
FA-GCN_WoEW | 1.46 | 1.46 |
When designing the structure of our graph, several considerations were taken. Firstly, we acknowledged that certain nodes (countries) might be more susceptible to noise due to reporting inconsistencies or a lack of reporting, leading to unreliable data. To address this issue, we explored the option of using only countries with consistently reported data, focusing on the top 30 (FA-GCN_30) and top 60 (FA-GCN_60) countries out of the total 87 (FA-GCN) countries in our dataset.
Additionally, we examined different approaches for creating edges in our graph. While country adjacency served as the base (FA-GCN), we also explored the option of connecting countries through flight routes using the OpenFlights dataset. This exploration included a combination of country adjacency and flight routes (FA-GCN_flights_adj), as well as utilizing only flight routes (FA-GCN_flights).
To ensure a thorough comparison, we also tested a fully connected graph (FA-GCN_fullcon), where every country was connected to every other country. This configuration was expected to test the limits of connectivity by removing the specificity of relationships, allowing us to see if more generalized connections might yield better results. Additionally, we included a model without edge weights (FA-GCN_WoEW) to serve as a baseline, assessing the importance of dynamic weights in the performance of the graph model.
From Table III, it is evident that our selected graph design, based on country adjacency, outperforms alternative configurations. The number of nodes (countries) in the graph does not significantly impact the model performance, showcasing its robustness. In contrast, a fully connected graph performs the worst, indicating that simply increasing the number of connections without considering the relevance of those connections can introduce noise and degrade the model’s accuracy. It emphasizes the importance of meaningful relationships between nodes, rather than indiscriminately connecting all nodes.
Interestingly, the use of flight routes generally results in increased errors when compared to the adjacency-based model. While this might seem counter-intuitive, considering that flights could connect distant countries and continents. It’s crucial to note that, based on the definition of country adjacency used by the dataset, countries like the USA and Russia are considered adjacent. As a result, they act as a link between continents. In such cases, the adjacency-based connections might already encapsulate the critical pathways for information flow, making the additional flight route data redundant or even misleading. Moreover, the incorporation of dynamic weights appears to improve model performance, as seen when comparing FA-GCN with FA-GCN_WoEW.
Overall, these results underline the importance of carefully considering the graph structure and the nature of the connections when designing GCN models. While flight routes might seem like a logical addition, their effectiveness depends on how well they complement the existing adjacency information. In this case, it appears that the adjacency-based graph design captures the essential relationships better than the other configurations.
V-C Implementation
The pipeline depicted in Figure 5, was implemented in Python. The implementation was carried out on a machine equipped with a 32-core Intel(R) Xeon(R) Gold 5218 CPU running at 2.3 GHz and 64 GB of RAM. Focusing on the example of variant 23F.Omicron at a single timestep, which contains the majority of the data, and the heaviest model (FA-GCN), the average runtimes for different components were approximately 32 seconds for data pre-processing, 20 seconds for training, and 0.01 seconds for inference. The entire retrospective pipeline takes approximately 2 hours to complete. Notably, the most time-consuming aspect of the pipeline is the data pre-processing step, which could potentially be optimized further through memorization.
V-D Limitations
One limitation of our approach in addressing the problem of Delay Prediction lies in handling situations where a variant might not even become dominant in a given country. To tackle this issue, we define the following problem:
Problem 2 (Dominance Prediction).
Given the prevalence (proportion) of a variant in regions , predict if the variant will ever reach a proportion in region .
In future work, we plan to address this issue by adopting a two-step approach. First, we solve Dominance Prediction, which aims to predict whether a variant will become dominant in a given region. If the variant is predicted to become dominant, we can then proceed to solve Delay Prediction.
VI Conclusion
We addressed the challenges of predicting variant delay across countries. The derivation of variant dynamics provided a theoretical foundation, which was used to engineer relevant features as well as a novel baseline dynamics-based model. We demonstrated that our dynamics feature-augmented GNN approach outperformed all other methods. Through comprehensive experiments and analysis, we demonstrated the effectiveness of our design choices, providing valuable tools for understanding and predicting the intricate relationships and connectivity patterns between nations. Furthermore, as we are the first to address the proposed problem, we provided a comprehensive benchmark and made our full pipeline available in an effort to facilitate research in the field.
Acknowledgements
This work was supported by the Centers for Disease Control and Prevention and the National Science Foundation under the awards no. 2135784, 2223933, and 2333494.
References
- [1] “World health organization covid-19 dashboard.” https://covid19.who.int/, 2023.
- [2] T. L. Wiemken, F. Khan, L. Puzniak, W. Yang, J. Simmering, P. Polgreen, J. L. Nguyen, L. Jodar, and J. M. McLaughlin, “Seasonal trends in covid-19 cases, hospitalizations, and mortality in the united states and europe,” Scientific Reports, vol. 13, Mar. 2023.
- [3] “Tracking sars-cov-2 variants.” https://www.who.int/activities/tracking-SARS-CoV-2-variants, 2023.
- [4] P. V. Markov, M. Ghafari, M. Beer, K. Lythgoe, P. Simmonds, N. I. Stilianakis, and A. Katzourakis, “The evolution of sars-cov-2,” Nature Reviews Microbiology, vol. 21, p. 361–379, Apr. 2023.
- [5] A. S. Lambrou, P. Shirk, M. K. Steele, P. Paul, C. R. Paden, B. Cadwell, H. E. Reese, Y. Aoki, N. Hassell, X.-Y. Zheng, et al., “Genomic surveillance for sars-cov-2 variants: predominance of the delta (b. 1.617. 2) and omicron (b. 1.1. 529) variants—united states, june 2021–january 2022,” Morbidity and Mortality Weekly Report, vol. 71, no. 6, p. 206, 2022.
- [6] J. Sun, X. Chen, Z. Zhang, S. Lai, B. Zhao, H. Liu, S. Wang, W. Huan, R. Zhao, M. T. A. Ng, et al., “Forecasting the long-term trend of covid-19 epidemic using a dynamic model,” Scientific reports, vol. 10, no. 1, p. 21122, 2020.
- [7] H. Hu, H. Du, J. Li, Y. Wang, X. Wu, C. Wang, Y. Zhang, G. Zhang, Y. Zhao, W. Kang, et al., “Early prediction and identification for severe patients during the pandemic of covid-19: a severe covid-19 risk model constructed by multivariate logistic regression analysis,” Journal of Global Health, vol. 10, no. 2, 2020.
- [8] D. A. Shah, E. D. De Wolf, P. A. Paul, and L. V. Madden, “Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models,” PLOS Computational Biology, vol. 17, pp. 1–23, 03 2021.
- [9] S. Palaniappan, R. V, and B. David, “Prediction of epidemic disease dynamics on the infection risk using machine learning algorithms,” SN computer science, vol. 3, no. 1, p. 47, 2022.
- [10] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational Physics, vol. 378, pp. 686–707, 2019.
- [11] E. B. Hodcroft, “Covariants: Sars-cov-2 mutations and variants of interest.” https://covariants.org/, 2021.
- [12] “Cdc variant classifications.” https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html, 2023. Centers for Disease Control and Prevention.
- [13] L. J. Beesley, K. R. Moran, K. Wagh, L. A. Castro, J. Theiler, H. Yoon, W. Fischer, N. W. Hengartner, B. Korber, and S. Y. Del Valle, “Sars-cov-2 variant transition dynamics are associated with vaccination rates, number of co-circulating variants, and convalescent immunity,” eBioMedicine, vol. 91, p. 104534, May 2023.
- [14] “Covid 19 forecast hub us.” https://covid19forecasthub.org/doc/.
- [15] K. Sherratt, H. Gruson, R. Grah, H. Johnson, R. Niehus, B. Prasse, F. Sandmann, J. Deuschel, D. Wolffram, S. Abbott, and et al., “Predictive performance of multi-model ensemble forecasts of covid-19 across european nations,” Apr 2023.
- [16] “The german and polish covid-19 forecasthub.” https://kitmetricslab.github.io/forecasthub/.
- [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.
- [18] M. R. Davahli, K. Fiok, W. Karwowski, A. M. Aljuaid, and R. Taiar, “Predicting the dynamics of the covid-19 pandemic in the united states using graph theory-based neural networks,” International Journal of Environmental Research and Public Health, vol. 18, p. 3834, Apr 2021.
- [19] J. Gao, R. Sharma, C. Qian, L. M. Glass, J. Spaeder, J. Romberg, J. Sun, and C. Xiao, “STAN: spatio-temporal attention network for pandemic prediction using real-world evidence,” Journal of the American Medical Informatics Association, vol. 28, pp. 733–743, 01 2021.
- [20] G. Panagopoulos, G. Nikolentzos, and M. Vazirgiannis, “Transfer graph neural networks for pandemic forecasting,” 2021.
- [21] S. Ganesan and D. Subramani, “Spatio-temporal predictive modeling framework for infectious disease spread,” Scientific Reports, vol. 11, no. 1, p. 6741, 2021.
- [22] S. Seo, C. Meng, and Y. Liu, “Physics-aware difference graph networks for sparsely-observed dynamics,” in International Conference on Learning Representations, 2019.
- [23] S. Seo and Y. Liu, “Differentiable physics-informed graph networks,” arXiv preprint arXiv:1902.02950, 2019.
- [24] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
- [25] A. Chopra, A. Rodríguez, J. Subramanian, A. Quera-Bofarull, B. Krishnamurthy, B. A. Prakash, and R. Raskar, “Differentiable agent-based epidemiology,” arXiv preprint arXiv:2207.09714, 2022.
- [26] A. Chopra, E. Gel, J. Subramanian, B. Krishnamurthy, S. Romero-Brufau, K. S. Pasupathy, T. C. Kingsley, and R. Raskar, “Deepabm: scalable, efficient and differentiable agent-based simulations via graph neural networks,” arXiv preprint arXiv:2110.04421, 2021.
- [27] S. Deng, S. Wang, H. Rangwala, L. Wang, and Y. Ning, “Graph message passing with cross-location attentions for long-term ili prediction,” arXiv preprint arXiv:1912.10202, 2019.
- [28] A. Kapoor, X. Ben, L. Liu, B. Perozzi, M. Barnes, M. Blais, and S. O’Banion, “Examining covid-19 forecasting using spatio-temporal graph neural networks,” arXiv preprint arXiv:2007.03113, 2020.
- [29] A. Srivastava, “The variations of sikjalpha model for covid-19 forecasting and scenario projections,” Epidemics, vol. 45, p. 100729, 2023.
- [30] J. Chen, C. Gu, Z. Ruan, and M. Tang, “Competition of sars-cov-2 variants on the pandemic transmission dynamics,” Chaos, Solitons; Fractals, vol. 169, p. 113193, Apr. 2023.
- [31] T. Hale, N. Angrist, R. Goldszmidt, B. Kira, A. Petherick, T. Phillips, S. Webster, E. Cameron-Blake, L. Hallas, S. Majumdar, et al., “A global panel database of pandemic policies (oxford covid-19 government response tracker),” Nature human behaviour, vol. 5, no. 4, pp. 529–538, 2021.
- [32] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- [33] T. Cai, S. Luo, K. Xu, D. He, T.-Y. Liu, and L. Wang, “Graphnorm: A principled approach to accelerating graph neural network training,” 2021.
- [34] S. Khare, C. Gurry, L. Freitas, M. B. Schultz, G. Bach, A. Diallo, N. Akite, J. Ho, R. T. Lee, W. Yeo, G. C. C. Team, and S. Maurer-Stroh, “Gisaid’s role in pandemic response,” China CDC Weekly, vol. 3, p. 1049, 2021.
- [35] I. Aksamentov, C. Roemer, E. B. Hodcroft, and R. A. Neher, “Nextclade: clade assignment, mutation calling and quality control for viral genomes,” Journal of Open Source Software, vol. 6, no. 67, p. 3773, 2021.
- [36] O. Awile, “Country adjacency dataset.” https://github.com/P1sec/country_adjacency, 2013.
- [37] OpenFlights, “OpenFlights dataset.” https://openflights.org/, 2023. Accessed: August 10, 2023.
- [38] “A timeline of covid-19 variants.” https://www.verywellhealth.com/covid-variants-timeline-6741198, 2023.