Travel Demand Forecasting: A Fair AI Approach
Abstract
Artificial Intelligence (AI) and machine learning have been increasingly adopted for travel demand forecasting. The AI-based travel demand forecasting models, though generate accurate predictions, may produce prediction biases and raise fairness issues. Using such biased models for decision-making may lead to transportation policies that exacerbate social inequalities. However, limited studies have been focused on addressing the fairness issues of these models. Therefore, in this study, we propose a novel methodology to develop fairness-aware, highly-accurate travel demand forecasting models. Particularly, the proposed methodology can enhance the fairness of AI models for multiple protected attributes (such as race and income) simultaneously. Specifically, we introduce a new fairness regularization term, which is explicitly designed to measure the correlation between prediction accuracy and multiple protected attributes, into the loss function of the travel demand forecasting model. We conduct two case studies to evaluate the performance of the proposed methodology using real-world ridesourcing-trip data in Chicago, IL and Austin, TX, respectively. Results highlight that our proposed methodology can effectively enhance fairness for multiple protected attributes while preserving prediction accuracy. Additionally, we have compared our methodology with three state-of-the-art methods that adopt the regularization term approach, and the results demonstrate that our approach significantly outperforms them in both preserving prediction accuracy and enhancing fairness. This study can provide transportation professionals with a new tool to achieve fair and accurate travel demand forecasting.
keywords:
Fairness , AI , Forecasting , Machine learning , Regularization , Travel demand1 Introduction
In recent years, Artificial Intelligence (AI) has been increasingly used in travel demand forecasting, due to its powerful prediction capability [49, 22]. However, a growing number of studies reported that AI has evident fairness issues [9, 5, 3, 6, 42, 15, 40]—making worse predictions for disadvantaged population groups (e.g., racial and ethnic minorities, low-income individuals, and women) than the advantaged groups. For example, facial recognition systems have higher error rates on classifying darker-skinned individuals and females [15]. Studies in the transportation domain also have similar findings. For example, recent research has shown that AI algorithms could underestimate the actual travel demand for the disadvantaged groups [51] and deliver much lower prediction accuracy for the disadvantaged groups than the advantaged groups [60]. The unfair predictions may negatively impact transportation policies and decision-making (e.g., vehicle rebalancing and traffic control), leading to unintended consequences for transportation equity. Therefore, AI-based travel demand forecasting models should account for both prediction accuracy and fairness [50].
Recently, some researchers have started to develop fairness-aware AI methods in travel behavior modeling, e.g., travel mode choice modeling [60] and travel demand forecasting [51]. However, research on this important topic, especially for travel demand forecasting, is still lacking. For instance, although various methods have been developed to mitigate the unfairness issues, very few can be flexibly adopted by different types of models (e.g., linear models, deep learning models with different architectures, etc.). In other words, there still lacks a systematic framework to address the model’s fairness issue in a model-agnostic (i.e., the method should be independent of models) manner. Also, it remains largely unsolved how to prioritize model fairness while preserving its prediction accuracy, both of which are critical to ensure the trustworthiness of AI [31, 37]. Additionally, previous studies have primarily focused on correcting the unfairness of a single protected attribute. In real-world dataset, however, the debiased model and results could vary across different protected attributes, potentially causing confusion and hindering adoption by end-users. For example, one study has found that mitigating unfairness of one protected attribute (i.e., race) could increase the prediction disparities of another protected attribute (i.e., income) [60]. This suggests that a model that is fair for one protected attribute could still be unfair for other attributes [46]. However, few prior studies have been devoted to simultaneously tackling fairness issues from multiple protected attributes [10, 46].
To address these research gaps, we aim to develop a new methodology to enhance fairness in AI-based travel demand forecasting models. More specifically, first, we define Fairness as the Equality of Prediction Accuracy, i.e., the prediction accuracy is equal for advantaged and disadvantaged population groups. Next, we examine the potential unfairness (i.e., prediction accuracy disparity) existing among several state-of-the-art deep learning and statistical models for travel demand forecasting, using real-world ridesourcing-trip data in Chicago, IL and Austin, TX. We propose a novel absolute correlation regularization method to simultaneously correct the detected unfairness across multiple protected attributes (e.g., race, education, etc). We further compare the proposed methodology with other existing state-of-the-art regularization terms to show its effectiveness in both preserving accuracy and correcting unfairness. The unique contributions of this study are presented as follows:
-
1.
This study is one of the first studies to examine the fairness issues of travel demand forecasting models from the algorithmic view. We extend the literature on this topic by detecting the unfairness issues of several commonly-used deep learning and statistical models and proposing a methodology to correct the unfairness.
-
2.
We introduce a novel absolute correlation regularization term to address the model’s unfairness arising from multiple protected attributes. This regularization term is explicitly designed to penalize models that produce unfair predictions, which holds notable transparency. Moreover, the proposed regularization term is model-agnostic and can be flexibly incorporated into the loss function of any type of model architecture.
-
3.
We propose to use an interactive weight coefficient for both the accuracy loss and fairness regularization terms. This weight coefficient is tuned simultaneously with other key hyperparameters of an AI model (e.g., number of hidden layers, number of hidden neurons, and learning rate of a multiple-layer perception model). Therefore, the fairness-aware travel demand forecasting models can optimally improve fairness while preserving prediction accuracy.
The remaining paper is structured as follows: Section 2 reviews the related studies. Section 3 introduces the fairness definitions, metrics and unfairness correction method. We introduce the empirical case studies in Section 4. The modeling results are presented in Section 5. Section 6 discusses the merits of the proposed methodology, echoes the critical findings, proposes some policy implications and lists several future research directions. Finally, Section 7 concludes our study.
2 Literature Review
2.1 AI fairness issues
In recent years, AI methods have been deployed in a broad array of real-world applications due to their outstanding strength in producing highly-accurate predictions. However, there has been a growing recognition that, despite predictive superiority, AI and machine learning techniques have also been accompanied by increasing concerns of fairness [3]. Studies from multiple fields have reported that AI algorithms could be discriminatory to the disadvantaged population groups under various applications, including healthcare, criminal justice, credit assessment, translation, among many others [3, 5, 6, 40, 23, 42]. For example, healthcare systems could underestimate the health condition of black patients than white patients, even if they have the same health risk score [40]. If these inherent biases are not addressed, using these AI systems to assist decision-making will worsen the existing social disparities [39].
2.1.1 Taxonomy of fairness notions
Numerous fairness notions and corresponding mathematical formulations have been proposed for different downstream learning tasks [39]. These fairness notions span various dimensions, including classification vs. regression, group vs. individual and disparate treatment [8]. In classification, multiple fairness notions are created to mitigate “disparate impact”, i.e., if practices or policies have disproportionately adverse effects on different groups [7]. For example, statistical parity [25], equality of odds and and equality of opportunity [27]. In regression, notions like individual/region-based fairness gap [51], cross-pair loss [8] and equal means [16] are introduced to address real-world regression applications that require fairness concerns. Fairness notions also branch into the axis of individual and group. Individual fairness requires similar individuals to be treated similarly, while group fairness equalizes the outcome among all groups [25]. Another branch to classify fairness notions is determining whether the disparate treatment is allowed. Disparate treatment measures fairness through treatment rather than the outcomes. It addresses both formal classification and intentional discrimination [7], and includes notions like counterfactual fairness [36] and fairness through unawareness [25]. These fairness notions have laid a solid foundation for defining and measuring fairness in real-world problems.
2.1.2 Correcting unfairness for multiple protected attributes
There are three possible ways to achieve the aforementioned fairness, i.e., correcting the unfairness. First, pre-processing the data (e.g., resampling or reweighting) and remove bias before training the models (e.g., [29, 17]). Second, in-processing: modifying the algorithms such as including fairness penalty in the loss function [8, 51] or incorporating constraints [1]. Third, post-processing: correcting unfairness by adjusting the learned algorithms [28, 27]. In this study, we selected the in-processing techniques due to their transparency (i.e., directly taking fairness into model optimization) and strong capabilities in achieving fairness even when confronted with biased data [18] and the effectiveness in mitigating bias amplification problems (i.e., the trained models amplify the biases in the training data) [47].
In-processing methods involve two categories: implicit method and explicit method [46]. Implicit methods debias the models by implicitly removing bias from the latent representations. They usually hypothesize that if the latent representations are less biased, the predictions produced from the representations could also be less biased. The implicit methods are commonly used in adversarial learning [56, 48, 52], contrastive learning [19], etc. However, these methods (1) are usually less transparent since we can hardly interpret how the produced latent representations mitigate (or even remove) the unfairness [43, 24] and (2) usually come with specific model architectures [52]. Explicit methods focus on explicitly modifying the objective function while keeping the model structure intact, for example, adding fairness-related regularization terms or constraints. Therefore, the explicit methods usually afford greater flexibility and can be applied to a wide range of models. Existing explicit methods include absolute correlation regularization term [9], pairwise fairness loss [8], equal means [16], etc. This study adopts the explicit method by integrating a fairness-related regularization term into the loss function to jointly account for accuracy and fairness.
Achieving multi-attribute fairness has long been an enduring challenge in using in-processing techniques to mitigate unfairness [46]. To date, most of the existing literature purely focused on correcting the unfairness of a single protected attribute [8, 56, 30, 1]. However, mitigating the unfairness of one attribute may increase the unfairness of another attribute [60]. This unexpected outcome may confuse the end-users (e.g., travel demand modelers) and thus hinder the adoption of the fairness-aware models. To tackle this issue, Yan and Howe [51] proposed to explicitly correct the unfairness of multiple attributes by simply adding multiple regularization terms (one for each attribute with a corresponding weight) into the loss function. However, when the protected attributes are correlated with each other (which is the case for most travel demand forecasting problems), it could be challenging to determine the appropriate weight for each protected attribute in order to achieve the optimal solution that minimizes the unfairness for the combination of the selected protected attributes. Other related methods include learning fair graph embeddings via adversarial learning [10], disentangled representation learning [34], adding fairness constraints for each protected attribute and achieve fairness via constrained optimization [32, 33]. However, as we discussed, these methods are often less transparent and come with specific model architectures, which hinder their adaptability. As of now, there is a pressing need to develop transparent, effective and flexible methods that can simultaneously account for fairness for multiple protected attributes and can be applied to any model class.
2.2 Addressing AI fairness issues in travel demand forecasting
Recently, transportation researchers have also started to examine and address the fairness concerns of travel demand forecasting models, e.g., Yan and Howe [51] and Yan and Howe [52]. Specifically, Yan and Howe [51] treated fairness as equal mean per capita travel demand across groups over a period of time and evaluated the fairness issues of several AI methods on demand prediction for ridesourcing services and bike-share systems. Results showed that machine learning spontaneously underestimated the travel demand of disadvantaged people. They also proposed two fairness regularization terms and a corresponding fairness-aware demand prediction model to correct the unfairness. Yan and Howe [52] proposed to use an implicit method, which contains fair representations (i.e., EquiTensors) learned by adversarial learning, to forecast the bike-share demand. These fairness-aware models offer transportation professionals new insights on transportation resource allocations and a novel instrument for designing a fairer transportation ecosystem.
However, there are still two critical knowledge gaps that have yet to be addressed. Firstly, prior research has primarily concentrated on equalizing per capita travel demand among different population groups, but we should note that travel demand disparities may have already been introduced during the data creation process, which is often beyond our control [60, 21]. For example, multiple studies found that rich people are more likely to use ridesourcing services than the poor [53, 58]. That means this behavioral bias among different population groups may naturally exist [41]. However, to date, no study has investigated how to appropriately account for this type of bias, especially for travel demand forecasting models. Second, the existing fairness-aware travel demand forecasting methods necessitate particular model structures, which has very limited adaptability. Thus, developing a model-agnostic (i.e., independent of the model structure) method that can be flexibly adopted by different types of AI models is promising. To date, however, a systematic method in model-agnostic manner to address fairness issues, especially for travel demand forecasting problems, is still lacking.
3 Methodology
The methodological framework is outlined as follows. The travel demand forecasting problem will be mathematically defined in Section 3.1. In Section 3.2, we will introduce the fairness metrics used in the proposed methodology, followed by the unfairness correction approach for multiple attributes (in Section 3.3). The notations are summarized in Table 1.
Notations | Description | Notations | Description | ||
---|---|---|---|---|---|
Indices and Sets | overall loss function | ||||
graph | primary loss function for forecasting model | ||||
the set of nodes | Variables | ||||
the set of time | the distance between node and node | ||||
the index set of attributes | the value of the protected attribute at node | ||||
the index set of nodes | the protected attribute , | ||||
the set of advantaged node index | travel demand at node at time | ||||
the set of disadvantaged node index | estimated travel demand of node for time | ||||
timestamp | ground truth travel demand at time , | ||||
|
estimated travel demand at time , | ||||
the element in the weighted adjacency matrix |
|
||||
weighted adjacency matrix |
|
||||
the size of the set of advantaged node index |
|
||||
the size of the set of disadvantaged node index | the matrix of protected attributes, | ||||
Parameters |
|
||||
length of input historical sequence | the prediction accuracy at time , | ||||
length of output sequence | the prediction accuracy of node at time | ||||
the number of nodes |
|
||||
interactive weight coefficient |
|
||||
the total number of protected attributes | the expectation of prediction accuracy | ||||
Functions | the expectation of protected attribute | ||||
function of the travel demand forecasting problem |
3.1 Travel demand forecasting problem
The goal of travel demand forecasting is to predict the future travel demand for each area (or other spatial unit such as traffic segments) given previously observed time-series data. We consider the transportation network as a weighted directed graph , where is a set of nodes (i.e., areas or traffic segments) with ; is a set of edges representing the connectivity between two nodes; and is a weighted adjacency matrix representing the node’s proximity (e.g., distance or functional similarity). Given weighted directed graph with nodes, we assume time is a discrete variable where is a set containing all possible timestamps, let represent travel demand at time , where is the index set of nodes, is the travel demand corresponding to node at time , and let be historical travel demand before . The travel demand forecasting problem could be formulated as learning a function which maps the historical travel demand to travel demand at next time interval for all nodes in a given graph . Let denote the predicted travel demand for next time interval starting from timestamp , where refers to the predicted travel demand at timestamp for all nodes, then we can mathematically write:
(1) |
3.2 Fairness in travel demand forecasting models
This study defines Fairness as the equality of prediction accuracy. Intuitively, we assume that the travel demand prediction accuracy should be independent of the protected attributes. Taking racial composition as an example, equality of prediction accuracy suggests that the prediction accuracy for any racial group should be equal. In this study, we use the Absolute Percentage Error (APE) to measure the predictive accuracy instead of the Mean Absolute Error (MAE) or Root Mean Square Error (RMSE). We believe the magnitude of the travel demand (especially for the emerging mobility) for an advantaged community (e.g., high-income community) should be naturally greater than a disadvantaged community [14]. This type of behavioral bias may largely be introduced during the data creation process instead of applying the algorithm [41, 21]. If we quantify the equality of prediction accuracy with MAE and RMSE, which are sensitive to the magnitude of the forecasting outcome, machine learning may replicate, or even reinforce and potentially exacerbate existing biases. Instead, APE scales the magnitude and cancels out the behavioral bias that has already been embedded in the data.
Recall from the previous section, a travel demand forecasting model is to learn a function which takes historical travel demands as input and predict travel demands from next time interval starting from time , i.e., . We define to indicate the prediction accuracy (i.e., APE) at time , and is the prediction accuracy of node at time . Specifically,
where , are the ground truth and predicted value of node at time , respectively, is the absolute percentage error for node at time . The lower the value of , the better the predictive performance.
Suppose is the matrix of protected attributes of interest, where is the index set of attributes, where is the total number of protected attributes; represents the protected attribute , and denotes the protected attribute at node , is the set of index for nodes. Denote as a binary indicator indicating if node is belonging to advantaged (i.e., ) or disadvantaged (i.e., ) groups for protected attribute , and accordingly let and represent the set of advantaged and disadvantaged node index for demographic attribute with size and , respectively. We note that assigning value for , i.e., determining whether each node should be labeled as advantaged or disadvantaged, is context-specific. This determination could be guided by the criteria or statistics defined by local government [51]. Subsequently, Equality of Prediction Accuracy is defined as:
Where and are the conditional expectation of prediction accuracy given and , and represents the mean APE for advantaged group and disadvantaged group respectively. That is, for any protected attribute , a fair model should have equal prediction accuracy for different groups. Moreover, when a forecasting model is conducted, we could measure the model fairness by quantifying prediction accuracy disparities, especially between nodes with different labels, for instance, low-income communities and high-income communities.
In this study, we introduce Prediction Accuracy Gap (PAG) as a fairness metric to measure prediction accuracy disparity and if fairness/unfairness achieves/occurs. Define:
(2) |
Intuitively speaking, PAG directly measures the prediction accuracy disparity between these two types of nodes. A high value of PAG indicates that the machine learning model delivers inconsistent predictive performance among nodes; in most cases, the performance is worse in disadvantaged nodes.
In this study, we also use Correlation Coefficient as another fairness metric. The correlation coefficient can naturally measure the extent to which the predictions are biased on specific protected groups. Intuitively, if fairness is achieved, correlation between prediction accuracy and any protected attribute should be zero. By using correlation coefficient as a measure of fairness, we assume that the target variable (i.e., prediction accuracy) is linearly correlated with the independent variable (i.e., protected attribute).
Recall from the discussions above, is the prediction accuracy (APE) at time , and refers to the protected attribute for all nodes. Then, the correlation between prediction accuracy and the protected attribute across all nodes is denoted by . Define:
(3) |
where and . In our experiment, we add small to denominator to keep it always positive. Although correlation coefficient does not require a label for each region, we cannot directly read the prediction accuracy disparity from it.
3.3 Unfairness correction method for travel demand forecasting models
In this study, we introduce an absolute correlation regularization approach, which adapts the efforts from Beutel et al. [9], to mitigate the prediction accuracy disparities existing among groups. In Beutel et al. [9], the authors applied this approach to a classification problem by minimizing the false positive rate (FPR) gap between groups. We generalize this approach to a regression setting (i.e., travel demand forecasting problem) by minimizing the prediction accuracy disparities among different communities.
More importantly, including Beutel et al. [9], most previous studies have primarily focused on correcting the unfairness of one single attribute. In real-world dataset, however, the debiased model and results could differ among various protected attributes. Also, a model that is fair for one protected attribute could still be unfair for other attributes [46, 60]. One feasible solution to solve this issue is to consider multiple attributes at the same time when correcting the unfairness of the models. We expected that a fair model should produce fair predictions for all types of attributes instead of focusing solely on one.
Therefore, we propose a methodology that can correct the unfairness for multiple protected attributes. More specifically, we propose to use the Multiple Correlation Coefficient [4], denoted as , to measure the correlation between the target variable, i.e., prediction accuracy, and a set of protected attributes (including race, education, age and income). A larger suggests that a stronger dependence may exist between the target variable and the explanatory variables. We expect that a fair prediction should lead to , or at least, a small value. Accordingly, we will use as the regularization term in the loss function to account for fairness loss. We should note that the linear model may encounter potential multicollinearity concerns. However, there is no need to address them since the goal of the linear model is forecasting rather than estimating the coefficients [45].
Recall from previous subsections, we will use the prediction accuracy as the target variable and to represent the matrix of multiple protected attributes of interest. And, we use to indicate the correlation between prediction accuracy and the protected attribute across all nodes. Given these notations, we will naturally write the vector of correlations between each protected attribute and prediction accuracy , i.e., , and the correlation matrix calculated by the correlation coefficient among each pair of protected attributes, denoted as , i.e.,
Consequently, the multiple correlation coefficient between and ), i.e., , which is the square root of the coefficient of determination (i.e., ) of the linear model [2], can be written as:
(4) |
where is the transpose of and is the inverse matrix of .
Accordingly, given graph and a forecasting model , we add the multiple correlation coefficient, , into the loss function, denoted as as shown in Eq. 5. In this way, the model will simultaneously account for the unfairness issues sourcing from multiple protected attributes. Let denote the ground truth travel demand of next time intervals starting from , mathematically, the loss function of the forecasting model to be minimized, i.e., , is written as:
(5) |
and,
(6) |
In the above equations, , refer to the ground truth and predicted travel demand for node at time , respectively; is the primary loss function for forecasting model, and in this study, we use mean squared error (MSE) for ; is the interactive weight coefficient that controls the weight between the prediction loss and the fairness regularization term. When , the model will be unaware of the fairness; and when , the model will completely focus on correcting the unfairness. We can directly treat as a hyperparameter to find the optimal model that effectively addresses fairness while preserving accuracy. The prediction accuracy disparity is captured and mitigated by the correlation regularization term, in Eq. (4). The regularization term is dedicated to shrinking the potential prediction accuracy disparity that existed among groups toward zero. Incorporating it into the loss function enables the machine learning model to automatically keep track of the fairness during training.
4 Case Study
In this section, we will describe two real-world ridesourcing-trip datasets and seven commonly-used travel demand forecasting models used for case studies. Section 4.1 and Section 4.2 present the data collection and processing process. Table 2 presents the descriptive statistics of all input variables. In Appendix.A, Fig. 1 displays the spatial distribution of the average ridesourcing demand per hour. We will briefly introduce the selected deep learning and statistical models for unfairness detection and correction in Section 4.3.
Chicago | Austin | |||||||
---|---|---|---|---|---|---|---|---|
Min | Max | Mean | St. Dev. | Min | Max | Mean | St. Dev. | |
Target Variable | ||||||||
Ridesourcing hourly demand | 0.00 | 2150.00 | 12.01 | 41.26 | 0.00 | 820.00 | 1.41 | 8.81 |
Demographic Characteristics | ||||||||
Race: Percentage of white population | 0.00 | 0.97 | 0.46 | 0.33 | 0.44 | 0.99 | 0.76 | 0.13 |
Edu: Percentage of population with a bachelor’s degree or above | 0.01 | 0.95 | 0.35 | 0.26 | 0.06 | 0.92 | 0.49 | 0.22 |
Age: Percentage of young population (aged 18 - 44) | 0.21 | 0.90 | 0.44 | 0.12 | 0.14 | 0.98 | 0.48 | 0.14 |
Income: Percentage of low-income households | 0.02 | 0.84 | 0.29 | 0.16 | 0.01 | 0.84 | 0.18 | 0.12 |
4.1 Chicago ridesourcing-trip data
In this study, we collected the publicly available ridesourcing-trip data from Chicago Data Portal111https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips-2018-2022-/m6dm-c72p/explore for case study. The data are from November 1, 2018 to March 31, 2019, containing 45,338,599 trips. There are plenty of attributes included in this dataset, but only pick-up locations and timestamps are considered for this research. Since we focused on trip generation (i.e., origin demand) forecasting, all trips are aggregated at the census-tract level and hourly counted. We prepared the data for modeling in the same way as previous studies [58], to account for the missing-data issues and outliers. The data preparation process produced the trip generation data for 711 census tracts. We split the first 70% data for training, the following 10% for validation and the remaining for testing. The census-tract-level demographic data (i.e., protected attributes) were collected from the American Community Survey 2013–2017 5-year estimates data, including the percentage of white, the percentage of low-income households, the percentage of population with a bachelor’s degree or above and the percentage of young populations (with age in 18-44).
4.2 Austin ridesourcing-trip dataset
This study also collected ridesourcing-trip data from RideAustin222https://data.world/ride-austin/ride-austin-june-6-april-13 for case study. The data ranges from October 1, 2016 to April 13, 2017, including 1,259,574 trips in total. Similar to the case study in Chicago, we only retained pick-up locations and the corresponding timestamps from the dataset for empirical analysis. All ridesourcing trips were aggregated at the census-tract level on an hourly basis. Finally, the prepared dataset includes 191 census tracts. The first 70% of the whole dataset was split for model training, followed by the following 10% for validation and 20% for testing. Four protected attributes, including the percentage of white, the percentage of low-income households, the percentage of population with a bachelor’s degree or above and the percentage of young populations (aged 18-44) were also collected from American Community Survey 2013–2017 5-year estimates data.
4.3 Models
In this study, we applied seven models as the major baseline models to measure the fairness metrics and perform the bias mitigation. We also compared their performance with historical average method. All used models are detailed as follows:
-
1.
Historical Average (HA): We calculate the historical average travel demand using the mean values of all observations from the inputted sequence.
-
2.
Multivariate Linear Regression (MLR): MLR is frequently used in machine learning studies as the benchmark model. This study treats observations at every timestamp as a covariate.
-
3.
Autoregressive Integrated Moving Average Model (ARIMA): ARIMA is one of the most fundamental statistical models for forecasting time-series data [38]. ARIMA consists of three basic parts: auto-regressive, first-differencing and moving-average part. The order of the auto-regressive () and moving-average () and the degree of first-differencing () included should be prespecified before building the model. In this study, we established ARIMA model to predict the travel demand for all areas at once.
-
4.
Multiple Layer Perception (MLP): MLP is a commonly-used deep neural net model. In this study, the model architecture is set as 1 hidden layer with 300 hidden linear neurons. A drop-out layer rate 0.01 is set after the hidden layer to avoid overfitting.
-
5.
Gated Recurrent Unit (GRU): GRU is a widely-adopted Recurrent Neural Network (RNN) model with gated hidden neurons [20]. GRU can generate the predicted travel demand by inputting the hidden status at timesampe and the travel demand at timestamp . In this way, GRU can dynamically capture the travel demand information at the current timestamp while maintaining the historical demand trend. We use GRU model for forecasting the travel demand for all nodes at once.
-
6.
Temporal Graph Convolution Network (T-GCN): T-GCN can capture the spatial dependency and temporal information at the same time [59]. Specifically, the spatial dependency is calibrated by the spatial adjacency graph , where 1 indicates two nodes are spatially adjacent and 0 otherwise. T-GCN takes the hidden status at timestamp and the graph-convolution-processed travel demand information at timestamp as the input. Therefore, T-GCN can effectively deal with data that have strong spatial dependency such traffic speed data.
-
7.
Convolutional Long-short Term Memory (ConvLSTM): ConvLSTM is one of the most novel approaches for spatio-temporal forecasting problem [44]. ConvLSTM has a convolution structure in both the input-to-state and state-to-state transitions; it determines a certain cell’s future states by considering the inputs and past states from its local neighbors. This characteristic allows it a more powerful strength in handling spatio-temporal correlations. In this study, the convolutional kernel size of the ConvLSTM is set to 5.
-
8.
Spatio-Temporal Graph Convolution Network (STGCN): STGCN is an effective approach for spatio-temporal traffic flow forecasting [57]. STGCN consists of several spatio-temporal convolution (ST-Conv) blocks. Each block has a “sandwich”-like structure: two gated sequential convolution layers and one spatial graph convolution layer in between. This allows STGCN to distill the most useful spatial features and capture the most essential temporal features collectively. In this study, we set the number of ST-Conv blocks as 2. Let denote the distance between node and node , the element in the weighted adjacency matrix, i.e., , is given by:
(7) where and , assigned as and , are thresholds that control the sparsity of .
5 Results
This section sequentially reports the modeling results of all benchmark models, the evaluations of their underlying fairness issues and the results after applying our proposed unfairness correction approach. We conducted empirical experiments using the real-world ridesourcing-trip data in Chicago, IL and Austin, TX. The analytical spatial unit is census tract. We incorporate the regularization term into the loss function for all models. All experiments were completed in a Pytorch environment using an Ampere A-100 GPU. We tuned the hyperparameters such as batch size and sequence length under each fairness weight using grid search. We built our models with Adam optimizer [35]. Early stopping method is also taken to avoid overfitting problems. In this study, we use and percentile statistics for the protected attributes as the threshold to determine the label (i.e., ) of each node (e.g., census tract). For instance, the percentile of white population percentage attribute is , for nodes with white population percentage over are labeled as advantaged.
5.1 Unfairness detection
The predictive performance and two fairness metrics (i.e., correlation [Corr] and prediction accuracy gap [PAG]) of all models with respect to four protected attributes are presented in Table 3 and Table 4.
Models | MAE | RMSE | Race | Edu | Age | Income | ||||
---|---|---|---|---|---|---|---|---|---|---|
Corr | PAG (%) | Corr | PAG (%) | Corr | PAG (%) | Corr | PAG (%) | |||
HA | 7.703 | 27.630 | 0.058 | -32.452 | 0.072 | -54.277 | 0.048 | -50.220 | -0.031 | -26.303 |
MLR | 5.535 | 12.973 | -0.048 | 8.040 | -0.084 | 4.261 | -0.092 | 3.862 | 0.049 | 6.667 |
ARIMA | 4.541 | 12.259 | -0.055 | 4.908 | -0.122 | 9.890 | -0.128 | 11.079 | 0.057 | 3.828 |
MLP | 3.918 | 10.147 | -0.054 | 4.453 | -0.128 | 9.487 | -0.133 | 10.611 | 0.060 | 3.671 |
GRU | 3.715 | 9.069 | -0.058 | 6.293 | -0.143 | 15.161 | -0.128 | 13.432 | 0.071 | 6.367 |
T-GCN | 4.705 | 9.993 | -0.093 | 23.672 | -0.148 | 31.644 | -0.153 | 34.748 | 0.095 | 19.175 |
STGCN | 3.012 | 8.539 | -0.122 | 9.679 | -0.279 | 21.363 | -0.285 | 24.484 | 0.129 | 8.182 |
ConvLSTM | 3.246 | 8.176 | -0.075 | 8.958 | -0.144 | 13.063 | -0.150 | 14.696 | 0.104 | 9.392 |
Notes: Corr represents correlation. All correlations are statistically significant at 1% confidence level.
Models | MAE | RMSE | Race | Edu | Age | Income | ||||
---|---|---|---|---|---|---|---|---|---|---|
Corr | PAG (%) | Corr | PAG (%) | Corr | PAG (%) | Corr | PAG (%) | |||
HA | 1.655 | 8.538 | -0.030 | 5.116 | -0.106 | 6.073 | -0.008 | -8.881 | 0.026 | 6.170 |
MLR | 1.324 | 4.280 | -0.008 | 0.572 | -0.041 | 4.118 | -0.056 | 6.561 | -0.029 | -0.784 |
ARIMA | 1.335 | 4.695 | -0.008 | 0.316 | -0.048 | 4.940 | -0.078 | 8.578 | -0.047 | -1.758 |
MLP | 1.297 | 4.163 | -0.008 | 0.467 | -0.048 | 5.030 | -0.074 | 8.156 | -0.042 | -1.073 |
GRU | 1.064 | 3.654 | -0.044 | 2.166 | -0.136 | 10.639 | -0.110 | 10.648 | -0.026 | -0.307 |
T-GCN | 1.357 | 3.911 | 0.010 | -1.983 | -0.049 | 5.627 | -0.098 | 12.801 | -0.087 | -6.796 |
STGCN | 1.042 | 4.064 | -0.034 | 1.234 | -0.178 | 10.933 | -0.179 | 11.614 | -0.081 | -1.013 |
ConvLSTM | 1.057 | 3.162 | 0.000 | -0.391 | -0.088 | 5.289 | -0.080 | 4.642 | -0.024 | 0.888 |
Notes: Corr represents correlation. All correlations are statistically significant at 1% confidence level.
We show the results of the predictive performance for each benchmark in Chicago ridesourcing-trip data (Table 3) and Austin ridesourcing-trip data (Table 4).
Regarding prediction accuracy, all benchmark models show a similar trend across two case studies. The performance ranking is ConvLSTM STGCN GRU T-GCN MLP ARIMA HA. It indicates that the prediction accuracy gradually increases as the model becomes more complex. Two convolution models, i.e., STGCN and CovLSTM, are best-performing among all models. Both STGCN and ConvLSTM can incorporate spatial and temporal information through the convolution blocks, which enhance their prediction power. Among two RNN-based models, GRU outperformed T-GCN for both MAE and RMSE. MLP, due to its simple model architecture, underperformed all neural network-based models. Compared with deep neural networks, traditional statistical models, i.e., MLR and ARIMA, have relatively low prediction accuracy. However, their performance still significantly outperformed HA. MLR and ARIMA both have a prespecified (linear) model structure and cannot capture the nonlinearity between the inputs and target variables, which restricts the predictive capability.
Regarding fairness issues, for Chicago ridesourcing-trip data, Table 3 shows that HA exhibits completely inverse relationships in correlation and gap compared with other models. Since HA has the worst predictive performance, the corresponding fairness metrics could be unreliable. The results illustrate that both statistical and deep learning models have evident fairness issues. protected attributes, including race, education and age, are negatively correlated with the prediction accuracy which means that communities with high proportion of white population, high education-attainment rate and more young people have high prediction accuracy. Income level is positively related to predictive performance, indicating that communities with more low-income households may have higher perdition error. In terms of magnitude, we found that education and age have the largest value of correlation with prediction accuracy, followed by income and race. Although there are variations in the magnitude of correlations, the signs for all protected attributes among all models except for HA are consistent. In addition to correlations, we also explored the PAG between the advantaged groups and disadvantaged groups. Table 3 presents that all gaps have a positive value (except for HA), indicating that the prediction error for minority groups is higher than for advantaged groups. Additionally, the prediction accuracy disparity is more pronounced for education and age than for race and income.
For Austin ridesourcing-trip data, all benchmark models demonstrate a similar performance (both trend and direction of associations) compared with using Chicago dataset. However, results showed that the fairness issues are relatively subdued in Austin dataset. In other words, the extent of unfairness (as shown by correlation coefficient and PAG) is notably diminished in comparison to the Chicago dataset. Notably, Table 4 shows that prediction accuracy is less biased regarding race and income. Two best-performing models (i.e., STGCN and ConvLSTM) may produce satisfying fair predictions. For example, the correlation between prediction accuracy and race delivered by ConvLSTM is 0.000 and the PAG regarding race is only -0.391%. This evidence indicates that the unfairness in prediction accuracy should be of little concern for this protected attribute.
5.2 Unfairness correction
We tuned a set of values of (i.e., the weight for fairness loss) by grid search to validate the effectiveness of the proposed unfairness correction method. Table 5 and Table 6 present the results of simultaneously mitigating the unfairness issues for multiple protected attributes across two case studies. We only present the best (i.e., the one that can significantly improve fairness while largely preserving prediction accuracy) from the empirical experiments. We also add experimental results of correcting unfairness of only one single attribute at the bottom of each table for comparison. For the sensitivity analysis of , please refer to Section 5.3. As discussed in previous section, only very limited prediction accuracy disparities are detected on race (percentage of white population) and income (percentage of low-income households) in the case study of Austin (as shown in Table 4). Thus, we decided to only correct the unfairness of prediction accuracy manifested in education (percentage of bachelor holders) and age (percentage of young population) in this case.
Multi-attribute (debiasing four selected attributes) | ||||||||||||||||||||||
Model | MLR | ARIMA | MLP | GRU | T-GCN | STGCN | ConvLSTM | |||||||||||||||
0.025 | 0.025 | 0.025 | 0.05 | 0.075 | 0.025 | 0.025 | ||||||||||||||||
RMSE |
|
|
|
|
|
|
|
|||||||||||||||
Race | Corr |
|
|
|
|
|
|
|
||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
|||||||||||||||
Edu | Corr |
|
|
|
|
|
|
|
||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
|||||||||||||||
Age | Corr |
|
|
|
|
|
|
|
||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
|||||||||||||||
Income | Corr |
|
|
|
|
|
|
|
||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
|||||||||||||||
Single-attribute (only debiasing Income) | ||||||||||||||||||||||
Income | 0.1 | 0.025 | 0.025 | 0.025 | 0.025 | 0.2 | 0.025 | |||||||||||||||
RMSE |
|
|
|
|
|
|
|
|||||||||||||||
Corr |
|
|
|
|
|
|
|
|||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
Notes: Corr represents correlation. PAG refers to prediction accuracy gap. The value inside each bracket refers to the percentage change of metric in absolute value. It is computed as: , with denoting the initial value obtained from the fairness-unaware model and representing the final value from the fairness-aware model. A positive value indicates the improvement while a negative value indicates the reduction.
Multi-attribute (debiasing two selected attributes) | ||||||||||||||||||||||
Model | MLR | ARIMA | MLP | GRU | T-GCN | STGCN | ConvLSTM | |||||||||||||||
0.025 | 0.025 | 0.5 | 0.05 | 0.4 | 0.05 | 0.025 | ||||||||||||||||
RMSE |
|
|
|
|
|
|
|
|||||||||||||||
Edu | Corr |
|
|
|
|
|
|
|
||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
|||||||||||||||
Age | Corr |
|
|
|
|
|
|
|
||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
|||||||||||||||
Single-attribute (only debiasing Age) | ||||||||||||||||||||||
Age | 0.075 | 0.05 | 0.1 | 0.025 | 0.025 | 0.05 | 0.025 | |||||||||||||||
RMSE |
|
|
|
|
|
|
|
|||||||||||||||
Corr |
|
|
|
|
|
|
|
|||||||||||||||
PAG(%) |
|
|
|
|
|
|
|
Notes: Corr represents correlation. PAG refers to prediction accuracy gap. The value inside each bracket refers to the percentage change of metric in absolute value. It is computed as: , with denoting the initial value obtained from the fairness-unaware model and representing the final value from the fairness-aware model. A positive value indicates the improvement while a negative value indicates the reduction.
There are several key findings to highlight. First, results of the multi-attribute scenario show great consistency across two datasets. Table 5 and Table 6 show that in almost all trails, incorporating a small fairness weight can significantly reduce the absolute value of the correlation and PAG across all protected attributes. For example, in Chicago dataset, incorporating only 0.050 fairness weight for GRU can lead to 85.142%, 86.386%, 92.004%, 94.927% reduction of the absolute values of the PAG for race, education, age and income, respectively. In the meantime, the correlation between prediction accuracy and protected attributes also improved more than 60%, but RMSE only increased by 2.062%. In Austin dataset, setting as 0.025 for ConvLSTM yields 68.973% and 88.496% PAG shrinkage on education and age by sacrificing only 5.661% increase on RMSE (from 3.162 to 3.341).
Second, the effects of the proposed unfariness correction method varies across models and protected attributes. For example, Table 5 shows that when mitigating the income bias, setting as 0.025 only reduces 54.923% of the PAG in absolute value for STGCN; while for ConvLSTM, the same setting can lead to a 95.081% reduction. In addition, the case study on Chicago ridesourcing-trip data reveals that compared with education and age, the absolute value of PAG for race and income are more likely to be largely reduced (i.e., to less than 1%).
Third, by choosing an appropriate , both fairness and accuracy can be improved at the same time. Taking Chicago dataset as an example, adding 0.025 fairness weight on STGCN can simultaneously reduce the absolute value of PAG and correlations for all protected attributes while even reducing RMSE by 1.347%.
Moreover, we found that MLR and ARIMA showed limited capabilities in mitigating unfairness. In Chicago dataset, the prediction accuracy disparities of education and age (as shown in the change of PAG) for MLR and ARIMA even increased after debiasing multiple protected attributes. Also, our examination of the Austin dataset indicated that after incorporating the proposed fairness regularization term, although the PAG for MLR and ARIMA decreased, the magnitude of this reduction was comparatively modest in comparison to other models. In fact, these two models are less flexible compared with other deep learning models since they have a pre-specified model structure. We believe that this inherent limitation could hinder their effectiveness in addressing fairness concerns.
Lastly, in most cases, our proposed multi-attribute unfairness correction method shows better performance in reducing disparities of prediction and preserving fairness compared with only debiasing a single attribute, especially for more complex deep learning models (e.g., GRU, T-GCN, STGCN and ConvLSTM). For example, Table 5 shows that when considering multiple attributes together, GRU and ConvLSTM can close more than 94% of PAG of income in absolute value; while for the single-attribute scenario, the PAG is only reduced by around 60%. However, we also observed in certain cases, single-attribute unfairness correction could produce fairer performance. For example, GRU is found to be more effective in reducing PAG when only debiasing age for Austin dataset.
Model | RMSE | PAG (%) | ||||
---|---|---|---|---|---|---|
Race | Edu | Age | Income | |||
Chicago | ||||||
ConvLSTM (Original) | 0.000 | 8.176 | 8.958 | 13.063 | 14.696 | 9.392 |
ConvLSTM (debiasing only Age) | 0.050 | 8.200 | 10.520 | -2.643 | -7.673 | 12.159 |
ConvLSTM (debiasing multi-attribute) | 0.025 | 8.595 | 0.649 | -6.612 | -6.441 | 0.462 |
Austin | ||||||
ConvLSTM (Original) | 0.000 | 3.162 | -0.391 | 5.289 | 4.642 | 0.888 |
ConvLSTM (debiasing only Age) | 0.050 | 3.354 | 2.027 | 3.993 | 0.771 | 8.238 |
ConvLSTM (debiasing multi-attribute) | 0.025 | 3.341 | 1.558 | -1.641 | -0.534 | 0.295 |
To provide a more comprehensive demonstration of the efficacy of the proposed multi-attribute unfairness correction approach and to pinpoint potential shortcomings in the single-attribute bias correction method, we conduct a comparative analysis of unfairness correction outcomes achieved through debiasing the age variable alone versus debiasing multiple attributes simultaneously. We have chosen the top-performing model, i.e., ConvLSTM, for demonstration. The resulting findings can be found in Table 7.
We found that correcting unfairness regarding one attribute might even create more biases for other protected attributes, which aligns with one previous study [60]. This finding highlights the importance of considering multiple protected attributes at once. Specifically, results showed that compared with the original model that purely focused on prediction accuracy, solely correcting unfairness of age variable could indeed help drop the absolute value of PAG. However, by only considering age, the PAG for other variables, especially for race and income variables, even increases. For example, in Austin dataset, debiasing only Age shrank the PAG from 4.642% to 0.771% by significantly sacrificing the PAG of income from 0.888% to 8.238%. This unexpected outcome may further shed light on the fact that the transportation resource allocations intended to be fair for distinct age groups could nonetheless still be unfair regarding communities with different income levels. Notably, the results showed that the proposed multi-attribute unfairness correction method can effectively debias multiple protected attributes and in almost all cases the absolute value of PAG is significantly dropped compared with the original model without sacrificing too much prediction accuracy.
5.3 Sensitivity analysis of fairness weight
We also explored the influence of the fairness weight, i.e., , in shaping the interaction between accuracy and fairness based on the predictive performance of seven models with four protected attributes. Fig. 1 presented in C illustrates the sensitivity analysis of in determining accuracy and fairness. The -axis is the value of while the -axis is the performance metrics (RMSE, correlation coefficient and PAG). Generally, the accuracy for all models decreases when gradually increases. In terms of RMSE, the marginal effect of on more complex models is relatively small. Figures show that as grows, the correlation will first drastically increase/decrease, and then remain flat. Notably, setting a small weight () can lead the correlation drop to around 0. The PAG shows a decreasing trend as gradually increases. But in most cases, the gap may get over-corrected when is greater than 0.1. According to the tables shown in Section 5.2, a suitable fairness weight possibly exists in the range between 0 to 0.1. This finding further reinforces the effectiveness of our proposed unfairness correction approach: incorporating only a small amount of weight for fairness can lead to a significant improvement in producing fair predictions. We also found that increasing fairness weight may not monotonically reduce the PAG. This finding echoes the results in Zheng et al. [60], where they showed that increasing fairness weight might even extend the PAG. Our computational experiments show that this scenario frequently occurs for traditional statistical models. This finding also suggests the need for more fine-grained searching ranges of when conducting hyperparameter tuning. Overall, the sensitivity of the effects of shows great consistency across two case studies. Finally, we noticed that in Austin case, setting fairness weight as 0.4 for GRU led a substantial increase in RMSE and PAG. One possible reason could be that this combination of hyperparameters might explode the gradients and thus lead to this numerical instability.
5.4 Comparison with benchmark fairness regularization methods
Model | Regularizer | Chicago (Race) | Regularizer | Austin (Education) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Lambda | RMSE | Corr | PAG(%) | Lambda | RMSE | Corr | PAG(%) | |||
MLR | 0.025 | 12.975 | -0.001 | -1.980 | 0.025 | 4.281 | -0.015 | -0.318 | ||
EM | 0.025 | 15.854 | -0.124 | 60.016 | EM | 0.025 | 4.270 | -0.030 | 2.829 | |
RFG | 0.025 | 13.100 | -0.002 | -3.355 | RFG | 0.025 | 4.274 | -0.055 | 5.609 | |
IFG | 0.025 | 13.048 | 0.003 | -4.619 | IFG | 0.025 | 4.279 | -0.005 | 0.621 | |
ARIMA | 0.025 | 12.263 | -0.021 | 0.424 | 0.025 | 4.701 | 0.005 | 0.071 | ||
EM | 0.025 | 15.679 | -0.136 | 62.380 | EM | 0.025 | 4.702 | -0.047 | 4.730 | |
RFG | 0.025 | 12.433 | 0.000 | -4.053 | RFG | 0.025 | 4.698 | -0.071 | 7.268 | |
IFG | 0.025 | 12.375 | 0.009 | -7.000 | IFG | 0.025 | 4.696 | -0.013 | 1.776 | |
MLP | 0.050 | 10.142 | -0.022 | 0.059 | 0.050 | 4.166 | -0.012 | 0.774 | ||
EM | 0.050 | 13.119 | -0.014 | 0.316 | EM | 0.050 | 4.120 | -0.050 | 4.898 | |
RFG | 0.050 | 10.259 | -0.026 | 0.483 | RFG | 0.050 | 4.110 | -0.045 | 4.261 | |
IFG | 0.050 | 10.221 | -0.034 | 2.027 | IFG | 0.050 | 4.147 | -0.039 | 3.727 | |
GRU | 0.025 | 9.263 | -0.002 | -2.940 | 0.025 | 3.517 | -0.003 | 0.923 | ||
EM | 0.025 | 14.455 | -0.141 | 120.286 | EM | 0.025 | 4.329 | -0.111 | 9.201 | |
RFG | 0.025 | 10.109 | -0.017 | 0.714 | RFG | 0.025 | 3.545 | -0.051 | 3.926 | |
IFG | 0.025 | 9.325 | -0.072 | 10.440 | IFG | 0.025 | 3.564 | -0.060 | 4.744 | |
T-GCN | 0.075 | 10.103 | 0.003 | -3.958 | 0.200 | 3.767 | -0.001 | 0.898 | ||
EM | 0.075 | 16.722 | -0.108 | 102.303 | EM | 0.200 | 4.050 | -0.048 | 7.186 | |
RFG | 0.075 | 12.215 | -0.034 | 17.875 | RFG | 0.200 | 3.990 | -0.018 | 3.570 | |
IFG | 0.075 | 12.621 | -0.074 | 31.545 | IFG | 0.200 | 3.982 | 0.005 | 2.119 | |
STGCN | 0.075 | 8.790 | 0.005 | -1.486 | 0.300 | 4.062 | 0.002 | 0.835 | ||
EM | 0.075 | 12.891 | -0.009 | 45.571 | EM | 0.300 | 4.576 | -0.059 | 6.618 | |
RFG | 0.075 | 10.216 | 0.009 | 15.865 | RFG | 0.300 | 4.092 | 0.000 | -2.710 | |
IFG | 0.075 | 9.904 | -0.002 | 5.433 | IFG | 0.300 | 4.403 | -0.018 | 3.307 | |
ConvLSTM | 0.025 | 8.474 | -0.001 | -0.389 | 0.025 | 3.325 | -0.001 | 0.497 | ||
EM | 0.025 | 12.634 | -0.153 | 92.676 | EM | 0.025 | 3.474 | -0.097 | 6.164 | |
RFG | 0.025 | 8.711 | -0.035 | 7.152 | RFG | 0.025 | 3.437 | -0.146 | 9.645 | |
IFG | 0.025 | 8.226 | -0.083 | 13.887 | IFG | 0.025 | 3.382 | -0.069 | 4.705 |
This study compares the performance of the proposed unfairness correction approach (i.e., the absolute correlation regularization term) with three state-of-the-art benchmark regularizers, including Equal Mean (EM) [16], Region-based Fairness Gap (RFG) and Individual-based Fairness Gap (IFG) [51]. For experiments, we only consider single-attribute scenario as these three benchmark regularizers are explicitly designed for addressing unfairness of a single protected attribute. For Chicago Ridesourcing dataset, we select race (percentage of white population) for model debiasing; while for RideAustin dataset, education variable (percentage of bachelor holders) is chosen for comparison. All benchmark regularizers are set with the best-performing yielded by our proposed method for comparison.
Table 8 presents the comparative analysis between our proposed method (i.e., absolute correlation regularizer) and three state-of-the-art benchmark regularizers. Results unequivocally show that the proposed method evidently outperforms other methods in terms of preserving prediction accuracy as well as improving fairness. Among all regularizers, EM delivers the worst performance. This is expected since EM focuses on balancing the target variable (i.e., ridesourcing demand) of disadvantaged and advantaged groups instead of the prediction accuracy. However, this method could be questionable since the variations in ridesourcing usage between different population groups may naturally exist due to socioeconomic and demographic disparities [14]. RFG and IFG tend to yield improved outcomes in terms of both accuracy and fairness when compared to EM. Moreover, in certain scenarios, their performance (especially for correlation and RMSE) surpasses that of the proposed method. We attribute this to their capabilities to effectively reduce variations in per capita travel demand for each individual population group, as indicated in Yan and Howe [51]. However, these two metrics may still not be able to fully account for the inherent disparities of different population groups in generating travel demand [58]. In most cases, especially for deep learning models with more complex model architectures, the proposed method can significantly help reduce the PAG between disadvantaged and advantaged groups while keeping the prediction error lowest. Although in some cases the proposed method may not be the best-performing one regarding accuracy, the produced RMSE still remains within an acceptable range.
6 Discussion
The above sections demonstrate the modeling results of our proposed unfairness correction method. In this section, we will discuss the merits of the unfairness correction method, policy implications, and the limitations of the work and future research directions.
6.1 Merit of the unfairness correction method
The merit of the proposed unfairness correction method is threefold.
First, a new regularizer to simultaneously debias multiple protected attributes. The current literature rarely discusses how to effectively address fairness issues for multiple protected attributes. However, designing a method that can accommodate various fairness needs is necessary for real-world applications [46]. This study addresses this issue by proposing to use Multiple Correlation Coefficient (i.e., of a linear model) as a regularization term and incorporating it into the loss function. The multiple correlation coefficient can directly measure the correlation between the target variable (i.e., prediction accuracy) and a set of protected demographic variables (i.e., race, age, education and income). By minimizing the coefficient of multiple correlation, AI models can simultaneously debias multiple sensitive attributes. Unlike adding multiple regularization terms (one for each attribute) [51], this approach is straightforward and easy to implement, and there is no need to fine-tune the fairness weight for different attributes (only one is enough). Also, this approach has little concern about the multicolinearity (as shown in Appendix.B) issues among different protected attributes, since the goal of the linear model is to use the set of protected attributes to forecast the prediction errors instead of estimating and interpreting the beta coefficients [45]. Overall, our proposed unfairness correction method enables future studies to flexibly debias both single or multiple protected attributes of interests.
Second, flexibility and transparency. The proposed unfairness correction method is model-agnostic and may be generalizable for different applications and different data modalities. We implemented the unfairness correction method on both statistical and deep learning models. Results jointly demonstrated that, generally, this approach could mitigate the unfairness while only slightly reducing the overall accuracy. Specifically, we correct the unfairness by incorporating an explicitly designed absolute correlation regularization term into the loss function without modifying the model structure. It allows the unfairness correction method great flexibility to be independent of the underlying model. Scholars can thus flexibly adopt any model they want in addressing fairness issues. Also, the proposed method enjoys great transparency since end-users (e.g., stakeholders) can easily understand how fairness is being taken into account and improved (from the fairness regularization term). Moreover, this method is transferable for other forecasting applications. Besides travel demand forecasting, other important issues including traffic count forecasting, pedestrian activity forecasting or crash frequency forecasting may also have silent fairness problems. Researchers can apply our proposed method to address the fairness issues and provide fair decision-making. This study only examined the proposed method using time-series (panel) data. However, we believe it can be easily generalized to other applications with different data modality. For example, transportation-planning models, which usually use cross-sectional data, should also be examined with fairness analysis. Our unfairness correction method can be flexibly adopted by planning models [e.g. 58] to inform fair design of transportation ecosystems. Flexibility is also reflected in that, once the models are trained, access to protected attributes is no longer required. Unlike the post-processing technique that always requires access to the protected attribute [27, 1], our approach lifts this restriction and can be flexibly adapted for future forecasting tasks.
Third, effectiveness in achieving fairness while preserving prediction accuracy. Multiple studies reported that machine learning has a trade-off between accuracy and fairness (e.g., [8, 1]), i.e., the reduction of unfairness will inevitably trigger an accuracy drop. Our scheme addresses this trade-off by incorporating an interactive weight coefficient (i.e., ) into the loss function. We treat as a hyperparameter of the learning task and tune it together with other hyperparameters. In this way, the model automatically finds the optimal hyperparameter combination that has the best performance in improving fairness while maintaining prediction accuracy. Most of our experiments revealed that this approach could significantly reduce unfairness only at little expense of accuracy decline. While in some cases, our proposed method can even significantly improve fairness and slightly improve prediction accuracy.
6.2 Policy implications
Dynamically balancing the supply and demand for transportation systems is important to improve cost-benefit effects and efficiency. And this balance relies heavily on accurate predictions [22]. Although machine learning intensively promotes predictions, it may simultaneously introduce bias. The overall satisfactory predictions may hide a huge prediction accuracy gap across areas of the city or underrepresented groups of residents [51, 60]. Our study also confirms this finding. Specifically, Table 3 shows that both machine learning and statistical models can produce lower prediction accuracy for the disadvantaged communities (i.e., the non-white-majority, the lower-education-attainment, the elderly and the low-income) than that of the advantaged communities. The predictive disparity implies that if transportation planners naively use such travel demand forecasting models without accounting for the fairness issues, the modeling results will lead to ineffective transportation resource allocations, impede the mobility of the disadvantaged communities, and even possibly further exacerbate the existing operational biases of ridesourcing services, e.g., higher trip-cancellation rate, longer waiting times and higher per-mile fees for disadvantaged communities [12, 55, 11, 13].
Our proposed method can help mitigate the unfairness issues of the current ridesourcing operations to better serve the disadvantaged communities. We believe that ridesourcing policymakers should consider incorporating our proposed method into the travel demand modeling framework to inform fairer ridesourcing resource allocations and operations. Additionally, two fairness metrics can be used by city governments to evaluate and regulate ridesourcing operations. Moreover, the fairness measurements and unfairness correction method can be adopted to facilitate the effective operations of other travel modes such as public transit and shared micromobility. For example, an accurate and fair demand forecasting model will enable transit authorities to provide more personalized transit services to balance operation efficiency and effectiveness [26]. Also, a fairness-aware travel demand forecasting model will help micromobility (e.g., bikeshare and e-scooter) operators better rebalance the vehicles and ensure fair distribution of service availability throughout the day [54].
6.3 Limitations and future research directions
This study has some limitations that warrant follow-up investigations. For example, we only evaluated the proposed methodology using two fairness metrics (i.e., prediction accuracy gap and correlation coefficient) in this paper. Future works may consider using a wider range of fairness metrics to conduct a comprehensive evaluation. Moreover, by using correlation techniques, we assume the prediction accuracy is linearly correlated with the protected attributes. Future studies may consider exploring whether this association is nonlinear and developing corresponding methods. Another widely debated research topic is the connection between accuracy and fairness. Several previous studies have shown that the accuracy-fairness trade-off exists across datasets and applications [8, 21] while others have shown that improvements in accuracy and fairness can co-occur [50]. Hence, forthcoming investigations may shed further light on this relationship, such as identifying scenarios in which fairness and accuracy can both be enhanced or where the accuracy-fairness trade-off is prominent. Finally, this study only examined one travel mode (i.e., ridesourcing). A more comprehensive analysis that includes various travel modes (e.g., transit, car-sharing, and shared micromobility) and diverse contexts (e.g., different locations) should be conducted to test the generalizability and robustness of the unfairness correction method.
7 Conclusion
This study examines the fairness issues in travel demand forecasting models and develops a new methodology to enhance their fairness while preserving the prediction accuracy. By leveraging two real-world ridesourcing-trip data from Chicago, IL and Austin, TX, the unfairness issues of seven state-of-the-art AI-based models on forecasting travel demand are evaluated. A novel and transparent in-processing method, which is based on an absolute correlation regularization term, is proposed to simultaneously address the unfairness arising from multiple protected attributes. We also compare the performance (including both fairness and accuracy) of our proposed unfairness correction method with three state-of-the-art unfairness correction method to show its effectiveness.
The results highlight that both statistical and machine learning models have pronounced fairness issues, i.e., the prediction accuracy for advantaged groups are notably higher than disadvantaged groups. Our proposed unfairness correction method can effectively enhance fairness for multiple protected attributes while preserving prediction accuracy. The comparative study reveals that our proposed method significantly outperforms other methods in both fairness and accuracy. Besides performance, our proposed method has remarkable flexibility—it is model-agnostic and can be adapted to different applications and different data modality. In summary, this study advances our understanding of fairness issues in travel demand forecasting and equips transportation researchers with a powerful tool to foster fairness within the transportation ecosystem.
Authorship Contribution Statement
The authors confirm contributions to the paper as follows: Zhang: Conceptualization, Data Curation, Methodology, Software, Formal Analysis, and Draft Preparation. Ke: Conceptualization, Methodology, Formal Analysis, Draft Preparation. Zhao: Conceptualization, Methodology, Draft Preparation, Supervision and Grant Acquisition.
Acknowledgment
This research was partially supported by the U.S. Department of Transportation through the Southeastern Transportation Research, Innovation, Development and Education (STRIDE) Region 4 University Transportation Center (Grant No. 69A3551747104) and through the Tier 1 University Transportation Center for Equitable Transit-Oriented Communities (CETOC) (Grant No. 69A3552348337). During the preparation of this work the authors used ChatGPT in order to check grammar errors and improving languages. After using this tool, the authors reviewed and edited the content as needed and takes full responsibility for the content of the publication.
Appendix A
The following two figures show the spatial distribution of the protected attributes and the average ridesourcing demand per hour for two case studies.

Appendix B
The correlation matrix of the selected four protected attributes for two case studies is shown in Table. B.1. Results show that the protected attributes are evidently correlated with each other.
Chicago | Austin | |||||||
---|---|---|---|---|---|---|---|---|
Race | Edu | Age | Income | Race | Edu | Age | Income | |
Race | 1.000 | 0.620 | 0.504 | -0.748 | 1.000 | 0.605 | -0.178 | -0.401 |
Edu | 0.620 | 1.000 | 0.682 | -0.610 | 0.605 | 1.000 | -0.062 | -0.424 |
Age | 0.504 | 0.682 | 1.000 | -0.403 | -0.178 | -0.062 | 1.000 | 0.620 |
Income | -0.748 | -0.610 | -0.403 | 1.000 | -0.401 | -0.424 | 0.620 | 1.000 |
Appendix C
The results of the sensitivity analysis for fairness weight, i.e., are presented in Fig. 1. We specifically investigated the effects of on model’s prediction accuracy by RMSE and fairness by both PAG and correlations. Note: In Austin case, setting fairness weight as 0.4 for GRU led to a substantial increase in RMSE and PAG. One possible reason could be that this combination of hyperparameters might explode the gradients and thus lead to this numerical instability.

References
- Agarwal et al. (2019) Agarwal, A., Dudík, M., Wu, Z.S., 2019. Fair regression: Quantitative definitions and reduction-based algorithms, in: International Conference on Machine Learning, PMLR. pp. 120–129.
- Allison (1999) Allison, P.D., 1999. Multiple regression: A primer. Pine Forge Press.
- Angwin et al. (2016) Angwin, J., Larson, J., Mattu, S., Kirchner, L., 2016. Machine bias, in: Ethics of Data and Analytics. Auerbach Publications, pp. 254–264.
- Bai and Krishnaiah (2003) Bai, Z., Krishnaiah, P., 2003. Reduction of dimensionality, in: Meyers, R.A. (Ed.), Encyclopedia of Physical Science and Technology (Third Edition). third edition ed.. Academic Press, New York, pp. 55–73. URL: https://www.sciencedirect.com/science/article/pii/B012227410500466X, doi:https://doi.org/10.1016/B0-12-227410-5/00466-X.
- Baker and Hawn (2021) Baker, R.S., Hawn, A., 2021. Algorithmic bias in education. International Journal of Artificial Intelligence in Education , 1–41.
- Barabas et al. (2018) Barabas, C., Virza, M., Dinakar, K., Ito, J., Zittrain, J., 2018. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment, in: Conference on fairness, accountability and transparency, PMLR. pp. 62–76.
- Barocas and Selbst (2016) Barocas, S., Selbst, A.D., 2016. Big data’s disparate impact. Calif. L. Rev. 104, 671.
- Berk et al. (2017) Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., Neel, S., Roth, A., 2017. A convex framework for fair regression. arXiv preprint arXiv:1706.02409 .
- Beutel et al. (2019) Beutel, A., Chen, J., Doshi, T., Qian, H., Woodruff, A., Luu, C., Kreitmann, P., Bischof, J., Chi, E.H., 2019. Putting fairness principles into practice: Challenges, metrics, and improvements, in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 453–459.
- Bose and Hamilton (2019) Bose, A., Hamilton, W., 2019. Compositional fairness constraints for graph embeddings, in: International Conference on Machine Learning, PMLR. pp. 715–724.
- Brown (2022) Brown, A., 2022. Not all fees are created equal: Equity implications of ride-hail fee structures and revenues submitted to transport policy. Transport Policy .
- Brown et al. (2019) Brown, A., Polzin, S.E., Sperling, D., Hampshire, R., Shoup, D., Saphores, J.D., Mitra, S.K., Willson, R., 2019. The equalizer: Could ride-hailing extend equitable car access? The Equalizer: Could Ride-Hailing Extend Equitable Car Access? , 2.
- Brown and Williams (2021) Brown, A., Williams, R., 2021. Equity implications of ride-hail travel during covid-19 in california. Transportation Research Record , 03611981211037246.
- Brown (2019) Brown, A.E., 2019. Prevalence and mechanisms of discrimination: Evidence from the ride-hail and taxi industries. Journal of Planning Education and Research , 0739456X19871687.
- Buolamwini and Gebru (2018) Buolamwini, J., Gebru, T., 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification, in: Conference on fairness, accountability and transparency, PMLR. pp. 77–91.
- Calders et al. (2013) Calders, T., Karim, A., Kamiran, F., Ali, W., Zhang, X., 2013. Controlling attribute effect in linear regression, in: 2013 IEEE 13th international conference on data mining, IEEE. pp. 71–80.
- Calmon et al. (2017) Calmon, F., Wei, D., Vinzamuri, B., Natesan Ramamurthy, K., Varshney, K.R., 2017. Optimized pre-processing for discrimination prevention. Advances in neural information processing systems 30.
- Caton and Haas (2020) Caton, S., Haas, C., 2020. Fairness in machine learning: A survey. arXiv preprint arXiv:2010.04053 .
- Cheng et al. (2021) Cheng, P., Hao, W., Yuan, S., Si, S., Carin, L., 2021. Fairfil: Contrastive neural debiasing method for pretrained text encoders. arXiv preprint arXiv:2103.06413 .
- Cho et al. (2014) Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y., 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 .
- Chouldechova and Roth (2018) Chouldechova, A., Roth, A., 2018. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810 .
- Chu et al. (2019) Chu, K.F., Lam, A.Y., Li, V.O., 2019. Deep multi-scale convolutional lstm network for travel demand and origin-destination predictions. IEEE Transactions on Intelligent Transportation Systems 21, 3219–3232.
- Dressel and Farid (2018) Dressel, J., Farid, H., 2018. The accuracy, fairness, and limits of predicting recidivism. Science advances 4, eaao5580.
- Du et al. (2020) Du, M., Yang, F., Zou, N., Hu, X., 2020. Fairness in deep learning: A computational perspective. IEEE Intelligent Systems 36, 25–34.
- Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R., 2012. Fairness through awareness, in: Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226.
- Ermagun and Tilahun (2020) Ermagun, A., Tilahun, N., 2020. Equity of transit accessibility across chicago. Transportation Research Part D: Transport and Environment 86, 102461.
- Hardt et al. (2016) Hardt, M., Price, E., Srebro, N., 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29.
- Johnson et al. (2016) Johnson, K.D., Foster, D.P., Stine, R.A., 2016. Impartial predictive modeling: Ensuring fairness in arbitrary models. Statistical Science , 1.
- Kamiran and Calders (2012) Kamiran, F., Calders, T., 2012. Data preprocessing techniques for classification without discrimination. Knowledge and information systems 33, 1–33.
- Kamishima et al. (2011) Kamishima, T., Akaho, S., Sakuma, J., 2011. Fairness-aware learning through regularization approach, in: 2011 IEEE 11th International Conference on Data Mining Workshops, IEEE. pp. 643–650.
- Kaur et al. (2022) Kaur, D., Uslu, S., Rittichier, K.J., Durresi, A., 2022. Trustworthy artificial intelligence: a review. ACM Computing Surveys (CSUR) 55, 1–38.
- Kearns et al. (2018) Kearns, M., Neel, S., Roth, A., Wu, Z.S., 2018. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness, in: International conference on machine learning, PMLR. pp. 2564–2572.
- Kearns et al. (2019) Kearns, M., Neel, S., Roth, A., Wu, Z.S., 2019. An empirical study of rich subgroup fairness for machine learning, in: Proceedings of the conference on fairness, accountability, and transparency, pp. 100–109.
- Kim et al. (2021) Kim, H., Shin, S., Jang, J., Song, K., Joo, W., Kang, W., Moon, I.C., 2021. Counterfactual fairness with disentangled causal effect variational autoencoder, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8128–8136.
- Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Kusner et al. (2017) Kusner, M.J., Loftus, J., Russell, C., Silva, R., 2017. Counterfactual fairness. Advances in neural information processing systems 30.
- Li et al. (2023) Li, B., Qi, P., Liu, B., Di, S., Liu, J., Pei, J., Yi, J., Zhou, B., 2023. Trustworthy ai: From principles to practices. ACM Computing Surveys 55, 1–46.
- Makridakis and Hibon (1997) Makridakis, S., Hibon, M., 1997. Arma models and the box–jenkins methodology. Journal of forecasting 16, 147–163.
- Mehrabi et al. (2021) Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A., 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 1–35.
- Obermeyer et al. (2019) Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S., 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453.
- Olteanu et al. (2019) Olteanu, A., Castillo, C., Diaz, F., Kıcıman, E., 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2, 13.
- Prates et al. (2020) Prates, M.O., Avelar, P.H., Lamb, L.C., 2020. Assessing gender bias in machine translation: a case study with google translate. Neural Computing and Applications 32, 6363–6381.
- Quadrianto et al. (2019) Quadrianto, N., Sharmanska, V., Thomas, O., 2019. Discovering fair representations in the data domain, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8227–8236.
- Shi et al. (2015) Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28.
- Shmueli (2010) Shmueli, G., 2010. To explain or to predict? .
- Wan et al. (2023) Wan, M., Zha, D., Liu, N., Zou, N., 2023. In-processing modeling techniques for machine learning fairness: A survey. ACM Transactions on Knowledge Discovery from Data 17, 1–27.
- Wang and Russakovsky (2021) Wang, A., Russakovsky, O., 2021. Directional bias amplification, in: International Conference on Machine Learning, PMLR. pp. 10882–10893.
- Xu et al. (2019) Xu, D., Wu, Y., Yuan, S., Zhang, L., Wu, X., 2019. Achieving causal fairness through generative adversarial networks, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.
- Xu et al. (2022) Xu, H., Zou, T., Liu, M., Qiao, Y., Wang, J., Li, X., 2022. Adaptive spatiotemporal dependence learning for multi-mode transportation demand prediction. IEEE Transactions on Intelligent Transportation Systems 23, 18632–18642.
- Yan (2021) Yan, A., 2021. Fairness-Aware Spatio-Temporal Prediction for Cities. University of Washington.
- Yan and Howe (2020) Yan, A., Howe, B., 2020. Fairness-aware demand prediction for new mobility, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1079–1087.
- Yan and Howe (2021) Yan, A., Howe, B., 2021. Equitensors: Learning fair integrations of heterogeneous urban data, in: Proceedings of the 2021 International Conference on Management of Data, pp. 2338–2347.
- Yan et al. (2020) Yan, X., Liu, X., Zhao, X., 2020. Using machine learning for direct demand modeling of ridesourcing services in chicago. Journal of Transport Geography 83, 102661.
- Yan et al. (2021) Yan, X., Yang, W., Zhang, X., Xu, Y., Bejleri, I., Zhao, X., 2021. A spatiotemporal analysis of e-scooters’ relationships with transit and station-based bikeshare. Transportation research part D: transport and environment 101, 103088.
- Yang et al. (2021) Yang, H., Liang, Y., Yang, L., 2021. Equitable? exploring ridesourcing waiting time and its determinants. Transportation Research Part D: Transport and Environment 93, 102774.
- Yang et al. (2023) Yang, J., Soltan, A.A., Eyre, D.W., Yang, Y., Clifton, D.A., 2023. An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digital Medicine 6, 55.
- Yu et al. (2017) Yu, B., Yin, H., Zhu, Z., 2017. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875 .
- Zhang and Zhao (2022) Zhang, X., Zhao, X., 2022. Machine learning approach for spatial modeling of ridesourcing demand. Journal of Transport Geography 100, 103310.
- Zhao et al. (2019) Zhao, L., Song, Y., Zhang, C., Liu, Y., Wang, P., Lin, T., Deng, M., Li, H., 2019. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems 21, 3848–3858.
- Zheng et al. (2021) Zheng, Y., Wang, S., Zhao, J., 2021. Equality of opportunity in travel behavior prediction with deep neural networks and discrete choice models. Transportation Research Part C: Emerging Technologies 132, 103410.