This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Analyzing sequential activity and travel decisions with interpretable deep inverse reinforcement learning

Yuebing Liang Shenhao Wang Jiangbo Yu Zhan Zhao Jinhua Zhao Sandy Pentland The Singapore-MIT Alliance for Research and Technology Department of Urban and Regional Planning, University of Florida Department of Civil Engineering, McGill University Department of Urban Planning and Design, The University of Hong Kong Department of Urban Studies and Planning, Massachusetts Institute of Technology Media Lab, Institute for Data, Systems, and Society, Massachusetts Institute of Technology
Abstract

Travel demand modeling has shifted from aggregated trip-based models to behavior-oriented activity-based models because daily trips are essentially driven by human activities. To analyze the sequential activity-travel decisions, deep inverse reinforcement learning (DIRL) has proven effective in learning the decision mechanisms by approximating a reward function to represent preferences and a policy function to replicate observed behavior using deep neural networks (DNNs). However, most existing research has focused on using DIRL to enhance only prediction accuracy, with limited exploration into interpreting the underlying decision mechanisms guiding sequential decision-making. To address this gap, we introduce an interpretable DIRL framework for analyzing activity-travel decision processes, bridging the gap between data-driven machine learning and theory-driven behavioral models. Our proposed framework adapts an adversarial IRL approach to infer the reward and policy functions of activity-travel behavior. The policy function is interpreted through a surrogate interpretable model based on choice probabilities from the policy function, while the reward function is interpreted by deriving both short-term rewards and long-term returns for various activity-travel patterns. Our analysis of real-world travel survey data reveals promising results in two key areas: (i) behavioral pattern insights from the policy function, highlighting critical factors in decision-making and variations among socio-demographic groups, and (ii) behavioral preference insights from the reward function, indicating the utility individuals gain from specific activity sequences.

keywords:
Inverse Reinforcement Learning , Explainable Artificial Intelligence , Deep Learning , Activity-Based Travel Demand Model

1 Introduction

In recent decades, there has been increasing interest in behaviorally-oriented activity-based travel demand models. Unlike simple aggregated trip-based models [22], which focus solely on individual trips, activity-based models consider the entire sequence of activities individuals undertake throughout the day, such as work, school, shopping, recreation, and other activities [7, 23]. By recognizing the interactions between activities and travel, these models provide deeper insights into how people organize their daily lives and make travel decisions. Understanding these patterns is essential for urban planners and transportation engineers, as it allows them to evaluate the impacts of policy interventions and transportation investments on human behavior [24, 4].

Early studies on modeling activity and travel decisions utilized rule-based computational process models [24, 4], which relied on predefined decision rules to simulate human decision-making, often suffered from subjective biases. Subsequent research introduced utility-based econometric models, which estimate utility functions for different alternatives and assume that individuals select the alternative with the highest utility. Recent advancements have further enhanced this approach by developing dynamic utility-based models to capture sequential decision-making processes [46, 37, 34]. These models incorporate forward-looking behavior, enabling individuals to consider the future consequences of their decisions and optimize their overall utility. Despite these improvements, existing dynamic utility-based models often depend on discrete choice modeling frameworks that assume simple linear-in-parameter utility functions. Although there have been attempts to enhance the flexibility of utility functions using multiplicative interactions among attributes or mixtures of distributions [36], these approaches may still fall short in accurately reflecting behavioral realism.

Recently, machine learning has emerged as a powerful tool for capturing complex patterns in human behavior [12, 29, 17]. Among these techniques, inverse reinforcement learning (IRL) stands out as a specialized approach for modeling sequential decision-making tasks [26, 1]. Assuming observed human behavior is near-optimal with respect to some unknown reward function, the goal of IRL is to infer this underlying reward function, representing the agent’s preferences or objectives, and to derive a policy function that replicates the observed behavior. Similar to the utility in microeconometric models, the reward in IRL is a concise and portable representation of a task [35]. Recent research has theoretically demonstrated that IRL and DDCM, though originating from different research fields, belong to a class of optimization problems characterized by a common objective form [31]. Consequently, IRL serves as a bridge between data-driven machine learning and theory-driven behavioral modeling, offering significant potential for interpreting the underlying mechanism that guide human sequential decision-making [2, 20, 42]. Recent advancements have integrated deep neural networks (DNNs) to approximate the reward and policy functions in the IRL framework, enhancing IRL’s ability to capture complex behavioral patterns. This approach, known as deep IRL (DIRL), has been applied to model various sequential decision-making processes in transportation, such as vehicle trajectory generation [11], route choice modeling [44], and activity-travel planning [35]. However, existing studies primarily focus on leveraging DIRL to improve prediction accuracy, often overlooking its potential for interpreting why humans make particular decisions.

This study addresses the research gap by introducing an interpretable DIRL framework for analyzing sequential activity-travel decisions. We model the decision-making process as a Markov Decision Process (MDP) and adapt an adversarial IRL framework [13] to infer reward and policy functions. The adversarial IRL framework combines IRL with generative adversarial networks (GANs), consisting of a generator to create realistic trajectories and a discriminator to distinguish between real and generated trajectories. The generator estimates a policy function representing human behavioral patterns, while the discriminator estimates a reward function reflecting human behavioral preferences. Utilizing the well-trained adversarial IRL model, we derive insights into behavioral patterns from the policy function and behavioral preferences from the reward function. Experiments with travel survey data demonstrate DIRL’s effectiveness in both improving prediction accuracy and enhancing the understanding of activity-travel behavior patterns and preferences. In summary, this study’s contributions lie in the following aspects:

  • Unlike previous studies that focus on prediction accuracy, this research introduces an interpretable DIRL framework for analyzing sequential activity and travel decisions. We first train an adversarial IRL model to learn activity-travel behavioral patterns and then introduce a post-hoc interpretation framework to explain the knowledge learned by the policy and reward networks. This interpretation framework is general and can be applied to other sequential human behaviors.

  • To extract behavioral patterns from the policy function, we adapt a knowledge distillation method, training a surrogate, interpretable model based on the soft labels predicted by a policy network. To explain behavioral preferences from the reward network, we map discrete human behavior sequences to continuous reward values using the learned reward function. These reward values are then used to identify different types of decision-makers and to compare the preference of different activity sequences.

  • We conducted an analysis using travel survey data from Singapore. The results demonstrate DIRL’s interpretability in two key aspects: (i) insights into behavioral patterns from the policy function, highlighting critical factors in decision-making and variations among socio-demographic groups, and (ii) insights into behavioral preferences from the reward function, providing a direct quantifiable measure of utility when making sequential decisions.

2 Literature Review

2.1 Activity and travel decision modeling

Existing methods for modeling activity and travel decisions can be grouped into three categories: rule-based, utility-based, and machine learning-based approaches. Rule-based models use pre-defined sets of condition-action (if-then) rules to outline how individuals make decisions. Examples include ALBATROSS [4], TASHA [24], and ADAPTS [5]. These models often require detailed knowledge of travel patterns and can be biased by subjective interpretations, which may affect their general applicability.

Utility-based models operate on the principle that individuals select activity patterns that maximize their personal utility. These models often break down activity-travel patterns into hierarchical structures to represent various aspects of activity choices. For example, a nested logit structure developed by [8] represents activity patterns at a higher level and details regarding primary and secondary tours at a lower level. [28] decomposed activity-travel simulation into models for activity type and activity duration, using nested logit and split-population survival models respectively. However, these models are criticized for their inability to comprehensively capture the large space of possible activity-travel patterns, often missing out on the temporal aspects of trips and the interconnected nature of different decision components [37]. Recent advancements have explored dynamic discrete choice theories to improve the modeling of activity-travel decisions. These models assume that individuals make activity-travel decisions sequentially and consider not only the immediate utility of choices but also the anticipated future utility at each decision step. [46] developed a mixed recursive logit model that treats activity-travel scheduling decisions as choices of paths. [37] introduced a dynamic discrete choice model (DDCM) for daily activity-travel planning, which allows agents to be forward-looking by maximizing the anticipated utilities of all travel and activity episodes for the remainder of the day. [34] further introduced a mixed dynamic discrete choice model with finite mixture distributions to account for persistent unobserved differences among individuals. An important advantage of utility-based models is that they provide a quantifiable measure of preference or benefit through utility functions, which allows for straightforward interpretations of how different attributes impact human decisions. However, designing utility functions is challenging due to the complex nature of activity-travel patterns, and simple linear approaches often fail to capture their intricacies [35].

Another group of studies has introduced machine learning models in the problem of activity-travel pattern modeling, including decision trees [12], multilayer perceptron networks [29], and random forest [25]. Compared with rule-based and utility-based models, machine learning models are better at capturing complex behavior patterns from large-scale human mobility data and thus can improve the prediction accuracy. On the other hand, their interpretability regarding behavioral mechanisms has been criticized due to their data-driven nature. Recently, inverse reinforcement learning (IRL) has emerged as a powerful paradigm to model sequential decision processes that bridges the gap between data-driven machine learning models and theory-driven behavioral modeling [26, 1]. It assumes that individuals make action decisions to maximize accumulative reward in the future and extract a reward function from demonstration data that explains observed human behavior [1, 26]. The reward function is similar to the utility function in dynamic utility-based models. In fact, recent research has demonstrated deeper connections between dynamic DCMs and IRL: they belong to a class of optimization problems with the same objective and policy definitions, while the main difference is how they approximate the value function for solving the optimization problem [31]. Therefore, compared with other machine learning models, IRL allows for better interpretations regarding underlying behavioral mechanisms.

Increasing studies have introduced IRL to model sequential decisions in transportation tasks. [45] proposed the maximum entropy IRL (MaxEntIRL) for route choice modeling, which assumes linear reward/utility functions based on the principle of maximum entropy to resolve the ambiguity in choosing a distribution over decisions. Subsequent research leveraged learned reward function weights to interpret decision-making mechanisms in travel behavior, including mixed traffic interactions [2], departure time choices [18], and pedestrian crossing behaviors [42]. To extend IRL to more complex and large-scale problems, researchers have incorporated deep learning techniques, leading to the development of Deep IRL (DIRL). For instance, [19] used deep learning architectures for reward function approximation in the MaxEntIRL approach in the context of delivery route planning. More recent work has introduced Adversarial IRL (AIRL), which uses a Generative Adversarial Network (GAN) to solve IRL problems [13]. [44] adapted the AIRL framework for route choice modeling with context-dependent reward functions to explicitly consider destination-related factors. [35] refined the AIRL framework to minimize f-divergences between expert and agent state marginal distributions, which were then used to model activity and travel choices. Despite these advancements, most research has predominantly focused on the prediction capabilities of DIRL to improve modeling accuracy, with limited exploration into its interpretability for providing insights into human decision-making strategies.

2.2 Interpretable Deep Learning

Although deep learning models often achieve better prediction performance than their conventional theory-driven statistical or econometric counterparts, their data-driven nature makes it challenging to explain the reasons behind their predictions. This lack of interpretability can raise concerns about the trustworthiness of deep learning models [3]. Additionally, in applications such as travel behavior modeling, understanding the underlying mechanisms of human decision-making processes is crucial. These insights are valuable for urban designers and policymakers when creating policy interventions or transportation investments. Consequently, there has been increasing attention on enhancing the interpretability of deep learning models [9]. Notable examples include [15], who introduced a global surrogate method that distills knowledge from DNNs by re-training interpretable models to fit the predicted soft labels of DNNs. [30] proposed a local surrogate model called LIME, which uses an interpretable model to fit machine learning model predictions locally. Building on this, [21] introduced SHapley Additive exPlanations (SHAP), a game theory-based method that connects LIME with Shapley values. Recent research efforts have also focused on interpreting specific types of DNNs, such as Grad-CAM, which provides visual explanations for CNN decisions [30], and GNNExplainer, which interprets graph-based machine learning models [43].

In the field of transportation, existing studies have linked DNNs with traditional economic models like DCMs and compared their interpretability, although most efforts have focused on mode choice. [40] used DNNs for choice analysis and demonstrated that DNNs can extract complete economic information for interpretation, similar to DCMs, including choice predictions, social welfare, elasticities, and more. [33] augmented the utility specification of DCMs with a new nonlinear representation from DNNs, allowing for more flexible utility representations. Researchers also designed new deep learning architectures to integrate the behavioral knowledge into standard feed-forward neural networks, achieving higher predictive performance [38, 39]. [3] introduced a heat map-based method to explain DNN predictions for mode choice using Layer-wise Relevance Propagation. These studies highlight the close relationship between multi-layer perceptrons (MLPs) and DCMs and provide examples of bridging the gap between data-driven machine learning models and theory-based econometric models using statistical learning theory [39]. However, unlike mode choice, the activity-travel decision process is more complex because it is sequential, with individuals making decisions based not only on immediate rewards but also on future expected utilities. Despite the great potential of DIRL in interpreting such sequential decision processes, as discussed in the previous section, existing DIRL-based studies have focused more on prediction accuracy, with limited exploration of its interpretability [44, 35].

3 Methodology

In this section, we introduce an interpretable DIRL framework to learn and analyze the sequential activity-travel choice process. As shown in Figure 1, the framework consists of three main components: (1) Data Preperation, extracting activity-travel sequences and socio-demographic features from observed data. This study constructs these features from travel survey data, as will be detailed in Section 4.1. (2) DIRL Modeling, formulating activity-travel decisions as a Markoc Decision Process (MDP) and training an AIRL model to mimic observed behavior. (3) DIRL Interpretation, which consists of a policy interpretator to understand behavioral patterns and a reward interpretator to understand behavioral preferences. More details regarding DIRL modeling and DIRL interpretation will be provided in Section 3.1 and Section 3.2, respectively.

Refer to caption
Figure 1: The framework of interpretable DIRL for activity-travel choice modeling

3.1 DIRL modeling

In this subsection, we first formulate activity-travel decision-making process as a Markov Decision Process (see Section 3.1.1) and then introduce an adapted AIRL model to learn the behavioral mechanisms (see Section 3.1.2).

3.1.1 Sequential activity-travel choice as MDPs

This study aims to model daily activity-travel decisions of individuals, focusing on what activities people choose throughout the day and when they decide to start and end these activities or travel behaviors. To represent time choices discretely, we divide the day into fixed-length intervals (e.g., 15 minutes). In this way, an individual’s daily activity-travel sequence YmY_{m} can be represented as Ym={y1,y2,,yn}Y_{m}=\{y_{1},y_{2},...,y_{n}\}, where yiy_{i} denotes the activity conducted at time step ii, and nn is the total number of time steps in a day.

Markov decision processes (MDPs) provide a mathematical framework for modeling decision-making processes. An MDP is generally defined as M={S,A,T,R,γ}M=\{S,A,T,R,\gamma\}, where SS is the state space, AA is the action space, T(s,a,s)T(s,a,s^{\prime}) is the transition model determining the next state ss^{\prime} given the current state ss and action aa, RR is the reward function for each state-action pair, indicating the instantaneous reward similar to the utility function in DDCM models, and γ\gamma is the discount factor. An agent in state sSs\in S decides the action aAa\in A based on a policy π\pi, which specifies a probability distribution over actions. In the context of activity and travel behavior modeling, each MDP component is defined as follows:

  • State sSs\in S: The state consists of five travel-related features at the current time step (Card(S)=5Card(S)=5): (1) Current time step i{1,2,,N}i\in\{1,2,...,N\}, (2) Current activity yiy_{i}, (3) Duration of stay at the current activity di{1,2N}d_{i}\in\{1,2...N\}, (4) Number of activities conducted since the start of the day pi{1,2N}p_{i}\in\{1,2...N\} (5) Duration since last leaving home qiq_{i}, 0 if currently at home, otherwise qi{1,2N}q_{i}\in\{1,2...N\}.

  • Action aAa\in A: An action represents the activity the agent plans to conduct next, which can include home, work, education, shopping, etc. To explicitly account for travel behavior, travel is defined as an additional activity category. Unlike other activities, travel serves as a transition action when people decide to start a new activity. To represent this, we add action masks in our model to restrict the feasible action space based on the current activity. For example, when the user is engaged in an activity, the action space allows for continuing the same activity or transitioning to travel. When the user is currently traveling, the action space includes travel or any other activity.

  • Transition Model T(s,a,s)T(s,a,s^{\prime}): For activity-travel behavior analysis, the transition model is deterministic, meaning a given state-action pair will always lead to the same next state. Specifically, at the next time step i+1i+1, the activity is determined by the action (yi+1=aiy_{i+1}=a_{i}). If the agent continues the same activity (yi+1=yiy_{i+1}=y_{i}), the stay duration at the activity increments by one (di+1=di+1d_{i+1}=d_{i}+1), and the number of activities since the start of the day remains unchanged (pi+1=pip_{i+1}=p_{i}). If the agent transitions to a different activity, the stay duration resets to one (di+1=1d_{i+1}=1), and the number of activities increases by one (pi+1=pi+1p_{i+1}=p_{i}+1). The duration since last leaving home increments by one (qi+1=qi+1q_{i+1}=q_{i}+1), unless the next activity is home, in which case it resets to zero (qi+1=0q_{i+1}=0).

  • Reward Function R(s,a|m)R(s,a|m): For an agent mm, the reward function R(s,a|m)R(s,a|m) defines the expected reward of choosing an action aa given the current state ss. To account for individual differences in behavioral preferences, we include socio-demographic features such as income, age, gender, etc., as part of the input to the reward function.

  • Policy function π(a|s,m)\pi(a|s,m): The policy function π(a|s,m)\pi(a|s,m) describes the probability distribution over actions aa that an agent mm might take given the current state ss. While the reward function assigns a value to each state-action pair indicating human preferences, the policy function captures the behavioral patterns of humans.

3.1.2 Adversarial IRL for activity-travel behavior modeling

IRL is a learning paradigm that aims to infer a reward function to explain human preferences and subsequently learn a policy function to replicate human choice behavioral patterns. Like DDCMs, it operates under the assumption that people make decisions to maximize some reward functions, though they are unknown. However, the original IRL is ambiguous and ill-defined, as many optimal policies can explain a set of demonstrations, and many rewards can explain an optimal policy [35]. To address the first ambiguity, [45] developed a probabilistic approach based on the principle of maximum entropy. Building on this, [13] further addressed the second ambiguity with adversarial IRL (AIRL), which introduced a specialized adversarial network to recover disentangled rewards that are invariant to changing dynamics. The original AIRL was proposed for robotic control tasks, and here we adapted it for modeling sequential activity-travel behavior.

AIRL operates within an adversarial framework, comprising a generator and a discriminator. The generator creates realistic trajectories and learns a policy function that maps each state to a probability distribution over actions. The discriminator distinguishes between generated and observed behavior by inferring a reward function, which assigns a value representing the preference of each state-action pair. An overview of the adapted AIRL framework in this study is shown in Figure 2.

Refer to caption
Figure 2: The network structure of AIRL for activity-travel choice modeling

The policy network incorporates two types of input features: those related to the current state and those related to the socio-demographic characteristics of an agent. For features of the current state, which are all categorical variables, we learn an embedding matrix for each feature within the model. The matrix maps each category to a latent vector, with the matrix’s width corresponding to the number of possible categories and its length to the dimension of the latent vector. For continuous positive socio-demographic features like income and age, we normalize them to a 0-1 scale before inputting them into the model. For categorical socio-demographic features such as employment type, we use embedding techniques to encode them into a latent space. All the encoded features are then concatenated and fed into a three-layer feedforward network, followed by a Softmax function to output the choice probability of the next activity. As mentioned in Section 3.1.1, we add an action mask before the Softmax function to ensure that only feasible state-action transitions are considered.

The reward network incorporates the current state and individual socio-demographic features as inputs, encoded using the same methods as in the generator. Additionally, to capture the immediate reward of transitioning to the next state, the reward network takes the next state features as inputs, which are encoded similarly to the current state features. To update an unshaped reward invariant to the dynamic environment, the reward function is formulated as follows:

Rθ,ϕ(st,at,st+1|m)=gθ(st,at|m)+γhϕ(st+1|m)hϕ(st|m),R_{\theta,\phi}(s_{t},a_{t},s_{t+1}|m)=g_{\theta}(s_{t},a_{t}|m)+\gamma h_{\phi}(s_{t+1}|m)-h_{\phi}(s_{t}|m), (1)

where gθ(st,at|m)g_{\theta}(s_{t},a_{t}|m) represents the immediate reward approximator. It captures the intrinsic reward the agent mm receives for taking action ata_{t} in state sts_{t}. hϕ(st|m)h_{\phi}(s_{t}|m) and hϕ(st+1|m)h_{\phi}(s_{t+1}|m) represent the reward shaping components designed to shape the reward function to make it invariant to changing dynamics in the environment. The term γhϕ(st+1|m)\gamma h_{\phi}(s_{t+1}|m) adjusts the reward based on the expected future state st+1s_{t+1}, while hϕ(st|m)h_{\phi}(s_{t}|m) adjusts it based on the current state sts_{t}. This difference helps smooth the reward landscape, making the learned reward function more robust. We use two separate DNNs to approximate gθ()g_{\theta}(*) and hϕ()h_{\phi}(*) separately, each of which is a three-layer feed-forward network based on the encoded state and socio-demographic features.

Based on the reward and policy functions, the discriminator in AIRL is defined as follows:

Dθ,ϕ(st,at,st+1|m)=exp(Rθ,ϕ(st,at,st+1|m))exp(Rθ,ϕ(st,at,st+1|m))+π(atst,m).D_{\theta,\phi}(s_{t},a_{t},s_{t+1}|m)=\frac{\exp\big{(}R_{\theta,\phi}(s_{t},a_{t},s_{t+1}|m)\big{)}}{\exp\big{(}R_{\theta,\phi}(s_{t},a_{t},s_{t+1}|m)\big{)}+\pi(a_{t}\mid s_{t},m)}. (2)

The discriminator is trained to differentiate between generated and real behavior by minimizing the cross-entropy loss between state-action pairs produced by the generator and those from observed trajectories. The policy function aims to identify an optimal policy that maximizes the expected reward as estimated by the discriminator. To achieve this, we employ a state-of-the-art policy gradient algorithm called Proximal Policy Optimization (PPO) [32] to train the policy network.

3.2 DIRL Interpretation

The trained AIRL model infers people’s activity-travel behavior patterns through the policy function and their activity-travel preferences through the reward function. By using DNNs to approximate these functions, the model can capture more complex patterns and preferences compared to traditional utility-based econometric models with linear restrictions. However, due to the black-box nature of DNNs, it is challenging to directly explain why and how these networks produce specific outcomes, leaving the “knowledge” embedded in the policy and reward functions unclear. To address this issue, we introduce an interpretation framework to extract human-understandable knowledge from DIRL. This framework consists of two components: one for extracting behavioral patterns from the policy network, and the other for extracting preference insights from the reward network. The details of these two components are elaborated below.

3.2.1 Policy interpretation

The original policy function is represented using DNNs, making it challenging to identify the learned knowledge from individual parameters. To address this issue, we adapt a knowledge distillation method for policy interpretation, initially introduced by [15]. A key insight from their work is that for complex models that learn to discriminate between classes, the knowledge learned is reflected in the choice probabilities assigned to each class. To transfer the knowledge from large, cumbersome models to smaller models, [15] proposed training a smaller model using the class probabilities produced by the large model as “soft targets.” We adapt this idea for interpretation purposes, using a surrogate interpretable model to distill knowledge from the policy network. Specifically, we use a multinomial logit (MNL) model as the surrogate, a common choice for choice analysis that provides comprehensive econometric interpretations [40, 33]. To extract knowledge from the policy network, the surrogate MNL model is trained to maximize the log probability of both ground-truth labels from observed behavior and “soft labels” from the policy network:

Lθ=m=1Mt=1N(1α)×𝒚t,mlog(𝒚^t,mmnl)+α×𝒚^t,mirllog(𝒚^t,mmnl),L_{\theta}=-\sum_{m=1}^{M}{\sum_{t=1}^{N}{(1-\alpha)\times\bm{y}_{t,m}\log(\hat{\bm{y}}^{mnl}_{t,m})+\alpha\times\hat{\bm{y}}^{irl}_{t,m}\log(\hat{\bm{y}}^{mnl}_{t,m})}}, (3)

where MM is the number of users in the training set, NN is the number of timesteps. 𝒚t,m\bm{y}_{t,m} is a one-hot vector denoting the ground-truth activity of user mm at time tt, 𝒚^t,mirl\hat{\bm{y}}^{irl}_{t,m} is the choice probability vector generated by the pre-trained policy network, and 𝒚^t,mmnl\hat{\bm{y}}^{mnl}_{t,m} is the estimated choice probability of the MNL model. α[0,1]\alpha\in[0,1] is a trade-off weight to balance different losses. When α=0\alpha=0, the MNL model is trained completely based on ground-truth labels, while when α=1\alpha=1, the model is trained completely based on soft labels generated by the policy network. The MNL model can then be directly interpreted through its learned parameters.

3.2.2 Reward interpretation

The goal of DIRL is to infer an unknown reward function that explains how people make decisions based on a set of demonstration data. A natural approach for interpreting rewards is to map trajectories in the demonstration data to their corresponding reward values using the pre-trained reward network. By doing so, we establish a direct link between human behavioral patterns and underlying preferences. The reward function assigns a value to each state-action pair, indicating the immediate utility an agent can gain. By applying the pre-trained reward network to each timestep, we can map an activity-travel sequence YmY_{m} to a reward sequence Rm={r1,m,r2,m,rN,m}R_{m}=\{r_{1,m},r_{2,m},...r_{N,m}\} where ri,mr_{i,m} is the inferred reward at the ii-th timestep for agent mm. The reward sequence provides a dynamic view of underlying human preferences in sequential behavioral processes. It also enables us to convert discrete behavioral sequences into continuous values, facilitating further analysis.

People can have varying preferences when making sequential activity-travel decisions. For instance, some may have regular, repetitive behavioral patterns, while others explore new and diverse activities with less predictable patterns [27]. To identify different types of decision-makers, we use a clustering technique to group individuals based on their corresponding reward sequences. People within the same cluster share similar preferences for activity-travel planning. In this case, without loss of generalizability, we employ the k-means algorithm [16], a popular method for partitioning data into kk distinct clusters.

The reward values estimated from the reward function represent the immediate utility an individual can gain. In sequential decision-making, it is important to consider not only the immediate reward but also the long-term expected utility. The reward function allows us to quantify the long-term expected return Um(i)U^{(i)}_{m} for an agent mm starting from timestep ii as follows:

Um(i)=t=iNγtirt,m.U^{(i)}_{m}=\sum_{t=i}^{N}{\gamma^{t-i}r_{t,m}}. (4)

In our case, without loss of generality, ii is set to 1, representing the accumulated return starting from midnight. This long-term expected utility can compress the reward sequence into a single value, representing the overall utility an individual gains from the entire sequential planning process. Using this quantified long-term utility, we can compare the “values” of different activity-travel patterns. This comparison enables us to understand which decision-making process is optimal, at least from the perspective of the DIRL model.

4 Experiments

This section demonstrates the feasibility of extracting behavioral patterns and preference knowledge from a well-trained DIRL model. We start by introducing the dataset used for analysis (see Section 4.1) and then compare the prediction performance of different methods, showcasing the effectiveness of DIRL in accurately mimicking observed activity-travel patterns (see Section 4.2). Next, we present the extracted behavioral knowledge based on the policy network in Section 4.3 and the preference knowledge derived from the reward network in Section 4.4.

4.1 Dataset

We use the Household Travel Survey data from Singapore as a case study. This dataset, provided by the Singapore Land Transport Authority, contains detailed trip information for 35,714 users over a self-reported day between June 25, 2012, and May 30, 2013. Each user reports the purpose of their trips along with start and end times, allowing us to construct complete daily activity-travel sequences for each individual. We categorize activities into four types: home, work, school, and others. The survey also includes socio-demographic features such as age, income, car ownership, gender, and employment type. After filtering out users with incomplete or invalid trip or socio-demographic information, we are left with 50,807 trips from 21,936 users. The stay duration for most other activities is quite short, typically within 1 hour. To formulate the activity-travel decision process into a MDP, we divide a day into 96 15-minute intervals. This results in each individual having 96 decision steps per day. Figure 3 provides an intuitive illustration of the data, showing the distribution of the number of trips per day, as well as the start times and duration of various activities.

Refer to caption
Figure 3: Singapore travel survey data

4.2 Prediction accuracy of different models

To demonstrate the effectiveness of DIRL in modeling sequential activity-travel behavior, we compare its prediction performance with several baseline methods:

  • Mobility Markov Chain (MC) [14]: MMC treats sequential behavior as a discrete stochastic process, assuming that the probability of moving to the next state depends only on the current state. For activity-travel behavior modeling, each state is defined by the current time step and current activity. The model learns the probability of the next activity based on the current state from the training data. Due to the inherent model restrictions of MMC, it is challenging to incorporate socio-demographic contexts as input.

  • Multinomial Logit Model (MNL): MNL assumes that individuals choose behaviors to maximize utility values and estimates a utility function to model choice probabilities. In this context, each state-action pair is treated as an independent data sample, and MNL is estimated based on the activity labels at the next time step. The utility function in MNL takes current state features and individual socio-demographic context as input, similar to the policy function in DIRL.

  • Behavior Cloning (BC): BC models sequential behavior as a supervised learning problem. It trains the model to map states to actions using ground-truth activity labels from expert demonstrations. For a fair comparison, BC adopts the same structure as the policy network in the DIRL model. The key difference is that the policy network of BC is directly trained to maximize the log-probability of the action labels, whereas that of DIRL is trained to represent the optimal policy that maximizes the learned reward function.

We randomly sampled 80% of users as the training set and the remaining 20% as the test set. This resulted in 17,549 users (or trajectories) for model training and 4,387 users (or trajectories) for model testing. During model testing, each user starts from a virtual start state and recursively generates predicted activities at each time step by sampling from the learned choice probabilities until the end of the day. To quantify model performance in terms of recovering real individual activity-travel patterns, we use the following metrics:

  • Accuracy: This metric evaluates the ratio of correctly predicted activity labels to the total number of timesteps in the dataset:

    Accuracy=1Mm=1Mi=1N1(yi,m,y^i,m)N,Accuracy=\frac{1}{M}\sum_{m=1}^{M}\sum_{i=1}^{N}\frac{1(y_{i,m},\hat{y}_{i,m})}{N}, (5)

    where MM is the number of users in the test set, NN is the number of timesteps in a day, ym,t,y^m,ty_{m,t},\hat{y}_{m,t} are the ground-truth and predicted activity labels for user mm at timestep ii, and 1()1(*) is a binary function that equals 1 when ym,t=y^m,ty_{m,t}=\hat{y}_{m,t}, and 0 otherwise.

  • Edit Distance (ED). Edit distance quantifies how dissimilar two sequences are and is computed as:

    ED=1Mm=1Mmin(Edit(Y^m,Ym)N,1),ED=\frac{1}{M}\sum_{m=1}^{M}{\min\left(\frac{Edit(\hat{Y}_{m},Y_{m})}{N},1\right)}, (6)

    where Y^m,Ym\hat{Y}_{m},Y_{m} are the predicted and ground-truth activity-travel sequence for user mm, and Edit()Edit(*) denotes a function counting the minimum number of operations required to transform one sequence into the other.

  • BiLingual Evaluation Understudy score (BLEU). Originally used in machine translation, the BLEU score measures the similarity of a candidate text to reference texts. In our context, it is used to measure trajectory similarities. Higher BLEU scores indicate closer matches to the reference trajectories. BLEU compares nn-gram matches between each predicted and ground-truth activity-travel sequence in the test set:

    BLEUn=1Mm=1M(j=1nPj)1n,{BLEU}_{n}=\frac{1}{M}\sum_{m=1}^{M}{(\prod_{j=1}^{n}P_{j})^{\frac{1}{n}}}, (7)

    where PjP_{j} is the precision of jj-gram matches, defined as Pj=cCjmin(wc,wc,max)WP_{j}=\frac{\sum_{c\in C_{j}}\min(w_{c},w_{c,max})}{W}, CjC_{j} is the set of unique jj-gram chunks found in the predicted sequence, wcw_{c} is the number of occurrences of chunk cc in the predicted sequence; wc,maxw_{c,max} is the maximum number of occurrences of chunk cc in the ground-truth sequence, and WW is the total number of chunks in the predicted sequence.

Table 1 presents a performance comparison between DIRL and the baseline models, clearly showing that DIRL significantly outperforms all baselines in modeling activity-travel behavioral patterns. Among the baselines, the Markov Chain performs relatively poorly, which is expected since it relies solely on state-action transition probabilities from the training data to simulate possible trajectories in the test data, neglecting individual differences arising from socio-demographic features. MNL performs better by capturing context-specific factors and individual variations through the estimation of a utility function. However, its linear function assumption limits its ability to capture more complex behavioral patterns. BC outperforms MNL, likely because it utilizes deep learning architectures to capture more intricate relationships between input contextual features and choice probabilities. However, BC treats each state-action pair as an independent data sample, which restricts its capacity to capture the sequential linkage between time steps in decision-making. In contrast, AIRL excels at capturing long-term rewards by assuming that individuals are forward-looking when making sequential decisions. This makes AIRL a highly effective framework for modeling individual activity-travel behavior, enabling further interpretation of the behavioral patterns and preferences learned from the model.

To intuitively illustrate the effectiveness of DIRL in mimicking real-world activity-travel behaviors, Figure 4 presents several examples of ground-truth and generated activity-travel sequences. These examples demonstrate AIRL’s ability to largely replicate activity type choice and scheduling patterns. Although there are some inconsistencies regarding activity transition times, these discrepancies are acceptable given the challenging nature of the task. This complexity arises from generating the entire activity sequence recursively by the model when only individual socio-demographic features are known.

Table 1: Comparison of prediction performance using different models
Models Metrics
ACCACC EDED BLEUBLEU
MMC 0.603 0.394 0.652
MNL 0.749 0.243 0.838
BC 0.773 0.217 0.863
DIRL 0.804 0.189 0.876
Refer to caption
Figure 4: Examples of ground-truth and generated activity-travel sequences

4.3 Policy interpretation

As described in Section 3.2.1, we utilize a knowledge distillation method to uncover the behavioral patterns learned by the policy network. Specifically, we train a surrogate interpretable Multinomial Logit (MNL) model using the soft labels generated by the policy network. In this section, we first demonstrate the effectiveness of this knowledge distillation method by discussing the impact of incorporating these soft labels on the modeling performance of the MNL (see Section 4.3.1). We then analyze the uncovered behavioral patterns by examining the coefficients learned by the MNL (see Section 4.3.2).

4.3.1 Prediction accuracy of MNL with soft labels from the policy network

Table 2 presents the performance of the MNL model trained with and without soft labels generated by the policy network. The results indicate that incorporating soft labels in the loss function enhances the MNL’s performance across all evaluation metrics. This is because compared to using ground-truth activities as labels, the soft labels generated by the policy network provide richer information about the relative choice probabilities among different activities. Surprisingly, when α\alpha is set to 1, meaning the MNL is trained entirely based on the soft labels predicted by the policy network, it achieves the highest performance in terms of accuracy and edit distance, and similarly strong performance in BLEU score compared to other α\alpha settings. This suggests that even without the ground-truth activities, the choice probabilities learned by the policy network can effectively capture real activity-travel behaviors.

Table 2: Performance comparison of MNL trained with and without soft labels from the policy network
Models Metrics
ACCACC EDED BLEUBLEU
Without soft labels α\alpha=0 0.750 0.241 0.837
With soft labels α\alpha=0.5 0.760 0.231 0.844
α\alpha=0.9 0.771 0.220 0.846
α\alpha=0.99 0.772 0.220 0.842
α\alpha=1 0.775 0.217 0.844

4.3.2 Interpretation of the surrogate MNL model

This section analyzes the behavioral patterns derived from the policy network by examining the coefficients of the surrogate MNL model. To ensure the robustness of our interpretations, we employ a bootstrapping strategy. Specifically, we randomly sample 80% of the observations with replacement to train the surrogate MNL model, repeating this process 100 times. Table 3 presents the average coefficients and the [0.025, 0.975] confidence intervals for each input variable based on the 100 bootstrap samples. Variables with inconsistent directional impacts within these intervals are highlighted in grey. The results show that most variables maintain consistent directional impacts over 95% of the time, validating the robustness of our interpretations. Features with consistent directional impacts in most experiments are likely to significantly influence human activity and travel behavior patterns, while those with inconsistent impacts may be less significant in explaining these patterns.

The current activity an agent is engaged in significantly influences their next activity choice. When the next activity is home, work, or education, there are positive coefficients for the same activity in the current time step. This means individuals tend to continue with the same activity in the next time step. Conversely, when the current activity falls under the category of others, choosing travel in the next time step is associated with positive coefficients, indicating that individuals are more likely to transition to travel. This pattern is reasonable, as the duration spent at home, work, and education is typically longer, whereas other activities generally have shorter durations, as illustrated in Figure 3.

By examining the time slots with positive coefficients, we can identify the most common periods for different activities. Home activities are most common early in the morning (0-6h) and in the afternoon and evening (14-24h). The most common work hours are between 6-16h, while school activities peak from 6-12h. For other activities, all time intervals show negative coefficients, likely because these activities occupy a smaller portion of individuals’ daily lives. Travel is more frequent during 6-8h and 16-20h, corresponding to the morning and evening peak hours for commuting.

The duration of staying at the current activity positively correlates with choosing travel as the next activity. This is reasonable as people are more likely to move on to another activity after spending a significant amount of time in one place. The number of activities since the start of the day shows positive coefficients for home and negative coefficients for other activities. This indicates that as the number of activities increases, people are more likely to return home and less likely to engage in out-of-home activities such as work, school, and others. Additionally, the duration of out-of-home activities is positively correlated with both travel and home, while negatively correlated with work, school, and other activities. This further supports the idea that as out-of-home duration increases, people tend to travel back home.

Behavioral patterns vary across different age groups. Increased age is positively correlated with work and other activities, indicating that older individuals are more likely to engage in these activities. This is reasonable, as older people often have more social and family responsibilities. Gender differences also emerge in activity-travel planning. Females are positively correlated with staying at home and engaging in other activities, while negatively correlated with work, school, and travel. This suggests that females are more likely to stay at home or participate in other activities and less likely to go to work or school, possibly due to a greater share of household duties. Income level also influences activity choices. High-income individuals are more likely to go to work and less likely to go to school, which is expected as schoolgoers typically do not have an income.

Different employment types clearly correspond to distinct activity planning patterns. Homemakers are more likely to stay at home or engage in other activities rather than work, school, or travel. This is reasonable as they often take more household responsibilities, such as shopping and picking up children. Full-time students tend to go to school, while full-time, part-time, and self-employed individuals are more likely to go to work. Notably, the coefficient for travel among full-time employees is higher than those for part-time and self-employed individuals, indicating that full-time employees are more likely to travel compared to part-time and self-employed individuals. This difference suggests that work modes influence travel tendencies: full-time employees have fixed schedules, resulting in regular commuting trips, whereas part-time and self-employed individuals have more flexibility in their time and location choices. Retired and unemployed individuals, as well as domestic workers, show positive coefficients for staying at home and negative coefficients for work, school, or travel. This indicates that these individuals tend to stay at home, which is reasonable since they do not need to go to office or school and have fewer travel requirements.

4.4 Reward interpretation

DIRL uncovers a reward function that maps state-action pairs to human-perceived preferences. Using the pre-trained reward network, we can estimate a reward value for each step in an activity sequence. This allows us to analyze both short-term rewards and long-term returns of different activity-travel patterns as described in Section 3.2.2.

4.4.1 Short-term reward

Figure 5 provides an intuitive illustration of the learned reward values at different steps in example activity-travel trajectories. Notably, the steps where individuals transition from an activity to travel are typically associated with the lowest instant rewards in the trajectory. This aligns with the intuition that most people do not enjoy traveling and only do so to fulfill necessary activities. Another interesting observation is that the instant reward for the same activity can change as the stay duration increases. Generally, the reward first increases but starts to decrease once the stay duration exceeds a certain threshold, indicating that people have a preferred stay duration for different activities.

Refer to caption
Figure 5: Examples of activity-travel sequence and corresponding reward sequence

Figure 6 depicts the distribution of reward values across various activity types. The mean reward values for home, work, and education activities are higher compared to other activities. This is likely because home, work, and education are regular, expected parts of daily life, whereas activities like shopping or medical visits arise from irregular personal or household needs that are less predictable. On the other hand, travel has the lowest average reward, which aligns with previous findings that it is generally perceived as a necessary but often unpleasant activity. The variance in reward values is also noteworthy. The distribution of rewards for other activities is more dispersed compared to home, work, and education. This is reasonable since “other” activities encompass a wide range of possible categories, such as shopping, pick-up drop-off, and medical visits, each with different reward values. The reward values for travel show two peaks. This can be explained by the extremely low rewards associated with the initial transition into travel, while subsequent travel steps are associated with higher reward values, as illustrated in Figure 5.

Refer to caption
Figure 6: Reward distribution of different activities

Reward values at different timesteps are correlated with each other. To understand these sequence-level patterns, we use a k-means clustering technique, as described in Section 3.2.2, to group reward sequences into different types. The cluster size is set to 4, which is the optimal group size determined using the elbow method. Figure 7 shows the clustering results of different reward sequences and their corresponding activity patterns. Cluster 1 and 2 correspond to users with regular travel patterns at similar times while cluster 3 and 4 correspond to users with less regular and predictable travel patterns. This in some way matches with the well-established explorer and returner theory [27], which distinguishes travelers as either having regular, repetitive movements or exploring new and diverse activities with less predictable movement patterns.

Refer to caption
Figure 7: Reward distribution of different activities

To understand the individual variance between different groups, we analyze the employment type composition as shown in Figure 8. Interestingly, each group is dominated by different employment types. Cluster 1 is primarily composed of full-time students, exhibiting representative school patterns. Cluster 2 is dominated by full-time employees, which aligns with typical commuting patterns. Cluster 3 includes a mix of full-time students and full-time employees, corresponding to individuals with regular work or school schedules who also engage in additional activities in the afternoon or evening. Cluster 4 is dominated by homemakers and retired individuals, explaining why they perform other activities during the morning hours. This demonstrates the capability of reward values to reflect and differentiate between various activity patterns and individual behaviors.

Refer to caption
Figure 8: Reward distribution of different activities

4.4.2 Long-term return

Based on the reward values at each step, we can estimate the long-term return of the activity sequence in a day, which compresses the impact of different activities at different time steps into a single value. The long-term return in a day can in some way indicate the overall utility an individual can get from his or her daily activity planning. By sorting the activity sequence based on their corresponding return value, we can then have a sense of what types of activity planning are more optimal than others, at least based on the understanding of the DIRL model. Figure 9 shows 500 activity sequences sorted by their associated return values. Activity sequences on the left side are associated with higher returns. The results clearly demonstrate a change from regular to irregular activity patterns, indicating that the model assigns higher return values for individuals with regular activity patterns. Activity sequences with top return values are school patterns, followed by regular commuting patterns. This corresponds to the results in Figure 6 that the DIRL model assigns a higher reward value to education than to work. This can be because people have a better sense of accompliment and get better long-term investiment when experiencing education.

Refer to caption
Figure 9: Activity, reward, and return sequences sorted by daily return values (high to low from left to right)

To explain the individual variance in return values associated with daily activity sequences, we divided trajectories into four groups based on return quantiles. Figure 10 shows the socio-demographic features of these four groups. Interestingly, the employment type compositions associated with user groups having different return levels closely align with the types of activity planners identified through the clustering technique discussed in the previous section. The top return group consists mostly of students. The mid-high return group is a mix of full-time employees and students. The mid-low return group is mainly full-time employees, while the low return group is predominantly homemakers. This distribution indicates that return values can reflect differences in employment types among individuals. Regarding age, groups with lower returns tend to include older individuals. This is likely because students, who receive higher returns, are typically younger, while homemakers and retired individuals, who receive lower returns, are generally older. There are also gender differences in return values: in the top return group, only 20% of individuals are female, whereas in the bottom return group, more than 70% are female. This suggests that females are typically associated with lower daily returns, possibly due to taking on more home duties that do not directly generate socially recognized value. The relationship between income and return groups forms a “\cap” shape: the top return group has the lowest average income, which is reasonable as it mainly consists of full-time students with no income. Conversely, from the mid-high to low return groups, as the average income decreases, the return on their daily activities also decreases. This indicates that the return values derived from activity sequences have the potential to reflect individual income levels.

Refer to caption
Figure 10: Socio-demographic features of users with different daily return levels

5 Conclusion and Discussion

This study presents an interpretable deep inverse reinforcement learning (DIRL) framework designed to uncover sequential activity-travel behavior patterns. Unlike previous research, which emphasizes the prediction ability of DIRL, our focus is on explaining the underlying behavioral mechanism through enhanced model interpretability. We start by adapting an adversarial IRL framework to model sequential behavior patterns, learning a policy function to replicate human behaviors and a reward function to infer human preferences. To extract explainable insights from the policy function, we introduce a knowledge distillation method. Specifically, we train a surrogate Multinomial Logit (MNL) model using the soft labels predicted by the trained policy network, allowing us to extract knowledge from the parameters of MNL. The learned reward function generates a reward value for each state-action pair, providing a quantifiable measure of immediate human preferences at each step. We apply clustering techniques to these reward values to distinguish different types of activity planners. Additionally, we calculate a long-term return value for each behavioral sequence, demonstrating the potential to infer the overall utility an individual gains through a sequential decision-making process.

We analyze travel survey data from Singapore using our proposed frameworks. The results reveal differences in activity and travel behavior across various employment types. Part-time and self-employed workers travel less frequently than full-time workers, indicating that work modes significantly influence travel tendencies. Homemakers and retirees engage in more other activities beyond school and work, while unemployed individuals, despite having flexible schedules, tend to stay home rather than move around. These patterns suggest that human movement can be inferred from employment types, highlighting the link between mobility and socio-economics. Policymakers could potentially influence mobility through economic adjustments that alter employment compositions. Behavioral differences are also evident among socio-demographic groups. Older individuals have higher travel probabilities, likely due to social and household duties. Males are more likely to travel to school and work, while females tend to stay home or engage in other activities, raising concerns about gender inequality in mobility. By clustering reward sequences, we identify four types of activity planners: (1) regular workers commuting between home and work, (2) regular schoolers commuting between home and school, (3) less regular workers or schoolers who engage in other activities before returning home, and (4) homemakers or retirees who conduct morning activities. This distinction among activity planners can potentially serve as a foundation for urban and transport planners to understand various user reactions when designing policy interventions. We further quantify the preference of sequential decision-making processes with a single return value. Regular travel patterns, typical of students and workers, yield higher returns, whereas irregular patterns, typical of homemakers, yield lower returns. Individuals with lower activity returns are often female, older, and have lower incomes, indicating the need for greater attention to minority groups whose activities may lack social recognition.

This study can be further improved in several ways. Firstly, although our current focus is on analyzing activity and travel behavior, the proposed interpretable DIRL framework is general and can be adapted to various sequential decision-making processes. These include route choice modeling in transportation [44], trading strategies and risk preferences for investment [41], and patient treatment strategies in medical domains [6]. Future studies can extend this framework to other scenarios as relevant data becomes available. Secondly, we currently model daily activity and travel patterns due to data limitations, which may not fully capture travel demand management strategies across different days of the week. Future research could explore weekly or monthly activity planning strategies using data with longer time coverage, such as mobile phone data. Thirdly, our current modeling process focuses on activity type choice and scheduling behavior. Future research could incorporate more dimensions, such as destination choice and mode choice. A significant challenge in modeling all these elements within an DIRL framework is the large space of action choice candidates. To manage the complexity of modeling these elements, a potential solution is to develop a hierarchical DIRL framework [10], which breaks down the decision-making process into a hierarchy of simpler sub-tasks. Lastly, the extent to which the utility values derived from IRL align with actual human perceptions remains largely unexplored. In this study, we offer some evidence that activity sequences with lower return values tend to be associated with individuals of lower income, indicating potential correlations. However, the real-world implications of the utility values inferred by the model still require further exploration.

References

  • Abbeel and Ng [2004] Abbeel, P., and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine learning ICML ’04 (p. 1). New York, NY, USA: Association for Computing Machinery. URL: https://doi.org/10.1145/1015330.1015430. doi:10.1145/1015330.1015430.
  • Alsaleh and Sayed [2020] Alsaleh, R., and Sayed, T. (2020). Modeling pedestrian-cyclist interactions in shared space using inverse reinforcement learning. Transportation research part F: traffic psychology and behaviour, 70, 37–57.
  • Alwosheel et al. [2021] Alwosheel, A., van Cranenburgh, S., and Chorus, C. G. (2021). Why did you predict that? towards explainable artificial neural networks for travel demand analysis. Transportation Research Part C: Emerging Technologies, 128, 103143.
  • Arentze et al. [2000] Arentze, T., Hofman, F., van Mourik, H., and Timmermans, H. (2000). Albatross: multiagent, rule-based model of activity pattern decisions. Transportation Research Record, 1706, 136–144.
  • Auld and Mohammadian [2009] Auld, J., and Mohammadian, A. (2009). Framework for the development of the agent-based dynamic activity planning and travel scheduling (adapts) model. Transportation Letters, 1, 245–255.
  • Babes et al. [2011] Babes, M., Marivate, V., Subramanian, K., and Littman, M. L. (2011). Apprenticeship learning about multiple intentions. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 897–904).
  • Bhat and Koppelman [1999] Bhat, C. R., and Koppelman, F. S. (1999). Activity-based modeling of travel demand. In Handbook of transportation Science (pp. 35–61). Springer.
  • Bowman and Ben-Akiva [2001] Bowman, J. L., and Ben-Akiva, M. E. (2001). Activity-based disaggregate travel demand model system with activity schedules. Transportation research part a: policy and practice, 35, 1–28.
  • Cantarella and de Luca [2005] Cantarella, G. E., and de Luca, S. (2005). Multilayer feedforward networks for transportation mode choice analysis: An analysis and a comparison with random utility models. Transportation Research Part C: Emerging Technologies, 13, 121–155.
  • Chen et al. [2023] Chen, J., Lan, T., and Aggarwal, V. (2023). Hierarchical adversarial inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, .
  • Choi et al. [2021] Choi, S., Kim, J., and Yeo, H. (2021). Trajgail: Generating urban vehicle trajectories using generative adversarial imitation learning. Transportation Research Part C: Emerging Technologies, 128, 103091.
  • Drchal et al. [2019] Drchal, J., Čertickỳ, M., and Jakob, M. (2019). Data-driven activity scheduler for agent-based mobility models. Transportation Research Part C: Emerging Technologies, 98, 370–390.
  • Fu et al. [2018] Fu, J., Luo, K., and Levine, S. (2018). Learning robust rewards with adversarial inverse reinforcement learning. arXiv:1710.11248 [cs], . URL: http://arxiv.org/abs/1710.11248. ArXiv: 1710.11248.
  • Gambs et al. [2012] Gambs, S., Killijian, M.-O., and del Prado Cortez, M. N. (2012). Next place prediction using mobility markov chains. In Proceedings of the first workshop on measurement, privacy, and mobility (pp. 1–6).
  • Hinton et al. [2015] Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, .
  • Krishna and Murty [1999] Krishna, K., and Murty, M. N. (1999). Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29, 433–439.
  • Liang and Zhao [2021] Liang, Y., and Zhao, Z. (2021). Nettraj: A network-based vehicle trajectory prediction model with directional representation and spatiotemporal attention mechanisms. IEEE Transactions on Intelligent Transportation Systems, 23, 14470–14481.
  • Liu and Jiang [2022] Liu, S., and Jiang, H. (2022). Personalized route recommendation for ride-hailing with deep inverse reinforcement learning and real-time traffic conditions. Transportation Research Part E: Logistics and Transportation Review, 164, 102780. URL: https://www.sciencedirect.com/science/article/pii/S1366554522001715. doi:10.1016/j.tre.2022.102780.
  • Liu et al. [2020] Liu, S., Jiang, H., Chen, S., Ye, J., He, R., and Sun, Z. (2020). Integrating Dijkstra’s algorithm into deep inverse reinforcement learning for food delivery route planning. Transportation Research Part E: Logistics and Transportation Review, 142, 102070. URL: https://www.sciencedirect.com/science/article/pii/S1366554520307213. doi:10.1016/j.tre.2020.102070.
  • Liu et al. [2022] Liu, Y., Li, Y., Qin, G., Tian, Y., and Sun, J. (2022). Understanding the behavioral effect of incentives on departure time choice using inverse reinforcement learning. Travel Behaviour and Society, 29, 113–124.
  • Lundberg and Lee [2017] Lundberg, S. M., and Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
  • McNally [2007] McNally, M. G. (2007). The four-step model. In Handbook of transport modelling (pp. 35–53). Emerald Group Publishing Limited volume 1.
  • McNally and Rindt [2007] McNally, M. G., and Rindt, C. R. (2007). The activity-based approach. In Handbook of transport modelling (pp. 55–73). Emerald Group Publishing Limited volume 1.
  • Miller and Roorda [2003] Miller, E. J., and Roorda, M. J. (2003). Prototype model of household activity-travel scheduling. Transportation Research Record, 1831, 114–121.
  • Nayak and Pandit [2023] Nayak, S., and Pandit, D. (2023). A joint and simultaneous prediction framework of weekday and weekend daily-activity travel pattern using conditional dependency networks. Travel Behaviour and Society, 32, 100595.
  • Ng and Russell [2000] Ng, A. Y., and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning ICML ’00 (pp. 663–670). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
  • Pappalardo et al. [2015] Pappalardo, L., Simini, F., Rinzivillo, S., Pedreschi, D., Giannotti, F., and Barabási, A.-L. (2015). Returners and explorers dichotomy in human mobility. Nature communications, 6, 8166.
  • Pendyala et al. [2005] Pendyala, R. M., Kitamura, R., Kikuchi, A., Yamamoto, T., and Fujii, S. (2005). Florida activity mobility simulator: overview and preliminary validation results. Transportation Research Record, 1921, 123–130.
  • Phan and Vu [2021] Phan, D. T., and Vu, H. L. (2021). A novel activity pattern generation incorporating deep learning for transport demand models. arXiv preprint arXiv:2104.02278, .
  • Ribeiro et al. [2016] Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144).
  • Sanghvi et al. [2021] Sanghvi, N., Usami, S., Sharma, M., Groeger, J., and Kitani, K. (2021). Inverse reinforcement learning with explicit policy estimates. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 9472–9480). volume 35.
  • Schulman et al. [2017] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, .
  • Sifringer et al. [2020] Sifringer, B., Lurkin, V., and Alahi, A. (2020). Enhancing discrete choice models with representation learning. Transportation Research Part B: Methodological, 140, 236–261.
  • Song et al. [2022] Song, Y., Li, D., Liu, D., Cao, Q., Chen, J., Ren, G., and Tang, X. (2022). Modeling activity-travel behavior under a dynamic discrete choice framework with unobserved heterogeneity. Transportation research part E: logistics and transportation review, 167, 102914.
  • Song et al. [2024] Song, Y., Li, D., Ma, Z., Liu, D., and Zhang, T. (2024). A state-based inverse reinforcement learning approach to model activity-travel choices behavior with reward function recovery. Transportation Research Part C: Emerging Technologies, 158, 104454.
  • Tinessa [2021] Tinessa, F. (2021). Closed-form random utility models with mixture distributions of random utilities: Exploring finite mixtures of qgev models. Transportation Research Part B: Methodological, 146, 262–288.
  • Västberg et al. [2020] Västberg, O. B., Karlström, A., Jonsson, D., and Sundberg, M. (2020). A dynamic discrete choice activity-based travel demand model. Transportation science, 54, 21–41.
  • Wang et al. [2020a] Wang, S., Mo, B., and Zhao, J. (2020a). Deep neural networks for choice analysis: Architecture design with alternative-specific utility functions. Transportation Research Part C: Emerging Technologies, 112, 234–251.
  • Wang et al. [2021] Wang, S., Wang, Q., Bailey, N., and Zhao, J. (2021). Deep neural networks for choice analysis: A statistical learning theory perspective. Transportation Research Part B: Methodological, 148, 60–81.
  • Wang et al. [2020b] Wang, S., Wang, Q., and Zhao, J. (2020b). Deep neural networks for choice analysis: Extracting complete economic information for interpretation. Transportation Research Part C: Emerging Technologies, 118, 102701.
  • Yang et al. [2018] Yang, S. Y., Yu, Y., and Almahdi, S. (2018). An investor sentiment reward-based trading system using gaussian inverse reinforcement learning algorithm. Expert Systems with Applications, 114, 388–401.
  • Ye et al. [2024] Ye, Y., Zheng, P., Liang, H., Chen, X., Wong, S., and Xu, P. (2024). Safety or efficiency? estimating crossing motivations of intoxicated pedestrians by leveraging the inverse reinforcement learning. Travel behaviour and society, 35, 100760.
  • Ying et al. [2019] Ying, Z., Bourgeois, D., You, J., Zitnik, M., and Leskovec, J. (2019). Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32.
  • Zhao and Liang [2023] Zhao, Z., and Liang, Y. (2023). A deep inverse reinforcement learning approach to route choice modeling with context-dependent rewards. Transportation Research Part C: Emerging Technologies, 149, 104079.
  • Ziebart et al. [2008] Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K. et al. (2008). Maximum entropy inverse reinforcement learning. In Aaai (pp. 1433–1438). Chicago, IL, USA volume 8.
  • Zimmermann et al. [2018] Zimmermann, M., Västberg, O. B., Frejinger, E., and Karlström, A. (2018). Capturing correlation with a mixed recursive logit model for activity-travel scheduling. Transportation Research Part C: Emerging Technologies, 93, 273–291.