To Optimize Human-in-the-loop Learning in Repeated Routing Games
Abstract
Today navigation applications (e.g., Waze and Google Maps) enable human users to learn and share the latest traffic observations, yet such information sharing simply aids selfish users to predict and choose the shortest paths to jam each other. Prior routing game studies focus on myopic users in oversimplified one-shot scenarios to regulate selfish routing via information hiding or pricing mechanisms. For practical human-in-the-loop learning (HILL) in repeated routing games, we face non-myopic users of differential past observations and need new mechanisms (preferably non-monetary) to persuade users to adhere to the optimal path recommendations. We model the repeated routing game in a typical parallel transportation network, which generally contains one deterministic path and stochastic paths. We first prove that no matter under the information sharing mechanism in use or the latest routing literature’s hiding mechanism, the resultant price of anarchy (PoA) for measuring the efficiency loss from social optimum can approach infinity, telling arbitrarily poor exploration-exploitation tradeoff over time. Then we propose a novel user-differential probabilistic recommendation (UPR) mechanism to differentiate and randomize path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces to close-to-optimal , which cannot be further reduced by any other non-monetary mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal performance of UPR mechanism.
Index Terms:
human-in-the-loop learning, repeated routing games, price of anarchy, mechanism designI Introduction
With ever-increasing smart devices owned by human users in transportation networks, mobile crowdsourcing applications (e.g., Waze and Google Maps) have emerged as vital platforms for users to learn and share their observed time-varying traffic information for guiding their daily or repeated routing ([1, 2, 3]). These applications simply publicize all useful traffic information, which encourages selfish users to consistently choose the current shortest routes in each round to minimize their own travel costs ([4, 5]). As a consequence, users miss critical exploration of non-shortest paths that may turn better to route and reduce travel costs in the future.
In systems characterized by dynamic and diverse information dynamics, Multi-Armed Bandit (MAB) problems emerge as a framework for investigating optimal exploration-exploitation trade-offs among stochastic arms or paths (e.g., [6, 7, 8, 9, 10, 11]). For instance, [9] utilizes MAB techniques to forecast congestion levels in rapidly changing transportation environments. Similarly, [10] employs MAB exploration-exploitation strategies to manage the recruitment of previously reliable users alongside new, unproven users in completing crowdsourcing tasks. More recently, research efforts have extended traditional MAB paradigms to distributed settings involving multiple non-myopic agents aiming to minimize overall social costs (e.g., [12, 13, 14, 15]). For instance, [12] explores the use of distributed agents to learn heterogeneous local models associated with each arm, contributing to the social planner’s decision-making process. Additionally, [13] investigates how a fixed flow of agents collaborates in a distributed MAB scenario over repeated interactions, where each agent autonomously makes optimal decisions based on privately acquired arm information. Subsequently, [14] and [15] allow agents to learn and share the arm information with their neighbors for making locally optimal decisions. Overall, these new distributed MAB solutions require all agents to be fully cooperative to adhere to the optimal policy set by the social planner, without any possible selfish deviation to favor individuals’ rewards.
To facilitate human-in-the-loop learning (HILL) in repeated routing games, it is crucial to incentivize selfish users to follow the social planner’s optimal recommendations, enabling both information learning on diverse paths and the long-term reduction of traffic congestion. Prior routing game studies mainly focus on an oversimplified one-shot routing scenario without any information variation over time. They develop information hiding (e.g., [16, 17, 18]) and side-payment or pricing mechanisms (e.g., [19, 20, 21]) to regulate myopic users’ selfish routing choices. However, these mechanisms strongly assume that the social planner possesses all the traffic information to design incentives. Moreover, they only consider persuading myopic users to follow, whereas myopic users do not have past traffic observations as side information to reverse-engineer and deviate. In contrast, our HILL problem deals with a dynamic system with time-varying traffic information, necessitating the social planner to not only reduce immediate congestion caused by users but also incentivize their ongoing information learning on diverse paths.
As compared to pricing or side-payment, informational mechanisms (e.g., information hiding) are non-monetary and easy to implement ([22, 23, 24]). In this paper, we aim to design the best informational mechanism to regulate HILL in repeated routing games to persuade non-myopic users with differential side information in routing choices. As such, we need to address the following two technical challenges.
-
•
The first question is how to evaluate existing routing policies’ performances for non-myopic users as compared to the social optimum in the dynamic system with time-varying traffic information. Prior routing game literature (e.g. [17, 18]) focuses on myopic users in oversimplified one-shot scenarios, assuming that the social planner knows all traffic conditions to decide routing policies. In practice, the social planner lacks access to live traffic conditions but relies on non-myopic users to explore different paths over time. Although non-myopic, users remain selfish, and their interests are not aligned with the social planner’s goal of balancing long-term exploitation and exploration to minimize the overall social cost. Moreover, the theoretical analysis of the existing routing policies is lacking, particularly in understanding their performance gaps from the social optimum.
-
•
The second challenge is how to persuade non-myopic users to follow the optimal path recommendations despite their selfishness and differential learning histories to reverse-engineer. In repeated routing games, each user only cares about its own travel cost. It may learn from its own traffic observations in the past to infer the current system’s status and deviate from the optimal recommendation. Existing studies of incentivized exploration ([25, 26, 27, 28]) and Bayesian persuasion ([29, 30, 32, 33]) largely assume that all users are myopic to participate and do not have private side information to play strategically against the system’s recommendations.
It is worth noting that there is a recent study related to this paper ([34]), which introduces an informational mechanism aimed at regulating myopic users’ distributed information learning in congestion games. However, for practical human-in-the-loop learning in repeated routing games, we face non-myopic users, who focus on their own long-term rewards rather than one-shot immediate rewards. In this case, the mechanism in [34] cannot work, as non-myopic users can reverse-engineer the system status to play against the mechanism. Unlike this work, our paper addresses the two challenges above by proposing new mechanism designs and analysis.
We summarize our key novelty and main contributions as follows.
-
•
To regulate human-in-the-loop learning (HILL) for repeated routing games: To the best of our knowledge, this work is the first to regulate non-myopic users with differential side-information in repeated routing games. We model the repeated routing game in a typical parallel transportation network, which generally contains one deterministic path and stochastic paths. Here each stochastic path has time-varying unknown traffic conditions to be learned by repeatedly arriving non-atomic users. Furthermore, users’ past routing choices create differential side information for selfish routing, making it challenging to design any informational mechanism.
-
•
Theoretical comparison among information sharing, hiding mechanisms, and social optimum via PoA analysis: We analyze the information sharing and hiding mechanisms as well as social optimum to derive their corresponding routing recommendations in closed form, respectively. Then we prove that no matter under today’s information sharing mechanism (in Waze and Google Maps) or the latest information hiding mechanism in the routing game literature, the resultant price of anarchy (PoA) for measuring the efficiency loss from optimum about total users’ travel costs can be arbitrarily large.
-
•
Best informational mechanism to remedy huge efficiency loss: We propose a novel user-differential probabilistic recommendation (UPR) mechanism to randomly differentiate path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces to close-to-optimal , which cannot be further reduced by any other informational mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal average performance of our UPR mechanism.
The rest of the paper is organized as follows. In Section II, we propose the repeated routing game model, the public human-in-the-loop learning (HILL) model under information sharing mechanism, and the differential HILL model under information hiding mechanism. Then in Section III, we first formulate the optimization problems for the sharing and hiding mechanisms, as well as the socially optimal policy, and then analyze the solutions for these three policies. In Section IV, we analytically compare the two mechanisms with the social optimum via PoA analysis. Based on the analysis, we propose our UPR mechanism in Section V. We verify the average performance of our UPR mechanism in Section VI. Finally, Section VII concludes the paper. To facilitate readability, we have summarized all the key notations in Table I.
II System model


In this section, as in the routing game literature (e.g., [17, 18, 28]), we examine a standard parallel transportation network, which is simple but fundamental, to introduce the repeated routing game model. Then, we proceed to present the public HILL model under the information sharing mechanism and the differential HILL model under the information hiding mechanism for the crowdsourcing platform, respectively.
II-A Repeated Routing Game Model
We consider an infinite discrete time horizon . In Figure 1(a), at the beginning of each , there is a unit flow of users arriving at origin O. Then this unit flow of users individually choose paths from all the available paths to travel to their common destination D. We generally consider a deterministic path 0 with a fixed internal travel cost other than stochastic paths as in [17, 18, 28].
Following existing routing game literature (e.g., [30, 31, 24]), for any stochastic path , the internal travel cost is random and dynamically alternates between a high-cost state and a low-cost state over time, according to the commonly used Markov chain in Figure 1(b).111Besides the references to support, we also validate the suitability of this Markov chain model in Section VI’s experiments. The internal cost may refer to factors like travel time or fuel consumption, which depend on the condition of each path (e.g., poor visibility, ‘black ice’ segments [28, 24]). Upon reaching destination D, users repeat the same process in the subsequent time slot . The length of a time slot can represent a day, for instance, when people commute from home to the office each morning within a similar time range. Therefore, we generally assume that users are non-myopic, allowing them to retain their prior traffic observations to guide future routing decisions.
Given the unit flow of user arrivals at current , we denote by the flow of users choosing any stochastic path . Following the existing literature (e.g., [17, 30, 18]), in addition to the internal travel cost , each user on path will face another (external) congestion cost caused by the other users on the same path. Therefore, the individual cost of each user traveling on stochastic path is . Similarly, each of the other flow of users on deterministic path 0 has individual cost , where
To learn the time-varying internal cost of stochastic path for future use, the platform (e.g., Waze) expects users to frequently travel on diverse paths and share their observations. Then it observes user flow and observation via GPS and travel time [35].
Notation | Meaning |
---|---|
The number of stochastic paths. | |
The internal travel cost of stochastic path . | |
The fixed internal travel cost of deterministic path 0. | |
The high- and low-cost states on stochastic paths. | |
The transition probabilities from low- and high-cost states to high-cost state in the Markov chain. | |
The flow of users traveling any path at time . | |
The prior and posterior public hazard beliefs on stochastic path under the sharing mechanism. | |
The set of public hazard beliefs of all stochastic paths. | |
The information age of path under the hiding mechanism. | |
The private hazard belief with information age on stochastic path under the hiding mechanism. | |
The set of private hazard beliefs of all stochastic paths. | |
The immediate social cost at time . | |
The long-term cost function for all users. | |
The discount factor. | |
The expected arrival rate of users. | |
The private recommendation for each user under our UPR mechanism at time . |
II-B Public HILL under Information Sharing
The HILL model depends on the implemented mechanism, and we first consider the existing information-sharing mechanism used by Google Maps and Waze to share all useful traffic information with the public. After all users making routing decisions, a flow of users travel on stochastic path to observe the actual state there. We define a public hazard belief to be the expected probability for seeing the bad traffic or high-cost condition at time . Using the Bayesian inference, we obtain:
(1) |
where summarizes prior public hazard beliefs before . Then we introduce the information learning process of this public HILL model below.
-
•
At the beginning of time , the platform broadcasts the public hazard belief of any stochastic path in (1) about traffic state .
-
•
During time slot , users traveling on stochastic path will observe the actual state and report it back to the platform. Then the platform will further update this prior belief to a posterior belief below:
-
–
If the flow of users observe on path , then the posterior belief is updated to .
-
–
If these users observe on path , then the posterior belief is updated to .
-
–
If without user to travel and learn this path , the posterior belief remains as .
-
–
-
•
At the end of each time slot, the platform updates public hazard belief from to below, based on the Markov chain in Fig. 1(b):
(2)
The above process will repeat as the new time slot begins. Note that the Markov chain-based HILL process described above is equivalent to the existing restless multi-armed bandit (MAB) problem (e.g., [36, 37]). Our subsequent analysis and key results (e.g, Proposition 1, Proposition 2, and our UPR mechanism) can also be generalized if follows a memoryless stochastic process (e.g., Poisson process or exponential distribution), as the worst-case scenario remains unchanged.
II-C Differential HILL under Information Hiding
If the platform chooses to hide this public information from users (e.g., by applying the latest information hiding mechanism [17, 32, 24]), users have to independently update private beliefs based on their own routing choices and observations of any path in the past. In this subsection, we introduce how users update their differential private beliefs without public information sharing by the platform. To aid users’ routing decisions, we allow them to observe the traffic flow in the last time slot (i.e., on any path ) at the beginning of each , as widely assumed in existing transportation literature (e.g., [38, 39, 40]).
At the initial time , the hazard belief of any stochastic path is given to all users. Without knowing public belief but only the past flow since , users will experience different information ages since their last observations of this path. Denote by the information age of path for an individual user at time , and let denote its corresponding private hazard belief. Then, we proceed to introduce the update of and under differential HILL in the following two cases.
-
•
If this user just explored stochastic path at , its information age of this path is set to the minimum . Therefore, its current private hazard belief is exactly the same as the public belief of the platform:
(3) -
•
If this user did not explore stochastic path at time , its information age should increase by from . As it is able to observe the last flow at the beginning of , it will rectify its private belief . For example, if this user observes more users traveling on path with in the last time slot, it will believe that this path had a good traffic condition with weak hazard belief according to Fig. 1(b). Otherwise, it infers with strong hazard belief there. By rectifying the last private belief to the exact public belief from the prior flow distribution, its information age keeps at and does not grow over time. Thus, its private belief for time is updated to:
(4) due to (2) and without new observation.
III Problem Formulations and Analysis for Sharing, Hiding, and Social Optimum
In this section, we first formulate optimization problems for guiding non-myopic users’ repeated routing under sharing and hiding mechanisms, as well as the social optimum. Then, we solve these optimization problems to determine the routing flow in closed form per time for each mechanism/policy.
We first define non-myopic users’ selfish routing decisions in the following.
Definition 1 (Selfish routing decision).
Under the selfish routing decision, each non-myopic user aims to maximize its own long-term -discounted expected reward, where is the discount factor.
Based on Definition 1, each user will make selfish routing decisions under both sharing and hiding mechanisms.
III-A Formulation and Equilibrium for Information Sharing
In this subsection, we consider the information-sharing mechanism (used by Waze) to cater to users’ selfish interests such that users will not deviate from its recommendations.
We first summarize the dynamics of public hazard belief for each stochastic path at any time in a vector below:
(6) |
each of which can be obtained by (2). For the unit user flow at time , the platform first shares the latest public hazard belief set with them. Then each non-myopic user will selfishly choose the path that minimizes its own long-run cost.
Let denote the minimal long-run expected cost of a single user since time , given the flow of non-myopic users choosing path under the sharing mechanism with symbol in the superscript. For ease of explanation, we take the special case of stochastic path alongside the deterministic path 0 in a typical two-path network example to solve and intuitively explain below. For an arbitrary number , we can similarly compute by balancing the expected costs of all the paths as in (7).
Lemma 1.
Under information sharing, given the unit flow of non-myopic user arrivals at time , the recommended flow of users for choosing stochastic path 1 at equilibrium is: 222To realize equilibrium flow in the second case, each user will be fairly chosen with probability to explore path 1, and it will not deviate because for reduced long-run cost.
(7) |
and to path 1, where infinitesimal , if is true and otherwise.
The proof of Lemma 1 is given in Appendix A of the supplemental material. From the first case of (7), even if the stochastic path 1’s internal travel cost is larger than path 0’s internal cost plus the maximum congestion there, non-myopic users may still choose path 1 with minimum flow to learn possible state for future use given . If the internal cost of path 0 is larger than the internal cost of path 1 plus the maximum congestion, all user flow will choose path 1 in the first case of (7). Otherwise, the two paths’ conditions are comparable and the user flow will partition to both paths to balance congestion there.
In the last two cases of (7), as there is always a positive flow of users traveling on path 1 to learn fresh information there, non-myopic users will myopically make routing decisions to minimize their immediate individual costs in the current time slot.
Then we further analyze the long-term social cost under the sharing mechanism (with specific (7) for case). Define and to be the long-term and immediate social cost functions. At each time slot , the immediate social cost is obtained as:
(8) | ||||
which includes the current travel cost of flow of users on deterministic path 0 and flow of users on any stochastic path .
After obtaining immediate social cost (8), we further formulate the long-term cost function as a Markov decision process (MDP), where the update of hazard belief depends on current users’ routing choices and observations of on path ’s. Consequently, two cases arise for updating at each time slot. If without user traveling on stochastic path , remains as and will be updated to according to (2). If with users traveling on path , then will be updated to or , depending on whether users’ observation there is or in Fig. 1(b).
Based on the analysis above, we formulate the long-term -discounted cost function under information sharing below:
(9) |
where is given in (8) and is the discount factor. Selfish users will not choose any stochastic path with higher long-run cost, and thus of this path may not be updated on time to be useful for the next time slot.
III-B Formulation and Equilibrium for Information Hiding
In this subsection, we formulate the long-term cost function under the latest hiding mechanism in the routing game literature (e.g., [17, 32, 24]). Note that their mechanisms cannot be directly applied to our repeated routing as users are non-myopic and no longer hold the same private belief. Thus we need to revise it to fit our differential HILL model as described in Section II-C. Note that the hiding mechanism still caters to users’ selfish interests to hide all information including and lets users themselves choose routing paths that minimize their own long-run costs under Definition 1.
Recall in (5) that users under the hiding mechanism are different in their private information ages or from their own last observations, resulting in differential private belief in (3) or in (4) about path , respectively. We summarize the dynamics of users’ private beliefs and among stochastic paths at time as:
(10) | ||||
Let denote the flow of non-myopic users choosing path at time under the hiding mechanism with subscript , which needs to specify users’ path decisions under different information age ’s. Denote by the flow of users choosing path with the same information age , which satisfies
(11) |
Given flow of users exploring path in the last time, the current flow of users with is thus . As they have the latest information on this path with , they will infer other users’ private belief with according to (4). Thus, they can estimate the exact flow of these users on path , and then decide to minimize their own long-run costs.
While for the other flow of users with , they only hold delayed or aged belief to decide to minimize their expected long-run costs.
To clearly tell non-myopic users’ decision-making under the differential HILL model, we derive in the next lemma the closed-form expressions for how users decide and in the basic two-path network at the equilibrium as an example.
Lemma 2.
Under the information-hiding mechanism, for the flow of users with minimum information age since their last observations, at time they decide their routing flow on stochastic path 1 under the information-hiding mechanism as:
(12) |
where is given by (7). For the other flow of users with age , their routing flow is
(13) |
where is given in (7). Then the total flow of users on path 1 is .
The proof of Lemma 2 is given in Appendix B of the supplemental material. As compared to under information sharing in (7), users with under information hiding have to use their private belief to decide in (13). Then users with decide in (12) to make the total flow on path approach to under information sharing as much as possible.
III-C Formulation and Solution for Social Optimum
Recall that the sharing and hiding mechanisms that cater to users’ selfish interests for persuading them to follow path recommendations in Lemmas 1 and 2. Differently, the socially optimal policy aims to centrally solve the optimal routing flow to minimize the long-run expected social cost for all users, by assuming users to be cooperative all the time. Let denote the long-term -discounted cost function under the socially optimal policy, given as:
(15) |
where is similarly defined as (8). If the platform asks all the unit flow of arriving users to choose path 0, there would be no observation on stochastic path . As a result, the posterior belief remains as . However, if the platform instructs some users to travel on path , can be timely updated to or and will be further updated to by (2).
In (15), even if the expected travel cost on path is much higher than others, the socially optimal policy may still recommend a small flow of users to learn on path . As , this infinitesimal flow of users’ immediate total cost is negligible. At the same time, their learned information on path can help reduce future costs for all users. Therefore, the socially optimal policy always decides for any stochastic path .
Next, we derive the optimal recommended flow in the typical two-path network example.
Lemma 3.
Under the socially optimal policy, the recommended flow on stochastic path 1 is
(16) |
where is infinitesimal.
The proof of Lemma 3 is given in Appendix C of the supplemental material. By comparing (16) to (7) under the sharing mechanism, we observe that the socially optimal policy always explores stochastic path 1 with , even if is much larger than . If , in (16) is smaller than in (7) to reduce the overall congestion on this stochastic path.
By comparing (16) to (12)-(13) under the hiding mechanism, the socially optimal policy adaptively uses actual to guide users’ routing in different cases. However, the hiding mechanism uses and for partially blind routing decisions, which can be very different from the optimum. Based on (7), (12)-(13) and (16) of the three mechanisms/policies, we are ready to compare their system performances in Section IV to inspire our mechanism design.
IV Analytical comparison between sharing, hiding, and social optimum via PoA
In this section, we first conduct a comparative PoA analysis between the sharing mechanism and the socially optimal policy to prove the huge performance loss of the sharing mechanism. Subsequently, we also prove the huge PoA of the hiding mechanism as compared to the social optimum. Then we are well motivated to propose our new mechanism in Section V. Actually, this section’s analytical results provide important guidance to our later mechanism design, which is carefully designed to play between information-sharing and information-hiding mechanism.
Before analyzing the two mechanisms’ performance loss, we first obtain the bounds of public hazard belief and private belief for future use.
Lemma 4.
Given any initial belief for stochastic path , both the public belief and private belief satisfy and for any time .
The proof of Lemma 4 is given in Appendix D of the supplemental material. According to (2), the lower bound and upper bound are derived when and maximum , respectively. If the current users observe , leading to , the hazard belief will be updated to its minimum value . Conversely, if with , will be updated to its maximum value . Consequently, the updated hazard belief will always remain bounded within . Similarly, based on the updates of the private belief in (3) or (4), it is also constrained by and .
Based on Lemma 4, if the minimal expected travel cost under the good traffic condition of stochastic path is larger than , selfish users will never explore stochastic path at any time. On the other hand, if the expected maximal expected travel cost is smaller than , selfish users will always choose path to exploit the low travel cost there. Therefore, To avoid the trivial cases of always exploration or exploitation, we focus on , , and below.
IV-A PoA Analysis of Information Sharing Mechanism
We compare under the sharing mechanism in Section III-A to the optimal recommended flow under the social optimum in Section III-C to examine the sharing mechanism’s efficiency loss. For example, by comparing in (7) and in (16) for the two-path network example, we find that the sharing mechanism misses both exploitation and exploration on stochastic path 1, as compared to the social optimum. Below, we examine their performance gap for an arbitrary number of stochastic paths.
We use the standard definition of the price of anarchy (PoA) in game theory [41], for measuring the maximum ratio between the long-term social cost under the sharing mechanism in (9) and the minimal social cost in (15):
(17) |
which is always larger than by considering all system parameters.
Through involved worst-case analysis, we prove the lower bound of in (17) caused by the sharing mechanism’s lack of exploration in the following proposition.
Proposition 1.
The proof of Proposition 1 is given in Appendix E of the supplemental material. In Proposition 1’s proof, we consider the worst case of minimum exploration under the sharing mechanism. Given up to on path , there will be an flow of users traveling on stochastic path at time . Once they find there at any , no user will explore this path again, as . However, socially optimal policy (16) always recommends flow of users to explore any path to learn possible there. As , it takes them at most time slots to find at least one path with . After that, all users’ long-term costs will be significantly reduced due to . Then we compare the two long-term costs to derive the PoA lower bound in (18).
Remark 1.
In (18), once to make , then to tell infinitely large efficiency loss under the sharing mechanism.
IV-B PoA Analysis of Information Hiding Mechanism
In this subsection, we further analyze the PoA lower bound caused by the hiding mechanism. Let denote the PoA caused by the hiding mechanism, which is similarly defined as (17) to compare the social cost in (14) with in (15). In the next proposition, we successfully provide the worst-case analysis to prove ’s lower bound.
Proposition 2.
The proof of Proposition 2 is given in Appendix F of the supplemental material. This PoA bound (19) can be similarly proved by considering the minimum exploration case as in Proposition 1. However, different from the sharing mechanism that always knows the latest system information, users with information delay under the hiding mechanism have a time slot’s latency to decide exploration or exploitation, which makes in (19) different from in (18)
Remark 2.
In (19), once to make , then , telling infinitely large efficiency loss under the information-hiding mechanism.
In (19), the lower bound of also decreases with path number . As with infinite paths, approaches , due to users’ late exploitations of stochastic paths to increase the total cost under information delay .
Given neither the sharing mechanism nor the hiding mechanism performs well for non-myopic users with finite in practice, it is critical to design a new mechanism other than simply sharing or hiding information.
V Design and Analysis of Our User-differential Informational Mechanism
To motivate non-myopic users to follow the socially optimal policy in (16) as much as possible, we aim to design the best mechanism to obtain the minimum possible PoA in this section. As pricing or side-payment mechanisms directly charge users, they are difficult to implement in the existing traffic navigation platforms ([19, 20, 42]). Instead, we aim to design a non-monetary informational mechanism that indirectly restricts information channels to regulate users. Following the mechanism design literature (e.g., [22] and [23]), we provide its general definition below.
Definition 2 (Informational mechanism).
An informational mechanism defines a Bayesian game, where users send costless information and the social planner decides the disclosing level of observable variables based on received information.
By Definition 2, we notice that the information hiding mechanism in Section III-B belongs to informational mechanisms, by simply hiding all traffic information from selfish users. Yet its achieved PoA is arbitrarily large in Proposition 2. Additionally, we analyze existing deterministic recommendation mechanisms (e.g., [17, 18, 24]), which offer state-dependent deterministic recommendations to users while concealing other information, and demonstrate that these mechanisms still result in an infinite PoA, motivating our UPR mechanism design.
Lemma 5.
The existing deterministic-recommendation mechanisms make in this system.
The proof of Lemma 5 is given in Appendix G of the supplemental material. Intuitively, when a user receives recommendation , it still infers the expected travel cost for each path based on its private information . As a result, the recommendation does not influence the user’s selfish path choice, ultimately leading to an infinite PoA.
We next examine the efficiency limit of any informational mechanism for our HILL system below.
Proposition 3.
In our dynamic HILL system, under any informational mechanism in Definition 2, the achieved PoA satisfies
The proof of Proposition 3 is given in Appendix H of the supplemental material. To obtain this minimal possible PoA, we first show that no informational mechanism in Definition 2 can prevent users’ maximum exploration of stochastic paths. Let for any stochastic path . Given , there will be flow of users traveling on each stochastic path at initial , according to (7). Since initial hazard belief , users will observe with good traffic condition on path . Then each user is aware of the public belief on its chosen stochastic path with minimum information age for next time . Consequently, they will behave the same as the sharing mechanism by continuing to travel to their chosen stochastic paths since , making any informational mechanism fail to regulate.
On the other hand, according to in (16), the socially optimal policy ensures flow of users to choose deterministic path 0 and avoid congestion cost on stochastic paths. In this case, the resulting minimal PoA is .
Building upon Proposition 3, we want to design the best informational mechanism to achieve . Based on the worst-case analysis in the proof above, when , no informational mechanism is able to prevent users with minimum information age from exploiting stochastic path at the current time. When a number of stochastic paths reach their minimum public hazard belief (i.e., for path according to Lemma 4), selfish users have the maximum tendency to exploit these paths and the resultant routing flow on path reaches its maximum:
(20) |
Given the number of stochastic paths in good condition at time , is known to all users. Alternatively, when , no informational mechanism can persuade users with minimum age to choose path at . Hence, we can only focus on incentivizing the remaining users with non-trivial information age to align with the platform’s optimal path recommendations.
On one hand, different from the hiding mechanism to blind all users, this user-differential approach inspires us to differentiate our path recommendations to users of differential . On the other hand, unlike the sharing mechanism, our mechanism design does not share with users the public belief set history but the previous flow distribution to improve efficiency. As users (with information age ) can reverse-engineer from the system’s path recommendations to infer the actual hazard belief, we propose probabilistic recommendations below.
In the following mechanism, let represent the recommended flow of path at time , which is always positive and keeps learning from all stochastic paths over time.
Definition 3 (User-differential Probabilistic Recommendation (UPR) mechanism).
At any given time , the platform operates in the following two steps:
Step 1: Selective information hiding. First, the platform hides public belief history from all users but discloses previous routing flow (i.e., for any path ).
Step 2: Probabilistic Recommendation. Then the platform performs the following user-differential random signaling strategies to disclose selected information.
-
•
For the flow of users observing on path with minimum age , the platform just recommends them to continue and make their own selfish routing decisions in (12).
-
•
For the other users with of any stochastic path , the platform independently recommends each of them to travel on path with probability
(21)
Besides, the platform independently recommends each of them to travel to deterministic path with probability
Here, our UPR mechanism does not prevent users, who just observed with minimum age , from pursuing selfish interests by exploiting stochastic path at time . However, for those users who just observed on another path and other users who just traveled on deterministic path 0 with obvious age , they receive probabilistic path recommendations in (21) to deviate from this strong hazard path and try other paths.
Our mechanism proactively designs (21) in the persuasive way that a user with hearing the realized recommendation will reverse-engineer path in a good condition , as the corresponding recommendation probability by comparing the two cases of (21). On the other hand, if a user is recommended to path , it will similarly believe that any stochastic path has a bad traffic condition with . As a result, it will follow this recommendation to choose path 0.
In summary, our UPR mechanism enables users to exploit on stochastic path , when the public hazard belief on this path is low. On the other hand, when stochastic path has a high hazard belief , our UPR mechanism still ensures an flow of users exploring this path to learn for future. Thus, our UPR mechanism successfully avoids the worst-case scenarios with minimum exploration under both the sharing and hiding mechanisms. Note that the main idea of our UPR mechanism can also be generated to regulate selfish players in other restless MAB games (e.g., [36, 37]). In these problems, the social planner provides differential recommendations to users based on their private information about each arm (path), which influences their posterior beliefs and consequently regulates their selfish arm choices.
Following [43] and [44], we define users’ Interim Individual Rationality (IIR) to ensure their long-term participation in our UPR mechanism.
Definition 4 (Interim Individual Rationality).
A mechanism ensures Interim Individual Rationality for users with private information if, prior to the realization of the cost incurred by following the mechanism, each user anticipates achieving a cost level that is at least as low as their reservation cost.
Based on above analysis and Definition 4, we prove our UPR mechanism ensures IIR for all users and achieves close-to-optimal performance in the following theorem.
Theorem 1.
The proof of Theorem 1 is given in Appendix I of the supplemental material. Recall Proposition 3 tells that no informational mechanism can achieve PoA less than , it demonstrates that our UPR mechanism is the best to reduce PoA. The worst case under our UPR mechanism is the same as Proposition 3, which occurs at users’ maximum exploration when all stochastic paths are in good traffic conditions with .
In addition to the worst-case PoA analysis presented in Theorem 1, we further conduct experiments using real datasets to examine the average performance of our UPR mechanism.

VI Experiment Validation Using Real Datasets
In this section, we conduct extensive experiments to evaluate our UPR mechanism’s average performance versus the information sharing mechanism (used by Google Maps and Waze), the latest hiding mechanism in routing game literature (e.g., [17, 32, 24]), and the social optimum in terms of the average long-term social cost.
Without much loss of generality, we focus on populous cities exhibiting time-varying road conditions and experiencing heavy traffic congestion. As such, we use live traffic datasets from Baidu Maps [48] and real-time traffic speed data during similar peak hours to train our traffic model practically. Building on our parallel transportation network in Fig. 1(a), we extend it to a hybrid network by introducing two origin-destination pairs-Shanghai Station (origin 1) to the Bund and World Expo Museum (origin 2) to the Bund-with overlapping paths (i.e., Yan’An Elevated Road and East Yan’An Road) in Fig. 2. This setup captures the typical flow of visitors traveling from origin 1 and origin 2 to the Bund.
For the route from origin 1 to the Bund, the options are:
-
•
Path 1-1: via First Haining Road, Henan Road, and then East Yan’An Road.
-
•
Path 1-2: via North Chengdu Road, Yan’An Elevated road, and East Yan’An Road.
For the route from origin 2 to the Bund, the options are:
-
•
Path 2-1: via North-South Elevated Road, Yan’An Elevated Road, and East Yan’An Road.
-
•
Path 2-2: via South Zhongshan Road and East Zhongshan No.2 Road.
The mean arrival rates are and cars every minutes at origins origin 1 and origin 2, respectively, with a standard deviation of . To develop a practical congestion model with travel costs, we mine and analyze extensive data on the SpeedBand values across the eight roads, which fluctuate every minutes in the peak-hour periods. Our data analysis verifies that the traffic conditions on North Chengdu Road, Yan’An Elevated Road, and East Yan’An Road in path 1-2, as well as North-South Elevated Road in path 2-1, can be well approximated as a Markov chain with two discretized states (high and low traffic states) for learning, as depicted in Fig. 1(b). In contrast, Haining Road, and Henan Road in path 1-1, as well as South Zhongshan Road and East Zhongshan No.2 Road in path 2-2, display more deterministic conditions. Following similar methodologies from [45, 46, 47], we train the hidden Markov model (HMM) and obtain the transition probability matrices , and NS_E for North Chengdu Road, Yan’An Elevated Road, East Yan’An Road, and North-South Elevated Road, respectively:
In this hybrid two-origin network, users assess the travel latency as the travel cost. Initially, the four stochastic roads’ hazard beliefs satisfy , , and before the peak hours.
Under both sharing and hiding mechanisms, users at origin 1 estimate the combined travel costs of North Chengdu Road and Yan’An Elevated Road as for path 1-2, incorporating the congestion cost from the expected flow of users on path 2-1. Since paths 1-1 and 1-2 share East Yan’An Road, users at origin 1 compare with the fixed cost (covering the combined travel costs of Haining Road and Henan Road) to make routing decisions. Similarly, users at origin 2 estimate the combined travel costs of North-South Elevated Road, Yan’An Elevated Road, and East Yan’An Road for path 2-1 as . Then they compare and the fixed cost to determine, based on self-interest, which path minimizes their personal travel costs. While the socially optimal policy continuously search for the path that minimizes the long-term social cost for all users. To apply our UPR mechanism in Definition 3, we combine the three roads on paths 1-2 and 2-1 and update the parameters in (21) to provide probabilistic recommendations. These recommendations direct users with differential information ages toward specific paths from origins origin 1 and origin 2, based on the public hazard beliefs of the four roads as in (21). In addition, we set the discount factor for the long-term social costs and as the minimum recommended flow in (7), (16), and (21). After training a practical traffic model and applying mechanisms, we are ready to conduct experiments to show our UPR mechanism’s average performance versus the other mechanisms. We run experiments (each with time slots or equivalently minutes) for averaging the long-term social costs during peak hours.

Fig. 3 compares the long-term social cost performances of the information sharing, hiding, the optimum, and our UPR mechanisms versus time horizon . It demonstrates that our UPR mechanism has less than efficiency loss from the social optimum for any time horizon , while the sharing and hiding mechanisms have around and efficiency losses, respectively.
VII Conclusions
In this paper, we study how to regulate human-in-the-loop learning (HILL) in repeated routing games, where we face non-myopic users of differential past observations and need new mechanisms (preferably non-monetary) to persuade users to adhere to the optimal path recommendations. We first prove that no matter under the information sharing mechanism in use or the latest routing literature’s hiding mechanism, the resultant price of anarchy (PoA) for measuring the efficiency loss from social optimum can approach infinity, telling arbitrarily poor exploration-exploitation tradeoff over time. Then we propose a novel user-differential probabilistic recommendation (UPR) mechanism to differentiate and randomize path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces to close-to-optimal , which cannot be further reduced by any other non-monetary mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal UPR performance mechanism.
References
- [1] T. Haselton, “How to use Google Waze,” https://www.cnbc.com/2018/11/13/how-to-use-google-waze-for-directions-and-avoiding-traffic.html, 2018.
- [2] S. Olariu, “Vehicular crowdsourcing for congestion support in smart cities,” Smart Cities, vol. 4, no. 2, pp. 662–685, 2021.
- [3] J. Kang, J. Zhang, H. Yang, D. Ye, and M. S. Hossain, “When metaverses meet vehicle road cooperation: Multi-agent drl-based stackelberg game for vehicular twins migration,” IEEE Internet of Things Journal, 2024.
- [4] S. Vasserman, M. Feldman, and A. Hassidim, “Implementing the wisdom of Waze,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- [5] M. R. Hasan, A. L. Bazzan, E. Friedman, and A. Raja, “A multiagent solution to overcome selfish routing in transportation networks,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2016, pp. 1850–1855.
- [6] A. Slivkins et al., “Introduction to multi-armed bandits,” Foundations and Trends® in Machine Learning, vol. 12, no. 1-2, pp. 1–286, 2019.
- [7] F. Li, D. Yu, H. Yang, J. Yu, H. Karl, and X. Cheng, “Multi-armed-bandit-based spectrum scheduling algorithms in wireless networks: A survey,” IEEE Wireless Communications, vol. 27, no. 1, pp. 24–30, 2020.
- [8] S. Gupta, S. Chaudhari, G. Joshi, and O. Yağan, “Multi-armed bandits with correlated arms,” IEEE Transactions on Information Theory, vol. 67, no. 10, pp. 6711–6732, 2021.
- [9] A. Bozorgchenani, S. Maghsudi, D. Tarchi, and E. Hossain, “Computation offloading in heterogeneous vehicular edge networks: On-line and off-policy bandit solutions,” IEEE Transactions on Mobile Computing, vol. 21, no. 12, pp. 4233–4248, 2021.
- [10] H. Wang, Y. Yang, E. Wang, W. Liu, Y. Xu, and J. Wu, “Truthful user recruitment for cooperative crowdsensing task: A combinatorial multi-armed bandit approach,” IEEE Transactions on Mobile Computing, vol. 22, no. 7, pp. 4314–4331, 2022.
- [11] F. Li, X. Yuan, L. Wang, H. Yang, D. Y, W. Lyu, and X. Cheng, “Collaborative learning in general graphs with limited memorization: Complexity, learnability, and reliability,” IEEE/ACM Transactions on Networking, vol. 31, no. 6, pp. 3222–3237, 2023.
- [12] C. Shi and C. Shen, “Federated multi-armed bandits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 9603–9611.
- [13] L. Yang, Y.-Z. J. Chen, S. Pasteris, M. Hajiesmaili, J. Lui, and D. Towsley, “Cooperative stochastic bandits with asynchronous agents and constrained feedback,” Advances in Neural Information Processing Systems, vol. 34, pp. 8885–8897, 2021.
- [14] L. Yang, Y.-Z. J. Chen, M. H. Hajiemaili, J. C. Lui, and D. Towsley, “Distributed bandits with heterogeneous agents,” in IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 2022, pp. 200–209.
- [15] J. Zhu and J. Liu, “Distributed multi-armed bandits,” IEEE Transactions on Automatic Control, 2023.
- [16] W. Mazurczyk, S. Wendzel, S. Zander, A. Houmansadr, and K. Szczypiorski, Information hiding in communication networks: fundamentals, mechanisms, applications, and countermeasures. John Wiley & Sons, 2016.
- [17] H. Tavafoghi and D. Teneketzis, “Informational incentives for congestion games,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2017, pp. 1285–1292.
- [18] M. Wu and S. Amin, “Information design for regulating traffic flows under uncertain network state,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 671–678.
- [19] B. L. Ferguson, P. N. Brown, and J. R. Marden, “The effectiveness of subsidies and tolls in congestion games,” IEEE Transactions on Automatic Control, vol. 67, no. 6, pp. 2729–2742, 2021.
- [20] Y. Yue, B. L. Ferguson, and J. R. Marden, “Incentive design for congestion games with unincentivizable users,” in 2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 4515–4520.
- [21] H. Li and L. Duan, “Online pricing incentive to sample fresh information,” IEEE Transactions on Network Science and Engineering, vol. 10, no. 1, pp. 514–526, 2022.
- [22] D. Fudenberg and J. Tirole, Game theory. MIT press, 1991.
- [23] T. Börgers, An introduction to the theory of mechanism design. Oxford University Press, USA, 2015.
- [24] Y. Li, C. Courcoubetis, and L. Duan, “Recommending paths: Follow or not follow?” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 2019, pp. 928–936.
- [25] I. Kremer, Y. Mansour, and M. Perry, “Implementing the “wisdom of the crowd”,” Journal of Political Economy, vol. 122, no. 5, pp. 988–1012, 2014.
- [26] Y. Mansour, A. Slivkins, V. Syrgkanis, and Z. S. Wu, “Bayesian exploration: Incentivizing exploration in Bayesian games,” Operations Research, vol. 70, no. 2, pp. 1105–1127, 2022.
- [27] X. Hu, D. Ngo, A. Slivkins, and S. Z. Wu, “Incentivizing combinatorial bandit exploration,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 173–37 183, 2022.
- [28] H. Li and L. Duan, “When congestion games meet mobile crowdsourcing: Selective information disclosure,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 5, 2023, pp. 5739–5746.
- [29] E. Kamenica and M. Gentzkow, “Bayesian persuasion,” American Economic Review, vol. 101, no. 6, pp. 2590–2615, 2011.
- [30] S. Das, E. Kamenica, and R. Mirka, “Reducing congestion through information design,” in 2017 55th annual allerton conference on communication, control, and computing (allerton). IEEE, 2017, pp. 1279–1284.
- [31] J. Kang, Y. Zhong, M. Xu, J. Nie, J. Wen, H. Du, D. Ye, X. Huang, D. Niyato, and S. Xie, “Tiny multi-agent drl for twins migration in uav metaverses: A multi-leader multi-follower stackelberg game approach,” IEEE Internet of Things Journal, 2024.
- [32] F. Farhadi and D. Teneketzis, “Dynamic information design: a simple problem on optimal sequential information disclosure,” Dynamic Games and Applications, vol. 12, no. 2, pp. 443–484, 2022.
- [33] M. Bernasconi, M. Castiglioni, A. Celli, A. Marchesi, F. Trovò, and N. Gatti, “Optimal rates and efficient algorithms for online bayesian persuasion,” in International Conference on Machine Learning. PMLR, 2023, pp. 2164–2183.
- [34] H. Li and L. Duan, “Human-in-the-loop learning for dynamic congestion games,” IEEE Transactions on Mobile Computing, 2024.
- [35] E. Meigs, F. Parise, A. Ozdaglar, and D. Acemoglu, “Optimal dynamic information provision in traffic routing,” arXiv preprint arXiv:2001.03232, 2020.
- [36] C. Tekin, and M. Liu, “Online learning of rested and restless bandits,” IEEE Transactions on Information Theory, vol. 58, no. 8, pp. 5588–5611, 2012.
- [37] S. Wang, L. Huang, and J. Lui, “Restless-UCB, an efficient and low-complexity algorithm for online restless bandits,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 878–11 889, 2020.
- [38] E. Castillo, P. Jimenez, J. M. Menendez, and A. J. Conejo, “The observability problem in traffic models: algebraic and topological methods,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 2, pp. 275–287, 2008.
- [39] M. Salari, L. Kattan, W. H. Lam, H. Lo, and M. A. Esfeh, “Optimization of traffic sensor location for complete link flow observability in traffic network considering sensor failure,” Transportation Research Part B: Methodological, vol. 121, pp. 216–251, 2019.
- [40] L. He, G. Sun, D. Niyato, H. Du, F. Mei, J. Kang, M. Debbah et al., “Generative ai for game theory-based mobile networking,” arXiv preprint arXiv:2404.09699, 2024.
- [41] E. Koutsoupias and C. Papadimitriou, “Worst-case equilibria,” in Annual symposium on theoretical aspects of computer science. Springer, 1999, pp. 404–413.
- [42] F. Li, Y. Chai, H. Yang, P. Hu, and L. Duan, “Incentivizing Massive Unknown Workers for Budget-Limited Crowdsensing: From Off-Line and On-Line Perspectives,” IEEE/ACM Transactions on Networking, 2024.
- [43] P. Cramton, R. Gibbons, and P. Klemperer, “Dissolving a partnership efficiently,” Econometrica: Journal of the Econometric Society, pp. 615–632, 1987.
- [44] G. Kosenok and S. Severinov, “Individually rational, budget-balanced mechanisms and allocation of surplus,” Journal of Economic Theory, vol. 140, no. 1, pp. 126–161, 2008.
- [45] S. R. Eddy, “Profile hidden markov models.” Bioinformatics (Oxford, England), vol. 14, no. 9, pp. 755–763, 1998.
- [46] Z. Chen, J. Wen, and Y. Geng, “Predicting future traffic using hidden markov models,” in 2016 IEEE 24th international conference on network protocols (ICNP). IEEE, 2016, pp. 1–6.
- [47] A. Saad, H. F. Schepker, B. Staehle, and R. Knorr, “Whitespace prediction using hidden markov model based maximum likelihood classification,” in 2019 IEEE 89th Vehicular Technology Conference (VTC2019-Spring). IEEE, 2019, pp. 1–7.
- [48] Baidu, “Baidu Maps Open Platform,” https://lbsyun.baidu.com/index.php?title=webapi/traffic, 2023.
![]() |
Hongbo Li (S’24-M’24) received the Ph.D. degree from Singapore University of Technology and Design (SUTD) in 2024. He is currently a Postdoctoral Research Fellow with the Pillar of Engineering Systems and Design, SUTD. In 2024, he was a Visiting Scholar at The Ohio State University (OSU), Columbus, OH, USA. His research interests include networked AI, game theory and mechanism design, and machine learning theory. |
![]() |
Lingjie Duan (S’09-M’12-SM’17) received the Ph.D. degree from The Chinese University of Hong Kong in 2012. He is an Associate Professor at the Singapore University of Technology and Design (SUTD) and is an Associate Head of Pillar (AHOP) of Engineering Systems and Design. In 2011, he was a Visiting Scholar at University of California at Berkeley, Berkeley, CA, USA. His research interests include network economics and game theory, network security and privacy, energy harvesting wireless communications, and mobile crowdsourcing. He is an Associate Editor of IEEE/ACM Transactions on Networking and IEEE Transactions on Mobile Computing. He was an Editor of IEEE Transactions on Wireless Communications and IEEE Communications Surveys and Tutorials. He also served as a Guest Editor of the IEEE Journal on Selected Areas in Communications Special Issue on Human-in-the-Loop Mobile Networks, as well as IEEE Wireless Communications Magazine. He is a General Chair of WiOpt 2023 Conference and is a regular TPC member of some other top conferences (e.g., INFOCOM, MobiHoc, SECON). He received the SUTD Excellence in Research Award in 2016 and the 10th IEEE ComSoc Asia-Pacific Outstanding Young Researcher Award in 2015. |
-A Proof of Lemma 1
We consider a typical two-path network with one deterministic path (path 0) and one stochastic path (path 1) to derive the three cases of (7) in Lemma 1, depending on the relationship between cost of deterministic path 0 and of stochastic path 1.
Under the information-sharing mechanism, a selfish user will always choose the path that minimizes his own long-term cost at any . Define to be the minimal long-run expected cost of a single user, given flow of users choosing path 1 except for himself. Denote this user’s selfish decision under the sharing mechanism by . We obtain:
(23) | |||
by estimating under , where and are cost-to-go functions under and , respectively.
From (23), if is obviously greater than under , then . This is because some other users will travel on stochastic path 1 and share the same observation there, and this user can always exploit the shared information on path 1 in the next time slot, no matter whether he chooses path or not. In this case, he will only compare immediate costs of the two paths in (23) to make his decision. Therefore, at the Nash equilibrium, the immediate travel costs on the two paths satisfy
Then the solution of has the following three cases:
-
•
If , with the maximum flow on stochastic path 1, which is the second case of (7).
-
•
If , we obtain , which is the third case of (7).
-
•
On the other hand, if , may be zero without other users choosing path 1. In this case, , as there will be no information update if this user also chooses deterministic path 0. Then if , this user will choose stochastic path 1 for lower long-term cost, and thus . Otherwise, no user will choose stochastic path 1, and . In summary, , which is the first case of (7).
This finishes the proof of decision-making at the two-path network.
For an arbitrary number , if the travel cost of a stochastic path is larger than other paths, a user also compares his long-term costs and . While in other cases, he will balance the immediate cost on path with other paths to reach the Nash equilibrium.
-B Proof of Lemma 2
We still consider the typical two-path network to derive and for users with information age and , respectively.
Given flow of users exploring path in the last time, the current flow of users with is thus . As they have the latest information on this path with , they will infer other users’ private belief with according to (4) and the exact flow of these users on path . Then they decide to minimize their own costs. Thus, we first derive for users with .
For the flow of users with , they only hold rough belief to decide to minimize their expected long-run costs. To realize the expected flow at the Nash equilibrium, the probability for each of them choosing stochastic path 1 is thus , which is derived by (7). Therefore, the routing flow of users with on path 1 is:
We now go back to users with minimum information age to decide . As these users know exact of path 1, they aim to realize routing flow on path 1 at the Nash Equilibrium. Given flow of users choosing stochastic path 1 in (13) already, they will compare with to make decisions:
-
•
If , none of users with will choose path 1.
-
•
Otherwise, there will be flow of users with choosing path 1. This flow is upper bounded by the total flow of users with minimum age , i.e., .
In summary, we finally obtain in (10). Given in (12) and in (13), the total flow of users finally choose path 1 is .
-C Proof of Lemma 3
In (15), even if the expected travel cost on path is much higher than others, the socially optimal policy may still recommend a small flow of users to learn on path . As , this infinitesimal flow of users’ immediate total cost is negligible. At the same time, their learned information on path can help reduce future costs for all users. Therefore, the socially optimal policy always decides for any stochastic path .
As there are always some users traveling on stochastic path to update , according to (15), the socially optimal policy only needs to minimize the immediate social cost , which is defined in (8). To derive the optimal flow , we calculate the first-order-derivative of :
Then there are also three cases for deciding the optimal routing flow :
-
•
If , is always true. In this case, , which is the second case of (16).
-
•
If , is always true. In this case, , which is the first case of (16).
-
•
Otherwise, by solving , we obtain , which is the third case of (16).
This completes the proof of Lemma 3.
-D Proof of Lemma 4
We first prove that increases monotonically with in (2). Then we prove is lower- and upper-bounded by and at and , respectively. Finally, according to the definitions of in (3) and in (4), we can obtain the same bounds for .
According to (2), we obtain the first-order-derivative of w.r.t. below:
which is always greater than based on the Markov chain in Figure 1b. If with a good observation, then , which is the minimum. While if , we obtain the maximum hazard belief .
Therefore, public belief belongs to the range for any . And private belief holds the same bounds.
-E Proof of Proposition 1
We will prove this lower bound of PoA in (18) by showing the worst-case scenario.
At the initial time with , according to (7), a flow of selfish users will choose to explore each stochastic path . As there are stochastic paths, there is a probability probability for them to find at least one stochastic path with . In this case, the expected social cost caused by selfish users satisfies
(24) |
Here the first inequality is due to
which is because selfish users will never explore any path with probability after finding there, and is lower bounded by the best case that all users keep exploiting all stochastic paths with probability . Note that according to (2) for any time .
Under selfish social cost in (24), we next calculate the optimal social cost. Different from the minimum exploration of the selfish policy, the socially optimal policy always lets an flow of users explore any stochastic path . Even though they find bad traffic conditions on path , there will still be an flow exploring this path until find good conditions. If users observe on path , let denote the maximum time slots before observing good condition again on path , which satisfies , where is a small infinitesimal. Here becomes after observing on path , according to Lemma 4. Then we obtain . After observing again, all users can exploit this path for near-to-zero cost there. As there are paths, after at most time slots, users will observe at least one path with a good condition. Given and , we can finally obtain the minimum social cost based on the above analysis:
(25) | ||||
Combing the social cost caused by the selfish players in (24) above and the minimum social cost in (25), we obtain
|
|||
as to make .
This PoA lower bound approaches infinity as under .
-F Proof of Proposition 2
To prove Proposition 2, we consider the same worst-case scenario with minimum exploration as the proof in Appendix E.
Under the information-hiding mechanism, according to (12) and (13), there will be a small flow of users choosing each stochastic path to travel. For this flow of users, if they observe on path , they still choose this path at the next time slot . If this flow of users observe there, none of them will go to this path again since , based on in (12). For the other flow of users choosing path 0 at time , there is no observation to update their posterior belief for the next time slot . Thus, at time , these users with information age still choose path 0. However, at the end of time , these users can observe to update their beliefs :
-
•
If they observe , they infer that , and they keep choosing path 0 since the next time slot .
-
•
If they observe , they infer that , and they will choose stochastic path with good conditions since .
Based on the analysis above and our proof of Proposition 1 in Appendix E, we calculate the social cost under the hiding mechanism as:
where such that the immediate cost for all users is at time .
For the socially optimal policy, it still always lets some users explore each stochastic path, and the caused minimum social cost is the same as (25) in Appendix E. Therefore, we calculate PoA under the hiding mechanism as
which approaches infinity under the same conditions as in Proposition 1.
-G Proof of Lemma 5
We demonstrate that the deterministic recommendation mechanism results in an infinite PoA by showing that users continue to make routing decisions as in the information-hiding mechanism.
Under the deterministic recommendation mechanism, when a user is advised to select a stochastic path, say , the only information it receives is . This limited information does not alter the user’s posterior belief about path , leading it to estimate its expected travel cost as it would without any recommendation. Consequently, the user disregards this recommendation. Similarly, if a user is advised to take safe path , its posterior beliefs regarding all other stochastic paths remain unaffected. Thus, each user continues to follow when deciding on a path, resulting in , as established in Proposition 2.
-H Proof of Proposition 3
To prove Proposition 3, we consider the worst-case with maximum exploration on stochastic paths.
Since users with minimum information age have complete information of path , any informational mechanism cannot change their path decisions. For example, we consider and . Users with knows the exact hazard belief on path . As the travel cost on path satisfies , these users will always choose this path to exploit the low cost there. Motivated by this example, we will try to find the worst-case with the maximum exploration that every user has to prove Proposition 3.
Let , , , and for any stochastic path . According to our analysis of in (7), there will be flow of users traveling on each stochastic path at initial . Then each user is aware of the public belief on his chosen stochastic path with minimum information age for next time . Consequently, under , they will behave the same as the sharing mechanism by continuing to choose their formerly chosen paths since , making any informational mechanism fail to regulate. The caused social cost by selfish users is
On the other hand, according to (16) in Lemma 3, the socially optimal policy ensures the flow of users to choose deterministic path 0 as:
to reduce the congestion cost on stochastic paths. Then we obtain the minimum social cost as:
which is derived by inputting into the right-hand side of the inequality. Finally, the resulting minimal PoA is
Therefore, any informational mechanism cannot reduce PoA to less than .
-I Proof of Theorem 1
We first analyze the received recommendations of users with different private information under our UPR mechanism. Then, we prove that our UPR mechanism ensures IIR for all non-myopic users, prompting them to follow the system’s random path recommendations. Finally, we demonstrate that under our UPR mechanism’s random recommendations, users will only over-explore stochastic paths, thereby reducing the PoA to the minimum ratio (22), as stated in Proposition 3.
At any given time , our UPR provides differential recommendations to users, who have different past observations, to influence their path decisions:
-
1.
For the flow of users who observed on path with minimum age , any informational mechanism cannot stop their exploration on this path, based on our analysis in Proposition 3. Therefore, the platform simply recommends that they follow their own selfish routing decisions in (7), and these users will stick to path at time .
-
2.
For users with on stochastic path , the platform will provide probabilistic recommendations to let them choose any stochastic path , where is the maximum flow of users on any path . The condition in set means that only if the flow of path in the last time slot satisfies , the platform will recommend some users with to choose this path in the current time slot . If , our UPR mechanism will simply let the former flow of users with follow their selfish decisions to determine , as shown in the case 1 above.
In the second case, given , if path is in a good condition, our UPR aims to realize flow of users exploiting this path. Conversely, if path is in a bad traffic condition, our UPR aims to ensure at least an flow of users choose this path, in order to achieve the social optimum in (16). Note that if under , in addition the recommended flow , users with will make their selfish decisions according to (12) to choose this path.
Based on the above analysis, we next prove that no matter receiving any private recommendation under our UPR mechanism, each user is IIR to follow such recommendation.
Suppose a user with receives private path recommendation for our UPR mechanism. According to (20), this user will infer the probability of a good condition on stochastic path as:
(26) |
which approaches as . Here the first equality is derived by the Bayesian inference and the second equality is due to the law of total probability. Based on (26), for this player, it is almost certain that .
Then this user infers the expected travel cost on path as:
which is the minimum expected cost. Based on our analysis that , following the recommendation to choose stochastic path is IIR for this player.
Similarly, if the user receives recommendation , it infers that any stochastic path is almost certainly with bad condition . Suppose that other users follow our UPR recommendations. Then, for any stochastic path with good condition , the expected flow of users on it will be . Therefore, following the recommendation to choose path 0 is individually rational.
As users always follow our UPR mechanism’s recommendations, finally, we prove that our UPR realizes the minimum PoA in (22), which is the lower bound in Proposition 3
According to recommendation probability in (21), if stochastic path has a bad traffic condition , our UPR recommends flow of users with information age on path to choose this path. For the flow of users with , who follow (12) to make their selfish routing decisions, they reverse-engineer the recommended flow to make their routing decision . In this case, the total flow of users choosing path under our UPR mechanism satisfies
(27) |
where the second equality is derived by (12).
Conversely, if path has a good traffic condition , our UPR mechanism recommends an expected flow of users with information age on path to choose this path. As the other flow of users with will stick to path , our UPR mechanism achieves
(28) |
Based on our analysis of (27) and (28), under our UPR mechanism always satisfies
where is derived by (16) in Lemma 3. It means that our UPR mechanism always avoids users’ under-exploration of any stochastic path . Therefore, the infinite PoA’s in Propositions 1 and 2 are successfully avoided.
As users will only over-explore stochastic paths, the worst-case scenario is the maximum over-exploration as in Proposition 3, where all users choose stochastic paths such that . Since is always true, our UPR mechanism always recommends flow of users on path , based on (28). Therefore, our UPR mechanism leads to the same PoA as in Proposition 3, which cannot be further reduced.