This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

To Optimize Human-in-the-loop Learning in Repeated Routing Games

Hongbo Li,  and Lingjie Duan,  This work is also supported in part by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 Grant with Award no. MOE-T2EP20121-0001; in part by SUTD Kickstarter Initiative (SKI) Grant with no. SKI 2021_04_07; and in part by the Joint SMU-SUTD Grant with no. 22-LKCSB-SMU-053.Hongbo Li and Lingjie Duan are with the Pillar of Engineering Systems and Design, Singapore University of Technology and Design, Singapore 487372 (e-mail: [email protected]; [email protected]).
Abstract

Today navigation applications (e.g., Waze and Google Maps) enable human users to learn and share the latest traffic observations, yet such information sharing simply aids selfish users to predict and choose the shortest paths to jam each other. Prior routing game studies focus on myopic users in oversimplified one-shot scenarios to regulate selfish routing via information hiding or pricing mechanisms. For practical human-in-the-loop learning (HILL) in repeated routing games, we face non-myopic users of differential past observations and need new mechanisms (preferably non-monetary) to persuade users to adhere to the optimal path recommendations. We model the repeated routing game in a typical parallel transportation network, which generally contains one deterministic path and NN stochastic paths. We first prove that no matter under the information sharing mechanism in use or the latest routing literature’s hiding mechanism, the resultant price of anarchy (PoA) for measuring the efficiency loss from social optimum can approach infinity, telling arbitrarily poor exploration-exploitation tradeoff over time. Then we propose a novel user-differential probabilistic recommendation (UPR) mechanism to differentiate and randomize path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces PoA=\text{PoA}=\infty to close-to-optimal PoA=1+14N+3\text{PoA}=1+\frac{1}{4N+3}, which cannot be further reduced by any other non-monetary mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal performance of UPR mechanism.

Index Terms:
human-in-the-loop learning, repeated routing games, price of anarchy, mechanism design
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

I Introduction

With ever-increasing smart devices owned by human users in transportation networks, mobile crowdsourcing applications (e.g., Waze and Google Maps) have emerged as vital platforms for users to learn and share their observed time-varying traffic information for guiding their daily or repeated routing (​​[1, 2, 3]). These applications simply publicize all useful traffic information, which encourages selfish users to consistently choose the current shortest routes in each round to minimize their own travel costs (​​[4, 5]). As a consequence, users miss critical exploration of non-shortest paths that may turn better to route and reduce travel costs in the future.

In systems characterized by dynamic and diverse information dynamics, Multi-Armed Bandit (MAB) problems emerge as a framework for investigating optimal exploration-exploitation trade-offs among stochastic arms or paths (e.g., [6, 7, 8, 9, 10, 11]). For instance, [9] utilizes MAB techniques to forecast congestion levels in rapidly changing transportation environments. Similarly, [10] employs MAB exploration-exploitation strategies to manage the recruitment of previously reliable users alongside new, unproven users in completing crowdsourcing tasks. More recently, research efforts have extended traditional MAB paradigms to distributed settings involving multiple non-myopic agents aiming to minimize overall social costs (e.g., [12, 13, 14, 15]). For instance, [12] explores the use of distributed agents to learn heterogeneous local models associated with each arm, contributing to the social planner’s decision-making process. Additionally, [13] investigates how a fixed flow of agents collaborates in a distributed MAB scenario over repeated interactions, where each agent autonomously makes optimal decisions based on privately acquired arm information. Subsequently, [14] and [15] allow agents to learn and share the arm information with their neighbors for making locally optimal decisions. Overall, these new distributed MAB solutions require all agents to be fully cooperative to adhere to the optimal policy set by the social planner, without any possible selfish deviation to favor individuals’ rewards.

To facilitate human-in-the-loop learning (HILL) in repeated routing games, it is crucial to incentivize selfish users to follow the social planner’s optimal recommendations, enabling both information learning on diverse paths and the long-term reduction of traffic congestion. Prior routing game studies mainly focus on an oversimplified one-shot routing scenario without any information variation over time. They develop information hiding (e.g., [16, 17, 18]) and side-payment or pricing mechanisms (e.g., [19, 20, 21]) to regulate myopic users’ selfish routing choices. However, these mechanisms strongly assume that the social planner possesses all the traffic information to design incentives. Moreover, they only consider persuading myopic users to follow, whereas myopic users do not have past traffic observations as side information to reverse-engineer and deviate. In contrast, our HILL problem deals with a dynamic system with time-varying traffic information, necessitating the social planner to not only reduce immediate congestion caused by users but also incentivize their ongoing information learning on diverse paths.

As compared to pricing or side-payment, informational mechanisms (e.g., information hiding) are non-monetary and easy to implement (​​[22, 23, 24]). In this paper, we aim to design the best informational mechanism to regulate HILL in repeated routing games to persuade non-myopic users with differential side information in routing choices. As such, we need to address the following two technical challenges.

  • The first question is how to evaluate existing routing policies’ performances for non-myopic users as compared to the social optimum in the dynamic system with time-varying traffic information. Prior routing game literature (e.g. [17, 18]) focuses on myopic users in oversimplified one-shot scenarios, assuming that the social planner knows all traffic conditions to decide routing policies. In practice, the social planner lacks access to live traffic conditions but relies on non-myopic users to explore different paths over time. Although non-myopic, users remain selfish, and their interests are not aligned with the social planner’s goal of balancing long-term exploitation and exploration to minimize the overall social cost. Moreover, the theoretical analysis of the existing routing policies is lacking, particularly in understanding their performance gaps from the social optimum.

  • The second challenge is how to persuade non-myopic users to follow the optimal path recommendations despite their selfishness and differential learning histories to reverse-engineer. In repeated routing games, each user only cares about its own travel cost. It may learn from its own traffic observations in the past to infer the current system’s status and deviate from the optimal recommendation. Existing studies of incentivized exploration (​​[25, 26, 27, 28]) and Bayesian persuasion (​​[29, 30, 32, 33]) largely assume that all users are myopic to participate and do not have private side information to play strategically against the system’s recommendations.

It is worth noting that there is a recent study related to this paper (​​[34]), which introduces an informational mechanism aimed at regulating myopic users’ distributed information learning in congestion games. However, for practical human-in-the-loop learning in repeated routing games, we face non-myopic users, who focus on their own long-term rewards rather than one-shot immediate rewards. In this case, the mechanism in [34] cannot work, as non-myopic users can reverse-engineer the system status to play against the mechanism. Unlike this work, our paper addresses the two challenges above by proposing new mechanism designs and analysis.

We summarize our key novelty and main contributions as follows.

  • To regulate human-in-the-loop learning (HILL) for repeated routing games: To the best of our knowledge, this work is the first to regulate non-myopic users with differential side-information in repeated routing games. We model the repeated routing game in a typical parallel transportation network, which generally contains one deterministic path and NN stochastic paths. Here each stochastic path has time-varying unknown traffic conditions to be learned by repeatedly arriving non-atomic users. Furthermore, users’ past routing choices create differential side information for selfish routing, making it challenging to design any informational mechanism.

  • Theoretical comparison among information sharing, hiding mechanisms, and social optimum via PoA analysis: We analyze the information sharing and hiding mechanisms as well as social optimum to derive their corresponding routing recommendations in closed form, respectively. Then we prove that no matter under today’s information sharing mechanism (in Waze and Google Maps) or the latest information hiding mechanism in the routing game literature, the resultant price of anarchy (PoA) for measuring the efficiency loss from optimum about total users’ travel costs can be arbitrarily large.

  • Best informational mechanism to remedy huge efficiency loss: We propose a novel user-differential probabilistic recommendation (UPR) mechanism to randomly differentiate path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces PoA=\text{PoA}=\infty to close-to-optimal PoA=1+14N+3\text{PoA}=1+\frac{1}{4N+3}, which cannot be further reduced by any other informational mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal average performance of our UPR mechanism.

The rest of the paper is organized as follows. In Section II, we propose the repeated routing game model, the public human-in-the-loop learning (HILL) model under information sharing mechanism, and the differential HILL model under information hiding mechanism. Then in Section III, we first formulate the optimization problems for the sharing and hiding mechanisms, as well as the socially optimal policy, and then analyze the solutions for these three policies. In Section IV, we analytically compare the two mechanisms with the social optimum via PoA analysis. Based on the analysis, we propose our UPR mechanism in Section V. We verify the average performance of our UPR mechanism in Section VI. Finally, Section VII concludes the paper. To facilitate readability, we have summarized all the key notations in Table I.

II System model

Refer to caption
(a) A typical parallel transportation network with one deterministic path and NN stochastic paths.
Refer to caption
(b) The Markov chain for modelling internal travel cost ci(t)c_{i}(t) dynamics of stochastic path i={1,,N}i\in\mathbb{N}=\{1,\cdots,N\} to alternate between low cost cLc_{L} and high cost cHc_{H}, where qLH<qLLq_{LH}<q_{LL} and qHL<qHHq_{HL}<q_{HH}.
Figure 1: At the beginning of each t{1,2,}t\in\{1,2,\cdots\}, a unit flow of users arrive at origin O to select a path among all the N+1N+1 paths of the network in Fig. 1(a). Here path 0 has a fixed internal travel cost c0c_{0}. Yet any other path ii\in\mathbb{N} is stochastic and its internal cost ci(t)c_{i}(t) varies based on the Markov chain illustrated in Fig. 1(b).

In this section, as in the routing game literature (e.g., [17, 18, 28]), we examine a standard parallel transportation network, which is simple but fundamental, to introduce the repeated routing game model. Then, we proceed to present the public HILL model under the information sharing mechanism and the differential HILL model under the information hiding mechanism for the crowdsourcing platform, respectively.

II-A Repeated Routing Game Model

We consider an infinite discrete time horizon t{0,1,}t\in\{0,1,\cdots\}. In Figure 1(a), at the beginning of each tt, there is a unit flow of users arriving at origin O. Then this unit flow of users individually choose paths from all the N+1N+1 available paths to travel to their common destination D. We generally consider a deterministic path 0 with a fixed internal travel cost c0c_{0} other than NN stochastic paths as in [17, 18, 28].

Following existing routing game literature (e.g., [30, 31, 24]), for any stochastic path i={1,,N}i\in\mathbb{N}=\{1,\cdots,N\}, the internal travel cost ci(t){cL,cH}c_{i}(t)\in\{c_{L},c_{H}\} is random and dynamically alternates between a high-cost state cHc_{H} and a low-cost state cLc_{L} over time, according to the commonly used Markov chain in Figure 1(b).111Besides the references to support, we also validate the suitability of this Markov chain model in Section VI’s experiments. The internal cost may refer to factors like travel time or fuel consumption, which depend on the condition of each path (e.g., poor visibility, ‘black ice’ segments [28, 24]). Upon reaching destination D, users repeat the same process in the subsequent time slot t+1t+1. The length of a time slot can represent a day, for instance, when people commute from home to the office each morning within a similar time range. Therefore, we generally assume that users are non-myopic, allowing them to retain their prior traffic observations to guide future routing decisions.

Given the unit flow of user arrivals at current tt, we denote by fi(t)[0,1]f_{i}(t)\in[0,1] the flow of users choosing any stochastic path ii. Following the existing literature (e.g., [17, 30, 18]), in addition to the internal travel cost ci(t)c_{i}(t), each user on path ii will face another (external) congestion cost fi(t)f_{i}(t) caused by the other users on the same path. Therefore, the individual cost of each user traveling on stochastic path ii\in\mathbb{N} is ci(t)+fi(t)c_{i}(t)+f_{i}(t). Similarly, each of the other f0(t)f_{0}(t) flow of users on deterministic path 0 has individual cost c0+f0(t)c_{0}+f_{0}(t), where

f0(t)=1i=1Nfi(t).\displaystyle f_{0}(t)=1-\sum_{i=1}^{N}f_{i}(t).

To learn the time-varying internal cost ci(t)c_{i}(t) of stochastic path ii\in\mathbb{N} for future use, the platform (e.g., Waze) expects users to frequently travel on diverse paths and share their observations. Then it observes user flow fi(t)f_{i}(t) and observation ci(t)c_{i}(t) via GPS and travel time [35].

TABLE I: Key notations and their meanings in this paper
Notation Meaning
NN The number of stochastic paths.
ci(t)c_{i}(t) The internal travel cost of stochastic path ii.
c0(t)c_{0}(t) The fixed internal travel cost of deterministic path 0.
cL,cHc_{L},c_{H} The high- and low-cost states on stochastic paths.
qLH,qHHq_{LH},q_{HH} The transition probabilities from low- and high-cost states to high-cost state in the Markov chain.
fi(t)f_{i}(t) The flow of users traveling any path ii at time tt.
xi(t),xi(t)x_{i}(t),x_{i}^{\prime}(t) The prior and posterior public hazard beliefs on stochastic path ii under the sharing mechanism.
𝐱(t)\mathbf{x}(t) The set of public hazard beliefs of all stochastic paths.
di(t)d_{i}(t) The information age of path ii under the hiding mechanism.
yi,di(t)(t)y_{i,d_{i}(t)}(t) The private hazard belief with information age di(t)d_{i}(t) on stochastic path ii under the hiding mechanism.
𝐲(t)\mathbf{y}(t) The set of private hazard beliefs of all stochastic paths.
c()c(\cdot) The immediate social cost at time tt.
C()C(\cdot) The long-term cost function for all users.
ρ\rho The discount factor.
α\alpha The expected arrival rate of users.
π(t)\pi(t) The private recommendation for each user under our UPR mechanism at time tt.

II-B Public HILL under Information Sharing

The HILL model depends on the implemented mechanism, and we first consider the existing information-sharing mechanism used by Google Maps and Waze to share all useful traffic information with the public. After all users making routing decisions, a flow fi(t)f_{i}(t) of users travel on stochastic path ii to observe the actual state ci(t)c_{i}(t) there. We define a public hazard belief xi(t)[0,1]x_{i}(t)\in[0,1] to be the expected probability for seeing the bad traffic or high-cost condition ci(t)=cHc_{i}(t)=c_{H} at time tt. Using the Bayesian inference, we obtain:

xi(t)=𝐏𝐫(ci(t)=cH|ci(t1),xi(t1)),\displaystyle x_{i}(t)=\mathbf{Pr}(c_{i}(t)=c_{H}|c_{i}(t-1),x_{i}(t-1)), (1)

where xi(t1)x_{i}(t-1) summarizes prior public hazard beliefs before t1t-1. Then we introduce the information learning process of this public HILL model below.

  • At the beginning of time tt, the platform broadcasts the public hazard belief xi(t)x_{i}(t) of any stochastic path ii in (1) about traffic state ci(t)c_{i}(t).

  • During time slot tt, users traveling on stochastic path ii will observe the actual state ci(t)c_{i}(t) and report it back to the platform. Then the platform will further update this prior belief to a posterior belief xi(t)x_{i}^{\prime}(t) below:

    • If the fi(t)f_{i}(t) flow of users observe ci(t)=cHc_{i}(t)=c_{H} on path ii, then the posterior belief is updated to xi(t)=1x^{\prime}_{i}(t)=1.

    • If these users observe ci(t)=cLc_{i}(t)=c_{L} on path ii, then the posterior belief is updated to xi(t)=0x^{\prime}_{i}(t)=0.

    • If fi(t)=0f_{i}(t)=0 without user to travel and learn this path ii, the posterior belief xi(t)x^{\prime}_{i}(t) remains as xi(t)x_{i}(t).

  • At the end of each time slot, the platform updates public hazard belief from xi(t)x_{i}^{\prime}(t) to xi(t+1)x_{i}(t+1) below, based on the Markov chain in Fig. 1(b):

    xi(t+1)=xi(t)qHH+(1xi(t))qLH.x_{i}(t+1)=x_{i}^{\prime}(t)q_{HH}+\big{(}1-x_{i}^{\prime}(t)\big{)}q_{LH}. (2)

The above process will repeat as the new time slot t+1t+1 begins. Note that the Markov chain-based HILL process described above is equivalent to the existing restless multi-armed bandit (MAB) problem (e.g., [36, 37]). Our subsequent analysis and key results (e.g, Proposition 1, Proposition 2, and our UPR mechanism) can also be generalized if ci(t)c_{i}(t) follows a memoryless stochastic process (e.g., Poisson process or exponential distribution), as the worst-case scenario remains unchanged.

II-C Differential HILL under Information Hiding

If the platform chooses to hide this public information xi(t)x_{i}(t) from users (e.g., by applying the latest information hiding mechanism [17, 32, 24]), users have to independently update private beliefs based on their own routing choices and observations of any path ii in the past. In this subsection, we introduce how users update their differential private beliefs without public information sharing by the platform. To aid users’ routing decisions, we allow them to observe the traffic flow in the last time slot (i.e., fi(t1)f_{i}(t-1) on any path ii) at the beginning of each tt, as widely assumed in existing transportation literature (e.g., [38, 39, 40]).

At the initial time t=0t=0, the hazard belief xi(0)x_{i}(0) of any stochastic path ii is given to all users. Without knowing public belief xi(t)x_{i}(t) but only the past flow fi(t1)f_{i}(t-1) since t=1t=1, users will experience different information ages since their last observations of this path. Denote by di(t)d_{i}(t) the information age of path ii for an individual user at time t1t\geq 1, and let yi,di(t)(t)y_{i,d_{i}(t)}(t) denote its corresponding private hazard belief. Then, we proceed to introduce the update of di(t)d_{i}(t) and yi,di(t)(t)y_{i,d_{i}(t)}(t) under differential HILL in the following two cases.

  • If this user just explored stochastic path ii at t1t-1, its information age of this path is set to the minimum di(t)=1d_{i}(t)=1. Therefore, its current private hazard belief is exactly the same as the public belief of the platform:

    yi,1(t)=xi(t).\displaystyle y_{i,1}(t)=x_{i}(t). (3)
  • If this user did not explore stochastic path ii at time t1t-1, its information age di(t)d_{i}(t) should increase by 11 from di(t1)d_{i}(t-1). As it is able to observe the last flow fi(t1)f_{i}(t-1) at the beginning of tt, it will rectify its private belief yi,di(t1)(t1)y_{i,d_{i}(t-1)}(t-1). For example, if this user observes more users traveling on path ii with fi(t1)fi(t2)f_{i}(t-1)\geq f_{i}(t-2) in the last time slot, it will believe that this path had a good traffic condition with weak hazard belief xi(t1)=qLHx_{i}(t-1)=q_{LH} according to Fig. 1(b). Otherwise, it infers xi(t1)=qHHx_{i}(t-1)=q_{HH} with strong hazard belief there. By rectifying the last private belief to the exact public belief xi(t1)x_{i}(t-1) from the prior flow distribution, its information age keeps at di(t)=2d_{i}(t)=2 and does not grow over time. Thus, its private belief for time tt is updated to:

    yi,2(t)\displaystyle y_{i,2}(t) =𝔼[xi(t)|xi(t1)]\displaystyle=\mathbb{E}[x_{i}(t)|x_{i}(t-1)]
    =xi(t1)qHH+(1xi(t1))qLH,\displaystyle=x_{i}(t-1)q_{HH}+(1-x_{i}(t-1))q_{LH}, (4)

    due to (2) and xi(t1)=xi(t1)x_{i}^{\prime}(t-1)=x_{i}(t-1) without new observation.

According to the above analysis, we summarize the dynamics of a user’s information age di(t)d_{i}(t) of path ii below:

di(t)={1,if this user explored path i at t1,2,otherwise,\displaystyle d_{i}(t)=\begin{cases}1,&\text{if this user explored path $i$ at }t-1,\\ 2,&\text{otherwise,}\end{cases} (5)

with the self-learning update of private hazard belief yi,1(t)y_{i,1}(t) in (3) and yi,2(t)y_{i,2}(t) in (4), respectively.

III Problem Formulations and Analysis for Sharing, Hiding, and Social Optimum

In this section, we first formulate optimization problems for guiding non-myopic users’ repeated routing under sharing and hiding mechanisms, as well as the social optimum. Then, we solve these optimization problems to determine the routing flow in closed form per time for each mechanism/policy.

We first define non-myopic users’ selfish routing decisions in the following.

Definition 1 (Selfish routing decision).

Under the selfish routing decision, each non-myopic user aims to maximize its own long-term ρ\rho-discounted expected reward, where ρ(0,1)\rho\in(0,1) is the discount factor.

Based on Definition 1, each user will make selfish routing decisions under both sharing and hiding mechanisms.

III-A Formulation and Equilibrium for Information Sharing

In this subsection, we consider the information-sharing mechanism (used by Waze) to cater to users’ selfish interests such that users will not deviate from its recommendations.

We first summarize the dynamics of public hazard belief xi(t)x_{i}(t) for each stochastic path ii at any time tt in a vector below:

𝐱(t)={x1(t),,xN(t)},\displaystyle\mathbf{x}(t)=\{x_{1}(t),\cdots,x_{N}(t)\}, (6)

each of which can be obtained by (2). For the unit user flow at time tt, the platform first shares the latest public hazard belief set 𝐱(t)\mathbf{x}(t) with them. Then each non-myopic user will selfishly choose the path that minimizes its own long-run cost.

Let Q(s)(𝐱(t)|fi(s)(𝐱(t)))Q^{(s)}\big{(}\mathbf{x}(t)|f_{i}^{(s)}(\mathbf{x}(t))\big{)} denote the minimal long-run expected cost of a single user since time tt, given the fi(s)(𝐱(t))[0,1]f_{i}^{(s)}(\mathbf{x}(t))\in[0,1] flow of non-myopic users choosing path ii under the sharing mechanism with symbol (s)(s) in the superscript. For ease of explanation, we take the special case of N=1N=1 stochastic path alongside the deterministic path 0 in a typical two-path network example to solve and intuitively explain f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) below. For an arbitrary number N2N\geq 2, we can similarly compute fi(s)(𝐱(t))f_{i}^{(s)}(\mathbf{x}(t)) by balancing the expected costs of all the N+1N+1 paths as in (7).

Lemma 1.

Under information sharing, given the unit flow of non-myopic user arrivals at time tt, the recommended flow of users for choosing stochastic path 1 at equilibrium is: 222To realize equilibrium flow f1(s)(x1(t))=ϵf_{1}^{(s)}(x_{1}(t))=\epsilon in the second case, each user will be fairly chosen with probability ϵ\epsilon to explore path 1, and it will not deviate because Q(s)(x1(t)|ϵ)Q(s)(x1(t)|0)Q^{(s)}(x_{1}(t)|\epsilon)\leq Q^{(s)}(x_{1}(t)|0) for reduced long-run cost.

f1(s)(x1(t))={ϵ𝟙{Q(s)(x1(t)|ϵ)Q(s)(x1(t)|0)},if 𝔼[c1(t)|x1(t)]c0+1,1, if 𝔼[c1(t)|x1(t)]c01,12+c0𝔼[c1(t)|x1(t)]2, otherwise,\displaystyle f_{1}^{(s)}(x_{1}(t))=\begin{cases}\epsilon\cdot\mathds{1}\{Q^{(s)}(x_{1}(t)|\epsilon)\leq Q^{(s)}(x_{1}(t)|0)\},\\ \quad\quad\quad\quad\ \ \text{if }\mathbb{E}[c_{1}(t)|x_{1}(t)]\geq c_{0}+1,\\ 1,\ \quad\quad\quad\text{ if }\mathbb{E}[c_{1}(t)|x_{1}(t)]\leq c_{0}-1,\\ \frac{1}{2}+\frac{c_{0}-\mathbb{E}[c_{1}(t)|x_{1}(t)]}{2},\quad\ \quad\text{ otherwise},\end{cases} (7)

and f0(s)(x1(t))=1f1(s)(x1(t))f_{0}^{(s)}(x_{1}(t))=1-f_{1}^{(s)}(x_{1}(t)) to path 1, where infinitesimal ϵ>0\epsilon>0, 𝟙{()}=1\mathds{1}\{(\cdot)\}=1 if ()(\cdot) is true and 𝟙{()}=0\mathds{1}\{(\cdot)\}=0 otherwise.

The proof of Lemma 1 is given in Appendix A of the supplemental material. From the first case of (7), even if the stochastic path 1’s internal travel cost 𝔼[c1(t)|x1(t)]\mathbb{E}[c_{1}(t)|x_{1}(t)] is larger than path 0’s internal cost c0c_{0} plus the maximum congestion 11 there, non-myopic users may still choose path 1 with minimum flow ϵ\epsilon to learn possible cLc_{L} state for future use given Q(s)(x1(t)|ϵ)Q(s)(x1(t)|0)Q^{(s)}(x_{1}(t)|\epsilon)\leq Q^{(s)}(x_{1}(t)|0). If the internal cost of path 0 is larger than the internal cost of path 1 plus the maximum congestion, all user flow will choose path 1 in the first case of (7). Otherwise, the two paths’ conditions are comparable and the user flow will partition to both paths to balance congestion there.

In the last two cases of (7), as there is always a positive f1(s)(x1(t))>0f_{1}^{(s)}(x_{1}(t))>0 flow of users traveling on path 1 to learn fresh information c1(t)c_{1}(t) there, non-myopic users will myopically make routing decisions to minimize their immediate individual costs in the current time slot.

Then we further analyze the long-term social cost under the sharing mechanism (with specific (7) for N=1N=1 case). Define C(s)(𝐱(t))C^{(s)}(\mathbf{x}(t)) and c(s)(𝐱(t))c^{(s)}(\mathbf{x}(t)) to be the long-term and immediate social cost functions. At each time slot tt, the immediate social cost is obtained as:

c(s)(𝐱(t))=\displaystyle c^{(s)}(\mathbf{x}(t))= f0(s)(𝐱(t))(c0+f0(s)(𝐱(t)))\displaystyle f_{0}^{(s)}(\mathbf{x}(t))\Big{(}c_{0}+f_{0}^{(s)}(\mathbf{x}(t))\Big{)} (8)
+i=1Nfi(s)(𝐱(t))(𝔼[ci(t)|xi(t)]+fi(s)(𝐱(t))),\displaystyle+\sum_{i=1}^{N}f_{i}^{(s)}(\mathbf{x}(t))\Big{(}\mathbb{E}[c_{i}(t)|x_{i}(t)]+f_{i}^{(s)}(\mathbf{x}(t))\Big{)},

which includes the current travel cost of f0(s)(𝐱(t))f_{0}^{(s)}(\mathbf{x}(t)) flow of users on deterministic path 0 and fi(s)(𝐱(t))f_{i}^{(s)}(\mathbf{x}(t)) flow of users on any stochastic path ii\in\mathbb{N}.

After obtaining immediate social cost (8), we further formulate the long-term cost function C(s)(𝐱(t))C^{(s)}(\mathbf{x}(t)) as a Markov decision process (MDP), where the update of hazard belief xi(t+1)x_{i}(t+1) depends on current users’ routing choices and observations of ci(t)c_{i}(t) on path ii’s. Consequently, two cases arise for updating xi(t+1)x_{i}(t+1) at each time slot. If fi(s)(𝐱(t))=0f_{i}^{(s)}(\mathbf{x}(t))=0 without user traveling on stochastic path ii, xi(t)x_{i}^{\prime}(t) remains as xi(t)x_{i}(t) and will be updated to xi(t+1)x_{i}(t+1) according to (2). If fi(s)(𝐱(t))>0f_{i}^{(s)}(\mathbf{x}(t))>0 with users traveling on path ii, then xi(t+1)x_{i}(t+1) will be updated to qHHq_{HH} or qLLq_{LL}, depending on whether users’ observation there is ci(t)=cHc_{i}(t)=c_{H} or ci(t)=cLc_{i}(t)=c_{L} in Fig. 1(b).

Based on the analysis above, we formulate the long-term ρ\rho-discounted cost function under information sharing below:

C(s)(𝐱(t))=c(s)(𝐱(t))+ρC(s)(𝐱(t+1)),\displaystyle C^{(s)}\big{(}\mathbf{x}(t)\big{)}=c^{(s)}(\mathbf{x}(t))+\rho C^{(s)}\big{(}\mathbf{x}(t+1)\big{)}, (9)

where c(s)(𝐱(t))c^{(s)}(\mathbf{x}(t)) is given in (8) and ρ(0,1)\rho\in(0,1) is the discount factor. Selfish users will not choose any stochastic path ii with higher long-run cost, and thus xi(t)x_{i}(t) of this path may not be updated on time to be useful for the next time slot.

III-B Formulation and Equilibrium for Information Hiding

In this subsection, we formulate the long-term cost function under the latest hiding mechanism in the routing game literature (e.g., [17, 32, 24]). Note that their mechanisms cannot be directly applied to our repeated routing as users are non-myopic and no longer hold the same private belief. Thus we need to revise it to fit our differential HILL model as described in Section II-C. Note that the hiding mechanism still caters to users’ selfish interests to hide all information including α(t)\alpha(t) and lets users themselves choose routing paths that minimize their own long-run costs under Definition 1.

Recall in (5) that users under the hiding mechanism are different in their private information ages di(t)=1d_{i}(t)=1 or 22 from their own last observations, resulting in differential private belief yi,1(t)y_{i,1}(t) in (3) or yi,2(t)y_{i,2}(t) in (4) about path ii, respectively. We summarize the dynamics of users’ private beliefs yi,1(t)y_{i,1}(t) and yi,2(t)y_{i,2}(t) among NN stochastic paths at time tt as:

𝐲(t)={\displaystyle\mathbf{y}(t)=\big{\{} y1,1(t),,yN,1(t),\displaystyle y_{1,1}(t),\cdots,y_{N,1}(t), (10)
y1,2(t),,yN,2(t)}.\displaystyle y_{1,2}(t),\cdots,y_{N,2}(t)\big{\}}.

Let fi(𝐲(t))[0,1]f_{i}^{\emptyset}(\mathbf{y}(t))\in[0,1] denote the flow of non-myopic users choosing path ii at time tt under the hiding mechanism with subscript \emptyset, which needs to specify users’ path decisions under different information age di(t)d_{i}(t)’s. Denote by fi,di(t)(yi,di(t)(t))f_{i,d_{i}(t)}^{\emptyset}(y_{i,d_{i}(t)}(t)) the flow of users choosing path ii with the same information age di(t)d_{i}(t), which satisfies

fi(𝐲(t))=j=12fi,j(yi,j(t)).\displaystyle f_{i}^{\emptyset}(\mathbf{y}(t))=\sum_{j=1}^{2}f_{i,j}^{\emptyset}(y_{i,j}(t)). (11)

Given fi(𝐲(t1))f_{i}^{\emptyset}(\mathbf{y}(t-1)) flow of users exploring path ii in the last time, the current flow of users with di(t)=1d_{i}(t)=1 is thus fi(𝐲(t1))f_{i}^{\emptyset}(\mathbf{y}(t-1)). As they have the latest information on this path with di(t)=1d_{i}(t)=1, they will infer other users’ private belief yi,2(t)y_{i,2}(t) with di(t)=2d_{i}(t)=2 according to (4). Thus, they can estimate the exact flow fi,2(yi,2(t))f_{i,2}^{\emptyset}(y_{i,2}(t)) of these users on path ii, and then decide fi,1(yi,1(t))f^{\emptyset}_{i,1}(y_{i,1}(t)) to minimize their own long-run costs.

While for the other 1fi(𝐲(t1))1-f_{i}^{\emptyset}(\mathbf{y}(t-1)) flow of users with di(t)=2d_{i}(t)=2, they only hold delayed or aged belief yi,2(t)y_{i,2}(t) to decide fi,2(yi,2(t))f^{\emptyset}_{i,2}(y_{i,2}(t)) to minimize their expected long-run costs.

To clearly tell non-myopic users’ decision-making under the differential HILL model, we derive in the next lemma the closed-form expressions for how users decide f1,1(y1,1(t))f_{1,1}^{\emptyset}(y_{1,1}(t)) and f1,2(y1,2(t))f_{1,2}^{\emptyset}(y_{1,2}(t)) in the basic two-path network at the equilibrium as an example.

Lemma 2.

Under the information-hiding mechanism, for the f1(𝐲(t1))f_{1}^{\emptyset}(\mathbf{y}(t-1)) flow of users with minimum information age d1(t)=1d_{1}(t)=1 since their last observations, at time tt they decide their routing flow f1,1(y1,1(t))f^{\emptyset}_{1,1}(y_{1,1}(t)) on stochastic path 1 under the information-hiding mechanism as:

f1,1(y1,1(t))=min{\displaystyle f^{\emptyset}_{1,1}(y_{1,1}(t))=\min\Big{\{} max{f1(s)(y1,1(t))f1,2(y1,2(t)),0},\displaystyle\max\{f^{(s)}_{1}(y_{1,1}(t))-f^{\emptyset}_{1,2}(y_{1,2}(t)),0\},
f1(𝐲(t1))},\displaystyle f_{1}^{\emptyset}(\mathbf{y}(t-1))\Big{\}}, (12)

where f1(s)(y1,1(t))f_{1}^{(s)}(y_{1,1}(t)) is given by (7). For the other 1f1(𝐲(t1))1-f_{1}^{\emptyset}(\mathbf{y}(t-1)) flow of users with age d1(t)=2d_{1}(t)=2, their routing flow is

f1,2(y1,2(t))=(1f1(𝐲(t1)))f1(s)(y1,2(t)),\displaystyle f^{\emptyset}_{1,2}(y_{1,2}(t))=(1-f_{1}^{\emptyset}(\mathbf{y}(t-1)))f_{1}^{(s)}(y_{1,2}(t)), (13)

where f1(s)(y1,2(t))f_{1}^{(s)}(y_{1,2}(t)) is given in (7). Then the total flow of users on path 1 is f1,1(y1,1(t))+f1,2(y1,2(t))f^{\emptyset}_{1,1}(y_{1,1}(t))+f^{\emptyset}_{1,2}(y_{1,2}(t)).

The proof of Lemma 2 is given in Appendix B of the supplemental material. As compared to f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) under information sharing in (7), users with di(t)=2d_{i}(t)=2 under information hiding have to use their private belief y1,2(t)y_{1,2}(t) to decide f1,2(y1,2(t))f_{1,2}^{\emptyset}(y_{1,2}(t)) in (13). Then users with di(t)=1d_{i}(t)=1 decide f1,2(y1,2(t))f_{1,2}^{\emptyset}(y_{1,2}(t)) in (12) to make the total flow on path 11 approach to f1(s)(y1,1(t))f_{1}^{(s)}(y_{1,1}(t)) under information sharing as much as possible.

Next, we similarly formulate the long-term ρ\rho-discounted cost function under the hiding mechanism below:

C(𝐲(t))=c(𝐲(t))+ρC(𝐲(t+1)),\displaystyle C^{\emptyset}(\mathbf{y}(t))=c^{\emptyset}(\mathbf{y}(t))+\rho C^{\emptyset}(\mathbf{y}(t+1)), (14)

where the immediate social cost c(𝐲(t))c^{\emptyset}(\mathbf{y}(t)) is similarly defined as (8) and the update of 𝐲(t+1)\mathbf{y}(t+1) is given in (3) or (4), depending on the system belief sets 𝐱(t1)\mathbf{x}(t-1) and 𝐱(t)\mathbf{x}(t).

III-C Formulation and Solution for Social Optimum

Recall that the sharing and hiding mechanisms that cater to users’ selfish interests for persuading them to follow path recommendations in Lemmas 1 and 2. Differently, the socially optimal policy aims to centrally solve the optimal routing flow fi(t)[0,α(t)]f^{*}_{i}(t)\in[0,\alpha(t)] to minimize the long-run expected social cost for all users, by assuming users to be cooperative all the time. Let C(𝐱(t))C^{*}(\mathbf{x}(t)) denote the long-term ρ\rho-discounted cost function under the socially optimal policy, given as:

C(𝐱(t))=minfi(t)[0,1]c(𝐱(t))+ρC(𝐱(t+1)),\displaystyle C^{*}(\mathbf{x}(t))=\min_{f^{*}_{i}(t)\in[0,1]}c^{*}(\mathbf{x}(t))+\rho C^{*}(\mathbf{x}(t+1)), (15)

where c(𝐱(t))c^{*}(\mathbf{x}(t)) is similarly defined as (8). If the platform asks all the unit flow of arriving users to choose path 0, there would be no observation on stochastic path ii. As a result, the posterior belief xi(t)x_{i}^{\prime}(t) remains as xi(t)x_{i}(t). However, if the platform instructs some users to travel on path ii, xi(t)x_{i}^{\prime}(t) can be timely updated to 0 or 11 and will be further updated to xi(t+1)x_{i}(t+1) by (2).

In (15), even if the expected travel cost 𝔼[ci(t)|xi(t)]\mathbb{E}[c_{i}(t)|x_{i}(t)] on path ii is much higher than others, the socially optimal policy may still recommend a small flow ϵ>0\epsilon>0 of users to learn xi(t)x_{i}(t) on path ii. As ϵ0\epsilon\rightarrow 0, this infinitesimal flow of users’ immediate total cost ϵ(𝔼[ci(t)|xi(t)]+ϵ)\epsilon(\mathbb{E}[c_{i}(t)|x_{i}(t)]+\epsilon) is negligible. At the same time, their learned information on path ii can help reduce future costs for all users. Therefore, the socially optimal policy always decides fi(t)ϵf_{i}^{*}(t)\geq\epsilon for any stochastic path ii.

Next, we derive the optimal recommended flow f1(x1(t))f_{1}^{*}(x_{1}(t)) in the typical two-path network example.

Lemma 3.

Under the socially optimal policy, the recommended flow on stochastic path 1 is

f1(x1(t))={ϵ, if 𝔼[c1(t)|x1(t)]c0+2,1, if 𝔼[c1(t)|x1(t)]c02,12+c0𝔼[c1(t)|x1(t)]4,otherwise,\displaystyle f_{1}^{*}(x_{1}(t))=\begin{cases}\epsilon,\quad\quad\text{ if }\mathbb{E}[c_{1}(t)|x_{1}(t)]\geq c_{0}+2,\\ 1,\quad\quad\text{ if }\mathbb{E}[c_{1}(t)|x_{1}(t)]\leq c_{0}-2,\\ \frac{1}{2}+\frac{c_{0}-\mathbb{E}[c_{1}(t)|x_{1}(t)]}{4},\ \ \ \text{otherwise,}\end{cases} (16)

where ϵ>0\epsilon>0 is infinitesimal.

The proof of Lemma 3 is given in Appendix C of the supplemental material. By comparing (16) to (7) under the sharing mechanism, we observe that the socially optimal policy always explores stochastic path 1 with f1(x1(t))ϵf_{1}^{*}(x_{1}(t))\geq\epsilon, even if 𝔼[c1(t)|x1(t)]\mathbb{E}[c_{1}(t)|x_{1}(t)] is much larger than c0c_{0}. If 𝔼[c1(t)|x1(t)]<c0\mathbb{E}[c_{1}(t)|x_{1}(t)]<c_{0}, f1(x1(t))=12+c0𝔼[c1(t)|x1(t)]4f_{1}^{*}(x_{1}(t))=\frac{1}{2}+\frac{c_{0}-\mathbb{E}[c_{1}(t)|x_{1}(t)]}{4} in (16) is smaller than f1(s)(x1(t))=12+c0𝔼[c1(t)|x1(t)]2f_{1}^{(s)}(x_{1}(t))=\frac{1}{2}+\frac{c_{0}-\mathbb{E}[c_{1}(t)|x_{1}(t)]}{2} in (7) to reduce the overall congestion on this stochastic path.

By comparing (16) to (12)-(13) under the hiding mechanism, the socially optimal policy adaptively uses actual x1(t)x_{1}(t) to guide users’ routing in different cases. However, the hiding mechanism uses y1,1(t)y_{1,1}(t) and y1,2(t)y_{1,2}(t) for partially blind routing decisions, which can be very different from the optimum. Based on (7), (12)-(13) and (16) of the three mechanisms/policies, we are ready to compare their system performances in Section IV to inspire our mechanism design.

IV Analytical comparison between sharing, hiding, and social optimum via PoA

In this section, we first conduct a comparative PoA analysis between the sharing mechanism and the socially optimal policy to prove the huge performance loss of the sharing mechanism. Subsequently, we also prove the huge PoA of the hiding mechanism as compared to the social optimum. Then we are well motivated to propose our new mechanism in Section V. Actually, this section’s analytical results provide important guidance to our later mechanism design, which is carefully designed to play between information-sharing and information-hiding mechanism.

Before analyzing the two mechanisms’ performance loss, we first obtain the bounds of public hazard belief xi(t)x_{i}(t) and private belief yi,di(t)(t)y_{i,d_{i}(t)}(t) for future use.

Lemma 4.

Given any initial belief xi(0)[0,1]x_{i}(0)\in[0,1] for stochastic path ii, both the public belief xi(t)x_{i}(t) and private belief yi,di(t)(t)y_{i,d_{i}(t)}(t) satisfy xi(t)[qLH,qHH]x_{i}(t)\in[q_{LH},q_{HH}] and yi,di(t)(t)[qLH,qHH]y_{i,d_{i}(t)}(t)\in[q_{LH},q_{HH}] for any time t1t\geq 1.

The proof of Lemma 4 is given in Appendix D of the supplemental material. According to (2), the lower bound qLHq_{LH} and upper bound qHHq_{HH} are derived when x(t)=0x^{\prime}(t)=0 and maximum x(t)=1x^{\prime}(t)=1, respectively. If the current users observe ci(t)=cLc_{i}(t)=c_{L}, leading to x(t)=0x^{\prime}(t)=0, the hazard belief xi(t+1)x_{i}(t+1) will be updated to its minimum value qLHq_{LH}. Conversely, if xi(t)=1x_{i}^{\prime}(t)=1 with ci(t)=cHc_{i}(t)=c_{H}, xi(t+1)x_{i}(t+1) will be updated to its maximum value qHHq_{HH}. Consequently, the updated hazard belief xi(t+1)x_{i}(t+1) will always remain bounded within [qLH,qHH][q_{LH},q_{HH}]. Similarly, based on the updates of the private belief yi,di(t)(t)y_{i,d_{i}(t)}(t) in (3) or (4), it is also constrained by qLHq_{LH} and qHHq_{HH}.

Based on Lemma 4, if the minimal expected travel cost 𝔼[ci(t)|xi(0)=qLH]\mathbb{E}[c_{i}(t)|x_{i}(0)=q_{LH}] under the good traffic condition of stochastic path ii is larger than c0c_{0}, selfish users will never explore stochastic path ii at any time. On the other hand, if the expected maximal expected travel cost 𝔼[ci(t)|xi(t)=qHH]\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{HH}] is smaller than c0c_{0}, selfish users will always choose path ii to exploit the low travel cost c0c_{0} there. Therefore, To avoid the trivial cases of always exploration or exploitation, we focus on 𝔼[ci(t)|qLH]<c0\mathbb{E}[c_{i}(t)|q_{LH}]<c_{0}, 𝔼[ci(t)|qHH]>c0\mathbb{E}[c_{i}(t)|q_{HH}]>c_{0}, and 𝔼[ci(0)|xi(0)]<c0+1\mathbb{E}[c_{i}(0)|x_{i}(0)]<c_{0}+1 below.

IV-A PoA Analysis of Information Sharing Mechanism

We compare fi(s)(𝐱(t))f_{i}^{(s)}(\mathbf{x}(t)) under the sharing mechanism in Section III-A to the optimal recommended flow fi(𝐱(t))f_{i}^{*}(\mathbf{x}(t)) under the social optimum in Section III-C to examine the sharing mechanism’s efficiency loss. For example, by comparing f1(s)(𝐱(t))f_{1}^{(s)}(\mathbf{x}(t)) in (7) and f1(𝐱(t))f_{1}^{*}(\mathbf{x}(t)) in (16) for the two-path network example, we find that the sharing mechanism misses both exploitation and exploration on stochastic path 1, as compared to the social optimum. Below, we examine their performance gap for an arbitrary number NN of stochastic paths.

We use the standard definition of the price of anarchy (PoA) in game theory [41], for measuring the maximum ratio between the long-term social cost under the sharing mechanism in (9) and the minimal social cost in (15):

PoA(s)=maxc0,cH,cL,qLL,qHH,𝐱(t),ρC(s)(𝐱(t))C(𝐱(t)),\displaystyle\text{PoA}^{(s)}=\max_{\begin{subarray}{c}c_{0},c_{H},c_{L},q_{LL},\\ q_{HH},\mathbf{x}(t),\rho\end{subarray}}\frac{C^{(s)}(\mathbf{x}(t))}{C^{*}(\mathbf{x}(t))}, (17)

which is always larger than 11 by considering all system parameters.

Through involved worst-case analysis, we prove the lower bound of PoA(s)\text{PoA}^{(s)} in (17) caused by the sharing mechanism’s lack of exploration in the following proposition.

Proposition 1.

The PoA defined in (17) under the information-sharing mechanism in (9) satisfies

PoA(s)1ρ+ρi=1Nxi(0)1ρ+(ρρlogqHHδN)i=1Nxi(0),\displaystyle\text{PoA}^{(s)}\geq\frac{1-\rho+\rho\prod_{i=1}^{N}x_{i}(0)}{1-\rho+(\rho-\rho^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}})\prod_{i=1}^{N}x_{i}(0)}, (18)

where δ\delta is an infinitesimal. This lower bound is tight as qLL1,qHH1,cL=0,cHc0,1qLL1qHH0,𝔼[ci(0)|xi(0)]c0+1q_{LL}\rightarrow 1,q_{HH}\rightarrow 1,c_{L}=0,c_{H}\gg c_{0},\frac{1-q_{LL}}{1-q_{HH}}\rightarrow 0,\mathbb{E}[c_{i}(0)|x_{i}(0)]\rightarrow c_{0}+1.

The proof of Proposition 1 is given in Appendix E of the supplemental material. In Proposition 1’s proof, we consider the worst case of minimum exploration under the sharing mechanism. Given 𝔼[ci(0)|xi(0)]\mathbb{E}[c_{i}(0)|x_{i}(0)] up to c0+1c_{0}+1 on path ii, there will be an ϵ\epsilon flow of users traveling on stochastic path ii at time t=0t=0. Once they find ci(t)=cHc_{i}(t)=c_{H} there at any t1t\geq 1, no user will explore this path again, as qHH1q_{HH}\rightarrow 1. However, socially optimal policy (16) always recommends ϵ\epsilon flow of users to explore any path ii to learn possible ci(t)=cLc_{i}(t)=c_{L} there. As 1qLL1qHH0\frac{1-q_{LL}}{1-q_{HH}}\rightarrow 0, it takes them at most logqHHδN\frac{\lceil\log_{q_{HH}}\delta\rceil}{N} time slots to find at least one path with cL=0c_{L}=0. After that, all users’ long-term costs will be significantly reduced due to qLL1q_{LL}\rightarrow 1. Then we compare the two long-term costs to derive the PoA lower bound in (18).

Remark 1.

In (18), once 𝒪(11ρ)𝒪(11qHH)\mathcal{O}(\frac{1}{1-\rho})\gg\mathcal{O}(\frac{1}{1-q_{HH}}) to make ρkN1\rho^{\frac{k}{N}}\rightarrow 1, then PoA(s)=\text{PoA}^{(s)}=\infty to tell infinitely large efficiency loss under the sharing mechanism.

In (18), the PoA lower bound decreases with path number NN, as more stochastic paths help risk pooling and make it easier for users to find ci(t)=cLc_{i}(t)=c_{L} to approach the optimum. As NN\rightarrow\infty with infinite stochastic paths, PoA(s)\text{PoA}^{(s)} in (18) approaches the optimum 11.

IV-B PoA Analysis of Information Hiding Mechanism

In this subsection, we further analyze the PoA lower bound caused by the hiding mechanism. Let PoA\text{PoA}^{\emptyset} denote the PoA caused by the hiding mechanism, which is similarly defined as (17) to compare the social cost C(𝐲(t))C^{\emptyset}(\mathbf{y}(t)) in (14) with C(𝐱(t))C^{*}(\mathbf{x}(t)) in (15). In the next proposition, we successfully provide the worst-case analysis to prove PoA\text{PoA}^{\emptyset}’s lower bound.

Proposition 2.

As compared to the minimal social cost in (15), the information-hiding mechanism in (14) results in

PoA1ρ2+ρ2i=1Nxi(0)1ρ+(ρρlogqHHδN)i=1Nxi(0),\displaystyle\text{PoA}^{\emptyset}\geq\frac{1-\rho^{2}+\rho^{2}\prod_{i=1}^{N}x_{i}(0)}{1-\rho+(\rho-\rho^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}})\prod_{i=1}^{N}x_{i}(0)}, (19)

where δ\delta is an infinitesimal. This lower bound is tight as qLL1,qHH1,cL=0,cHc0,1qLL1qHH0q_{LL}\rightarrow 1,q_{HH}\rightarrow 1,c_{L}=0,c_{H}\gg c_{0},\frac{1-q_{LL}}{1-q_{HH}}\rightarrow 0, 𝔼[ci(0)|xi(0)]c0+1\mathbb{E}[c_{i}(0)|x_{i}(0)]\rightarrow c_{0}+1.

The proof of Proposition 2 is given in Appendix F of the supplemental material. This PoA bound (19) can be similarly proved by considering the minimum exploration case as in Proposition 1. However, different from the sharing mechanism that always knows the latest system information, users with information delay di(t)=2d_{i}(t)=2 under the hiding mechanism have a time slot’s latency to decide exploration or exploitation, which makes PoA\text{PoA}^{\emptyset} in (19) different from PoA(s)\text{PoA}^{(s)} in (18)

Remark 2.

In (19), once 𝒪(11ρ)𝒪(11qHH)\mathcal{O}(\frac{1}{1-\rho})\gg\mathcal{O}(\frac{1}{1-q_{HH}}) to make ρkN1\rho^{\frac{k}{N}}\rightarrow 1, then PoA=\text{PoA}^{\emptyset}=\infty, telling infinitely large efficiency loss under the information-hiding mechanism.

In (19), the lower bound of PoA\text{PoA}^{\emptyset} also decreases with path number NN. As NN\rightarrow\infty with infinite paths, PoA\text{PoA}^{\emptyset} approaches 1+ρ1+\rho, due to users’ late exploitations of stochastic paths to increase the total cost under information delay di(t)=2d_{i}(t)=2.

Given neither the sharing mechanism nor the hiding mechanism performs well for non-myopic users with finite NN in practice, it is critical to design a new mechanism other than simply sharing or hiding information.

V Design and Analysis of Our User-differential Informational Mechanism

To motivate non-myopic users to follow the socially optimal policy in (16) as much as possible, we aim to design the best mechanism to obtain the minimum possible PoA in this section. As pricing or side-payment mechanisms directly charge users, they are difficult to implement in the existing traffic navigation platforms (​​[19, 20, 42]). Instead, we aim to design a non-monetary informational mechanism that indirectly restricts information channels to regulate users. Following the mechanism design literature (e.g., [22] and [23]), we provide its general definition below.

Definition 2 (Informational mechanism).

An informational mechanism defines a Bayesian game, where users send costless information and the social planner decides the disclosing level of observable variables based on received information.

By Definition 2, we notice that the information hiding mechanism in Section III-B belongs to informational mechanisms, by simply hiding all traffic information from selfish users. Yet its achieved PoA is arbitrarily large in Proposition 2. Additionally, we analyze existing deterministic recommendation mechanisms (e.g., [17, 18, 24]), which offer state-dependent deterministic recommendations to users while concealing other information, and demonstrate that these mechanisms still result in an infinite PoA, motivating our UPR mechanism design.

Lemma 5.

The existing deterministic-recommendation mechanisms make PoA=\text{PoA}=\infty in this system.

The proof of Lemma 5 is given in Appendix G of the supplemental material. Intuitively, when a user receives recommendation π(t)=i\pi(t)=i, it still infers the expected travel cost for each path based on its private information yi,di(t)(t)y_{i,d_{i}(t)}(t). As a result, the recommendation π(t)\pi(t) does not influence the user’s selfish path choice, ultimately leading to an infinite PoA.

We next examine the efficiency limit of any informational mechanism for our HILL system below.

Proposition 3.

In our dynamic HILL system, under any informational mechanism in Definition 2, the achieved PoA satisfies

PoA1+14N+3.\displaystyle\text{PoA}\geq 1+\frac{1}{4N+3}.

The proof of Proposition 3 is given in Appendix H of the supplemental material. To obtain this minimal possible PoA, we first show that no informational mechanism in Definition 2 can prevent users’ maximum exploration of stochastic paths. Let xi(0)=qLH,qLH0,qHH<1,cL=0x_{i}(0)=q_{LH},q_{LH}\rightarrow 0,q_{HH}<1,c_{L}=0 for any stochastic path ii\in\mathbb{N}. Given 𝔼[ci(0)|xi(0)]=c01N\mathbb{E}[c_{i}(0)|x_{i}(0)]=c_{0}-\frac{1}{N}, there will be fi(s)(𝐱(0))=1Nf_{i}^{(s)}(\mathbf{x}(0))=\frac{1}{N} flow of users traveling on each stochastic path ii at initial t=0t=0, according to (7). Since initial hazard belief xi(0)=qLH0x_{i}(0)=q_{LH}\rightarrow 0, users will observe ci(0)=cLc_{i}(0)=c_{L} with good traffic condition on path ii. Then each user is aware of the public belief xi(1)=qLHx_{i}(1)=q_{LH} on its chosen stochastic path ii with minimum information age di(1)=1d_{i}(1)=1 for next time t=1t=1. Consequently, they will behave the same as the sharing mechanism by continuing to travel to their chosen stochastic paths since t=1t=1, making any informational mechanism fail to regulate.

On the other hand, according to f(𝐱(t))f^{*}(\mathbf{x}(t)) in (16), the socially optimal policy ensures 12(N+1)\frac{1}{2(N+1)} flow of users to choose deterministic path 0 and avoid congestion cost on stochastic paths. In this case, the resulting minimal PoA is 1+14N+31+\frac{1}{4N+3}.

Building upon Proposition 3, we want to design the best informational mechanism to achieve PoA=1+14N+3\text{PoA}=1+\frac{1}{4N+3}. Based on the worst-case analysis in the proof above, when ci(t1)=cLc_{i}(t-1)=c_{L}, no informational mechanism is able to prevent users with minimum information age di(t)=1d_{i}(t)=1 from exploiting stochastic path ii at the current time. When a number 1nN1\leq n\leq N of stochastic paths reach their minimum public hazard belief (i.e., xi(t)=qLHx_{i}(t)=q_{LH} for path ii according to Lemma 4), selfish users have the maximum tendency to exploit these paths and the resultant routing flow on path ii reaches its maximum:

f¯(s)(n,t)=min{1n,c0+1𝔼[ci(t)|xi(t)=qLH]n+1}.\bar{f}^{(s)}(n,t)=\min\left\{\frac{1}{n},\frac{c_{0}+1-\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{LH}]}{n+1}\right\}. (20)

Given the number nn of stochastic paths in good condition at time tt, f¯(s)(n,t)\bar{f}^{(s)}(n,t) is known to all users. Alternatively, when ci(t1)=cHc_{i}(t-1)=c_{H}, no informational mechanism can persuade users with minimum age di(t)=1d_{i}(t)=1 to choose path ii at tt. Hence, we can only focus on incentivizing the remaining users with non-trivial information age di(t)=2d_{i}(t)=2 to align with the platform’s optimal path recommendations.

On one hand, different from the hiding mechanism to blind all users, this user-differential approach inspires us to differentiate our path recommendations to users of differential di(t)d_{i}(t). On the other hand, unlike the sharing mechanism, our mechanism design does not share with users the public belief set history {𝐱(1),,𝐱(t)}\{\mathbf{x}(1),\cdots,\mathbf{x}(t)\} but the previous flow distribution to improve efficiency. As users (with information age di(t)=2d_{i}(t)=2) can reverse-engineer from the system’s path recommendations to infer the actual hazard belief, we propose probabilistic recommendations below.

In the following mechanism, let fi(UPR)(t)[ϵ,f¯(s)(n,t)]f^{(\text{UPR})}_{i}(t)\in[\epsilon,\bar{f}^{(s)}(n,t)] represent the recommended flow of path ii at time tt, which is always positive and keeps learning from all stochastic paths over time.

Definition 3 (User-differential Probabilistic Recommendation (UPR) mechanism).

At any given time t1t\geq 1, the platform operates in the following two steps:

Step 1: Selective information hiding. First, the platform hides public belief history {𝐱(1),,𝐱(t)}\{\mathbf{x}(1),\cdots,\mathbf{x}(t)\} from all users but discloses previous routing flow (i.e., fi(UPR)(t1)f^{(\text{UPR})}_{i}(t-1) for any path ii).

Step 2: Probabilistic Recommendation. Then the platform performs the following user-differential random signaling strategies to disclose selected information.

  • For the fi(UPR)(t1)f_{i}^{(\text{UPR})}(t-1) flow of users observing ci(t1)=cLc_{i}(t-1)=c_{L} on path ii with minimum age di(t)=1d_{i}(t)=1, the platform just recommends them to continue and make their own selfish routing decisions in (12).

  • For the other users with di(t)=2d_{i}(t)=2 of any stochastic path iI(t):={i|f¯(s)(n,t)>fi(UPR)(t1)}i\in I(t):=\{i\in\mathbb{N}|\bar{f}^{(s)}(n,t)>f^{(\text{UPR})}_{i}(t-1)\}, the platform independently recommends each of them to travel on path π(t)=i\pi(t)=i with probability

    𝐏𝐫(π(t)=i|xi(t))=\displaystyle\mathbf{Pr}(\pi(t)=i|x_{i}(t))= (21)
    {ϵ, if high hazard belief xi(t)=qHH,f¯(s)(n,t)fi(UPR)(t1)1fi(UPR)(t1),if low hazard belief xi(t)=qLH.\displaystyle\begin{cases}\epsilon,\quad\quad\quad\quad\quad\quad\ \ \ \text{ if high hazard belief }x_{i}(t)=q_{HH},\\ \frac{{\bar{f}^{(s)}(n,t)-f^{(\text{UPR})}_{i}(t-1)}}{1-f^{(\text{UPR})}_{i}(t-1)},\text{if low hazard belief }x_{i}(t)=q_{LH}.\end{cases}

Besides, the platform independently recommends each of them to travel to deterministic path π(t)=0\pi(t)=0 with probability

𝐏𝐫(π(t)=0|xi(t))=1iI(t)𝐏𝐫(π(t)=i|xi(t)).\displaystyle\mathbf{Pr}(\pi(t)=0|x_{i}(t))=1-\sum_{i\in I(t)}\mathbf{Pr}(\pi(t)=i|x_{i}(t)).

Here, our UPR mechanism does not prevent users, who just observed ci(t1)=cLc_{i}(t-1)=c_{L} with minimum age di(t)=1d_{i}(t)=1, from pursuing selfish interests by exploiting stochastic path ii at time tt. However, for those users who just observed cj(t1)=cHc_{j}(t-1)=c_{H} on another path jj and other users who just traveled on deterministic path 0 with obvious age di(t)=2d_{i}(t)=2, they receive probabilistic path recommendations in (21) to deviate from this strong hazard path and try other paths.

Our mechanism proactively designs (21) in the persuasive way that a user with di(t)=2d_{i}(t)=2 hearing the realized recommendation π(t)=i\pi(t)=i will reverse-engineer path ii in a good condition xi(t)=qLHx_{i}(t)=q_{LH}, as the corresponding recommendation probability f¯(s)(n,t)fi(UPR)(t1)1fi(UPR)(t1)ϵ\frac{{\bar{f}^{(s)}(n,t)-f^{(\text{UPR})}_{i}(t-1)}}{1-f^{(\text{UPR})}_{i}(t-1)}\gg\epsilon by comparing the two cases of (21). On the other hand, if a user is recommended to path π(t)=0\pi(t)=0, it will similarly believe that any stochastic path ii has a bad traffic condition with xi(t)=qHHx_{i}(t)=q_{HH}. As a result, it will follow this recommendation to choose path 0.

In summary, our UPR mechanism enables users to exploit ci(t)=cLc_{i}(t)=c_{L} on stochastic path ii, when the public hazard belief xi(t)=qLHx_{i}(t)=q_{LH} on this path is low. On the other hand, when stochastic path ii has a high hazard belief xi(t)=qHHx_{i}(t)=q_{HH}, our UPR mechanism still ensures an ϵ\epsilon flow of users exploring this path to learn for future. Thus, our UPR mechanism successfully avoids the worst-case scenarios with minimum exploration under both the sharing and hiding mechanisms. Note that the main idea of our UPR mechanism can also be generated to regulate selfish players in other restless MAB games (e.g., [36, 37]). In these problems, the social planner provides differential recommendations to users based on their private information about each arm (path), which influences their posterior beliefs and consequently regulates their selfish arm choices.

Following [43] and [44], we define users’ Interim Individual Rationality (IIR) to ensure their long-term participation in our UPR mechanism.

Definition 4 (Interim Individual Rationality).

A mechanism ensures Interim Individual Rationality for users with private information if, prior to the realization of the cost incurred by following the mechanism, each user anticipates achieving a cost level that is at least as low as their reservation cost.

Based on above analysis and Definition 4, we prove our UPR mechanism ensures IIR for all users and achieves close-to-optimal performance in the following theorem.

Theorem 1.

Our UPR mechanism in Definition 3 ensures interim individual rationality for all non-myopic users to follow the system’s random path recommendations. It greatly reduces PoA from arbitrarily large in (18) and (19) to the minimum possible ratio:

PoA(UPR)=1+14N+3,\displaystyle\text{PoA}^{(\text{UPR})}=1+\frac{1}{4N+3}, (22)

which is always no larger than 87\frac{8}{7} for any N1N\geq 1.

The proof of Theorem 1 is given in Appendix I of the supplemental material. Recall Proposition 3 tells that no informational mechanism can achieve PoA less than 1+14N+31+\frac{1}{4N+3}, it demonstrates that our UPR mechanism is the best to reduce PoA. The worst case under our UPR mechanism is the same as Proposition 3, which occurs at users’ maximum exploration when all stochastic paths are in good traffic conditions with xi(t)=qLHx_{i}(t)=q_{LH}.

In addition to the worst-case PoA analysis presented in Theorem 1, we further conduct experiments using real datasets to examine the average performance of our UPR mechanism.

Refer to caption
Figure 2: A hybrid road network with two origins, World Expo Museum and Shanghai Station, and a single destination, the Bund, Shanghai’s most popular shopping and sightseeing area. Users randomly leave from World Expo Museum and Shanghai Station each day, and subsequently choose a path to reach the Bund.

VI Experiment Validation Using Real Datasets

In this section, we conduct extensive experiments to evaluate our UPR mechanism’s average performance versus the information sharing mechanism (used by Google Maps and Waze), the latest hiding mechanism in routing game literature (e.g., [17, 32, 24]), and the social optimum in terms of the average long-term social cost.

Without much loss of generality, we focus on populous cities exhibiting time-varying road conditions and experiencing heavy traffic congestion. As such, we use live traffic datasets from Baidu Maps [48] and real-time traffic speed data during similar peak hours to train our traffic model practically. Building on our parallel transportation network in Fig. 1(a), we extend it to a hybrid network by introducing two origin-destination pairs-Shanghai Station (origin 1) to the Bund and World Expo Museum (origin 2) to the Bund-with overlapping paths (i.e., Yan’An Elevated Road and East Yan’An Road) in Fig. 2. This setup captures the typical flow of visitors traveling from origin 1 and origin 2 to the Bund.

For the route from origin 1 to the Bund, the options are:

  • Path 1-1: via First Haining Road, Henan Road, and then East Yan’An Road.

  • Path 1-2: via North Chengdu Road, Yan’An Elevated road, and East Yan’An Road.

For the route from origin 2 to the Bund, the options are:

  • Path 2-1: via North-South Elevated Road, Yan’An Elevated Road, and East Yan’An Road.

  • Path 2-2: via South Zhongshan Road and East Zhongshan No.2 Road.

The mean arrival rates are α1=102\alpha^{1}=102 and α2=56\alpha^{2}=56 cars every 55 minutes at origins origin 1 and origin 2, respectively, with a standard deviation of 5.625.62. To develop a practical congestion model with travel costs, we mine and analyze extensive data on the SpeedBand values across the eight roads, which fluctuate every 55 minutes in the peak-hour periods. Our data analysis verifies that the traffic conditions on North Chengdu Road, Yan’An Elevated Road, and East Yan’An Road in path 1-2, as well as North-South Elevated Road in path 2-1, can be well approximated as a Markov chain with two discretized states (high and low traffic states) for learning, as depicted in Fig. 1(b). In contrast, Haining Road, and Henan Road in path 1-1, as well as South Zhongshan Road and East Zhongshan No.2 Road in path 2-2, display more deterministic conditions. Following similar methodologies from [45, 46, 47], we train the hidden Markov model (HMM) and obtain the transition probability matrices N_CD,YA_E,E_YA\text{N\_CD},\text{YA\_E},\text{E\_YA}, and NS_E for North Chengdu Road, Yan’An Elevated Road, East Yan’An Road, and North-South Elevated Road, respectively:

N_CD=[0.62390.37610.35250.6475]\displaystyle\text{N\_CD}=\begin{bmatrix}0.6239&0.3761\\ 0.3525&0.6475\end{bmatrix} ,YA_E=[0.78730.21270.30390.6961],\displaystyle,\text{YA\_E}=\begin{bmatrix}0.7873&0.2127\\ 0.3039&0.6961\end{bmatrix},
E_YA=[0.84430.15570.13260.8674]\displaystyle\text{E\_YA}=\begin{bmatrix}0.8443&0.1557\\ 0.1326&0.8674\end{bmatrix} ,NS_E=[0.87300.12800.08980.9002].\displaystyle,\text{NS\_E}=\begin{bmatrix}0.8730&0.1280\\ 0.0898&0.9002\end{bmatrix}.

In this hybrid two-origin network, users assess the travel latency as the travel cost. Initially, the four stochastic roads’ hazard beliefs satisfy xNCD(0)=0.5x_{\text{NCD}}(0)=0.5, xYAE(0)=0.6x_{\text{YAE}}(0)=0.6, xEYA(0)=0.7x_{\text{EYA}}(0)=0.7 and xNSE(0)=0.3x_{\text{NSE}}(0)=0.3 before the peak hours.

Under both sharing and hiding mechanisms, users at origin 1 estimate the combined travel costs of North Chengdu Road and Yan’An Elevated Road as 𝔼[c12(t)]\mathbb{E}[c_{1-2}(t)] for path 1-2, incorporating the congestion cost from the expected flow of users on path 2-1. Since paths 1-1 and 1-2 share East Yan’An Road, users at origin 1 compare 𝔼[c12(t)]\mathbb{E}[c_{1-2}(t)] with the fixed cost c11c_{1-1} (covering the combined travel costs of Haining Road and Henan Road) to make routing decisions. Similarly, users at origin 2 estimate the combined travel costs of North-South Elevated Road, Yan’An Elevated Road, and East Yan’An Road for path 2-1 as 𝔼[c21(t)]\mathbb{E}[c_{2-1}(t)]. Then they compare 𝔼[c21(t)]\mathbb{E}[c_{2-1}(t)] and the fixed cost c22c_{2-2} to determine, based on self-interest, which path minimizes their personal travel costs. While the socially optimal policy continuously search for the path that minimizes the long-term social cost for all users. To apply our UPR mechanism in Definition 3, we combine the three roads on paths 1-2 and 2-1 and update the parameters in (21) to provide probabilistic recommendations. These recommendations direct users with differential information ages toward specific paths from origins origin 1 and origin 2, based on the public hazard beliefs of the four roads as in (21). In addition, we set the discount factor ρ=0.95\rho=0.95 for the long-term social costs and ϵ=1\epsilon=1 as the minimum recommended flow in (7), (16), and (21). After training a practical traffic model and applying mechanisms, we are ready to conduct experiments to show our UPR mechanism’s average performance versus the other mechanisms. We run 100100 experiments (each with 3030 time slots or equivalently 150150 minutes) for averaging the long-term social costs during peak hours.

Refer to caption
Figure 3: Average long-term social costs (in minutes) under information sharing, hiding, the optimum, and our UPR mechanism versus time horizon TT.

Fig. 3 compares the long-term social cost performances of the information sharing, hiding, the optimum, and our UPR mechanisms versus time horizon TT. It demonstrates that our UPR mechanism has less than 5%5\% efficiency loss from the social optimum for any time horizon TT, while the sharing and hiding mechanisms have around 20%20\% and 40%40\% efficiency losses, respectively.

VII Conclusions

In this paper, we study how to regulate human-in-the-loop learning (HILL) in repeated routing games, where we face non-myopic users of differential past observations and need new mechanisms (preferably non-monetary) to persuade users to adhere to the optimal path recommendations. We first prove that no matter under the information sharing mechanism in use or the latest routing literature’s hiding mechanism, the resultant price of anarchy (PoA) for measuring the efficiency loss from social optimum can approach infinity, telling arbitrarily poor exploration-exploitation tradeoff over time. Then we propose a novel user-differential probabilistic recommendation (UPR) mechanism to differentiate and randomize path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces PoA=\text{PoA}=\infty to close-to-optimal PoA=1+14N+3\text{PoA}=1+\frac{1}{4N+3}, which cannot be further reduced by any other non-monetary mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal UPR performance mechanism.

References

  • [1] T. Haselton, “How to use Google Waze,” https://www.cnbc.com/2018/11/13/how-to-use-google-waze-for-directions-and-avoiding-traffic.html, 2018.
  • [2] S. Olariu, “Vehicular crowdsourcing for congestion support in smart cities,” Smart Cities, vol. 4, no. 2, pp. 662–685, 2021.
  • [3] J. Kang, J. Zhang, H. Yang, D. Ye, and M. S. Hossain, “When metaverses meet vehicle road cooperation: Multi-agent drl-based stackelberg game for vehicular twins migration,” IEEE Internet of Things Journal, 2024.
  • [4] S. Vasserman, M. Feldman, and A. Hassidim, “Implementing the wisdom of Waze,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • [5] M. R. Hasan, A. L. Bazzan, E. Friedman, and A. Raja, “A multiagent solution to overcome selfish routing in transportation networks,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2016, pp. 1850–1855.
  • [6] A. Slivkins et al., “Introduction to multi-armed bandits,” Foundations and Trends® in Machine Learning, vol. 12, no. 1-2, pp. 1–286, 2019.
  • [7] F. Li, D. Yu, H. Yang, J. Yu, H. Karl, and X. Cheng, “Multi-armed-bandit-based spectrum scheduling algorithms in wireless networks: A survey,” IEEE Wireless Communications, vol. 27, no. 1, pp. 24–30, 2020.
  • [8] S. Gupta, S. Chaudhari, G. Joshi, and O. Yağan, “Multi-armed bandits with correlated arms,” IEEE Transactions on Information Theory, vol. 67, no. 10, pp. 6711–6732, 2021.
  • [9] A. Bozorgchenani, S. Maghsudi, D. Tarchi, and E. Hossain, “Computation offloading in heterogeneous vehicular edge networks: On-line and off-policy bandit solutions,” IEEE Transactions on Mobile Computing, vol. 21, no. 12, pp. 4233–4248, 2021.
  • [10] H. Wang, Y. Yang, E. Wang, W. Liu, Y. Xu, and J. Wu, “Truthful user recruitment for cooperative crowdsensing task: A combinatorial multi-armed bandit approach,” IEEE Transactions on Mobile Computing, vol. 22, no. 7, pp. 4314–4331, 2022.
  • [11] F. Li, X. Yuan, L. Wang, H. Yang, D. Y, W. Lyu, and X. Cheng, “Collaborative learning in general graphs with limited memorization: Complexity, learnability, and reliability,” IEEE/ACM Transactions on Networking, vol. 31, no. 6, pp. 3222–3237, 2023.
  • [12] C. Shi and C. Shen, “Federated multi-armed bandits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 9603–9611.
  • [13] L. Yang, Y.-Z. J. Chen, S. Pasteris, M. Hajiesmaili, J. Lui, and D. Towsley, “Cooperative stochastic bandits with asynchronous agents and constrained feedback,” Advances in Neural Information Processing Systems, vol. 34, pp. 8885–8897, 2021.
  • [14] L. Yang, Y.-Z. J. Chen, M. H. Hajiemaili, J. C. Lui, and D. Towsley, “Distributed bandits with heterogeneous agents,” in IEEE INFOCOM 2022-IEEE Conference on Computer Communications.   IEEE, 2022, pp. 200–209.
  • [15] J. Zhu and J. Liu, “Distributed multi-armed bandits,” IEEE Transactions on Automatic Control, 2023.
  • [16] W. Mazurczyk, S. Wendzel, S. Zander, A. Houmansadr, and K. Szczypiorski, Information hiding in communication networks: fundamentals, mechanisms, applications, and countermeasures.   John Wiley & Sons, 2016.
  • [17] H. Tavafoghi and D. Teneketzis, “Informational incentives for congestion games,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2017, pp. 1285–1292.
  • [18] M. Wu and S. Amin, “Information design for regulating traffic flows under uncertain network state,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2019, pp. 671–678.
  • [19] B. L. Ferguson, P. N. Brown, and J. R. Marden, “The effectiveness of subsidies and tolls in congestion games,” IEEE Transactions on Automatic Control, vol. 67, no. 6, pp. 2729–2742, 2021.
  • [20] Y. Yue, B. L. Ferguson, and J. R. Marden, “Incentive design for congestion games with unincentivizable users,” in 2021 60th IEEE Conference on Decision and Control (CDC).   IEEE, 2021, pp. 4515–4520.
  • [21] H. Li and L. Duan, “Online pricing incentive to sample fresh information,” IEEE Transactions on Network Science and Engineering, vol. 10, no. 1, pp. 514–526, 2022.
  • [22] D. Fudenberg and J. Tirole, Game theory.   MIT press, 1991.
  • [23] T. Börgers, An introduction to the theory of mechanism design.   Oxford University Press, USA, 2015.
  • [24] Y. Li, C. Courcoubetis, and L. Duan, “Recommending paths: Follow or not follow?” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications.   IEEE, 2019, pp. 928–936.
  • [25] I. Kremer, Y. Mansour, and M. Perry, “Implementing the “wisdom of the crowd”,” Journal of Political Economy, vol. 122, no. 5, pp. 988–1012, 2014.
  • [26] Y. Mansour, A. Slivkins, V. Syrgkanis, and Z. S. Wu, “Bayesian exploration: Incentivizing exploration in Bayesian games,” Operations Research, vol. 70, no. 2, pp. 1105–1127, 2022.
  • [27] X. Hu, D. Ngo, A. Slivkins, and S. Z. Wu, “Incentivizing combinatorial bandit exploration,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 173–37 183, 2022.
  • [28] H. Li and L. Duan, “When congestion games meet mobile crowdsourcing: Selective information disclosure,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 5, 2023, pp. 5739–5746.
  • [29] E. Kamenica and M. Gentzkow, “Bayesian persuasion,” American Economic Review, vol. 101, no. 6, pp. 2590–2615, 2011.
  • [30] S. Das, E. Kamenica, and R. Mirka, “Reducing congestion through information design,” in 2017 55th annual allerton conference on communication, control, and computing (allerton).   IEEE, 2017, pp. 1279–1284.
  • [31] J. Kang, Y. Zhong, M. Xu, J. Nie, J. Wen, H. Du, D. Ye, X. Huang, D. Niyato, and S. Xie, “Tiny multi-agent drl for twins migration in uav metaverses: A multi-leader multi-follower stackelberg game approach,” IEEE Internet of Things Journal, 2024.
  • [32] F. Farhadi and D. Teneketzis, “Dynamic information design: a simple problem on optimal sequential information disclosure,” Dynamic Games and Applications, vol. 12, no. 2, pp. 443–484, 2022.
  • [33] M. Bernasconi, M. Castiglioni, A. Celli, A. Marchesi, F. Trovò, and N. Gatti, “Optimal rates and efficient algorithms for online bayesian persuasion,” in International Conference on Machine Learning.   PMLR, 2023, pp. 2164–2183.
  • [34] H. Li and L. Duan, “Human-in-the-loop learning for dynamic congestion games,” IEEE Transactions on Mobile Computing, 2024.
  • [35] E. Meigs, F. Parise, A. Ozdaglar, and D. Acemoglu, “Optimal dynamic information provision in traffic routing,” arXiv preprint arXiv:2001.03232, 2020.
  • [36] C. Tekin, and M. Liu, “Online learning of rested and restless bandits,” IEEE Transactions on Information Theory, vol. 58, no. 8, pp. 5588–5611, 2012.
  • [37] S. Wang, L. Huang, and J. Lui, “Restless-UCB, an efficient and low-complexity algorithm for online restless bandits,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 878–11 889, 2020.
  • [38] E. Castillo, P. Jimenez, J. M. Menendez, and A. J. Conejo, “The observability problem in traffic models: algebraic and topological methods,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 2, pp. 275–287, 2008.
  • [39] M. Salari, L. Kattan, W. H. Lam, H. Lo, and M. A. Esfeh, “Optimization of traffic sensor location for complete link flow observability in traffic network considering sensor failure,” Transportation Research Part B: Methodological, vol. 121, pp. 216–251, 2019.
  • [40] L. He, G. Sun, D. Niyato, H. Du, F. Mei, J. Kang, M. Debbah et al., “Generative ai for game theory-based mobile networking,” arXiv preprint arXiv:2404.09699, 2024.
  • [41] E. Koutsoupias and C. Papadimitriou, “Worst-case equilibria,” in Annual symposium on theoretical aspects of computer science.   Springer, 1999, pp. 404–413.
  • [42] F. Li, Y. Chai, H. Yang, P. Hu, and L. Duan, “Incentivizing Massive Unknown Workers for Budget-Limited Crowdsensing: From Off-Line and On-Line Perspectives,” IEEE/ACM Transactions on Networking, 2024.
  • [43] P. Cramton, R. Gibbons, and P. Klemperer, “Dissolving a partnership efficiently,” Econometrica: Journal of the Econometric Society, pp. 615–632, 1987.
  • [44] G. Kosenok and S. Severinov, “Individually rational, budget-balanced mechanisms and allocation of surplus,” Journal of Economic Theory, vol. 140, no. 1, pp. 126–161, 2008.
  • [45] S. R. Eddy, “Profile hidden markov models.” Bioinformatics (Oxford, England), vol. 14, no. 9, pp. 755–763, 1998.
  • [46] Z. Chen, J. Wen, and Y. Geng, “Predicting future traffic using hidden markov models,” in 2016 IEEE 24th international conference on network protocols (ICNP).   IEEE, 2016, pp. 1–6.
  • [47] A. Saad, H. F. Schepker, B. Staehle, and R. Knorr, “Whitespace prediction using hidden markov model based maximum likelihood classification,” in 2019 IEEE 89th Vehicular Technology Conference (VTC2019-Spring).   IEEE, 2019, pp. 1–7.
  • [48] Baidu, “Baidu Maps Open Platform,” https://lbsyun.baidu.com/index.php?title=webapi/traffic, 2023.
[Uncaptioned image] Hongbo Li (S’24-M’24) received the Ph.D. degree from Singapore University of Technology and Design (SUTD) in 2024. He is currently a Postdoctoral Research Fellow with the Pillar of Engineering Systems and Design, SUTD. In 2024, he was a Visiting Scholar at The Ohio State University (OSU), Columbus, OH, USA. His research interests include networked AI, game theory and mechanism design, and machine learning theory.
[Uncaptioned image] Lingjie Duan (S’09-M’12-SM’17) received the Ph.D. degree from The Chinese University of Hong Kong in 2012. He is an Associate Professor at the Singapore University of Technology and Design (SUTD) and is an Associate Head of Pillar (AHOP) of Engineering Systems and Design. In 2011, he was a Visiting Scholar at University of California at Berkeley, Berkeley, CA, USA. His research interests include network economics and game theory, network security and privacy, energy harvesting wireless communications, and mobile crowdsourcing. He is an Associate Editor of IEEE/ACM Transactions on Networking and IEEE Transactions on Mobile Computing. He was an Editor of IEEE Transactions on Wireless Communications and IEEE Communications Surveys and Tutorials. He also served as a Guest Editor of the IEEE Journal on Selected Areas in Communications Special Issue on Human-in-the-Loop Mobile Networks, as well as IEEE Wireless Communications Magazine. He is a General Chair of WiOpt 2023 Conference and is a regular TPC member of some other top conferences (e.g., INFOCOM, MobiHoc, SECON). He received the SUTD Excellence in Research Award in 2016 and the 10th IEEE ComSoc Asia-Pacific Outstanding Young Researcher Award in 2015.

-A Proof of Lemma 1

We consider a typical two-path network with one deterministic path (path 0) and one stochastic path (path 1) to derive the three cases of (7) in Lemma 1, depending on the relationship between cost c0c_{0} of deterministic path 0 and 𝔼[c1(t)|x1(t)]\mathbb{E}[c_{1}(t)|x_{1}(t)] of stochastic path 1.

Under the information-sharing mechanism, a selfish user will always choose the path that minimizes his own long-term cost at any tt. Define Q(s)(x1(t)|f1(s)(x1(t)))Q^{(s)}\big{(}x_{1}(t)|f_{1}^{(s)}(x_{1}(t))\big{)} to be the minimal long-run expected cost of a single user, given f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) flow of users choosing path 1 except for himself. Denote this user’s selfish decision under the sharing mechanism by π(s)(t)\pi^{(s)}(t). We obtain:

Q(s)(x1(t)|f1(s)(x1(t)))=\displaystyle Q^{(s)}\big{(}x_{1}(t)|f_{1}^{(s)}(x_{1}(t))\big{)}= (23)
min{c0+1f1(s)(x1(t))+ρQ0(s)(x1(t+1)|f1(s)(x1(t))),\displaystyle\min\Big{\{}c_{0}+1-f_{1}^{(s)}(x_{1}(t))+\rho Q_{0}^{(s)}\big{(}x_{1}(t+1)|f_{1}^{(s)}(x_{1}(t))\big{)},
𝔼[c1(t)|x1(t)]+f1(s)(x1(t))+ρQ1(s)(x1(t+1)|f1(s)(x1(t)))}\displaystyle\mathbb{E}[c_{1}(t)|x_{1}(t)]+f_{1}^{(s)}(x_{1}(t))+\rho Q_{1}^{(s)}\big{(}x_{1}(t+1)|f_{1}^{(s)}(x_{1}(t))\big{)}\Big{\}}

by estimating f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) under x1(t)x_{1}(t), where Q0(s)(x1(t+1)|f1(s)(x1(t))Q_{0}^{(s)}\big{(}x_{1}(t+1)|f_{1}^{(s)}(x_{1}(t)) and Q1(s)(x1(t+1)|f1(s)(x1(t))Q_{1}^{(s)}\big{(}x_{1}(t+1)|f_{1}^{(s)}(x_{1}(t)) are cost-to-go functions under π(s)(t)=0\pi^{(s)}(t)=0 and π(s)(t)=1\pi^{(s)}(t)=1, respectively.

From (23), if f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) is obviously greater than 0 under 𝔼[c1(t)|x1(t)]<c0+1\mathbb{E}[c_{1}(t)|x_{1}(t)]<c_{0}+1, then Q0(s)(x1(t+1)|f1(s)(x1(t)))=Q1(s)(x1(t+1)|f1(s)(x1(t)))Q_{0}^{(s)}\big{(}x_{1}(t+1)|f_{1}^{(s)}(x_{1}(t))\big{)}=Q_{1}^{(s)}\big{(}x_{1}(t+1)|f_{1}^{(s)}(x_{1}(t))\big{)}. This is because some other users will travel on stochastic path 1 and share the same observation there, and this user can always exploit the shared information on path 1 in the next time slot, no matter whether he chooses path π(s)(t)=1\pi^{(s)}(t)=1 or not. In this case, he will only compare immediate costs of the two paths in (23) to make his decision. Therefore, at the Nash equilibrium, the immediate travel costs on the two paths satisfy

c0+1f1(s)(x1(t))=𝔼[c1(t)|x1(t)]+f1(s)(x1(t)).c_{0}+1-f_{1}^{(s)}(x_{1}(t))=\mathbb{E}[c_{1}(t)|x_{1}(t)]+f_{1}^{(s)}(x_{1}(t)).

Then the solution of f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) has the following three cases:

  • If 𝔼[c1(t)|x1(t)]c01\mathbb{E}[c_{1}(t)|x_{1}(t)]\leq c_{0}-1, f1(s)(x1(t))=1f_{1}^{(s)}(x_{1}(t))=1 with the maximum flow on stochastic path 1, which is the second case of (7).

  • If c0+1>𝔼[c1(t)|x1(t)]>c01c_{0}+1>\mathbb{E}[c_{1}(t)|x_{1}(t)]>c_{0}-1, we obtain f1(s)(x1(t))=12+c0𝔼[c1(t)|x1(t)]2f_{1}^{(s)}(x_{1}(t))=\frac{1}{2}+\frac{c_{0}-\mathbb{E}[c_{1}(t)|x_{1}(t)]}{2}, which is the third case of (7).

  • On the other hand, if 𝔼[c1(t)|x1(t)]c0+1\mathbb{E}[c_{1}(t)|x_{1}(t)]\geq c_{0}+1, f1(s)(x1(t))f_{1}^{(s)}(x_{1}(t)) may be zero without other users choosing path 1. In this case, Q0(s)(x1(t+1)|0)Q1(s)(x1(t+1)|ϵ)Q_{0}^{(s)}\big{(}x_{1}(t+1)|0\big{)}\neq Q_{1}^{(s)}\big{(}x_{1}(t+1)|\epsilon\big{)}, as there will be no information update if this user also chooses deterministic path 0. Then if Q0(s)(x1(t+1)|0)>Q1(s)(x1(t+1)|ϵ)Q_{0}^{(s)}\big{(}x_{1}(t+1)|0\big{)}>Q_{1}^{(s)}\big{(}x_{1}(t+1)|\epsilon\big{)}, this user will choose stochastic path 1 for lower long-term cost, and thus f1(s)(x1(t))=ϵf_{1}^{(s)}(x_{1}(t))=\epsilon. Otherwise, no user will choose stochastic path 1, and f1(s)(x1(t))=0f_{1}^{(s)}(x_{1}(t))=0. In summary, f1(s)(x1(t))=ϵ𝟏{Q(s)(x1(t)|ϵ)Q(s)(x1(t)|0)}f_{1}^{(s)}(x_{1}(t))=\epsilon\cdot\mathbf{1}\{Q^{(s)}(x_{1}(t)|\epsilon)\leq Q^{(s)}(x_{1}(t)|0)\}, which is the first case of (7).

This finishes the proof of decision-making at the two-path network.

For an arbitrary number N2N\geq 2, if the travel cost of a stochastic path ii is larger than other paths, a user also compares his long-term costs Q0(s)(𝐱(t)|fi(s)(𝐱(t))=0)Q_{0}^{(s)}(\mathbf{x}(t)|f_{i}^{(s)}(\mathbf{x}(t))=0) and Q1(s)(𝐱(t)|fi(s)(𝐱(t))=ϵ)Q_{1}^{(s)}(\mathbf{x}(t)|f_{i}^{(s)}(\mathbf{x}(t))=\epsilon). While in other cases, he will balance the immediate cost on path ii with other paths to reach the Nash equilibrium.

-B Proof of Lemma 2

We still consider the typical two-path network to derive f1,1(y1,1(t))f_{1,1}^{\emptyset}(y_{1,1}(t)) and f1,2(y1,2(t))f_{1,2}^{\emptyset}(y_{1,2}(t)) for users with information age d1(t)=1d_{1}(t)=1 and d1(t)=2d_{1}(t)=2, respectively.

Given f1(𝐲(t1))f_{1}^{\emptyset}(\mathbf{y}(t-1)) flow of users exploring path ii in the last time, the current flow of users with d1(t)=1d_{1}(t)=1 is thus f1(𝐲(t1))f_{1}^{\emptyset}(\mathbf{y}(t-1)). As they have the latest information on this path with d1(t)=1d_{1}(t)=1, they will infer other users’ private belief yi,2(t)y_{i,2}(t) with d1(t)=2d_{1}(t)=2 according to (4) and the exact flow fi,2(yi,2(t))f_{i,2}^{\emptyset}(y_{i,2}(t)) of these users on path ii. Then they decide fi,1(yi,1(t))f^{\emptyset}_{i,1}(y_{i,1}(t)) to minimize their own costs. Thus, we first derive fi,2(yi,2(t))f^{\emptyset}_{i,2}(y_{i,2}(t)) for users with d1(t)=2d_{1}(t)=2.

For the (1f1(𝐲(t1)))\big{(}1-f_{1}^{\emptyset}(\mathbf{y}(t-1))\big{)} flow of users with d1(t)=2d_{1}(t)=2, they only hold rough belief yi,2(t)y_{i,2}(t) to decide fi,2(yi,2(t))f^{\emptyset}_{i,2}(y_{i,2}(t)) to minimize their expected long-run costs. To realize the expected flow f1(s)(y1,2(t))f_{1}^{(s)}(y_{1,2}(t)) at the Nash equilibrium, the probability for each of them choosing stochastic path 1 is thus f1(s)(y1,2(t))f_{1}^{(s)}(y_{1,2}(t)), which is derived by (7). Therefore, the routing flow of users with d1(t)=2d_{1}(t)=2 on path 1 is:

f1,2(y1,2(t))=(1f1(𝐲(t1)))f1(s)(y1,2(t)).\displaystyle f^{\emptyset}_{1,2}(y_{1,2}(t))=(1-f_{1}^{\emptyset}(\mathbf{y}(t-1)))f_{1}^{(s)}(y_{1,2}(t)).

We now go back to users with minimum information age d1(t)=1d_{1}(t)=1 to decide f1,1(y1,1(t))f^{\emptyset}_{1,1}(y_{1,1}(t)). As these users know exact x1(t)x_{1}(t) of path 1, they aim to realize routing flow f1(s)(y1,1(t))f^{(s)}_{1}(y_{1,1}(t)) on path 1 at the Nash Equilibrium. Given f1,2(y1,2(t))f^{\emptyset}_{1,2}(y_{1,2}(t)) flow of users choosing stochastic path 1 in (13) already, they will compare f1(s)(y1,1(t))f^{(s)}_{1}(y_{1,1}(t)) with f1,2(y1,2(t))f^{\emptyset}_{1,2}(y_{1,2}(t)) to make decisions:

  • If f1,2(y1,2(t))f1(s)(y1,1(t))f^{\emptyset}_{1,2}(y_{1,2}(t))\geq f^{(s)}_{1}(y_{1,1}(t)), none of users with d1(t)=1d_{1}(t)=1 will choose path 1.

  • Otherwise, there will be f1(s)(y1,1(t))f1,2(y1,2(t))f^{(s)}_{1}(y_{1,1}(t))-f^{\emptyset}_{1,2}(y_{1,2}(t)) flow of users with d1(t)=1d_{1}(t)=1 choosing path 1. This flow is upper bounded by the total flow of users with minimum age d1(t)=1d_{1}(t)=1, i.e., f1(𝐲(t1))f_{1}^{\emptyset}(\mathbf{y}(t-1)).

In summary, we finally obtain f1,1(y1,1(t))f^{\emptyset}_{1,1}(y_{1,1}(t)) in (10). Given f1,1(y1,1(t))f^{\emptyset}_{1,1}(y_{1,1}(t)) in (12) and f1,2(y1,2(t))f^{\emptyset}_{1,2}(y_{1,2}(t)) in (13), the total flow of users finally choose path 1 is f1,1(y1,1(t))+f1,2(y1,2(t))f^{\emptyset}_{1,1}(y_{1,1}(t))+f^{\emptyset}_{1,2}(y_{1,2}(t)).

-C Proof of Lemma 3

In (15), even if the expected travel cost 𝔼[ci(t)|xi(t)]\mathbb{E}[c_{i}(t)|x_{i}(t)] on path ii is much higher than others, the socially optimal policy may still recommend a small flow ϵ>0\epsilon>0 of users to learn xi(t)x_{i}(t) on path ii. As ϵ0\epsilon\rightarrow 0, this infinitesimal flow of users’ immediate total cost ϵ(𝔼[ci(t)|xi(t)]+ϵ)\epsilon(\mathbb{E}[c_{i}(t)|x_{i}(t)]+\epsilon) is negligible. At the same time, their learned information on path ii can help reduce future costs for all users. Therefore, the socially optimal policy always decides fi(t)ϵf_{i}^{*}(t)\geq\epsilon for any stochastic path ii.

As there are always some users traveling on stochastic path ii to update x1(t+1)x_{1}(t+1), according to (15), the socially optimal policy only needs to minimize the immediate social cost c(x1(t))c^{*}(x_{1}(t)), which is defined in (8). To derive the optimal flow f1(x1(t))f_{1}^{*}(x_{1}(t)), we calculate the first-order-derivative of c(x1(t))c^{*}(x_{1}(t)):

c(x1(t))f1(x1(t))=4f1(x1(t))+(𝔼[c1(t)|x1(t)]c02).\displaystyle\frac{\partial c^{*}(x_{1}(t))}{f_{1}^{*}(x_{1}(t))}=4f_{1}^{*}(x_{1}(t))+(\mathbb{E}[c_{1}(t)|x_{1}(t)]-c_{0}-2).

Then there are also three cases for deciding the optimal routing flow f1(x1(t))f_{1}^{*}(x_{1}(t)):

  • If 𝔼[c1(t)|x1(t)]c02\mathbb{E}[c_{1}(t)|x_{1}(t)]\leq c_{0}-2, c(x1(t))f1(x1(t))0\frac{\partial c^{*}(x_{1}(t))}{f_{1}^{*}(x_{1}(t))}\leq 0 is always true. In this case, f1(x1(t))=1f_{1}^{*}(x_{1}(t))=1, which is the second case of (16).

  • If 𝔼[c1(t)|x1(t)]c0+2\mathbb{E}[c_{1}(t)|x_{1}(t)]\geq c_{0}+2, c(x1(t))f1(x1(t))0\frac{\partial c^{*}(x_{1}(t))}{f_{1}^{*}(x_{1}(t))}\geq 0 is always true. In this case, f1(x1(t))=ϵf_{1}^{*}(x_{1}(t))=\epsilon, which is the first case of (16).

  • Otherwise, by solving c(x1(t))f1(x1(t))=0\frac{\partial c^{*}(x_{1}(t))}{f_{1}^{*}(x_{1}(t))}=0, we obtain f1(x1(t))=12+c0𝔼[c1(t)|x1(t)]4f_{1}^{*}(x_{1}(t))=\frac{1}{2}+\frac{c_{0}-\mathbb{E}[c_{1}(t)|x_{1}(t)]}{4}, which is the third case of (16).

This completes the proof of Lemma 3.

-D Proof of Lemma 4

We first prove that xi(t+1)x_{i}(t+1) increases monotonically with xi(t)x^{\prime}_{i}(t) in (2). Then we prove xi(t+1)x_{i}(t+1) is lower- and upper-bounded by qLHq_{LH} and qHHq_{HH} at xi(t)=0x^{\prime}_{i}(t)=0 and xi(t)=1x^{\prime}_{i}(t)=1, respectively. Finally, according to the definitions of yi,1(t)y_{i,1}(t) in (3) and yi,2(t)y_{i,2}(t) in (4), we can obtain the same bounds for yi,di(t)(t)y_{i,d_{i}(t)}(t).

According to (2), we obtain the first-order-derivative of xi(t+1)x_{i}(t+1) w.r.t. xi(t)x^{\prime}_{i}(t) below:

xi(t+1)xi(t)=qHHqLH,\displaystyle\frac{\partial x_{i}(t+1)}{\partial x^{\prime}_{i}(t)}=q_{HH}-q_{LH},

which is always greater than 0 based on the Markov chain in Figure 1b. If xi(t)=0x_{i}^{\prime}(t)=0 with a good observation, then xi(t+1)=qLHx_{i}(t+1)=q_{LH}, which is the minimum. While if xi(t)=1x_{i}^{\prime}(t)=1, we obtain the maximum hazard belief xi(t+1)=qHHx_{i}(t+1)=q_{HH}.

Therefore, public belief xi(t)x_{i}(t) belongs to the range [qLH,qHH][q_{LH},q_{HH}] for any tt. And private belief yi,di(t)(t)y_{i,d_{i}(t)}(t) holds the same bounds.

-E Proof of Proposition 1

We will prove this lower bound of PoA in (18) by showing the worst-case scenario.

At the initial time t=0t=0 with 𝔼[ci(0)|xi(0)]c0+1\mathbb{E}[c_{i}(0)|x_{i}(0)]\rightarrow c_{0}+1, according to (7), a flow of selfish users will choose to explore each stochastic path ii. As there are NN stochastic paths, there is a probability 1i=1Nxi(0)1-\prod_{i=1}^{N}x_{i}(0) probability for them to find at least one stochastic path ii with ci(0)=cLc_{i}(0)=c_{L}. In this case, the expected social cost caused by selfish users satisfies

C(s)(𝐱(0))\displaystyle C^{(s)}(\mathbf{x}(0))
=\displaystyle= (c0+1ϵ)(1ϵ)+N𝔼[ci(0)|xi(0)]ϵ+ρC(s)(𝐱(1))\displaystyle(c_{0}+1-\epsilon)(1-\epsilon)+N\mathbb{E}[c_{i}(0)|x_{i}(0)]\epsilon+\rho C^{(s)}(\mathbf{x}(1))
\displaystyle\geq c0+1+j=1ρj(c0+1i=1Nxi(0)+1N(1i=1Nxi(0)))\displaystyle c_{0}+1+\sum_{j=1}^{\infty}\rho^{j}\Big{(}c_{0}+1\prod_{i=1}^{N}x_{i}(0)+\frac{1}{N}(1-\prod_{i=1}^{N}x_{i}(0))\Big{)}
\displaystyle\geq c0+1+j=1ρj((c0+1)i=1Nxi(0))\displaystyle c_{0}+1+\sum_{j=1}^{\infty}\rho^{j}\Big{(}(c_{0}+1)\prod_{i=1}^{N}x_{i}(0)\Big{)}
=\displaystyle= c0+1+ρ(c0+1)i=1Nxi(0)1ρ.\displaystyle c_{0}+1+\frac{\rho(c_{0}+1)\prod_{i=1}^{N}x_{i}(0)}{1-\rho}. (24)

Here the first inequality is due to

C(s)(𝐱(1))j=0ρj((c0+1)i=1Nxi(0)+1N(1i=1Nxi(0))),\displaystyle C^{(s)}(\mathbf{x}(1))\geq\sum_{j=0}^{\infty}\rho^{j}\Big{(}(c_{0}+1)\prod_{i=1}^{N}x_{i}(0)+\frac{1}{N}(1-\prod_{i=1}^{N}x_{i}(0))\Big{)},

which is because selfish users will never explore any path ii with probability i=1Nxi(0)\prod_{i=1}^{N}x_{i}(0) after finding xi(t)=qHHx_{i}(t)=q_{HH} there, and C(s)(𝐱(1))C^{(s)}(\mathbf{x}(1)) is lower bounded by the best case that all users keep exploiting all NN stochastic paths with probability 1i=1Nxi(0)1-\prod_{i=1}^{N}x_{i}(0). Note that 𝔼[xi(t)]=xi(0)\mathbb{E}[x_{i}(t)]=x_{i}(0) according to (2) for any time t1t\geq 1.

Under selfish social cost C(s)(𝐱(1))C^{(s)}(\mathbf{x}(1)) in (24), we next calculate the optimal social cost. Different from the minimum exploration of the selfish policy, the socially optimal policy always lets an ϵ\epsilon flow of users explore any stochastic path ii. Even though they find bad traffic conditions on path ii, there will still be an ϵ\epsilon flow exploring this path until find good conditions. If users observe ci(t)=cHc_{i}(t)=c_{H} on path ii, let kk denote the maximum time slots before observing good condition ci(t+k)=cLc_{i}(t+k)=c_{L} again on path ii, which satisfies qHHkδq_{HH}^{k}\leq\delta, where δ0\delta\rightarrow 0 is a small infinitesimal. Here xi(t)x_{i}(t) becomes qHHq_{HH} after observing cHc_{H} on path ii, according to Lemma 4. Then we obtain k=logqHHδk=\lceil\log_{q_{HH}}\delta\rceil. After observing cLc_{L} again, all users can exploit this path for near-to-zero cost there. As there are NN paths, after at most kN\frac{k}{N} time slots, users will observe at least one path with a good condition. Given qLL1q_{LL}\rightarrow 1 and 1qLL1qHH0\frac{1-q_{LL}}{1-q_{HH}}\rightarrow 0, we can finally obtain the minimum social cost based on the above analysis:

C(𝐱(0))\displaystyle C^{*}(\mathbf{x}(0))
\displaystyle\leq c0+1+(1i=1Nxi(0))j=1ρj+i=1Nxi(0)ρC(qHH)\displaystyle c_{0}+1+(1-\prod_{i=1}^{N}x_{i}(0))\sum_{j=1}^{\infty}\rho^{j}+\prod_{i=1}^{N}x_{i}(0)\rho C^{*}(q_{HH})
\displaystyle\leq c0+1+(1i=1Nxi(0))ρ1ρ+i=1Nxi(0)τ=1logqHHδNρτ(c0+1)\displaystyle c_{0}+1+(1-\prod_{i=1}^{N}x_{i}(0))\frac{\rho}{1-\rho}+\prod_{i=1}^{N}x_{i}(0)\sum_{\tau=1}^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}}\rho^{\tau}(c_{0}+1)
=\displaystyle= c0+1+(1i=1Nxi(0))ρ1ρ+\displaystyle c_{0}+1+(1-\prod_{i=1}^{N}x_{i}(0))\frac{\rho}{1-\rho}+ (25)
ρρlogqHHδN(c0+1)1ρi=1Nxi(0).\displaystyle\frac{\rho-\rho^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}}(c_{0}+1)}{1-\rho}\prod_{i=1}^{N}x_{i}(0).

Combing the social cost C(s)(𝐱(0))C^{(s)}(\mathbf{x}(0)) caused by the selfish players in (24) above and the minimum social cost C(𝐱(0))C^{*}(\mathbf{x}(0)) in (25), we obtain

PoA(s)\displaystyle\text{PoA}^{(s)}
\displaystyle\geq C(s)(𝐱(0))C(𝐱(0)\displaystyle\frac{C^{(s)}(\mathbf{x}(0))}{C^{*}(\mathbf{x}(0)}
=\displaystyle=

1ρ+ρi=1Nxi(0)1ρ+(1i=1Nxi(0))ρ1c0+1+(ρρlogqHHδN)i=1Nxi(0)\frac{1-\rho+\rho\prod_{i=1}^{N}x_{i}(0)}{1-\rho+(1-\prod_{i=1}^{N}x_{i}(0))\rho\frac{1}{c_{0}+1}+(\rho-\rho^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}})\prod_{i=1}^{N}x_{i}(0)}

=\displaystyle= 1ρ+ρi=1Nxi(0)1ρ+(ρρlogqHHδN)i=1Nxi(0),\displaystyle\frac{1-\rho+\rho\prod_{i=1}^{N}x_{i}(0)}{1-\rho+(\rho-\rho^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}})\prod_{i=1}^{N}x_{i}(0)},

as c01c_{0}\gg 1 to make 1c0+1=0\frac{1}{c_{0}+1}=0.

This PoA lower bound approaches infinity as ρlogqHHδN1\rho^{\frac{\lceil\log_{q_{HH}}\delta\rceil}{N}}\rightarrow 1 under 𝒪(11ρ)𝒪(11qHH)\mathcal{O}(\frac{1}{1-\rho})\gg\mathcal{O}(\frac{1}{1-q_{HH}}).

-F Proof of Proposition 2

To prove Proposition 2, we consider the same worst-case scenario with minimum exploration as the proof in Appendix E.

Under the information-hiding mechanism, according to (12) and (13), there will be a small ϵ\epsilon flow of users choosing each stochastic path ii to travel. For this ϵ\epsilon flow of users, if they observe ci(0)=cLc_{i}(0)=c_{L} on path ii, they still choose this path at the next time slot t+1t+1. If this fi(0)=ϵf_{i}^{\emptyset}(0)=\epsilon flow of users observe ci(0)=cHc_{i}(0)=c_{H} there, none of them will go to this path again since t=1t=1, based on fi,1(yi,1(t))f_{i,1}^{\emptyset}(y_{i,1}(t)) in (12). For the other 1ϵ1-\epsilon flow of users choosing path 0 at time t=0t=0, there is no observation to update their posterior belief yi,2(1)y_{i,2}(1) for the next time slot t=1t=1. Thus, at time t=1t=1, these users with information age di(1)=2d_{i}(1)=2 still choose path 0. However, at the end of time t=1t=1, these users can observe fi(1)f^{\emptyset}_{i}(1) to update their beliefs yi,2(2)y_{i,2}(2):

  • If they observe fi(1)=0f^{\emptyset}_{i}(1)=0, they infer that ci(0)=cHc_{i}(0)=c_{H}, and they keep choosing path 0 since the next time slot t=2t=2.

  • If they observe fi(1)=ϵf^{\emptyset}_{i}(1)=\epsilon, they infer that ci(0)=cLc_{i}(0)=c_{L}, and they will choose stochastic path ii with good conditions since t=2t=2.

Based on the analysis above and our proof of Proposition 1 in Appendix E, we calculate the social cost under the hiding mechanism as:

C(𝐱(0))\displaystyle C^{\emptyset}(\mathbf{x}(0))\geq c0+1+ρ(c0+1)+\displaystyle c_{0}+1+\rho(c_{0}+1)+
j=2ρj((c0+1)i=1Nxi(0)+1N(1i=1Nxi(0)))\displaystyle\sum_{j=2}^{\infty}\rho^{j}\Big{(}(c_{0}+1)\prod_{i=1}^{N}x_{i}(0)+\frac{1}{N}(1-\prod_{i=1}^{N}x_{i}(0))\Big{)}
\displaystyle\geq (1+ρ)(c0+1)+ρ2(c0+1)i=1Nxi(0)1ρ,\displaystyle(1+\rho)(c_{0}+1)+\frac{\rho^{2}(c_{0}+1)\prod_{i=1}^{N}x_{i}(0)}{1-\rho},

where fi(1)=ϵf_{i}^{\emptyset}(1)=\epsilon such that the immediate cost for all users is ρ(c0+1)\rho(c_{0}+1) at time t=1t=1.

For the socially optimal policy, it still always lets some users explore each stochastic path, and the caused minimum social cost C(𝐱(0))C^{*}(\mathbf{x}(0)) is the same as (25) in Appendix E. Therefore, we calculate PoA under the hiding mechanism as

PoA1ρ2+ρ2i=1Nxi(0)1ρ+(ρρkN)i=1Nxi(0),\displaystyle\text{PoA}^{\emptyset}\geq\frac{1-\rho^{2}+\rho^{2}\prod_{i=1}^{N}x_{i}(0)}{1-\rho+(\rho-\rho^{\frac{k}{N}})\prod_{i=1}^{N}x_{i}(0)},

which approaches infinity under the same conditions as in Proposition 1.

-G Proof of Lemma 5

We demonstrate that the deterministic recommendation mechanism results in an infinite PoA by showing that users continue to make routing decisions as in the information-hiding mechanism.

Under the deterministic recommendation mechanism, when a user is advised to select a stochastic path, say π(t)=i\pi(t)=i, the only information it receives is fi(t)>0f_{i}^{*}(t)>0. This limited information does not alter the user’s posterior belief yi,di(t)y_{i,d_{i}(t)} about path ii, leading it to estimate its expected travel cost as it would without any recommendation. Consequently, the user disregards this recommendation. Similarly, if a user is advised to take safe path π(t)=0\pi(t)=0, its posterior beliefs regarding all other stochastic paths remain unaffected. Thus, each user continues to follow fi(𝐲(t))f_{i}^{\emptyset}(\mathbf{y}(t)) when deciding on a path, resulting in PoA=\text{PoA}=\infty, as established in Proposition 2.

-H Proof of Proposition 3

To prove Proposition 3, we consider the worst-case with maximum exploration on stochastic paths.

Since users with minimum information age di(t)=1d_{i}(t)=1 have complete information of path ii, any informational mechanism cannot change their path decisions. For example, we consider qLH0,qHH<1,cL=0,xi(t)=qLHq_{LH}\rightarrow 0,q_{HH}<1,c_{L}=0,x_{i}(t)=q_{LH} and 𝔼[ci(t)|xi(t)=qLH]<c01\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{LH}]<c_{0}-1. Users with di(t)=1d_{i}(t)=1 knows the exact hazard belief xi(t)=qLHx_{i}(t)=q_{LH} on path ii. As the travel cost on path ii satisfies 𝔼[ci(t)|xi(t)=qLH]<c01\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{LH}]<c_{0}-1, these users will always choose this path ii to exploit the low cost there. Motivated by this example, we will try to find the worst-case with the maximum exploration that every user has di(t)=1d_{i}(t)=1 to prove Proposition 3.

Let xi(0)=qLH{x_{i}(0)=q_{LH}}, qLH0{q_{LH}\rightarrow 0}, qHH<1{q_{HH}<1}, cL=0{c_{L}=0} and 𝔼[ci(0)|xi(0)]=c01N{\mathbb{E}[c_{i}(0)|x_{i}(0)]=c_{0}-\frac{1}{N}} for any stochastic path i𝐍{i\in\mathbf{N}}. According to our analysis of fi(s)(𝐱(0)){f^{(s)}_{i}(\mathbf{x}(0))} in (7), there will be 1N\frac{1}{N} flow of users traveling on each stochastic path ii at initial t=0t=0. Then each user is aware of the public belief xi(1)=qLH{x_{i}(1)=q_{LH}} on his chosen stochastic path ii with minimum information age di(1)=1{d_{i}(1)=1} for next time t=1{t=1}. Consequently, under 𝔼[ci(1)|xi(1)]=c01N{\mathbb{E}[c_{i}(1)|x_{i}(1)]=c_{0}-\frac{1}{N}}, they will behave the same as the sharing mechanism by continuing to choose their formerly chosen paths since t=1t=1, making any informational mechanism fail to regulate. The caused social cost by selfish users is

C(s)(𝐱(0))\displaystyle C^{(s)}(\mathbf{x}(0)) =j=0ρj(𝔼[ci(1)|xi(1)]+1N)\displaystyle=\sum_{j=0}^{\infty}\rho^{j}(\mathbb{E}[c_{i}(1)|x_{i}(1)]+\frac{1}{N})
=𝔼[ci(1)|xi(1)]+1N1ρ.\displaystyle=\frac{\mathbb{E}[c_{i}(1)|x_{i}(1)]+\frac{1}{N}}{1-\rho}.

On the other hand, according to (16) in Lemma 3, the socially optimal policy ensures the flow of users to choose deterministic path 0 as:

f(𝐱(t))=12(N+1),\displaystyle{f^{*}(\mathbf{x}(t))}=\frac{1}{2(N+1)},

to reduce the congestion cost on stochastic paths. Then we obtain the minimum social cost as:

C(𝐱(0))=\displaystyle C^{*}(\mathbf{x}(0))= j=0ρj(N2(N+1)(𝔼[ci(0)|xi(0)]+12(N+1))\displaystyle\sum_{j=0}^{\infty}\rho^{j}\bigg{(}\frac{N}{2(N+1)}\Big{(}\mathbb{E}[c_{i}(0)|x_{i}(0)]+\frac{1}{2(N+1)}\Big{)}
+(1N2(N+1))(c0+1N2(N+1)))\displaystyle+\Big{(}1-\frac{N}{2(N+1)}\Big{)}\Big{(}c_{0}+1-\frac{N}{2(N+1)}\Big{)}\bigg{)}
=\displaystyle= 4(N+1)(𝔼[ci(1)|xi(1)]+1N)(1ρ)(4(N+1)1),\displaystyle\frac{4(N+1)(\mathbb{E}[c_{i}(1)|x_{i}(1)]+\frac{1}{N})}{(1-\rho)(4(N+1)-1)},

which is derived by inputting 𝔼[ci(1)|xi(1)]+1N=c0\mathbb{E}[c_{i}(1)|x_{i}(1)]+\frac{1}{N}=c_{0} into the right-hand side of the inequality. Finally, the resulting minimal PoA is

PoAC(𝐱(0))C(𝐱(0))=4(N+1)4N+3.\displaystyle\text{PoA}\geq\frac{C^{\emptyset}(\mathbf{x}(0))}{C^{*}(\mathbf{x}(0))}=\frac{4(N+1)}{4N+3}.

Therefore, any informational mechanism cannot reduce PoA to less than 1+14N+31+\frac{1}{4N+3}.

-I Proof of Theorem 1

We first analyze the received recommendations of users with different private information under our UPR mechanism. Then, we prove that our UPR mechanism ensures IIR for all non-myopic users, prompting them to follow the system’s random path recommendations. Finally, we demonstrate that under our UPR mechanism’s random recommendations, users will only over-explore stochastic paths, thereby reducing the PoA to the minimum ratio (22), as stated in Proposition 3.

At any given time t1t\geq 1, our UPR provides differential recommendations to users, who have different past observations, to influence their path decisions:

  • 1.

    For the fi(UPR)(t1)f_{i}^{(\text{UPR})}(t-1) flow of users who observed ci(t1)=cLc_{i}(t-1)=c_{L} on path ii with minimum age di(t)=1d_{i}(t)=1, any informational mechanism cannot stop their exploration on this path, based on our analysis in Proposition 3. Therefore, the platform simply recommends that they follow their own selfish routing decisions in (7), and these users will stick to path ii at time tt.

  • 2.

    For users with di(t)=2d_{i}(t)=2 on stochastic path ii, the platform will provide probabilistic recommendations to let them choose any stochastic path iI(t):={i|f¯(s)(n,t)>fi(UPR)(t1)}i\in I(t):=\{i\in\mathbb{N}|\bar{f}^{(s)}(n,t)>f^{(\text{UPR})}_{i}(t-1)\}, where f¯(s)(n,t)\bar{f}^{(s)}(n,t) is the maximum flow of users on any path ii. The condition in set I(t)I(t) means that only if the flow of path ii in the last time slot satisfies fi(UPR)(t1)<f¯(s)(n,t)f^{(\text{UPR})}_{i}(t-1)<\bar{f}^{(s)}(n,t), the platform will recommend some users with di(t)=2d_{i}(t)=2 to choose this path ii in the current time slot tt. If fi(UPR)(t1)f¯(s)(n,t)f^{(\text{UPR})}_{i}(t-1)\geq\bar{f}^{(s)}(n,t), our UPR mechanism will simply let the former fi(UPR)(t1)f^{(\text{UPR})}_{i}(t-1) flow of users with di(t)=1d_{i}(t)=1 follow their selfish decisions to determine fi(UPR)(t)=f¯(s)(n,t)f^{(\text{UPR})}_{i}(t)=\bar{f}^{(s)}(n,t), as shown in the case 1 above.

In the second case, given fi(UPR)(t1)<f¯(s)(n,t)f^{(\text{UPR})}_{i}(t-1)<\bar{f}^{(s)}(n,t) , if path ii is in a good condition, our UPR aims to realize fi(UPR)(t)=f¯(s)(n,t)f^{(\text{UPR})}_{i}(t)=\bar{f}^{(s)}(n,t) flow of users exploiting this path. Conversely, if path ii is in a bad traffic condition, our UPR aims to ensure at least an ϵ\epsilon flow of users choose this path, in order to achieve the social optimum in (16). Note that if fi(s)(𝐱(t))>ϵf_{i}^{(s)}(\mathbf{x}(t))>\epsilon under xi(t)=qHHx_{i}(t)=q_{HH}, in addition the recommended flow ϵ\epsilon, users with di(t)=1d_{i}(t)=1 will make their selfish decisions fi(𝐱(t))f_{i}^{\emptyset}(\mathbf{x}(t)) according to (12) to choose this path.

Based on the above analysis, we next prove that no matter receiving any private recommendation π(t)\pi(t) under our UPR mechanism, each user is IIR to follow such recommendation.

Suppose a user with di(t)=2d_{i}(t)=2 receives private path recommendation π(t)=i\pi(t)=i for our UPR mechanism. According to (20), this user will infer the probability of a good condition on stochastic path ii as:

𝐏𝐫(xi(t)=qLH|π(t)=i)\displaystyle\mathbf{Pr}(x_{i}(t)=q_{LH}|\pi(t)=i)
=\displaystyle= 𝐏𝐫(xi(t)=qLH,π(t)=i)𝐏𝐫(πi(t)=i)\displaystyle\frac{\mathbf{Pr}(x_{i}(t)=q_{LH},\pi(t)=i)}{\mathbf{Pr}(\pi_{i}(t)=i)}
=\displaystyle= 𝐏𝐫(πi(t)=i|xi(t)=qLH)𝐏𝐫(xi(t)=qLH)xi(t){qLH,qHH}𝐏𝐫(πi(t)=i|xi(t))𝐏𝐫(xi(t))\displaystyle\frac{\mathbf{Pr}(\pi_{i}(t)=i|x_{i}(t)=q_{LH})\mathbf{Pr}(x_{i}(t)=q_{LH})}{\sum_{x_{i}(t)\in\{q_{LH},q_{HH}\}}\mathbf{Pr}(\pi_{i}(t)=i|x_{i}(t))\mathbf{Pr}(x_{i}(t))}
=\displaystyle= f¯(s)(n,t)fi(UPR)(t1)1fi(UPR)(t1)(1xi(t1))f¯(s)(n,t)fi(UPR)(t1)1fi(UPR)(t1)(1xi(t1))+ϵxi(t1),\displaystyle\frac{\frac{{\bar{f}^{(s)}(n,t)-f^{(\text{UPR})}_{i}(t-1})}{1-f^{(\text{UPR})}_{i}(t-1)}\cdot(1-x_{i}(t-1))}{\frac{{\bar{f}^{(s)}(n,t)-f^{(\text{UPR})}_{i}(t-1)}}{1-f^{(\text{UPR})}_{i}(t-1)}\cdot(1-x_{i}(t-1))+\epsilon\cdot x_{i}(t-1)}, (26)

which approaches 11 as ϵ0\epsilon\rightarrow 0. Here the first equality is derived by the Bayesian inference and the second equality is due to the law of total probability. Based on (26), for this player, it is almost certain that xi(t)=qLHx_{i}(t)=q_{LH}.

Then this user infers the expected travel cost on path ii as:

limϵ0𝔼[ci(t)|π(t)=i]\displaystyle\lim_{\epsilon\rightarrow 0}\mathbb{E}[c_{i}(t)|\pi(t)=i]
=\displaystyle= limϵ0{𝐏𝐫(xi(t)=qLH|π(t)=i)𝔼[ci(t)|xi(t)=qLH]\displaystyle\lim_{\epsilon\rightarrow 0}\Big{\{}\mathbf{Pr}(x_{i}(t)=q_{LH}|\pi(t)=i)\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{LH}]
+𝐏𝐫(xi(t)=qHH|π(t)=i)𝔼[ci(t)|xi(t)=qHH]}\displaystyle+\mathbf{Pr}(x_{i}(t)=q_{HH}|\pi(t)=i)\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{HH}]\Big{\}}
=\displaystyle= 𝔼[ci(t)|xi(t)=qLH],\displaystyle\mathbb{E}[c_{i}(t)|x_{i}(t)=q_{LH}],

which is the minimum expected cost. Based on our analysis that fi(UPR)(t1)f¯(s)(n,t)f^{(\text{UPR})}_{i}(t-1)\leq\bar{f}^{(s)}(n,t), following the recommendation π(t)=i\pi(t)=i to choose stochastic path ii is IIR for this player.

Similarly, if the user receives recommendation π(t)=0\pi(t)=0, it infers that any stochastic path ii is almost certainly with bad condition xi(t)=qHHx_{i}(t)=q_{HH}. Suppose that other users follow our UPR recommendations. Then, for any stochastic path with good condition xi(t)=qLHx_{i}(t)=q_{LH}, the expected flow of users on it will be f¯(s)(n,t)\bar{f}^{(s)}(n,t). Therefore, following the recommendation to choose path 0 is individually rational.

As users always follow our UPR mechanism’s recommendations, finally, we prove that our UPR realizes the minimum PoA in (22), which is the lower bound in Proposition 3

According to recommendation probability 𝐏𝐫(π(t)=i|xi(t))\mathbf{Pr}(\pi(t)=i|x_{i}(t)) in (21), if stochastic path ii has a bad traffic condition xi(t)=qHHx_{i}(t)=q_{HH}, our UPR recommends (1fi(UPR)(t1))ϵ(1-f_{i}^{(\text{UPR})}(t-1))\epsilon flow of users with information age di(t)=2d_{i}(t)=2 on path ii to choose this path. For the fi(UPR)(t1)f_{i}^{(\text{UPR})}(t-1) flow of users with di(t)=1d_{i}(t)=1, who follow (12) to make their selfish routing decisions, they reverse-engineer the recommended flow ϵ\epsilon to make their routing decision fi(𝐱(t))f_{i}^{\emptyset}(\mathbf{x}(t)). In this case, the total flow of users choosing path ii under our UPR mechanism satisfies

fi(UPR)(t)\displaystyle f^{(\text{UPR})}_{i}(t) =(1fi(UPR)(t1))ϵ+fi(𝐱(t))\displaystyle=(1-f_{i}^{(\text{UPR})}(t-1))\epsilon+f_{i}^{\emptyset}(\mathbf{x}(t))
=max{f1(s)(y1,1(t)),(1fi(UPR)(t1))ϵ},\displaystyle=\max\{f_{1}^{(s)}(y_{1,1}(t)),(1-f_{i}^{(\text{UPR})}(t-1))\epsilon\}, (27)

where the second equality is derived by (12).

Conversely, if path ii has a good traffic condition xi(t)=qLHx_{i}(t)=q_{LH}, our UPR mechanism recommends an expected flow f¯(s)(n,t)fi(UPR)(t1)\bar{f}^{(s)}(n,t)-f^{(\text{UPR})}_{i}(t-1) of users with information age di(t)=2d_{i}(t)=2 on path ii to choose this path. As the other fi(UPR)(t1)f^{(\text{UPR})}_{i}(t-1) flow of users with di(t)=1d_{i}(t)=1 will stick to path ii, our UPR mechanism achieves

fi(UPR)(t)=f¯(s)(n,t).\displaystyle f^{(\text{UPR})}_{i}(t)=\bar{f}^{(s)}(n,t). (28)

Based on our analysis of (27) and (28), fi(UPR)(t)f^{(\text{UPR})}_{i}(t) under our UPR mechanism always satisfies

fi(UPR)(t)fi(𝐱(t)),\displaystyle f^{(\text{UPR})}_{i}(t)\geq f_{i}^{*}(\mathbf{x}(t)),

where fi(𝐱(t))f_{i}^{*}(\mathbf{x}(t)) is derived by (16) in Lemma 3. It means that our UPR mechanism always avoids users’ under-exploration of any stochastic path ii. Therefore, the infinite PoA’s in Propositions 1 and 2 are successfully avoided.

As users will only over-explore stochastic paths, the worst-case scenario is the maximum over-exploration as in Proposition 3, where all users choose stochastic paths such that f(s)(𝐱(t))=f¯(s)(n,t)f^{(s)}(\mathbf{x}(t))=\bar{f}^{(s)}(n,t). Since xi(t)=qLHx_{i}(t)=q_{LH} is always true, our UPR mechanism always recommends fi(UPR)(t)=f¯(s)(n,t)f^{(\text{UPR})}_{i}(t)=\bar{f}^{(s)}(n,t) flow of users on path ii, based on (28). Therefore, our UPR mechanism leads to the same PoA 1+14N+31+\frac{1}{4N+3} as in Proposition 3, which cannot be further reduced.