Algorithmic collusion in a two-sided market: A rideshare example*

Pravesh Koirala¹ and Forrest Laine¹ *This work was not supported by any organization¹Department of Computer Science, Vanderbilt University, Nashville, TN, USA [email protected], [email protected]

Abstract

With dynamic pricing on the rise, firms are using sophisticated algorithms for price determination. These algorithms are often non-interpretable and there has been a recent interest in their seemingly emergent ability to tacitly collude with each other without any prior communication whatsoever. Most of the previous works investigate algorithmic collusion on simple reinforcement learning (RL) based algorithms operating on a basic market model. Instead, we explore the collusive tendencies of Proximal Policy Optimization (PPO), a state-of-the-art continuous state/action space RL algorithm, on a complex double-sided hierarchical market model of rideshare. For this purpose, we extend a mathematical program network (MPN) based rideshare model to a temporal multi origin-destination setting and use PPO to solve for a repeated duopoly game. Our results indicate that PPO can either converge to a competitive or a collusive equilibrium depending upon the underlying market characteristics, even when the hyper-parameters are held constant.

I INTRODUCTION

Dynamic pricing has seen a large uptrend ever since internet-enabled handheld devices became a commonplace. Smartphones, equipped with GPS and anytime connectivity, can continuously push location (and other relevant) statistics to centralized servers which can then estimate a localized demand pattern. Based on this demand information, sophisticated machine learning models can instantaneously produce an optimal price for any given spatio-temporal region. Industries like gasoline, ticketing, online retail, etc are known to use dynamic pricing [1]. Recently, platform economies like rideshares are using dynamic pricing to hike prices during period of high demands i.e. surge, with the intent of ensuring high service availability to requests of priority. It is argued that in platform economies like rideshares, dynamic pricing produces better outcomes both in shorter timeframe by incentivizing the drivers to accept more rides, and in the longer timeframe by shaping driver’s schedules to identified surge periods [2]. Regardless, increasing number of industries seem to be gravitating towards dynamic pricing, including food giants like Wendys who recently withdrew plans for proposed surge pricing following a sharp backlash [3].

With dynamic pricing gaining traction, much has been invested in devising complex models that can accurately predict an optimum price given informative attributes of any spatio-temporal frame. However, most of the sophisticated machine learning models are effectively a black-box [4]. They are known to learn non-trivial latent representation of the given dataset and often, these representations are not readily interpretable. A recent concern with these machine learning based pricing agents is the possibility of tacit collusion. Collusion refers to the tendency of rational agents to cooperate with each other with the intent of earning profits that would be higher than if they were competing i.e. supra-competitive profits. While in most of the economies around the world, a coordinated collusion among firms is heavily penalized in the spirit of promoting healthy competition and maintaining a free market economy, emergent collusion in pricing algorithms is a perplexing phenomenon from not only regulatory but also technological perspective. Perhaps of most importance is the fact that it’s not entirely clear what causes a collusive behavior to emerge in disconnected algorithmic agents. Some works argue that certain hyperparameters give rise to collusion [5], while others argue that superior algorithms cause this behavior [6]. Irrespective of these unknowns, evidence suggests that algorithmic collusion does exists, with a study claiming an increase of almost 28% margins in the German retail gasoline market after adoption of the dynamic pricing by duopolies [7].

Much of the related literature have investigated simple algorithms like Q-learning in a simpler market setting like single-sided duopoly competition to argue about potential collusion, but markets are often complex (hierarchical, multi-sided, etc.) and industries are moving towards increasingly powerful methods to optimize their profits. It’s not apparent, then, if observed collusive behavior in established literature also emerges in such non-trivial market regimes under modern algorithms. Also, to our knowledge, there is no prior work that studies algorithmic collusion as function of market responsiveness i.e. the rate at which a market adjusts to pricing changes. We believe that this is an important market attribute that warrants exploration as it signifies how aware a market is of the underlying pricing algorithms. In this paper, we study collusive behaviors in sophisticated pricing algorithms operating in a complex platform market i.e. rideshare ²²2Used synonymously to ridehailing throughout this work as per established usage of the word. We choose rideshare for this study as it is a flagship example of a hierarchical market with its two networks (passengers and drivers) and various inter/intra network externalities. We begin with a mathematical program network (MPN) [8] based model of rideshare first presented by Koirala and Laine [9] and extend it to a dynamic multi origin-destination setting. Since we want to investigate a modern reinforcement learning algorithm for potential collusion, we use Proximal Policy Optimization (PPO) [10], which operates in a continuous state/action space and is a much more realistic algorithm for pricing agents in today’s context. We further extend our study by doing a comparison on the collusive behavior of the agents under different intrinsic market characteristics. In summary, our major contributions are as follows:

•

We investigate potential collusive behaviors in a non-trivial market with modern reinforcement learning algorithms.
•

We compare the emergent collusive behavior under different market responsiveness.

To the best of our knowledge, these contributions are novel and add substantively to the literature of algorithmic pricing and collusion. The rest of the paper is organized as follows: In section II we present a brief overview of existing literature on algorithmic collusion. In section III, we describe our MPN based model and in section IV, we explain our experimental setup. Finally, in section V, we describe our obtained results and conclude our arguments in VI.

II Literature Review

Huck, et al. [11] first showed that simple learning agents using a trial-and-error approach can learn to collude without even having the knowledge of payoff function in a cournout duopoly. The seminal study by Waltman, et al. [5] showed, via simulation, that Q-learning agents can learn to collude with each other in Cournot oligopoly games and while they may not learn optimal collusive behavior, their joint profit is better than that of perfect competition. They also analytically argue that potential collusion are results of exploratory tendencies of algorithms. Salcedo [12] showed that in a dynamic game where firms use a fixed pricing algorithm that can be revised over time and be decoded by competitors, collusion is not only possible but is inevitable.

Perhaps the most defining work in algorithmic pricing is by Calvano, et al. [13], where they argued that Q-learning agents in an oligopoly model learn to charge supracompetitive prices without any explicit communication with each other. The distinguishing feature of their work is that they find that collusions among agents are enforced via punishments in case of deviation. A counter argument is proposed by Miklos-Thal and Tucker [14] where they argue that algorithms with better demand forecasting can, in fact, lead to an increase in social welfare. Brown and MacKay [6] performed an study on the pricing behavior of firms with assymetric pricing technology. They showed that if a single firm adopts superior pricing technology, all firms are able to obtain higher prices and if all firms adopt automated high-frequency algorithms, collusive pricing emerges. Klein [15] presented a different pricing model based on Q-learning where agents, instead of deciding on prices simultaneously, take a turn-based approach. They found that even under this market regime, collusive pricing does emerge. Abada and Lambin [16] extended results from [5, 13] to a multi-agent pricing game to buy and sell a storable good. They argued that seeming collusion may originate in imperfect exploration (where algorithms lock in rapidly to a state of exploitation) and not algorithmic sophistication. A work by Boer, et al. [17] questioned the results obtained by Calvano [13] and argued that simple Q-learning algorithms are not a cartel concern. Sanchez-cartas and Katsamakas [18] compared Particle Swarm Optimization algorithm and Q-learning algorithm for logit, hotelling, and linear demand models and find that PSO sets a more competitive price as compared to Q-learning model.

As mentioned earlier, most of these works are based on simple models like Q-learning and PSO. Hettich [19] replicated the results of [13] using a Deep Q-Network based architecture and showed significantly fast convergence and, under certain state reformulation, collusive outcomes for wider oligopolies (i.e. with more participants). However, their model is still a discrete space model and the first continuous space model to investigate algorithmic collusion seems to be from Wang [20] who used a natural policy gradient method to demonstrate a counter result i.e., they show that that their policy gradient based implementation converges to the competitive Nash equilibrium. They go on to suggest that algorithmic collusion might be a feature of the learning algorithm being used. Our study, like Wang [20], uses a continuous space reinforcement learning algorithm i.e. PPO. But unlike other existing works with focus on simple single-sided market duopolies, we model a complicated two-sided market like ridesharing that has inherent externalities. Furthermore, we primarily study the role of market responsiveness, i.e. the rate at which the market adjusts to pricing decisions, in the outcome of the equilibrium and potential collusion.

III Model

The ridesharing economy has ample moving parts. First, there are the strategic agents such as platforms like Uber, Lyft, Didi, etc. who decide what to charge and how much to pay to, respectively, their passengers and drivers, who, in turn, decide which platform to use and serve. When choosing a platform to utilize, many of the passengers and drivers are typically not tied to a specific platform i.e. they can use any of the available ones depending upon the costs and the profits and, thus, are engaged in what’s called a multi-homing behavior adding further complexity to the model. With the drivers forming the supply network and the passengers forming the demand network, any model must also capture the interactions (externalities) within and across these two networks, i.e. the additional utility / cost derived by any participant of the network from the presence / absence of participants in the same (same-side) or the complementary (cross-side) network. Concretely, a passenger is better served when there are more drivers in a platform due to reduced wait-times and vice-versa, which shows that there is a positive cross-side externality between the supply and demand networks. However, a passenger (driver) is worse-off when there are more passengers (drivers) because of increased wait-times (decreased utilization) which is indicative of a negative same-side externality. A final modeling difficulty is due to the order of decision as there may be different equilibrium outcomes in the game depending upon who moves/commits to an action first.

In consideration of these challenges, we use the Mathematical Program Network (MPN) based ridesharing model introduced by Koirala and Laine [9] (other game theoretical models of rideshare can be found in [21, 22, 23, 24]). This model essentially consists of a partially ordered directed graph of mathematical optimizers which captures both the negative same-side externalities and the positive cross-side externalities inherent in the ridesharing market. Furthermore, in this model, the rideshare game is represented as a sequential game of three stages with the platforms moving first, followed by the drivers, and then the passengers. This scheme highlights the importance of the drivers as their allocation decision conclusively shapes the equilibrium outcome of the entire game. Lastly, this model has clear interpretations under both single-homing and multi-homing regimes.

Figure 1: MPN based model for the ridesharing duopoly problem proposed by [9]. Platforms U and L decide wages and rates simultaneously for the drivers (D) and the passengers (P). The drivers decide which platform to work for followed by the passengers deciding which platform to use.

The terminologies used in this model are as follows:

	$\displaystyle U$	$\displaystyle:\text{Platform U}.$
	$\displaystyle L$	$\displaystyle:\text{Platform L}$
	$\displaystyle O$	$\displaystyle:\text{Outside option (public transit)}$
	$\displaystyle d\in^{+}$	$\displaystyle:\text{Average distance of a trip.}$
	$\displaystyle\lambda\in^{+}$	$\displaystyle:\text{Wait cost multiplier}$
	$\displaystyle r^{u/l/o}\in^{+}$	$\displaystyle:\text{Rate charged per mile by }U/L/O$
	$\displaystyle c^{u/l/o}\in^{+}$	$\displaystyle:\text{Commission paid per mile by }U/L/O$
	$\displaystyle p^{u/l/o}\in[0,1]$	$\displaystyle:\text{Proportions of passengers using }U/L/O$
	$\displaystyle a^{u/l}\in[0,1]$	$\displaystyle:\text{Availability of drivers on }U/L$
	$\displaystyle g\in^{+}$	$\displaystyle:\text{Gas cost per mile}$

The game consists of three stages with the platforms first deciding on the rate to charge and commission to pay out per mile. Following this, the drivers collectively self-allocate themselves between the two platforms. Finally, the passengers decide which platform (or the public transit) to use depending upon the monetary and waiting costs. A node of the MPN is defined as $\{f,\xi,J\}$ where $f$ denotes the utility/cost being optimized for, $\xi$ denotes the feasible set, and $J$ denotes the endogenous variables being optimized on. The solution of a MPN depends upon both the nodes and the edges of the MPN (which establish a hierarchical relationship resembling a Stackelberg process), and we refer readers to the original work introducing MPNs for further details [8]. For this particular problem, the MPN nodes of figure 1 are defined as follows (maximization objectives in bold):

\displaystyle U:=\left\{\begin{aligned} f&:\boldsymbol{dp^{u}(r^{u}-c^{u})}\\ \xi&:r^{u},c^{u}\geq 0\\ J&:r^{u},c^{u}\end{aligned}\right\}

(1)

\displaystyle L:=\left\{\begin{aligned} f&:\boldsymbol{dp^{l}(r^{l}-c^{l})}\\ \xi&:r^{l},c^{l}\geq 0\\ J&:r^{l},c^{l}\end{aligned}\right\}

(2)

\displaystyle D:=\left\{\begin{aligned} f&:\boldsymbol{dp^{u}(c^{u}-g)+dp^{l}(c^{l}-g)}\\ \xi&:0\leq a^{u},a^{l}\leq 1\scalebox{1.2}{$~{}\cap~{}$}a^{u}+a^{l}\leq p^{u}+p^{l}\\ J&:a^{u},a^{l}\end{aligned}\right\}

(3)

\displaystyle P:=\left\{\begin{aligned} f&:p^{u}\left(dr^{u}+\lambda\frac{p^{u}}{a^{u}}\right)+p^{l}\left(dr^{l}+\lambda\frac{p^{l}}{a^{l}}\right)\\ &~{}~{}~{}+p^{o}\left(dr^{o}+\lambda p^{o}\right)\\ \xi&:0\leq p^{u},p^{l},p^{o}\leq 1\scalebox{1.2}{$~{}\cap~{}$}p^{u}+p^{l}+p^{o}=1\\ J&:p^{u},p^{l},p^{o}\end{aligned}\right\}

(4)

These nodes, in tandem with the edges shown in figure 1 define the MPN for the ridesharing problem. As can be seen, this model captures all intricacies of the target economy. The profits of platforms depend not only upon the rates they charge and the commission they pay, but also the population that use the service. But the population using this service depends upon the rates being charged and the availability of drivers, which in turn depend upon the commissions being paid out and the number of passengers willing to use the platform. Furthermore, this model also captures the network externalities. If more drivers are present, the wait cost for passengers decreases and vice-versa. Similarly, if more passengers are present, the profits for the drivers increase and vice-versa. A detailed exposition on this model can be found at [9]. Below, we mention some of the key findings of this model:

1.

In perfectly competitive equilibrium, the solution degenerates to one where $r^{u}=c^{u}=r^{l}=c^{l}$ i.e. all platforms charge the same and pay the same. In this solution, no platform is able to earn any profit.
2.

There are two kinds of collusive outcomes possible: double-sided where platforms collude by agreeing on both rates and commission, and single-sided where platforms collude by agreeing on the bare minimum commission payout.
3.

Since the platforms obtain highest profit when commission payout is low and since it’s less suspicious and more stable to maintain, single-sided collusion is the natural outcome of this game.

III-A Extension to temporal multi origin-destination (OD) graphs

Aforementioned model depicts a one-shot interaction across a single origin and destination. We extend it to a temporal multi origin-destination version suitable for use within a city-like setting. Concretely, we consider a fully-connected OD graph $G$ with $N$ nodes and $E$ edges with the corresponding edge-weights $e_{ij}$ representing the proportion of passengers at $i$ who want to move to node $j$ , with $e_{ii}=0~{}\forall i\in[N]$ where $[N]=\{1,2,..N\}$ . Similarly, we also consider a distance matrix $\mathbb{D}$ , similar to structure in $G$ , but the edge-weights $d_{ij}$ representing the effective distance between any two nodes. These distances are allowed to be asymmetric because of traffic conditions and other extraneous factors and the distance from any node to itself is considered 0.

Figure 2: A graph with multiple origin-destination. Nodes in this graph represent a location whereas the edges indicate routes. Edge weights represent the proportions of passengers wanting to move to corresponding nodes. The distance of these routes is given by the

\mathbb{D}

matrix.

In this temporal multi OD setting, the platforms optimize on rates and commissions for each edge i.e. $r^{u/l}_{ij}(t),c^{u/l}_{ij}(t)~{}\forall(i,j)\in E$ , the drivers optimize on their allocations for each node i.e. $a^{u/l}_{i}(t)~{}\forall i\in[N]$ , and the passengers optimize on their preference for each edge i.e. $p^{u/l/o}_{ij}(t)$ at a given time $t$ . We further introduce population distribution parameters for drivers and passengers i.e. $\Pi^{D},\Pi^{P}$ such that $\Pi^{D}_{i}(t),\Pi^{P}_{i}(t)$ indicate the number of drivers and passengers at node $i$ at time $t$ . The dynamics of these distribution parameters are defined as follows:

	$\displaystyle\begin{split}\dot{\Pi}^{P}_{j}(t)=&\sum_{i\in[N]\setminus\{j\}}{\left(\mathcal{F}^{u}_{ij}(t)+\mathcal{F}^{l}_{ij}(t)+\mathcal{F}^{o}_{ij}(t)\right)}\\ &-\sum_{i\in[N]\setminus\{j\}}{\left(\mathcal{F}^{u}_{ji}(t)+\mathcal{F}^{l}_{ji}(t)+\mathcal{F}^{o}_{ji}(t)\right)}\end{split}$			(5)
	$\displaystyle\begin{split}\dot{\Pi}^{D}_{j}(t)=&\sum_{i\in[N]\setminus\{j\}}{\left(\mathcal{F}^{u}_{ij}(t)+\mathcal{F}^{l}_{ij}(t)\right)}\\ &-\sum_{i\in[N]\setminus\{j\}}{\left(\mathcal{F}^{u}_{ji}(t)+\mathcal{F}^{l}_{ji}(t)\right)}\end{split}$			(6)

Where $\mathcal{F}^{u/l/o}_{ij}(t)$ indicates the instantaneous flow of traffic via the corresponding platform from node $i$ to $j$ at time $t$ defined as:

	$\displaystyle\mathcal{F}^{o}_{ij}(t)$	$\displaystyle=\hat{\mathcal{P}}^{o}_{ij}(t)$		(7)
	$\displaystyle\mathcal{F}^{u/l}_{ij}(t)$	$\displaystyle=min\left(\hat{\mathcal{P}}^{u/l}_{ij}(t),\hat{\mathcal{D}}^{u/l}_{ij}(t)\right)$		(8)

Where $\hat{\mathcal{P}}^{u/l/o}_{ij}(t)$ and $\hat{\mathcal{D}}^{u/l}_{ij}(t)$ are defined as available passenger and driver flow from node $i$ to $j$ via corresponding platforms at given time. Intuitively, equation (7) encodes an assumption that outside option i.e. public transit is pervasive and all passengers willing to take public transit are accommodated at any time. Whereas, equation (8) indicates that the total flow via a platform is constrained by both available flow of passengers and drivers with a one-to-one driver passenger correspondence for that platform at that time. The available flows are calculated as follows:

	$\displaystyle\hat{\mathcal{P}}^{u/l/o}_{ij}(t)=\Pi^{P}_{i}(t)~{}e_{ij}~{}{p^{u/l/o}}^{*}_{ij}(t)$		(9)
	$\displaystyle\hat{\mathcal{D}}^{u/l}_{ij}(t)=\Pi^{D}_{i}(t)~{}e_{ij}~{}{a^{u/l}}^{*}_{i}(t)$		(10)

Where ${p^{u/l/o}}^{*}_{ij}(t)$ and ${a^{u/l}}^{*}(t)$ are the optimum response of passengers and drivers at time $t$ for the given market wage and rate conditions (see subsection III-C).

III-B Temporal multi OD mathematical program network

The extension from a single-shot single OD setting to a temporal multi OD graph setting changes the underlying mathematical program network significantly. In particular, the corresponding Extended MPN (eMPN) must have multiple nodes to accommodate different OD pairs and it must also be temporal in nature. The key changes for different actors are:

The passengers’ nodes are now $N^{2}-N$ in number as the passengers optimize for each edge (except self-edges). The individual cost function for each node is the same as (4) with a key difference being that the wait-time multiplier $\lambda$ differs at each node depending upon the current passenger and driver population at that node. This emphasizes the fact that passengers (say, using $U$ ) are more sensitive to wait-times in the event there are fewer drivers around (even if they all drive for $U$ ). The eMPN node $P$ then changes to:

\displaystyle P_{ij}(t):=\left\{\begin{aligned} f&:{p^{u}_{ij}(t)\left(d_{ij}r^{u}_{ij}(t)+\lambda_{i}(t)\frac{p^{u}_{ij}(t)}{a^{u}_{i}(t)}\right)}\\ &~{}~{}~{}~{}{+p^{l}_{ij}(t)\left(d_{ij}r^{l}_{ij}(t)+\lambda_{i}(t)\frac{p^{l}_{ij}(t)}{a^{l}_{i}(t)}\right)}\\ &~{}~{}~{}~{}{+p^{o}_{ij}(t)\left(d_{ij}r^{o}+\lambda_{i}(t)p^{o}_{ij}(t)\right)}\\ \xi&:0\leq p^{u/l/o}_{ij}\leq 1\\ &\scalebox{1.2}{$~{}\cap~{}$}p^{u}_{ij}(t)+p^{l}_{ij}(t)+p^{o}_{ij}(t)=1\\ &\scalebox{1.2}{$~{}\cap~{}$}\lambda_{i}(t)=\lambda\frac{\Pi^{P}_{i}(t)}{\Pi^{D}_{i}(t)}\\ J&:\{p^{u},p^{l},p^{o}\}_{ij}(t)\end{aligned}\right\}

(11)

There are $N$ eMPN nodes for drivers. A major difference in these new nodes is that we eschew using the matching constraint i.e. $a^{u}+a^{l}\leq p^{u}+p^{l}$ as it is already modeled by flow constraint of equation (8). It is shown in [9] that drivers prefer to participate fully in the ridesharing market as long as they can break even. Therefore, we add a total-covering constraint in this model of the form $\forall i,~{}a^{u}_{i}(t)+a^{l}_{i}(t)=1$ . The eMPN drivers nodes then become (objective function is in bold because this is a maximizer).

\displaystyle D_{i}(t):=\left\{\begin{aligned} f&:\boldsymbol{\sum_{j\in[N]}{d_{ij}p^{u}_{ij}(t)(c^{u}_{ij}(t)-g)}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}\boldsymbol{+d_{ij}p^{l}_{ij}(t)(c^{l}_{ij}(t)-g)}\\ \xi&:0\leq a^{u/l}_{i}(t)\leq 1\scalebox{1.2}{$~{}\cap~{}$}a^{u}_{i}(t)+a^{l}_{i}(t)=1\\ J&:\{a^{u},a^{l}\}_{i}(t)\end{aligned}\right\}

(12)

The platform’s profit functions are now dependent upon the flows (equations 8) in the graph. The relevant eMPN nodes are specified as:

\displaystyle\{U/L\}(t):=\left\{\begin{aligned} f&:\boldsymbol{\sum_{(i,j)\in E}{d_{ij}\mathcal{F}^{u/l}_{ij}(t)(r^{u/l}_{ij}(t)-c^{u/l}_{ij}(t))}}\\ \xi&:r^{u/l}_{ij}(t),c^{u/l}_{ij}(t)\geq 0\\ J&:\{r^{u/l},c^{u/l}\}_{ij}\end{aligned}\right\}

(13)

The final eMPN is shown in figure 3.

Figure 3: eMPN for temporal multi-OD graphs. This is an extension of MPN depicted by figure 1. There are

N

nodes for drivers and

N^{2}-N

nodes for passengers.

III-C Optimum Response and Optimization Objective

At each instant, after the platforms set the wages and the rates for each edge of the graph, an optimum self-allocation of the drivers and passengers takes place. For a graph with two nodes and a single edge between the two, results obtained from [9] provide the theoretical optimum allocation. However, our formulation differs from [9] in that the wait costs are dynamic across the edges and the driver allocation is per-node instead of per-edge. Therefore, we take the following approach to determine optimum driver and passenger allocations at a future time $t+\partial t$ :

1.

The platforms choose $r^{u/l}_{ij}(t+\partial t),c^{u/l}_{ij}(t+\partial t)$ .

With,

A^{D}(t):=\{a^{u}_{1}(t),\ldots,a^{u}_{N}(t),a^{l}_{1}(t),\ldots,a^{l}_{N}(t)\}

being the optimum allocation at time $t$ , the drivers choose $n^{D}$ candidate allocations $A^{D}_{1}(t+\partial t),A^{D}_{2}(t+\partial t),~{}\ldots$ such that $A^{D}_{i}(t+\partial t)=clip(A^{D}(t)+\Delta A^{D}_{i})$ . Where, the clip function clips the elements of each array to be between 0 and 1 and $\Delta A^{D}_{i}$ is a vector of perturbations defined as:

\Delta A^{D}_{i}=\{\delta a^{u}_{1},\delta a^{u}_{2},\ldots\delta a^{u}_{N},1-\delta a^{u}_{1},\ldots 1-\delta a^{u}_{N}\}

Where each $\delta a^{u}_{i}\sim\mathcal{U}\left(-\delta a,+\delta a\right)$ . Colloquially, $\delta a$ defines the allowed width of the perturbation which, in turn, decides just how much the driver market may adjust in the next time-step.

3.

For each candidate allocation, the optimum passenger allocation for each edge with the per-node dynamic wait-cost multiplier $\lambda_{i}(t)$ is calculated according to the best response function by [9] and the optimum allocations ${p^{u/l/o}}^{*}_{ij\in E}(t+\partial t)$ are obtained.
4.

For all candidate driver allocations and their optimum passenger responses, the one that maximizes the sum of driver’s profits across all nodes is chosen as the approximately optimal driver allocation and all ${a^{u/l}}^{*}_{i\in[N]}(t+\partial t)$ are obtained. This gradient-free approach of obtaining an approximate solution for a sequential optimization problem is in line with what’s proposed in [25]. All driver nodes are in Nash equilibrium with each other and maximizing their sum of utility i.e. finding the pareto-optimal solution, may not always yield the true Nash solution, but we still choose this method to relatively simplify the computation.

It is evident that the hyperparameters $a_{N}$ and much importantly, $\delta a$ , determine how the supply market (drivers) responds to exogenous changes. In particular, we take a look into the following two cases:

III-C1 Responsive supply market

In this case, we set $\delta a$ to a high value (say 1) which causes drivers to rapidly adjust their allocation across the nodes in response to the rate/wage fluctuation of the platforms.

III-C2 Lagging supply market

In this case, we set $\delta a$ to a low value (say 0.05). This causes a slow change in the drivers and can be thought of as a measure of some kind of intrinsic inertia whereby drivers want to keep on driving for platforms which they’ve been driving for all along. This is a reasonable stylization of the real world market.

Refer to caption — (a) Profit at the end of episode for responsive market ( $\delta a=1$ )

III-D Optimization Objective

In the extended rideshare model, the overarching goal of the platforms is to maximize their running profit. In other words, both platforms seek to maximize the following objective:

	$\displaystyle U^{*}$	$\displaystyle:=\max_{r^{u}_{ij}(t),c^{u}_{ij}(t)}\int_{t=0}^{T}{U(t)\partial t}$		(14)
	$\displaystyle L^{*}$	$\displaystyle:=\max_{r^{l}_{ij}(t),c^{l}_{ij}(t)}\int_{t=0}^{T}{L(t)\partial t}$		(15)

Where $U(t),L(t)$ are as defined in equation (13). In practice, we discretize the model using a small time-step $\Delta t=0.01$ for computational purposes. To solve this optimization problem, we use the undermentioned algorithm.

III-E Proximal Policy Optimization (PPO) for Multi-Agent Reinforcement Learning

PPO is the state-of-the-art policy gradient based on-policy reinforcement learning algorithm that uses a surrogate objective to provide the flexibility of trust region based policy gradient methods while still being easy to implement. Due to its vast popularity, PPO has been used in domains such as robotics, trading, aviation, autonomous driving, cloud application management etc. [26, 27, 28, 29, 30]. In this paper, we refrain from discussing the exact mechanisms of the algorithm and instead, refer our readers to the seminal work that first introduced it [10] for additional details. We now delineate our usage of PPO for solving the discretized version of the aforementioned optimization problem.

We consider a repeated game with $t=0,1,2,...$ and so on. At each time step $t$ , the market state $S_{t}$ is given by the distribution and the allocation of drivers and passengers across the nodes and edges of the OD graph:

	$\displaystyle S_{t}=$	$\displaystyle[\{\Pi^{P}\}_{i},\{\Pi^{D}\}_{i},\{a^{u}\}_{i},\{a^{l}\}_{i},\{p^{u}\}_{ij},\{p^{l}\}_{ij},\{p^{o}\}_{ij}]$
		$\displaystyle~{}~{}~{}\forall i\in[N],(i,j)\in E$

We implement two identical PPO based reinforcement learning algorithms $PPO^{U},PPO^{L}$ used by, respectively, platforms $U$ and $L$ in a decentralized fashion. These algorithms do not explicitly communicate with each other. In fact, they do not even have the knowledge of the competitor’s price at any moment. They merely utilize the observed market allocation state $S_{t}$ to produce actions $\mathcal{A}^{U/L}_{t}=[r^{u/l}_{ij}(t+1),c^{u/l}_{ij}(t+1)]$ in order to maximize the corresponding discounted rewards at each time step. The decisions are made simultaneously and the obtained platform-specific rates and commissions are used to induce the market response as described in subsection III-C that, along with the dynamics in equations (5) and (6), construct the state $S_{t+1}$ . The process continues till defined steps i.e. episode length. At the end of each episode, the underlying policy of both agents are trained using the observed action/reward pairs. Since the transition depends upon both algorithms which are continually changing, this is a non-stationary markov decision process, and hence, there are no theoretical guarantees of convergence.

IV Experiment

As number of OD nodes $N$ increases, the state-space becomes $3N^{2}+N$ and the action space for each platform becomes $2N^{2}-2N$ . Since we are dealing with a non-stationary markov decision process, to facilitate convergence and obtain a proof of concept, we work with a small OD graph with $N=2$ . The relevant adjacency matrix, the cost matrix, and the initial driver/passenger distribution is given by:

OD=\begin{pmatrix}0&0.9\\ 0.2&0\end{pmatrix},\mathbb{D}=\begin{pmatrix}0&5\\ 2&0\end{pmatrix}

\Pi^{D}(0)=\begin{pmatrix}500\\ 1000\end{pmatrix},\Pi^{P}(0)=\begin{pmatrix}2000\\ 3000\end{pmatrix}

We set the gas cost to $g=5$ , base wait-time multiplier to $\lambda=2$ , and public transit per mile cost to $r^{o}=10$ . We also limit the rates and commissions to be between $[5,20]$ to promote convergence, and fix the number of candidate driver allocations per step to $a_{N}=10$ . The initial allocations are randomly initialized from the distribution ${a^{u}}_{i},{a^{l}}_{i}\sim\mathcal{N}(0.5,0.01)$ , $p^{u}_{ij},p^{l}_{ij},p^{o}_{ij}\sim\mathcal{N}(\frac{1}{3},0.01)$ . All values are appropriately clipped to maintain relevant constraints. We then train our simulation for both responsive and lagging supply market with, respectively, $\delta a=\{1,0.05\}$ with episode length of $2048$ . At the end of each episode, the models are trained via PPO and the state is reinitialized as discussed.

We continue this process for 500 epochs. Our market model is implemented as a PettingZoo environment [31] and we base our decentralized learning algorithm on an openly available PPO implementation from [32]. The code for the paper, including all config parameters for the PPO, can be found at https://github.com/PraveshKoirala/RideshareMARL

V Results and Discussion

All simulations are run on a machine with 12th Gen Intel(R) Core(TM) i7-12700 2.10 GHz processor and 32 GB of RAM. GPU was not used. The time it took for simulating both responsive and lagging market was approximately 27 minutes. The results for the simulation are depicted in figures 4, 5, and 6. All figures have been smoothed using Exponential Moving Average with the smoothing parameter $\alpha$ set to 0.5, in order to facilitate readability (original values are still visible). As can be observed from figure 4, the instantaneous profits for responsive market eventually dies out as compared to the profits from the lagging market towards the end of the episode. While these profits are not just governed by the rates and the commissions of the platforms but also the available passenger and driver distribution and the resulting flows (equation 13), we theorize that the responsiveness of the drivers aids in the loss of profit as the platforms compete aggressively on the supply side by increasing the commissions in addition to the existing competition for passenger occupancy. This is substantiated by figure 5 where it can be seen that the commission payout towards the end of each episode converges towards the charged rate signifying a competition. In contrast to this, in the lagging market (fig 6), the commissions do not significantly increase and instead, steadily converge towards the minimum driver costs ( $g=5$ ), showing that the platforms opt to collude, rather than compete, on the supply side, earning supracompetitive profits (fig 4).

VI Conclusion

The obtained results have vast implications on the literature of algorithmic collusion. First and foremost, this establishes that modern algorithms like PPO have potential for emergent collusive behavior in a complex platform market. Additionally, these results also show that even when the hyperparameters for a learning algorithm are held constant, certain market characteristics like responsiveness can either induce competition or emphasize collusion. This is an interesting result for concerned regulatory bodies as this suggests that markets with high inertial (aka slow response) properties should be prioritized while investigating for potential algorithmic collusion. Lastly, this establishes that emergent collusive behaviors can negatively impact certain market segments (like drivers) more than the other. In this particular case, this result helps explain why gig workers are getting paid less despite of their indispensability in the gig economy. Future works in this area could be a large-scale simulation of this (or related) model to glean further insights into the nature of market characteristics in facilitating tacit collusion in intelligent pricing agents.

On a final note, we want to stress that this work does not accuse any real world platforms or their pricing agents of collusion. It merely provides results for a stylized model.

References

[1] E. Garbarino and O. F. Lee, “Dynamic pricing in internet retail: effects on consumer trust,” Psychology & Marketing, vol. 20, no. 6, pp. 495–513, 2003.
[2] M. K. Chen and M. Sheldon, “Dynamic pricing in a labor market: Surge pricing and flexible work on the uber platform.” Ec, vol. 16, p. 455, 2016.
[3] B. News, “Wendy’s denies plans for surge pricing after backlash,” https://www.bbc.com/news/business-68427039, 2024, accessed: 2024-04-29.
[4] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature machine intelligence, vol. 1, no. 5, pp. 206–215, 2019.
[5] L. Waltman and U. Kaymak, “Q-learning agents in a cournot oligopoly model,” Journal of Economic Dynamics and Control, vol. 32, no. 10, pp. 3275–3293, 2008.
[6] Z. Y. Brown and A. MacKay, “Competition in pricing algorithms,” American Economic Journal: Microeconomics, vol. 15, no. 2, pp. 109–156, 2023.
[7] S. Assad, R. Clark, D. Ershov, and L. Xu, “Algorithmic pricing and competition: Empirical evidence from the german retail gasoline market,” Journal of Political Economy, vol. 132, no. 3, pp. 000–000, 2024.
[8] F. Laine, “Mathematical program networks,” arXiv preprint arXiv:2404.03767, 2024.
[9] P. Koirala and F. Laine, “Decreasing wages in gig economy: A game theoretic explanation using mathematical program networks,” 2024.
[10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[11] S. Huck, H.-T. Normann, and J. Oechssler, “Through trial and error to collusion,” International Economic Review, vol. 45, no. 1, pp. 205–224, 2004.
[12] B. Salcedo, “Pricing algorithms and tacit collusion,” Manuscript, Pennsylvania State University, 2015.
[13] E. Calvano, G. Calzolari, V. Denicolo, and S. Pastorello, “Artificial intelligence, algorithmic pricing, and collusion,” American Economic Review, vol. 110, no. 10, pp. 3267–3297, 2020.
[14] J. Miklós-Thal and C. Tucker, “Collusion by algorithm: Does better demand prediction facilitate coordination between sellers?” Management Science, vol. 65, no. 4, pp. 1552–1561, 2019.
[15] T. Klein, “Autonomous algorithmic collusion: Q-learning under sequential pricing,” The RAND Journal of Economics, vol. 52, no. 3, pp. 538–558, 2021.
[16] I. Abada and X. Lambin, “Artificial intelligence: Can seemingly collusive outcomes be avoided?” Management Science, vol. 69, no. 9, pp. 5042–5065, 2023.
[17] A. V. den Boer, J. M. Meylahn, and M. P. Schinkel, “Artificial collusion: Examining supracompetitive pricing by q-learning algorithms,” Amsterdam Law School Research Paper, no. 2022-25, 2022.
[18] J. M. Sanchez-Cartas and E. Katsamakas, “Artificial intelligence, algorithmic competition and market structures,” IEEE Access, vol. 10, pp. 10 575–10 584, 2022.
[19] M. Hettich, “Algorithmic collusion: Insights from deep learning,” Available at SSRN 3785966, 2021.
[20] Y. Wang, “Will algorithms learn to collude?” 2022.
[21] A. Nikzad, “Thickness and competition in ride-sharing markets.” [Online]. Available: https://papers.ssrn.com/abstract=3065672
[22] F. Bernstein, G. A. DeCroix, and N. B. Keskin, “Competition between two-sided platforms under demand and supply congestion effects,” Manufacturing & Service Operations Management, vol. 23, no. 5, pp. 1043–1061, 2021.
[23] K. A. Bryan and J. S. Gans, “A theory of multihoming in rideshare competition,” Journal of Economics & Management Strategy, vol. 28, no. 1, pp. 89–96, 2019.
[24] J. Bai and C. S. Tang, “Can two competing on-demand service platforms be profitable?” International Journal of Production Economics, vol. 250, p. 108672, 2022.
[25] P. Koirala and F. Laine, “Monte carlo optimization for solving multilevel stackelberg games,” arXiv preprint arXiv:2312.03282, 2023.
[26] G. C. Lopes, M. Ferreira, A. da Silva Simões, and E. L. Colombini, “Intelligent control of a quadrotor with proximal policy optimization reinforcement learning,” in 2018 Latin American Robotic Symposium, 2018 Brazilian Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education (WRE). IEEE, 2018, pp. 503–508.
[27] W. Funika, P. Koperek, and J. Kitowski, “Automatic management of cloud applications with use of proximal policy optimization,” in International Conference on Computational Science. Springer, 2020, pp. 73–87.
[28] E. Bøhn, E. M. Coates, S. Moe, and T. A. Johansen, “Deep reinforcement learning attitude control of fixed-wing uavs using proximal policy optimization,” in 2019 international conference on unmanned aircraft systems (ICUAS). IEEE, 2019, pp. 523–533.
[29] F. Ye, X. Cheng, P. Wang, C.-Y. Chan, and J. Zhang, “Automated lane change strategy using proximal policy optimization-based deep reinforcement learning,” in 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2020, pp. 1746–1752.
[30] S. Lin and P. A. Beling, “An end-to-end optimal trade execution framework based on proximal policy optimization,” in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021, pp. 4548–4554.
[31] J. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. S. Santos, C. Dieffendahl, C. Horsch, R. Perez-Vicente, et al., “Pettingzoo: Gym for multi-agent reinforcement learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 032–15 043, 2021.
[32] S. Kim. (2022) Mujoco-pytorch: Ppo, ddpg, sac, implementation on mujuco environment. [Online]. Available: https://github.com/seolhokim/Mujoco-Pytorch