This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Problem of Social Cost in Multi-Agent General Reinforcement Learning:
Survey and Synthesis

Kee Siong Ng    Samuel Yang-Zhao    Timothy Cadogan-Cowper

1 Introduction

The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable collateral damage to others, including destroying humankind and the world in extreme cases; see, for example, [111, 18]. Many of these examples are effectively variations of the tragedy of the commons phenomenon, which has been studied extensively in economics [59, 100, 35]. Tragedy of the commons typically occur because of an externality that arises when a utility-maximising economic agent does not pay an appropriate cost when making a decision to take an action that provides private benefit but incurs social harm to others, in particular those that are not a party to the decision-making process. Pollution, overfishing, traffic are all classic examples of multi-agent economic systems that exhibit externalities.

In this paper, we consider the problem of social harms that can result from actions taken by learning and utility-maximising agents in a multi-agent environment. The problem of measuring social harms or impacts in such multi-agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in [44]. We provide a partial answer to that open problem in the form of market-based mechanisms to quantify and control the cost of such social harms in § 4. The key proposal is the protocol described in § 4.1 and the agent valuation functions stated in equations (14)-(15). The protocol captures many existing and well-studied special cases which we list in § 4.2. Our proposed setup is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history-based general reinforcement learning environment like in AIXI [69]; (ii) the reinforcement-learning agents participating in the environment can have different horizons. To demonstrate the practicality of the proposed setup, we survey some possible learning algorithms in § 5 and present a few applications in § 6.

As the background literature is rich, both in breadth and depth, we have taken the liberty to take an expository approach in writing the paper and will introduce key topics and concepts as they are required in the development of the paper, starting with General Reinforcement Learning in § 2 and Mechanism Design in § 3. Readers who are familiar with these topics should be able to skip those sections without issue.

2 General Reinforcement Learning

2.1 Single Agent Setting

We consider finite action, observation and reward spaces denoted by 𝒜,𝒪,\mathcal{A},\mathcal{O},\mathcal{R} respectively. The agent interacts with the environment in cycles: at any time, the agent chooses an action from 𝒜\mathcal{A} and the environment returns an observation and reward from 𝒪\mathcal{O} and \mathcal{R}. Frequently we will be considering observations and rewards together, and will denote x𝒪×x\in\mathcal{O}\times\mathcal{R} as a percept xx from the percept space 𝒪×\mathcal{O}\times\mathcal{R}. We will denote a string x1x2xnx_{1}x_{2}\ldots x_{n} of length nn by x1:nx_{1:n} and its length n1n-1 prefix as x<nx_{<n}. An action, observation and reward from the same time step will be denoted aortaor_{t}. After interacting for nn cycles, the interaction string a1o1r1anonrna_{1}o_{1}r_{1}\ldots a_{n}o_{n}r_{n} (denoted aor1:naor_{1:n} from here on) is generated. We define the history space \mathcal{H} to be an interaction string of any length.

Definition 1.

A history hh is an element of the space (𝒜×𝒪×)\mathcal{H}\coloneqq(\mathcal{A}\times\mathcal{O}\times\mathcal{R})^{*}. The history at time tt is denoted ht=aor1:th_{t}=aor_{1:t}.

An environment is a process which generates the percepts given actions. It is defined to be a sequence of probability distributions over percept sequences conditioned on the actions taken by the agent.

Definition 2.

An environment ρ\rho is a sequence of probability distributions {ρ0,ρ1,ρ2,}\{\rho_{0},\rho_{1},\rho_{2},\ldots\}, where ρn:𝒜n𝒟((𝒪×)n)\rho_{n}:\mathcal{A}^{n}\to\mathcal{D}((\mathcal{O}\times\mathcal{R})^{n}), that satisfies

a1:nor<nρn1(or<n|a<n)=or𝒪×ρn(or1:n|a1:n).\displaystyle\forall a_{1:n}~{}\forall or_{<n}\;\rho_{n-1}(or_{<n}\,|\,a_{<n})=\sum_{or\in\mathcal{O}\times\mathcal{R}}\rho_{n}(or_{1:n}\,|\,a_{1:n}). (1)

In the base case, we have ρ0(ϵ|ϵ)=1\rho_{0}(\epsilon|\epsilon)=1.

Equation (1) captures the natural constraint that actions in the future do not affect past percepts and is known as the chronological condition [67]. We will drop the subscript on ρn\rho_{n} when the context is clear.

The predictive probability of the next percept given history and a current action is given by

ρ(orn|aor<n,an)=ρ(orn|hn1an)ρ(or1:n|a1:n)ρ(or<n|a<n)\displaystyle\rho(or_{n}|aor_{<n},a_{n})=\rho(or_{n}|h_{n-1}a_{n})\coloneqq\frac{\rho(or_{1:n}|a_{1:n})}{\rho(or_{<n}|a_{<n})}

for all aor1:naor_{1:n} such that ρ(or<n|a<n)>0\rho(or_{<n}|a_{<n})>0. This allows us to write

ρ(or1:n|a1:n)=ρ(or1|a1)ρ(or2|aor1a2)ρ(orn|aor<nan).\rho(or_{1:n}|a_{1:n})=\rho(or_{1}|a_{1})\rho(or_{2}|aor_{1}a_{2})\ldots\rho(or_{n}|aor_{<n}a_{n}).

The general reinforcement learning problem is for the agent to learn a policy π:𝒟(A)\pi:\mathcal{H}\to\mathcal{D}(A) mapping histories to a distribution on possible actions that will allow it to maximise its expected cumulative future rewards.

Definition 3.

Given an environment μ\mu, at time tt, given history ht1h_{t-1} and an action ata_{t}, the expected future cumulative rewards up to finite horizon mm\in\mathbb{N} is given by the value function

Vt,mμ,(ht1,at)=ortμ(ort|ht1at)maxat+1ort+1μ(ort+1|ht1aortat+1)maxat+mort+mμ(ort+m|ht1aort+1:t+m1at+m)[i=tt+mri],V_{t,m}^{\mu,*}(h_{t-1},a_{t})=\sum_{or_{t}}\mu(or_{t}\,|\,h_{t-1}a_{t})\max_{a_{t+1}}\sum_{or_{t+1}}\mu(or_{t+1}\,|\,h_{t-1}aor_{t}a_{t+1})\,\cdots\\ \max_{a_{t+m}}\sum_{or_{t+m}}\mu(or_{t+m}\,|\,h_{t-1}aor_{t+1:t+m-1}a_{t+m})\left[\sum_{i=t}^{t+m}r_{i}\right], (2)

which can also be written in this recursive form

Vt,mμ,(ht1,at)=ortμ(ort|ht1at)[rt+maxat+1Vt+1,mμ,(ht1aort,at+1)].V_{t,m}^{\mu,*}(h_{t-1},a_{t})=\sum_{or_{t}}\mu(or_{t}\,|\,h_{t-1}a_{t})\left[r_{t}+\max_{a_{t+1}}V_{t+1,m}^{\mu,*}(h_{t-1}aor_{t},a_{t+1})\right]. (3)

If the environment μ\mu is known, the optimal action ata_{t}^{*} to take at time tt is given by

at=argmaxatVt,mμ,(ht1,at).\displaystyle a_{t}^{*}=\arg\max_{a_{t}}V^{\mu,*}_{t,m}(h_{t-1},a_{t}).

In practice, μ\mu is of course unknown and needs to be learned from data and background knowledge. The AIXI agent [67] is a mathematical solution to the general reinforcement learning, obtained by estimating the unknown environment μ\mu in (2) using Solomonoff Induction [119]. At time tt, the AIXI agent chooses action ata^{*}_{t} according to

at=argmaxatortmaxat+mort+m[j=tt+mrj]ρU2K(ρ)ρ(or1:t+m|a1:t+m),\displaystyle a^{*}_{t}=\arg\max_{a_{t}}\sum_{or_{t}}\ldots\max_{a_{t+m}}\sum_{or_{t+m}}\left[\sum_{j=t}^{t+m}r_{j}\right]\sum_{\rho\in\mathcal{M}_{U}}2^{-K(\rho)}\rho(or_{1:{t+m}}\,|\,a_{1:t+m}), (4)

where mm\in\mathbb{N} is a finite lookahead horizon, U\mathcal{M}_{U} is the set of all enumerable chronological semimeasures [67], ρ(or1:t+m|a1:t+m)\rho(or_{1:{t+m}}|a_{1:t+m}) is the probability of observing or1:t+mor_{1:t+m} given the action sequence a1:t+ma_{1:t+m}, and K(ρ)K(\rho) denotes the Kolmogorov complexity [85] of ρ\rho. The performance of AIXI relies heavily on the next result.

Definition 4.

Given a countable model class {ρ1,ρ2,}\mathcal{M}\coloneqq\{\rho_{1},\rho_{2},\ldots\} and a prior weight w0ρ>0w^{\rho}_{0}>0 for each ρ\rho\in\mathcal{M} such that ρw0ρ=1\sum_{\rho\in\mathcal{M}}w^{\rho}_{0}=1, the Bayesian mixture model with respect to \mathcal{M} is given by ξ(or1:n|a1:n)=ρw0ρρ(or1:n|a1:n)\xi_{\mathcal{M}}(or_{1:n}|a_{1:n})=\sum_{\rho\in\mathcal{M}}w^{\rho}_{0}\rho(or_{1:n}|a_{1:n}).

A Bayesian mixture model enjoys the property that it converges rapidly to the true environment if there exists a ‘good’ model in the model class.

Theorem 1.

[67] Let μ\mu be the true environment and ξ\xi be the Bayesian mixture model over a model class \mathcal{M}. For all nn\in\mathbb{N} and for all a1:na_{1:n},

j=1nor1:jμ(or<j|a<j)(μ(orj|aor<jaj)ξ(orj|aor<jaj))2minρ{ln1w0ρ+Dn(μ||ρ)},\sum_{j=1}^{n}\sum_{or_{1:j}}\mu(or_{<j}|a_{<j})(\mu(or_{j}|aor_{<j}a_{j})-\xi(or_{j}|aor_{<j}a_{j}))^{2}\\ \leq\min_{\rho\in\mathcal{M}}\left\{\ln\frac{1}{w^{\rho}_{0}}+D_{n}(\mu||\rho)\right\}, (5)

where Dn(μ||ρ)D_{n}(\mu||\rho) is the KL divergence of μ(|a1:n)\mu(\cdot|a_{1:n}) and ρ(|a1:n)\rho(\cdot|a_{1:n}) defined by

Dn(μ||ρ)or1:nμ(or1:n|a1:n)lnμ(or1:n|a1:n)ρ(or1:n|a1:n).D_{n}(\mu||\rho)\coloneqq\sum_{or_{1:n}}\mu(or_{1:n}|a_{1:n})\ln\frac{\mu(or_{1:n}|a_{1:n})}{\rho(or_{1:n}|a_{1:n})}.

To see the rapid convergence of ξ\xi to μ\mu, take the limit nn\to\infty on the l.h.s of (5) and observe that in the case where minρsupnDn(μ||ρ)\min_{\rho\in\mathcal{M}}\sup_{n}D_{n}(\mu||\rho) is bounded, the l.h.s. can only be finite if ξ(ork|aor<kak)\xi(or_{k}|aor_{<k}a_{k}) converges sufficiently fast to μ(ork|aor<kak)\mu(or_{k}|aor_{<k}a_{k}).

It can be shown that the model ξU:=ρU2K(ρ)ρ(or1:t|a1:t)\xi_{U}:=\sum_{\rho\in\mathcal{M}_{U}}2^{-K(\rho)}\rho(or_{1:t}|a_{1:t}) in (4) is a Bayesian mixture model over the class of all computable functions, with 2K(ρ)2^{-K(\rho)} as the prior for ρ\rho. So assuming the environment is a computable function, AIXI’s model will converge rapidly to it.

AIXI is known to be incomputable. In practice, we will consider Bayesian reinforcement learning agents [53] that make use of Bayesian mixture models of different kinds and approximate the expectimax operation in (4) with algorithms like monte-carlo tree search [79] or other reinforcement learning algorithms; examples of such agents include approximations of AIXI like [129, 139, 138]. These are all model-based techniques; model-free techniques like Temporal Difference learning [121] that exploits the functional form of (3) and universal function approximators like deep neural networks can also be considered.

2.2 Multi-Agent Setting

In the multi-agent setup, we assume there are k>1k>1 agents, each with its own action and observation spaces 𝒜i\mathcal{A}_{i} and 𝒪i\mathcal{O}_{i}, i[1k]i\in[1\ldots k]. At time tt, the kk agents take a joint action

𝐚𝐭=(at,1,,at,k)𝒜1××𝒜k=𝒜\mathbf{a_{t}}=(a_{t,1},\ldots,a_{t,k})\in\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{k}=\mathbf{\mathcal{A}}

and receives a joint percept

𝐨𝐫𝐭=(ort,1,,ort,k)(𝒪1×)××(𝒪k×)=𝒪×.\mathbf{or_{t}}=(or_{t,1},\ldots,or_{t,k})\in(\mathcal{O}_{1}\times\mathcal{R})\times\cdots\times(\mathcal{O}_{k}\times\mathcal{R})=\mathbf{\mathcal{O}\times\mathcal{R}}.

The joint history up to time tt is denoted 𝐡𝐭=𝐚𝐨𝐫𝟏:𝐭=𝐚𝟏𝐨𝐫𝟏𝐚𝟐𝐨𝐫𝟐𝐚𝐭𝐨𝐫𝐭\mathbf{h_{t}}=\mathbf{aor_{1:t}}=\mathbf{a_{1}}\mathbf{or_{1}}\mathbf{a_{2}}\mathbf{or_{2}}\ldots\mathbf{a_{t}}\mathbf{or_{t}}. We assume for now that every agent can see the entire joint history; this assumption will be reexamined in § 5.2.

Definition 5.

A multi-agent environment ϱ\varrho is a sequence of probability distributions {ϱ0,ϱ1,ϱ2,}\{\varrho_{0},\varrho_{1},\varrho_{2},\ldots\}, where ϱn:(𝒜)n𝒟(𝒪×)n\varrho_{n}:(\mathbf{\mathcal{A}})^{n}\to\mathcal{D}(\mathbf{\mathcal{O}\times\mathcal{R}})^{n}, that satisfies

𝐚𝟏:𝐧𝐨𝐫<𝐧ϱn1(𝐨𝐫<𝐧|𝐚<𝐧)=𝐨𝐫𝐧𝒪×ϱn(𝐨𝐫𝟏:𝐧|𝐚𝟏:𝐧).\displaystyle\forall\mathbf{a_{1:n}}~{}\forall\mathbf{or_{<n}}\;\varrho_{n-1}(\mathbf{or_{<n}}\,|\,\mathbf{a_{<n}})=\sum_{\mathbf{or_{n}}\in\mathbf{\mathcal{O}\times\mathcal{R}}}\varrho_{n}(\mathbf{or_{1:n}}\,|\,\mathbf{a_{1:n}}). (6)

In the base case, we have ϱ0(ϵ|ϵ)=1\varrho_{0}(\epsilon\,|\,\epsilon)=1.

Multi-agent environments can have different (non-exclusive) properties, some of which are listed here:

  1. 1.

    Mutually exclusive actions, where only one of the actions in 𝐚𝐭\mathbf{a_{t}} can be executed by the environment at time tt;

  2. 2.

    Non-mutually exclusive actions, where two or more individual actions in 𝐚𝐭\mathbf{a_{t}} can be executed simultaneously by the environment at time tt;

  3. 3.

    Zero-sum rewards, where the agents’ rewards sum to 0 at every time step so they are competing against each other, like in [86];

  4. 4.

    Identical rewards, where the agents get the exact same rewards at every time step so they are playing cooperatively with each other;

  5. 5.

    Mixed-sum rewards, which cover all the settings that combine elements of both cooperation and competition;

  6. 6.

    Existence of a common resource pool, which can be represented by an ‘agent’ with null action space and whose reward is a function of other agents’ consumption of the resource pool.

Comprehensive surveys of formalisations and some key challenges and results in a few of these topics can be found in [115, 142].

Each agent’s goal in a multi-agent environment is to learn the optimal policy to achieve its own maximum expected cumulative future rewards. The behaviour of AIXI agents in a multi-agent environment is an open problem because the each AIXI agent assumes its environment is computable, and that assumption breaks down when the environment contains other AIXI agents [67]. We will consider primarily Bayesian reinforcement learning agents in this paper.

In addition to the performance of individual agents, we will also have occasion to consider aggregate functions of the vector of expected cumulative rewards achieved by all the agents in the system, where the exact aggregate function chosen is dependent on desired properties of the multi-agent system like market efficiency and fairness.

3 Mechanism Design

3.1 Tragedy of the Commons

The key to solving tragedy of the commons issues is to work out a way to ‘internalise’ the externality in the design of the multi-agent economic system of interest. There are two primary approaches: price regulation through a central authority, and a market-based cap-and-trade system. The former is sometimes referred to as Pigouvian tax after [106, 105], and it requires a central authority to (i) have enough information to quite accurately determine the unit price of the externality or social harm; and (ii) enforce its payment by agents that cause the externality, thereby internalising it. In contrast, the cap-and-trade system is motivated by the idea of Coasean bargaining [31, 30], whereby the maximum amount of the externality or social harm allowed is capped through the issuance of a fixed number of permits, each of which allows an agent to produce a unit of externality, and the agents are allowed to determine for themselves whether to use their permits to produce externality, or trade the permits among themselves for profit. The idea is that the cap-and-trade system will allow the agents that are most efficient in generating private benefit while minimising social harm to win because they can afford to pay a higher price for the permits. Indeed, the Coase ‘Theorem’ says that as long as the permits are completely allocated and there is no transaction cost involved in trading, then the agents will collectively end up with a Pareto efficient solution. So a market made up of utility-maximising agents, under the right conditions, is capable of determining the right price for the externality; there is no need for an informative and powerful central authority to set the price.

In the following sections, we will look at some concrete protocols from the field of Mechanism Design for implementing Coasean bargaining and trading in multi-agent environments. In keeping with the intended spirit of [32], we will largely avoid the term externality from here onwards and favour, instead, the term ‘social harm’.

3.2 The VCG Mechanism

In a multi-agent environment, the different agents participating in it can be given different goals and preferences, either competing or collective, and the algorithms behind those agents can exhibit different behaviour, including differing abilities in learning and planning for the long term. Mechanism design [94] is the study of protocols that can take the usually dispersed and private information and preferences of multiple agents and aggregating them into an appropriate social choice, usually a decision among alternatives, that maximises the welfare of all involved.

Let 𝔸\mathbb{A} be a set of alternatives for a set of kk agents. The preference of agent ii is given by a valuation function vi:𝔸v_{i}:\mathbb{A}\to\mathbb{R}, where vi(a)v_{i}(a) denotes the value that agent ii assigns to alternative aa being chosen. Here, viViv_{i}\in V_{i}, where Vi𝔸V_{i}\subseteq\mathbb{R}^{\mathbb{A}} is the set of possible valuation functions for agent ii. We will use the notation Vi=V1××Vi1×Vi+1×VkV_{-i}=V_{1}\times\cdots\times V_{i-1}\times V_{i+1}\times\cdots V_{k} in the following.

Definition 6.

A mechanism is a tuple (f,p1,,pk)(f,p_{1},\ldots,p_{k}) made up of a social choice function f:V1××Vk𝔸f:V_{1}\times\cdots\times V_{k}\to\mathbb{A} and payment functions p1,,pkp_{1},\ldots,p_{k}, where pi:V1××Vkp_{i}:V_{1}\times\cdots\times V_{k}\to\mathbb{R} is the amount that agent ii pays to the mechanism.

Given a mechanism (f,p1,,pk)(f,p_{1},\ldots,p_{k}) and kk agents with value functions v1,,vkv_{1},\ldots,v_{k}, the utility of agent ii from participating in the mechanism is given by

ui(v1,,vk):=vi(f(v1,,vk))pi(v1,,vk).u_{i}(v_{1},\ldots,v_{k}):=v_{i}(f(v_{1},\ldots,v_{k}))-p_{i}(v_{1},\ldots,v_{k}). (7)
Definition 7.

A mechanism (f,p1,,pk)(f,p_{1},\ldots,p_{k}) is called incentive compatible if for every agent ii with valuation function viViv_{i}\in V_{i}, for every viViv_{i}^{\prime}\in V_{i}, and every viViv_{-i}\in V_{-i}, we have

vi(f(vi,vi))pi(vi,vi)vi(f(vi,vi))pi(vi,vi).v_{i}(f(v_{i},v_{-i}))-p_{i}(v_{i},v_{-i})\geq v_{i}(f(v_{i}^{\prime},v_{-i}))-p_{i}(v_{i}^{\prime},v_{-i}). (8)

Thus, in an incentive compatible mechanism, each agent ii would maximise its utility by being truthful in revealing its valuation function viv_{i} to the mechanism, rather than needing to worry about obtaining an advantage by presenting a possibly false / misleading viv_{i}^{\prime}.

Definition 8.

A mechanism (f,p1,,pk)(f,p_{1},\ldots,p_{k}) is individually rational if for every agent ii with valuation function viViv_{i}\in V_{i} and every viViv_{-i}\in V_{-i}, we have

vi(f(vi,vi))pi(vi,vi)0.v_{i}(f(v_{i},v_{-i}))-p_{i}(v_{i},v_{-i})\geq 0. (9)

In other words, the utility of each agent is always non-negative, assuming the agent reports truthfully.

Definitions 6, 7 and 8 can be generalised to allow the social choice function ff and the payment functions pip_{i}’s to be randomised functions, in which case we will work with the expectation version of (7), (8) and (9).

Definition 9.

A mechanism (f,p1,,pk)(f,p_{1},\ldots,p_{k}) is called a Vickrey-Clarke-Groves (VCG) mechanism if

  1. 1.

    f(v1,,vk)argmaxa𝔸ivi(a)f(v_{1},\ldots,v_{k})\in\arg\max_{a\in\mathbb{A}}\sum_{i}v_{i}(a); that is the social choice function ff maximises the social welfare, and

  2. 2.

    there exists functions h1,,hkh_{1},\ldots,h_{k}, where hi:Vih_{i}:V_{-i}\to\mathbb{R}, such that for all v1V1,,vkVkv_{1}\in V_{1},\ldots,v_{k}\in V_{k}, we have

    pi(v1,,vk)=hi(vi)jivj(f(v1,,vk)).p_{i}(v_{1},\ldots,v_{k})=h_{i}(v_{-i})-\sum_{j\neq i}v_{j}(f(v_{1},\ldots,v_{k})).

Here is a classical result from mechanism design.

Theorem 2.

Every Vickrey-Clarke-Groves mechanism is incentive compatible.

What should the hih_{i} functions in VCG mechanisms be? A good choice is the Clark pivot rule.

Definition 10.

The Clark pivot payment function for a VCG mechanism is given by hi(vi):=maxb𝔸jivj(b)h_{i}(v_{-i}):=\max_{b\in\mathbb{A}}\sum_{j\neq i}v_{j}(b) for agent ii.

Under this choice of hih_{i}, the payment for agent ii is

pi(v1,,vk)=maxbjivj(b)jivj(f(v1,,vk)),p_{i}(v_{1},\ldots,v_{k})=\max_{b}\sum_{j\neq i}v_{j}(b)-\sum_{j\neq i}v_{j}(f(v_{1},\ldots,v_{k})),

which is the difference between the collective social welfare of the others with and without ii’s participation in the system. So each agent makes the payment that internalises the exact social harm it causes other agents. The utility of agent ii is

ui(v1,,vk)=jvj(f(v1,,vk))maxbjivj(b).u_{i}(v_{1},\ldots,v_{k})=\sum_{j}v_{j}(f(v_{1},\ldots,v_{k}))-\max_{b}\sum_{j\neq i}v_{j}(b).
Theorem 3.

The VCG mechanism with Clark pivot payment function is individually rational if the agent valuation functions are all non-negative.

Example 1.

Consider an auction where 𝔸={1,,k}\mathbb{A}=\{1,\ldots,k\} – so one and only one of the agents win – and where, for agent ii, vi(i)=piv_{i}(i)=p_{i} and ji,vi(j)=0\forall j\neq i,v_{i}(j)=0. Vickrey’s Second Price auction, in which each agent ii bids the highest price pip_{i} it is willing to pay for the auction item and where the winner i=argmaxjpji^{*}=\arg\max_{j}p_{j} pays the second highest bid price p=maxjipjp^{*}=\max_{j\neq i^{*}}p_{j} and every one else pays 0 is a VCG mechanism with the Clark pivot payment function.

Example 2.

Consider the design of a mechanism to allow two agents, a buyer BB and a seller SS, to engage in bilateral trade for a good owned by the seller. There are two possible outcomes: no-trade or trade, which we model numerically as 𝔸={0,1}\mathbb{A}=\{0,1\}. The buyer values the good at θB0\theta_{B}\geq 0 so its valuation function is

vB(d):=𝑖𝑓d=1𝑡ℎ𝑒𝑛θB𝑒𝑙𝑠𝑒 0.v_{B}(d):=\mathit{if}\;d=1\;\mathit{then}\;\theta_{B}\;\mathit{else}\;0.

The seller values the good at θS0\theta_{S}\geq 0 so its valuation function is

vS(d):=𝑖𝑓d=1𝑡ℎ𝑒𝑛θS𝑒𝑙𝑠𝑒 0v_{S}(d):=\mathit{if}\;d=1\;\mathit{then}\;-\theta_{S}\;\mathit{else}\;0

because the seller loses the good in the case of a trade. Suppose we use the VCG mechanism with the Clark pivot payment function as the mechanism, we will end up with the social choice

d=f(vB,vS)\displaystyle d^{*}=f(v_{B},v_{S}) =argmaxd𝔸(vB(d)+vS(d))\displaystyle=\arg\max_{d\in\mathbb{A}}(v_{B}(d)+v_{S}(d))
=argmaxd𝔸(𝑖𝑓d=1𝑡ℎ𝑒𝑛θBθS𝑒𝑙𝑠𝑒 0)\displaystyle=\arg\max_{d\in\mathbb{A}}\,(\mathit{if}\;d=1\;\mathit{then}\;\theta_{B}-\theta_{S}\;\mathit{else}\;0)
=𝑖𝑓θBθS>0𝑡ℎ𝑒𝑛 1𝑒𝑙𝑠𝑒 0,\displaystyle=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;1\;\mathit{else}\;0,

which means there is a trade iff the buyer attaches a higher value to the good than the seller. Here are the payment functions:

pB(vB,vS)\displaystyle p_{B}(v_{B},v_{S}) =maxd𝔸vS(d)vS(d)=𝑖𝑓θBθS>0𝑡ℎ𝑒𝑛θS𝑒𝑙𝑠𝑒 0\displaystyle=\max_{d\in\mathbb{A}}v_{S}(d)-v_{S}(d^{*})=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;\theta_{S}\;\mathit{else}\;0
pS(vB,vS)\displaystyle p_{S}(v_{B},v_{S}) =maxd𝔸vB(d)vB(d)=𝑖𝑓θBθS>0𝑡ℎ𝑒𝑛 0𝑒𝑙𝑠𝑒θB.\displaystyle=\max_{d\in\mathbb{A}}v_{B}(d)-v_{B}(d^{*})=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;0\;\mathit{else}\;\theta_{B}.

So the buyer pays the mechanism θS\theta_{S} if there is a trade and the seller pays the mechanism if there is no trade. The latter is slightly odd and results in the problematic issue of the seller always having negative utility in participating in the mechanism:

uS(vB,vS)\displaystyle u_{S}(v_{B},v_{S}) =vB(d)+vS(d)maxd𝔸vB(d)\displaystyle=v_{B}(d^{*})+v_{S}(d^{*})-\max_{d\in\mathbb{A}}v_{B}(d)
=𝑖𝑓θBθS>0𝑡ℎ𝑒𝑛θS𝑒𝑙𝑠𝑒θB.\displaystyle=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;-\theta_{S}\;\mathit{else}\;-\theta_{B}.

The problem comes down to the asymmetric position of the buyer and seller and the pi()0p_{i}(\cdot)\geq 0 condition enforced by the Clark pivot payment function, where the seller is forced to make a payment to maintain its ownership of the good (no trade), even though the status quo is that the seller already owns the good. There are at least two solutions. The first solution is to insist that the buyer and seller pays nothing if there is no trade, so we end up with the constraints

pB(vB,vS)=hB(vS)vS(0)=0\displaystyle p_{B}(v_{B},v_{S})=h_{B}(v_{S})-v_{S}(0)=0
pS(vB,vS)=hS(vB)vB(0)=0\displaystyle p_{S}(v_{B},v_{S})=h_{S}(v_{B})-v_{B}(0)=0

that imply hB(vS)=vS(0)h_{B}(v_{S})=v_{S}(0) and hS(vB)=vB(0)h_{S}(v_{B})=v_{B}(0). In this scenario, when trade happens, we have

pB(vB,vS)\displaystyle p_{B}(v_{B},v_{S}) =hB(vS)vS(1)=θS\displaystyle=h_{B}(v_{S})-v_{S}(1)=\theta_{S}
pS(vB,vS)\displaystyle p_{S}(v_{B},v_{S}) =hS(vB)vB(1)=θB,\displaystyle=h_{S}(v_{B})-v_{B}(1)=-\theta_{B},

which means the buyer pays the mechanism θS\theta_{S} and the seller is paid θB\theta_{B} by the mechanism, with both agents obtaining utility θBθS>0\theta_{B}-\theta_{S}>0. Note that mechanism ends up having to subsidise the trade. The second solution is to remove the ownership asymmetry between the two agents, by first making the mechanism pay an amount θθS\theta\geq\theta_{S} to the seller to transfer the good to the mechanism and then running the VCG mechanism with Clark pivot rule to determine new ownership of the good. Under this setup, the valuation function of the buyer stays the same, but the valuation function of the seller becomes

vS(d)=𝑖𝑓d=1𝑡ℎ𝑒𝑛 0𝑒𝑙𝑠𝑒θS,v_{S}(d)=\mathit{if}\;d=1\;\mathit{then}\;0\;\mathit{else}\;\theta_{S},

and we end up with the Vickrey second-price auction setup. The utility of the buyer stays the same, and that of the seller becomes

uS(vB,vS)\displaystyle u_{S}(v_{B},v_{S}) =θ+(vS(d)+vB(d)maxd𝔸vB(d)\displaystyle=\theta+(v_{S}(d^{*})+v_{B}(d^{*})-\max_{d\in\mathbb{A}}v_{B}(d)
=θθB+(𝑖𝑓θBθS>0𝑡ℎ𝑒𝑛θB𝑒𝑙𝑠𝑒θS)\displaystyle=\theta-\theta_{B}+(\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;\theta_{B}\;\mathit{else}\;\theta_{S})
=𝑖𝑓θBθS>0𝑡ℎ𝑒𝑛θ𝑒𝑙𝑠𝑒θ+θSθB\displaystyle=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;\theta\;\mathit{else}\;\theta+\theta_{S}-\theta_{B}
0.\displaystyle\geq 0.

As with the first solution, the mechanism ends up with a negative value, which is the cost of subsidising the trade.

3.3 The Exponential VCG Mechanism

We have shown in § 3.2 that VCG mechanisms are incentive compatible and individually rational, which means agents are incentivised to participate and be truthful. It turns out that VCG mechanisms can be made privacy-preserving too. The exponential mechanism [90], a key technique in differential privacy [39], has been shown in [64] to be a generalisation of the VCG mechanism that is differentially private, incentive compatible and nearly optimal for maximising social welfare. We now briefly describe this key result and furnish the proofs, which are rather instructive.

Definition 11.

A randomized algorithm :V1××Vk𝔸\mathcal{M}:V_{1}\times\cdots\times V_{k}\to\mathbb{A} is (ϵ,δ)(\epsilon,\delta)-differentially private if for any vV1××Vkv\in V_{1}\times\cdots\times V_{k} and for any subset Ω𝔸\Omega\subseteq\mathbb{A}

P((v)Ω)eϵP((v)Ω)+δ,\displaystyle P(\mathcal{M}(v)\in\Omega)\leq e^{\epsilon}P(\mathcal{M}(v^{\prime})\in\Omega)+\delta,

for all vv^{\prime} such that |vv|11|v-v^{\prime}|_{1}\leq 1 (i.e. there exists at most one i[n]i\in[n] such that viviv_{i}\neq v^{\prime}_{i}).

Definition 12.

Given a quality function q:V1××Vk×𝔸q:V_{1}\times\cdots\times V_{k}\times\mathbb{A}\to\mathbb{R} and a vV1××Vkv\in V_{1}\times\cdots\times V_{k}, the Exponential DP Mechanism qϵ(v)\mathcal{M}_{q}^{\epsilon}(v) samples and outputs an element r𝔸r\in\mathbb{A} with probability proportional to exp(ϵ2Δqq(v,r))\exp(\frac{\epsilon}{2\Delta_{q}}q(v,r)), where

Δq=maxr𝔸maxv1,v2:v1v211|q(v1,r)q(v2,r)|.\Delta_{q}=\max_{r\in\mathbb{A}}\max_{v_{1},v_{2}:||v_{1}-v_{2}||_{1}\leq 1}|q(v_{1},r)-q(v_{2},r)|.
Theorem 4.

The Exponential DP Mechanism is (ϵ,0)(\epsilon,0)-differentially private.

Definition 13.

The Exponential VCG Mechanism is defined by (qϵ,p1,,pk)(\mathcal{M}_{q}^{\epsilon},p_{1},\ldots,p_{k}) where

q(v,r)=ivi(r)\displaystyle q(v,r)=\sum_{i}v_{i}(r)
pi(v)=𝔼rqϵ(v)[jivj(r)]2ϵH(qϵ(v))+2ϵln(r𝔸exp(ϵ2jivj(r)))\displaystyle p_{i}(v)=\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(v)}\left[-\sum_{j\neq i}v_{j}(r)\right]-\frac{2}{\epsilon}H(\mathcal{M}_{q}^{\epsilon}(v))+\frac{2}{\epsilon}\ln\left(\sum_{r\in\mathbb{A}}\exp\left(\frac{\epsilon}{2}\sum_{j\neq i}v_{j}(r)\right)\right)

and H()H(\cdot) is the Shannon entropy function.

Note that as ϵ\epsilon increases, qϵ\mathcal{M}_{q}^{\epsilon} will sample r=argmaxr𝔸ivi(r)r^{*}=\arg\max_{r\in\mathbb{A}}\sum_{i}v_{i}(r) with probability rapidly approaching 1, and the payment function also satisfies the form given in Definition 9. In that sense, the exponential VCG mechanism can be considered a generalisation of the VCG mechanism.

Lemma 5.

Given ϵ\epsilon\in\mathbb{R} and valuation functions v=v1,,vkv=v_{1},\ldots,v_{k} where each vi:𝔸[0,1]v_{i}:\mathbb{A}\to[0,1], the Gibbs social welfare defined by

𝔼rξ[ivi(r)]+2ϵH(ξ)\mathbb{E}_{r\sim\xi}\left[\sum_{i}v_{i}(r)\right]+\frac{2}{\epsilon}H(\xi)

is maximised when ξ=qϵ(v)\xi=\mathcal{M}^{\epsilon}_{q}(v) for q(v,r)=ivi(r)q(v,r)=\sum_{i}v_{i}(r).

Proof.

The first term in the Gibbs social welfare can be rewritten as follows:

r𝔸ξ(r)q(v,r)\displaystyle\sum_{r\in\mathbb{A}}\xi(r)q(v,r)
=2ϵr𝔸ξ(r)ϵ2q(v,r)\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\frac{\epsilon}{2}q(v,r)
=2ϵr𝔸ξ(r)ln(exp(ϵq(v,r)2))\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln\left(\exp\left(\frac{\epsilon q(v,r)}{2}\right)\right)
=2ϵr𝔸ξ(r)ln(exp(ϵq(v,r)/2)a𝔸exp(ϵq(v,a)/2))+2ϵln(a𝔸exp(ϵq(v,a)/2)).\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln\left(\frac{\exp(\epsilon q(v,r)/2)}{\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)}\right)+\frac{2}{\epsilon}\ln\left(\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)\right). (10)

Adding (10) to the second entropy term of the Gibbs social welfare and noting that Δq=1\Delta_{q}=1, we get

𝔼rξ[ivi(r)]+2ϵH(ξ)\displaystyle\mathbb{E}_{r\sim\xi}\left[\sum_{i}v_{i}(r)\right]+\frac{2}{\epsilon}H(\xi)
=2ϵr𝔸ξ(r)ln(qϵ(v)(r))+2ϵln(a𝔸exp(ϵq(v,a)/2))2ϵr𝔸ξ(r)ln(ξ(r))\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln(\mathcal{M}_{q}^{\epsilon}(v)(r))+\frac{2}{\epsilon}\ln\left(\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)\right)-\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln(\xi(r))
=2ϵD𝐾𝐿(ξ||qϵ(v))+2ϵln(a𝔸exp(ϵq(v,a)/2)).\displaystyle=-\frac{2}{\epsilon}D_{\mathit{KL}}(\xi\,||\,\mathcal{M}_{q}^{\epsilon}(v))+\frac{2}{\epsilon}\ln\left(\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)\right). (11)

By Gibb’s inequality, (11) is maximised when ξ=qϵ(v)\xi=\mathcal{M}_{q}^{\epsilon}(v). ∎

Theorem 6.

([64]) The Exponential VCG Mechanism is incentive compatible and individually rational.

Proof.

We first show the incentive compatible property. Consider an agent ii with valuation function viv_{i} and fix the bids viv_{-i} of the other agents. Let

h({vj}j)=2ϵln(r𝔸exp(ϵ2jvj(r))).h(\{v_{j}\}_{j})=\frac{2}{\epsilon}\ln\left(\sum_{r\in\mathbb{A}}\exp\left(\frac{\epsilon}{2}\sum_{j}v_{j}(r)\right)\right).

The expected utility to agent ii when bidding bib_{i} is

𝔼rqϵ(bi,vi)[vi(r)]pi(bi,vi)\displaystyle\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}[v_{i}(r)]-p_{i}(b_{i},v_{-i})
=𝔼rqϵ(bi,vi)[vi(r)]+𝔼rqϵ(bi,vi)[jivj(r)]+2ϵH(qϵ(bi,vi))h(vi)\displaystyle=\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}[v_{i}(r)]+\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}\left[\sum_{j\neq i}v_{j}(r)\right]+\frac{2}{\epsilon}H(\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i}))-h(v_{-i})
=𝔼rqϵ(bi,vi)[jvj(r)]+2ϵH(qϵ(bi,vi))h(vi)\displaystyle=\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}\left[\sum_{j}v_{j}(r)\right]+\frac{2}{\epsilon}H(\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i}))-h(v_{-i})
=2ϵD𝐾𝐿(qϵ(bi,vi)||qϵ(vi,vi))+h(vi,vi)h(vi),\displaystyle=-\frac{2}{\epsilon}D_{\mathit{KL}}(\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})\,||\,\mathcal{M}_{q}^{\epsilon}(v_{i},v_{-i}))+h(v_{i},v_{-i})-h(v_{-i}), (12)

where the last step follows from (11) and is maximised when bi=vib_{i}=v_{i}.

To show the individually rational property, note that when bi=vib_{i}=v_{i}, the expression (12) reduces to h(vi,vi)h(vi)h(v_{i},v_{-i})-h(v_{-i}), which is non-negative and equals zero when viv_{i} is the zero function. ∎

As shown in [64], it is also advisable to add differential privacy noise to the payment functions given it contains information about all the agents’ valuation functions, which may need to be kept private.

4 The Social Cost of Actions

4.1 General Case

Let μ\mu be a multi-agent environment and assume there are kk Bayesian reinforcement learning agents operating within it. These agents are concrete realisations of the concept of perfectly rational utility-maximising agents commonly assumed in economics. We have seen that, in the absence of some control mechanism, such multi-agent environments can exhibit bad equilibrium. To avoid tragedy of the commons-type issues, we need to impose a cost on each agent’s actions, commensurate with the social harm they are causing other agents with that action, and we will see in this section how augmenting a multi-agent environment with, for example, VCG mechanisms can address such issues.

Definition 14.

A multi-agent environment ϕ\phi controlled by a VCG mechanism M=(f,p1,,pk)M=(f,p_{1},\ldots,p_{k}) is a sequence of probability distributions Mϕ={φ0,φ1,φ2,}M\triangleright\phi=\{\varphi_{0},\varphi_{1},\varphi_{2},\ldots\} such that each φt:(V1××Vk)t𝒟(𝒪×)t\varphi_{t}:(V_{1}\times\cdots\times V_{k})^{t}\to\mathcal{D}(\mathbf{\mathcal{O}\times\mathcal{R}})^{t} is defined by

φt(𝐨𝐫𝟏:𝐭|𝐯𝟏:𝐭):=ϕt(𝐨𝐫𝟏:𝐭|𝐚𝟏:𝐭:=f(𝐯𝟏)f(𝐯𝟐)f(𝐯𝐭)),\varphi_{t}(\mathbf{or^{\prime}_{1:t}}\,|\,\mathbf{v_{1:t}}):=\phi_{t}(\mathbf{or_{1:t}}\,|\,\mathbf{a^{*}_{1:t}}:=f(\mathbf{v_{1}})f(\mathbf{v_{2}})\cdots f(\mathbf{v_{t}})),

where 𝐯𝐭=(vt,1,vt,2,,vt,k)\mathbf{v_{t}}=(v_{t,1},v_{t,2},\ldots,v_{t,k}) with vt,iVi𝒜iv_{t,i}\in V_{i}\subseteq\mathbb{R}^{\mathcal{A}_{i}} being the valuation function of agent ii at time tt, 𝔸𝒜\mathbb{A}\subseteq\mathbf{\mathcal{A}} is the space of actions that can be jointly taken by the agents, 𝐚𝟏:𝐭𝔸t\mathbf{a^{*}_{1:t}}\in\mathbb{A}^{t} and 𝐫𝐭=(rt,1,,rt,k)\mathbf{r^{\prime}_{t}}=(r^{\prime}_{t,1},\ldots,r^{\prime}_{t,k}) with rt,i=(rt,i,pi(vt,1,,vt,k))r^{\prime}_{t,i}=(r_{t,i},p_{i}(v_{t,1},\ldots,v_{t,k})).

𝔸\mathbb{A} can be a strict subset of 𝒜\mathbf{\mathcal{A}} if there are certain combinations of actions that are impossible because there are mutually exclusive individual actions. Intuitively, an action has a social cost precisely in the scenarios where only one agent can execute the action, and the social cost of executing that action is the loss of value when other agents are prevented from executing that action.

Protocol for Controlled Multi-Agent Environment

Let 𝐡𝐭𝟏\mathbf{h_{t-1}} be the joint history up till time tt. At time tt, we use the VCG mechanism M=(f,p1,,pk)M=(f,p_{1},\ldots,p_{k}) to determine the joint action the agents should take to maximise social welfare via

𝐚𝐭:=f(vt,1,,vt,k)=argmax𝐚𝔸ivt,i(𝐚).\mathbf{a^{*}_{t}}:=f(v_{t,1},\ldots,v_{t,k})=\arg\max_{\mathbf{a}\in\mathbb{A}}\sum_{i}v_{t,i}(\mathbf{a}). (13)

A joint percept 𝐨𝐫𝐭\mathbf{or_{t}} is then sampled from ϕ(𝐨𝐫𝐭|𝐡𝐭𝟏𝐚𝐭)\phi(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a^{*}_{t}}) and each agent ii receives the percept ort,ior_{t,i} and pays the mechanism MM the amount:

pi(vt,1,,vt,k)=hi(vt,i)jivt,j(𝐚𝐭).p_{i}(v_{t,1},\ldots,v_{t,k})=h_{i}(v_{t,-i})-\sum_{j\neq i}v_{t,j}(\mathbf{a_{t}^{*}}).

There is a degree of freedom in the choice of hi(vt,i)h_{i}(v_{t,-i}). A reasonable choice is the Clark pivot payment function

hi(vt,i)=max𝐚𝐭jivt,j(𝐚𝐭).h_{i}(v_{t,-i})=\max_{\mathbf{a_{t}}}\sum_{j\neq i}v_{t,j}(\mathbf{a_{t}}).

Under the VCG mechanism convention, the utility of agent ii for the chosen action 𝐚𝐭\mathbf{a_{t}^{*}} is ui(𝐯𝐭):=vt,i(𝐚𝐭)pi(𝐯𝐭)u_{i}(\mathbf{v_{t}}):=v_{t,i}(\mathbf{a_{t}^{*}})-p_{i}(\mathbf{v_{t}}). Here, pi(𝐯𝐭)p_{i}(\mathbf{v_{t}}) is the social cost of the joint action 𝐚𝐭\mathbf{a_{t}^{*}}. The social utility of joint action 𝐚𝐭\mathbf{a_{t}^{*}} is

u(𝐯𝐭):=iui(𝐯𝐭)=ivt,i(𝐚𝐭)pi(𝐯𝐭).u(\mathbf{v_{t}}):=\sum_{i}u_{i}(\mathbf{v_{t}})=\sum_{i}v_{t,i}(\mathbf{a^{*}_{t}})-p_{i}(\mathbf{v_{t}}).
What should the agent valuation functions be?

Given full knowledge of the environment ϕ\phi and the mechanism (f,p1,,pk)(f,p_{1},\ldots,p_{k}), the valuation function vt,iv_{t,i} of agent ii with horizon mim_{i} at time tt, where tmit\leq m_{i}, having seen history 𝐡𝐭𝟏\mathbf{h_{t-1}} is defined inductively as follows:

vt,i(𝐡𝐭𝟏,𝐚𝐭)=𝑖𝑓mit=0𝑡ℎ𝑒𝑛𝐨𝐫𝐭ϕ(𝐨𝐫𝐭|𝐡𝐭𝟏𝐚𝐭)rt,i\displaystyle v_{t,i}(\mathbf{h_{t-1},\mathbf{a_{t}}})=\mathit{if}\;m_{i}-t=0\;\mathit{then}\;\sum_{\mathbf{or_{t}}}\phi(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a_{t}})\cdot r_{t,i}
𝑒𝑙𝑠𝑒𝐨𝐫𝐭ϕ(𝐨𝐫𝐭|𝐡𝐭𝟏𝐚𝐭)[rt,i+ut+1,i(𝐡𝐭𝟏𝐚𝐭𝐨𝐫𝐭)]\displaystyle\hskip 68.00012pt\mathit{else}\;\sum_{\mathbf{or_{t}}}\phi(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a_{t}})[r_{t,i}+u_{t+1,i}(\mathbf{h_{t-1}}\mathbf{a_{t}}\mathbf{or_{t}})] (14)
ut,i(𝐡𝐭𝟏)=vt,i(𝐡𝐭𝟏,f(𝐯𝐭(𝐡𝐭𝟏)))pi(vt,1(𝐯𝐭(𝐡𝐭𝟏)),\displaystyle u_{t,i}(\mathbf{h_{t-1}})=v_{t,i}(\mathbf{h_{t-1}},f(\mathbf{v_{t}}(\mathbf{h_{t-1}})))-p_{i}(v_{t,1}(\mathbf{v_{t}}(\mathbf{h_{t-1}})), (15)

where we denote 𝐯𝐭(𝐡𝐭𝟏)=(vt,1(𝐡𝐭𝟏,),,vt,k(𝐡𝐭𝟏,))\mathbf{v_{t}}(\mathbf{h_{t-1}})=(v_{t,1}(\mathbf{h_{t-1}},\cdot),\ldots,v_{t,k}(\mathbf{h_{t-1}},\cdot)) As shown in Appendix A, the utility of agent ii with horizon mm at time t=1t=1 can be rewritten as

u1,i(ϵ)=𝐨𝐫𝟏:𝐦φt(𝐨𝐫𝟏:𝐦|𝐯𝟏:𝐦)[t=1mrt,i]𝐨𝐫𝟏:(𝐦𝟏)φt(𝐨𝐫𝟏:𝐦𝟏|𝐯𝟏:𝐦𝟏)[t=1mpi(𝐯𝐭)],u_{1,i}(\epsilon)=\sum_{\mathbf{or_{1:m}}}\varphi_{t}(\mathbf{or_{1:m}}\,|\,\mathbf{v_{1:m}})\left[\sum_{t=1}^{m}r_{t,i}\right]\\ -\sum_{\mathbf{or_{1:(m-1)}}}\varphi_{t}(\mathbf{or_{1:m-1}}\,|\,\mathbf{v_{1:m-1}})\biggl{[}\sum_{t=1}^{m}p_{i}(\mathbf{v_{t}})\biggr{]}, (16)

where the components of 𝐯𝐭\mathbf{v_{t}} are defined by (14). Note that the agents can have different horizons.

To see that (14) is a reasonable definition, we can appeal to Theorem 2 and use backward induction to argue that, at every time step, every single agent can do no better than report their valuation function truthfully, starting with the base case of t=m1t=m-1. Given the ut+1,i()u_{t+1,i}() future-utility term in (14) is thus well-defined when all the agents are assumed to be rational, the formulation of (14) then gives the best expected result for each possible joint action 𝐚𝐭\mathbf{a_{t}}. In the case where the mechanism M=(f,p1,,pk)M=(f,p_{1},\ldots,p_{k}) is probabilistic (e.g. the Exponential VCG Mechanism described in § 3.3), the utility function (15) can be generalised to take the expectation over the values of f(vt,1,,vt,k)f(v_{t,1},\ldots,v_{t,k}).

Variations of Agent Valuation Functions

We can, if necessary, add reward discounting to (14) in the usual manner. If the underlying environment is known to be episodic – repeated games or episodic MDPs, for example – we can also adjust the base recursive condition of (14) to be tmodm=0t\mod m=0.

Partial Knowledge

In practice, an agent participating in a mechanism-controlled environment will not have full knowledge of the environment and likely no full visibility of the joint actions and / or payments. Some of the possible configurations are covered in § 4.2, with more detailed descriptions of learning algorithms covered in § 5.

4.2 Special Cases and Related Settings

Here are some special cases of a mechanism controlled environment MϕM\triangleright\phi, all of which have a rich literature behind them.

Single Agent

When k=1k=1, the formula (14) simplifies to (3) because f(v)=argmaxav(a)f(v)=\arg\max_{a}v(a) and the protocol becomes that of the single-agent general reinforcement learning setting described in § 2.1. And, of course, when the actions are null or have no effect on the environment, we recover the sequential prediction setting, which includes Solomonoff Induction [117, 118], prediction with expert advice [23] and the closely related zero-sum repeated games.

Two Player Turn-based Games

When k=2k=2 and the agents take turn executing actions (and therefore the actions are never exclusionary), our general setup reduces to two-player turn-based games, covering both perfect information and imperfect information settings, studied in game theory [132, 99].

Multi-Agent Reinforcement Learning

If the actions of the kk agents are never mutually exclusive, then 𝐚𝐭=f(vt,1,,vt,k)=argmax𝐚𝔸ivt,i(𝐚)\mathbf{a_{t}^{*}}=f(v_{t,1},\ldots,v_{t,k})=\arg\max_{\mathbf{a}\in\mathbb{A}}\sum_{i}v_{t,i}(\mathbf{a}) is such that 𝐚𝐭=argmax𝐚𝔸vt,i(𝐚)\mathbf{a_{t}^{*}}=\arg\max_{\mathbf{a}\in\mathbb{A}}v_{t,i}(\mathbf{a}) for each i[1,,k]i\in[1,\ldots,k]. In that case, the mechanism MM in MϕM\triangleright\phi never play a role and the protocol becomes that of the multi-agent general reinforcement learning setting described in § 2.2. Comprehensive surveys of key challenges and results in this area can be found in [115, 95, 142], with a unifying framework in [82].

Static Mechanism Design

When m=1m=1 and ϕ\phi is fixed, the setup recovers the classical mechanism design settings like auctions and single-decision markets [17]. For example, if ϕ(|f(v1,,vk))\phi(\cdot\,|\,f(v_{1},\ldots,v_{k})) assigns probability 1 to the outcome that agent i=argmaxivi(i)i^{*}=\arg\max_{i}v_{i}(i) gets the percept (i,vi)(i^{*},v_{i^{*}}) and every other agent gets (i,0)(i^{*},0), and the payment function is the Clark pivot function, then MϕM\triangleright\phi reduces to the Vickrey second-price auction among kk bidders.

Dynamic Mechanism Design

When 1<m1<m\leq\infty and kk and ϕ\phi can change over time, then we are in the dynamic mechanism design setting [14], with special cases like sequential auctions, and market participants that come and go. Such dynamic mechanisms have been used, for example, in Uber’s surge-pricing model [26] and congestion control [10]. A general online mechanism design algorithm based on Hedge [48] can be found in [66].

Multi-Agent Coordinated Planning

There are both similarities and important differences between our mechanism-controlled multi-agent environments setup and the literature on online mechanisms for coordinated planning and learning in multi-agent systems [103, 22]. In the latter case, there is usually a top-level Markov Decision Process (MDP) whose transition and reward functions are common knowledge to all the agents, and each of the agents may only be active during certain time periods and they hold private information – usually something simple like the value attached to winning an item being auctioned, or a hidden state that forms part of the state for the top-level MDP – that are needed by a central planner to work out the optimal joint policy for all the agents. In that setup, the key problem is the design of dynamic VCG-style mechanisms to incentivise all the agents to truthfully reveal their private information to the central planner at each time step. Also worth noting that the concept of self-interested agents in the multi-agent coordinate planning literature is mostly about an agent who may lie about its own private information in order to gain an advantage, which is a narrow form of the classical economic notion of a self-interested agent that seeks to act in such a way to maximise its own expected future cumulative rewards.

Multi-Agent Coordinated Learning

Our setup can also be understood game-theoretically as a simultaneous-move sequential game in which there are kk agents, the action space for each agent is the set of all valuation functions, and the loss function for agent ii given the collective actions (i.e. submitted valuation functions of all the agents) is its negative utility as determined by the VCG mechanism. In the learning in games literature [49], the underlying game dynamics is usually assumed to be static and the concern of each agent is primarily around learning a best response to the possibly adaptive strategies of other agents. Key results in such simultaneous-move repeated games can be found in [16, 23].

Dynamic Mechanism Design via Reinforcement Learning

In the multi-agent coordinated planning case, the underlying MDP is assumed to be known. When this assumption is dropped, the mechanism designer has to use reinforcement learning algorithms to learn the parameters of the VCG mechanism from data, including the payment functions. The simpler bandit setting is tackled in [74]. A reinforcement learning (RL) algorithm using linear function approximation is described in [107]. It is worth noting that it is the mechanism designer that is using RL to learn the optimal VCG mechanism parameters over time from the environment made up of multiple agents and their instantaneous (or horizon 1) reward functions. The RL for dynamic mechanism design framework is silent on where the agents’ reward functions come from; each agent’s reward function could come from another learning process, for example via the agent learning its owner’s preferences.

5 Learning in the Presence of Social Cost

5.1 Measures of Success

In [115], when it comes to multi-agent reinforcement learning, the authors ask “If learning is the answer, what is the question?”. The answer is not obvious, not only because the underlying environment is non-stationary – which can already appear in the single-agent reinforcement learning setting – but also because the agents can adapt to each other’s behaviour so each agent can play the dual roles of learner and teacher at the same time. For example, the storied Tit-for-Tat strategy [8, 97, 96] is an agent policy that both learn from and ‘teach’ the other agents in iterated Prisoner’s Dilemma-type problems.

Several possible and non-unique theories of successful learning, both descriptive and prescriptive, are provided in [115, §7]. In the context of multi-agent reinforcement learning under mechanism design, we are concerned primarily with designing learning algorithms that satisfy some or all of the following properties, noting that the agents do not learn policy functions to but only valuation functions that are fed into a VCG mechanism for joint action selection.

Convergence

Starting from no or only partial knowledge of the environment and the other agents, each agent’s learned valuation function at any one time should converge to (15) and its realised utility should converge over time to (16).

No Regret

An agent’s learning algorithm minimises regret if, against any set of other agents, it learns a sequence of valuation functions that achieves long-term utility almost as well as what the agent can achieve by picking, with hindsight, the best fixed valuation function for every time step.

Rational

An agent’s learning algorithm is rational if, whenenever the other agents have settled on a stationary set of valuation functions, it settles on a best response to that stationary set.

Incentive Compatibility

In the case when the parameters of the VCG mechanism is learned through data, we require that the mechanism exhibits approximate incentive compatibility with high probability.

Some of these concepts will be made more precise in the coming subsections.

5.2 Bayesian Reinforcement Learning Agents

In practice, an agent ii operating in a given controlled multi-agent environment MϕM\triangleright\phi will not actually have enough information to construct vt,iv_{t,i} as defined in (14). First of all, it does not know what ϕ\phi is. A good solution is to use a Bayesian mixture ξ𝒫\xi_{\mathcal{P}}, for a suitable model class 𝒫\mathcal{P}, to learn ϕ\phi. So at time tt with history 𝐡𝐭𝟏\mathbf{h_{t-1}}, agent ii approximates the expression ϕ(𝐨𝐫𝐭|𝐡𝐭𝟏𝐚𝐭)\phi(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a_{t}}) by

ρ𝒫w0ρρ(𝐨𝐫𝐭|𝐡𝐭𝟏𝐚𝐭).\sum_{\rho\in\mathcal{P}}w_{0}^{\rho}\rho(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a_{t}}). (17)

The quantity 𝐩𝐭=(p1,,pk)\mathbf{p_{t}}=(p_{1},\ldots,p_{k}) in (15) can also be estimated directly using, say, another Bayesian mixture ξ𝒬\xi_{\mathcal{Q}} via

ρ𝒬w0ρρ(𝐩𝐭|𝐚𝐩𝐨𝐫<𝐭𝐚𝐭).\sum_{\rho\in\mathcal{Q}}w_{0}^{\rho}\rho(\mathbf{p_{t}}\,|\,\mathbf{apor_{<t}}\mathbf{a_{t}}). (18)

In general terms, we can think of (17) as learning the dynamics of the underlying environment ϕ\phi, and (18) as learning the preferences and strategies of the other agents in the environment.

Online Mixture Learning in Practice

The optimal model class 𝒫\mathcal{P} (and 𝒬\mathcal{Q}) for each agent to use is of course the class of all computable functions, yielding a multi-agent Solomonoff induction setup, where we can see that each agent’s model converges at a fast rate to the true environment by virtue of Theorem 1. This proposal has two issues. First of all, it is an incomputable solution. Secondly, Theorem 1 is only applicable when the environment is computable, and this assumption is violated when the environment contains other agents that are themselves incomputable.

In practice, the Solomonoff induction can be approximated efficiently using factored, binarised versions of the Context Tree Weighting algorithm [135, 127, 129, 139], or online prediction-with-experts algorithms like Exponential Weights / Hedge [48, 23, 4], Switch [131, 128] and their many special cases [62]. While these algorithms have their roots in online convex optimisation, they can be interpreted as Bayesian mixtures when the loss function is log loss [81], which in our setup is log2M(or)-\log_{2}M(or) where MM is the model for ϕ\phi and oror is the observed percept. In particular, the Hedge algorithm is an exact Bayesian mixture model in the case when the loss function is log loss and the learning rate is 1, in which case one can show that the Hedge weight for each expert is its posterior probability [23, §9.2]. The Prod algorithm with log loss [98] has been shown to be a "soft-bayes" algorithm, which coincides with the exact Bayesian mixture when the learning rate is 1 and can be interpreted as a "slowed-down" version of Bayesian mixture or a Bayesian mixture with "partially sleeping" experts when the learning rate is less than 1.111When the true environment is contained in the model class, the optimal learning rate is 1 because Bayesian inference is optimal. However, in the agnostic / improper learning setting where the true environment may not be in the model class, the learning rate needs to decrease over time to avoid pathological issues especially in non-convex function classes [56, 125]. More generally, [81] shows that there is a spectrum of Bayesian mixture algorithms over expert sequences with different priors that can be understood as interpolations of Bayesian mixtures over fixed experts and (non-learning) element-wise mixtures. The spectrum includes Fixed-Share [61] with static Bernoulli switching frequencies, Switching distributions with dynamic slowly decreasing switching frequencies [126, 124] and dynamically learned switching frequencies [131], and more general switching frequencies that are dependent on some ordering on experts [133, 81].

Such online-learning algorithms can exhibit good regret-minimising convergence behaviour under different scenarios. In the context of multi-agent systems, regret-minimisation learning algorithms are particularly useful in that one can show a set of agents that all mimimise a notion of regret called swap regret will converge to a correlated equilibrium [7], where no agent would want to deviate from their preferences / strategies assuming the others also do not deviate. In [16, 15, 27], the authors describe a general procedure to turn agents that minimise the standard form of regret into agents that minimise the swap regret, which is regret that allows for any specific agent action to be switched with some other action with the benefit of hindsight. This is achieved via a master algorithm that runs N=|𝒜|N=|\mathcal{A}| instances of a standard regret minimisation algorithm, one for each possible action. Each of the AlA_{l} algorithms, l[N]l\in[N], maintains a qltNq_{l}^{t}\in\mathbb{R}^{N} weight vector at each time step as usual. The master algorithm maintains a weight vector ptNp^{t}\in\mathbb{R}^{N} that is the solution to the equation pt=ptQtp^{t}=p^{t}Q^{t}, where QtN×NQ^{t}\in\mathbb{R}^{N\times N} is the matrix made up of row vectors qlt,l[N]q_{l}^{t},l\in[N]. (The ptp^{t} on the RHS of pt=ptQtp^{t}=p^{t}Q^{t} can be understood as the weights attached to AlA_{l} algorithms, and the ptp^{t} on the left is best understood as the probability of selecting the different actions. Thus, ptp^{t} is the stable limiting distribution of the stochastic matrix QtQ^{t} and the existence and efficient computability of ptp^{t} is given by the Perron-Frobenius Theorem.) The master algorithm incurs a loss t\ell^{t} at each time step, which is then attributed to AlA_{l} by pttp^{t}\ell^{t}. Intuitively, each AlA_{l} is responsible for minimising the regret of switching action ll to any other action via its standard regret minimising property.

Monte Carlo Planning

The recurrence in (14)-(15) can be approximated using variants of the Monte Carlo Tree Search (MCTS) algorithm [20, 130, 122]. Even though there are no explicit min-nodes and max-nodes in an MCTS tree, it is known that MCTS converges to the (expecti)minimax tree because as the number of roll-out operations goes to infinity, the UCT [79] selection policy at each decision node concentrates the vast majority of the roll-outs on the best child nodes so the weighted average back-up rewards converges to the max/min value. It is also worth noting that, rather than choosing the action that maximises the value function at the root of the MCTS tree, the agent declares the entire value function to MϕM\triangleright\phi for a joint action to be chosen by the MM mechanism. The vt,iv_{t,i} arguments to f()f() and pi()p_{i}() in (15) can be estimated using the decision nodes in the MCTS tree at any one time.

To gain some intuition into how the MCTS algorithm would work, consider the tree in Fig 1 for the simple setup where the horizon m=2m=2 and the action space consists only of {a1,a2}\{a^{1},a^{2}\} and the perception space consists only of {or1,or2}\{or^{1},or^{2}\}. There are two types of alternating nodes: decision nodes (in red) and observation nodes (in green). Each node is indexed by the sequence of symbols in the path from the root to the node. Attached to

  • each terminal decision node at the bottom of the tree labelled by a percept oror is set of values {ri}i=1k\{r_{i}\}_{i=1\ldots k};

  • each non-terminal decision node indexed by hth_{t} is a set {ut+1,i(ht)}i=1k\{u_{t+1,i}(h_{t})\}_{i=1\ldots k} of values corresponding to (15);

  • each observation node indexed by ht1ath_{t-1}a_{t} is a set {vt,i(ht1,at)}i=1k\{v_{t,i}(h_{t-1},a_{t})\}_{i=1\ldots k} of values corresponding to (14).

Each non-terminal decision node indexed by hth_{t} has a social-welfare maximising action argmaxaivt+1,i(ht,at)\arg\max_{a}\sum_{i}v_{t+1,i}(h_{t},a_{t}), which is indicated by a thick arrow. In the diagram, we assume each of the agents has the same horizon. In general, this is not the case and some decision nodes can play the role of both terminal and non-terminal nodes, depending on each agent’s horizon.

ϵ\epsilon a1a^{1}a2a^{2} or1or^{1} or2or^{2} \cdots\cdots or1or^{1} or2or^{2} a1a^{1}a2a^{2}a1a^{1}a2a^{2} or1or^{1} or2or^{2} or1or^{1} or2or^{2} or1or^{1} or2or^{2} or1or^{1} or2or^{2}
Figure 1: A horizon-2 game tree with small action and perception spaces
Model-Free Reinforcement Learning

The above discussion are mostly based on model-based reinforcement learning approaches. It is possible to adopt model-free reinforcement learning approaches like variants of Q-learning and temporal-difference to learn the valuation and utility functions too. A key consideration here is that the underlying environment is non-stationary so suitable adjustments are required. It is also worth noting that policy-gradient RL algorithms are harder to apply in our setting, because the VCG mechanism requires valuation functions to be submitted by agents.

Partial Observability

We have so far assumed each agent can see everything, including the declared 𝐯𝐭\mathbf{v_{t}} valuation functions of other agents, the chosen joint action 𝐚𝐭\mathbf{a_{t}}, the full joint percept 𝐨𝐫𝐭\mathbf{or_{t}}, and all payments 𝐩𝐭\mathbf{p_{t}}. It is possible to relax this assumption, which we will briefly look at now.

Assumption 1.

Each agent ii sees, at time tt, the joint action 𝐚𝐭\mathbf{a_{t}} but only its own percept 𝐨𝐫𝐭|𝐢:=ort,i\mathbf{or_{t|i}}:=or_{t,i} and the value of its payment pi(𝐯𝐭)p_{i}(\mathbf{v_{t}}). Its view of the history 𝐡𝐭\mathbf{h_{t}} up to time tt is denoted 𝐡𝐭|𝐢:=𝐚𝟏or1,i𝐚𝟐or2,i𝐚𝐭ort,i\mathbf{h_{t|i}}:=\mathbf{a_{1}}or_{1,i}\mathbf{a_{2}}or_{2,i}\ldots\mathbf{a_{t}}or_{t,i}.

Definition 15.

Given a multi-agent environment ϱ\varrho, we can define agent ii’s view of ϱ\varrho under Assumption 1 as ϱ|i={ϱ0|i,ϱ1|i,ϱ2|i,}\varrho_{|i}=\{\varrho_{0|i},\varrho_{1|i},\varrho_{2|i},\ldots\}, where ϱt|i:𝒜t𝒟(𝒪i×)t\varrho_{t|i}:\mathbf{\mathcal{A}}^{t}\to\mathcal{D}(\mathcal{O}_{i}\times\mathcal{R})^{t} is defined by

ϱt|i(or1:t|𝐚𝟏:𝐭):=𝐨𝐫𝟏:𝐭st𝐨𝐫𝟏:𝐭|𝐢=or1:tϱt(𝐨𝐫𝟏:𝐭|𝐚𝟏:𝐭).\varrho_{t|i}(o^{\prime}r^{\prime}_{1:t}\,|\,\mathbf{a_{1:t}})\,:=\sum_{\begin{subarray}{c}\mathbf{or_{1:t}}\\ \text{st}\;\mathbf{or_{1:t|i}}=o^{\prime}r^{\prime}_{1:t}\end{subarray}}\varrho_{t}(\mathbf{or_{1:t}}\,|\,\mathbf{a_{1:t}}).

The valuation function (14) can no longer be approximated directly with the partial observations. Instead, we will have to use ϕt|i\phi_{t|i} instead of ϕt\phi_{t} in (14), and learn the f()f() and p()p() functions in (15) from data obtained from interactions with the environment, possibly using Bayesian mixture estimators.

Markovian State Abstraction

To maintain computational tractability and statistical learnability, we will need to approximate estimators like (17) with

ρ𝒫w0ρρ(𝐨𝐫𝐭|χ(𝐡𝐭𝟏𝐚𝐭)),\sum_{\rho\in\mathcal{P}}w_{0}^{\rho}\rho(\mathbf{or_{t}}\,|\,\chi(\mathbf{h_{t-1}}\mathbf{a_{t}})),

where χ\chi is a feature function that maps arbitrarily long history sequences into usually a finite nn-dimensional feature space that is a strict subset of n\mathbb{R}^{n}. Depending on the application, such a χ\chi could be hand-coded by domain experts, or picked from a class of possible feature functions using model-selection principles. The aim is to select a mapping χ\chi such that the induced process can facilitate learning without severely compromising performance. Theoretical approaches to this question have typically focused on providing conditions that minimise the error in the action-value function between the abstract and original process [1, 84]. The Φ\PhiMDP framework [68] instead provides an optimisation criteria for ranking candidate mappings χ\chi based on how well the state-action-reward sequence generated by χ\chi can be modelled as an MDP whilst still being predictive of the rewards. A good χ\chi results in a model that is Markovian and predictive of future rewards, facilitating the efficient learning of good actions that maximise the expected long-term reward. The class of possible feature functions can be defined using formal logic [51, 36, 87, 40] or obtained from embedding techniques [12, 21, 55] and Deep Learning techniques [91, 5].

Two such algorithms, motivated by AIXI [67], are described in [139, 138]. In both cases, the formal agent knowledge representation and reasoning language is as described in [88, 87]. In particular, the syntax of the language are the terms of the λ\lambda-calculus [29], extended with type polymorphism and modalities to increase its expressiveness for modelling agent concepts. The semantics of the language follow Henkin [60] and Kripke [47], suitably extended. The inference engine has two interacting components: an equational-reasoning engine and a tableau theorem prover [37]. There is also a predicate rewrite system for defining and enumerating a set of predicates. The choice of formalism is informed by the standard arguments given in [45, 65] and the practical convenience of working with (the functional subset of) a language like Python. (Suitable alternatives to the formalism are studied in the Statistical Relational Artificial Intelligence literature [51, 108], including a few that caters specifically for relational reinforcement learning [40, 38] and Symbolic MDP/POMDPs [78, 112].) The feature selection problem was addressed through the use of a so-called random forest binary decision diagram algorithm to find a set of features that approximately minimise an adjusted version of the Φ\Phi-MDP criteria [93].

5.3 Bandit VCG Mechanisms

Note that, in practice, the agents’ valuation functions are not fully known but estimated from the actual percepts they get from the environment, which are in turn dependent on the joint actions chosen by the VCG mechanism. This means formula (13) in the mechanism-controlled environment protocol needs to be handled carefully; in particular there is an exploration-exploitation trade-off here, where we need to do enough explorations for the agents to arrive at good estimates of their valuation functions before we can reliably compute the argmax𝐚𝔸\arg\max_{\mathbf{a}\in\mathbb{A}} over them. This problem can be solved using Bandit algorithms, and the techniques described in [74] that uses upper-confidence bounds [6, 83] are directly applicable in our setting. A key finding in [74] is that (asymptotic) truthfulness is harder to achieve with agents that learn their valuation functions; this may also be an issue in our setting.

5.4 Markov VCG Mechanisms in Unknown Environments

§ 5.2 describes agents that learn in a mechanism-controlled environment. In this section, we take the perspective of the mechanism designer and look at reinforcement learning algorithms that can adjust the parameters of the VCG mechanism based on interactions with agents that are themselves capable of learning and adjusting.

We will first summarise the setup and algorithmic framework described in [107] and then describe some possible extensions. The environment with a controller and kk agents is defined by an episodic Markov Decision Process Ξ=(𝒮,𝒜,H,𝒫,{ri}i=0k)\Xi=(\mathcal{S},\mathcal{A},H,\mathcal{P},\{r_{i}\}_{i=0}^{k}), where 𝒮\mathcal{S} and 𝒜\mathcal{A} are the state and action spaces, HH is the length of each episode, 𝒫={𝒫t:𝒮×𝒜𝒟(𝒮)}t=1H\mathcal{P}=\{\mathcal{P}_{t}:\mathcal{S}\times\mathcal{A}\to\mathcal{D}(\mathcal{S})\}_{t=1}^{H} is the state transition function, and ri={ri,t:𝒮×𝒜[0,1]}t=1Hr_{i}=\{r_{i,t}:\mathcal{S}\times\mathcal{A}\to[0,1]\}_{t=1}^{H} are the reward functions, with r0r_{0} denoting the reward function for the controller and ri,1ikr_{i},1\leq i\leq k, denoting the reward function for agent ii. Except for r0r_{0}, the environment Ξ\Xi is unknown to the controller. The controller interacts with the kk agents in multiple episodes, with each episode lasting HH time steps. An initial state x1x_{1} is set at the start of each episode. For each time step tt in the episode, the controller observes the current state xt𝒮x_{t}\in\mathcal{S}, picks an action at𝒜a_{t}\in\mathcal{A}, and receives a reward r0,t(xt,at)r_{0,t}(x_{t},a_{t}). Each agent ii receives their own reward ri,t(xt,at)r_{i,t}(x_{t},a_{t}) and report a value r~i,t(xt,at)\widetilde{r}_{i,t}(x_{t},a_{t}) to the controller. At the end of the episode, the controller charges each agent ii a price pip_{i}. Given the controller’s policy function π={πt:𝒮𝒜}t=1H\pi=\{\pi_{t}:\mathcal{S}\to\mathcal{A}\}_{t=1}^{H} and the prices {pi}i=1k\{p_{i}\}_{i=1}^{k}, the controller’s utility for the episode is defined by

u0=V1π(x1;r0)+i=1kpi,u_{0}=V_{1}^{\pi}(x_{1};r_{0})+\sum_{i=1}^{k}p_{i},

and each agent ii’s utility for the episode is defined by ui=V1π(x1;ri)piu_{i}=V_{1}^{\pi}(x_{1};r_{i})-p_{i}, where

Vhπ(x;r)=t=hH𝔼π,𝒫[rt(xt,πt(xt))xh=x].V_{h}^{\pi}(x;r)=\sum_{t=h}^{H}\mathbb{E}_{\pi,\mathcal{P}}[r_{t}(x_{t},\pi_{t}(x_{t}))\mid x_{h}=x].

The controller’s goal is to learn the policy π\pi^{*} and pricing {pi}i=1k\{p_{i}\}_{i=1}^{k} that implements the so-called Markov VCG mechanism

π(x)=argmaxπV1π(x;R)\displaystyle\pi^{*}(x)=\arg\max_{\pi}V_{1}^{\pi}(x;R) (19)
πi(x)=argmaxπV1π(x;Ri)\displaystyle\pi^{*}_{-i}(x)=\arg\max_{\pi}V_{1}^{\pi}(x;R^{-i}) (20)
pi(x)=V1πi(x,Ri)V1π(x,Ri),\displaystyle p_{i}(x)=V_{1}^{\pi^{*}_{-i}}\bigl{(}x,R^{-i}\bigr{)}-V_{1}^{\pi^{*}}\bigl{(}x,R^{-i}\bigr{)}, (21)

where R=j=0krjR=\sum_{j=0}^{k}r_{j} and Ri=RriR^{-i}=R-r_{i} in the following:

Lemma 7 ([89]).

The Markov VCG mechanism is incentive compatible and individually rational.

Of course, one can only implement the Markov VCG mechanism directly if the 𝒫\mathcal{P} and {ri}i=0k\{r_{i}\}_{i=0}^{k} elements of the underlying episodic MDP are known, and that (19) and (20) can be computed efficiently. In practice, the controller has to learn 𝒫\mathcal{P} and {ri}i=0k\{r_{i}\}_{i=0}^{k} from interactions with the environment and agents, and (19) and (20) need to be handled using function approximation when the state and action spaces 𝒮\mathcal{S} and 𝒜\mathcal{A} are large.

In [107], a reinforcement learning framework was proposed for learning Markov VCG with Least-Squares Value Iteration (LSVI) [73]. There are two phases: an exploration phase and an exploitation phase. During the exploration phase, the controller uses reward-free reinforcement learning [72] to interact with the environment and the kk agents to appropriately explore the underlying MDP. This exploration phase is crucial because the controller would need enough data to approximate (20) and (21) by simulating counterfactual scenarios for when each agent ii is missing from the environment. During the exploitation phase, the controller will then repeat the following steps in each episode tt:

  1. 1.

    Construct a π^t\widehat{\pi}^{t} that approximates (19) using LSVI on previously collected data DD;

  2. 2.

    Learn an approximation Fti(x)F_{t}^{-i}(x) of V1πi(x,Ri)V_{1}^{\pi^{*}_{-i}}\bigl{(}x,R^{-i}\bigr{)}, the first term in (21), using LSVI on collected data DD;

  3. 3.

    Learn an approximation Gti(x)G_{t}^{-i}(x) of V1π(x,Ri)V_{1}^{\pi^{*}}\bigl{(}x,R^{-i}\bigr{)}, the second term in (21), using LSVI on collected data DD and π^t\widehat{\pi}^{t} from Step 1;

  4. 4.

    For time step h=1,,Hh=1,\ldots,H, observe state xt,hx_{t,h}, take action at,h=π^t(xt,h)a_{t,h}=\widehat{\pi}^{t}(x_{t,h}) and observe r~i,t,h(xt,h,at,h)\widetilde{r}_{i,t,h}(x_{t,h},a_{t,h}) from each agent ii.

  5. 5.

    Charge each agent ii the price pi,t(x1)=Fti(xt,1)Gti(xt,1)p_{i,t}(x_{1})=F_{t}^{-i}(x_{t,1})-G_{t}^{-i}(x_{t,1}).

  6. 6.

    Add {(xt,h,at,h,{r~i,t,h(xt,h,at,h)}i=1k)}h=1H\{(x_{t,h},a_{t,h},\{\widetilde{r}_{i,t,h}(x_{t,h},a_{t,h})\}_{i=1}^{k})\}_{h=1}^{H} to DD.

In the first three steps, the Least-Squares Value Iteration algorithm is used, in conjunction with a suitable model class \mathcal{F}, to construct a sequence of functions (Q^t,h)h=1H(\widehat{Q}_{t,h})_{h=1}^{H} that approximates the Q-function for the corresponding value function V1π(,r)V_{1}^{\pi}(\cdot,r) at episode tt via the following optimisation problem, from h=H,,1h=H,\ldots,1:

Q^t,h=argminfτ=1t1[rτ,h(xτ,h,aτ,h)+V~h+1π(xτ,h+1)f(xτ,h,aτ,h)]2+reg(f),\displaystyle\widehat{Q}_{t,h}=\arg\min_{f\in\mathcal{F}}\sum_{\tau=1}^{t-1}\bigl{[}r_{\tau,h}(x_{\tau,h},a_{\tau,h})+\widetilde{V}_{h+1}^{\pi}(x_{\tau,h+1})-f(x_{\tau,h},a_{\tau,h})\bigr{]}^{2}+\text{reg}(f),

where reg(ff) is a suitable regularisation term, and V~h+1π(x)=Q^t,h+1(x,π(x))\widetilde{V}_{h+1}^{\pi}(x)=\widehat{Q}_{t,h+1}(x,\pi(x)). In [107], the authors take \mathcal{F} to be linear functions of the form f(x,a)=w,χ(x,a)f(x,a)=\langle w,\chi(x,a)\rangle for some feature mapping χ(,)m\chi(\cdot,\cdot)\in\mathbb{R}^{m} and give efficient algorithms for linear MDPs that can (i) learn to recover the Markov VCG mechanism from data, and (ii) achieve regret, with respect to the optimal Markov VCG mechanism, upper-bounded by O(T2/3)O(T^{2/3}), where TT is the total number of episodes.

For general MDPs or Partially Observable MDPs, we can replace the LSVI with linear function class with approximate AIXI algorithms like [139, 138] to learn Markov VCG mechanisms from data.

5.5 Mechanism-level RL vs Agent-level RL

Note that (19) is a special case of (14)-(15), in that it is the policy (executed by the mechanism) that maximises iu1,i(ϵ)\sum_{i}u_{1,i}(\epsilon) (see (16)) when the underlying environment ϕ\phi is an episodic MDP, and all the agents have the same horizon mm. The key advantage of performing reinforcement learning at the mechanism level is that, in the case when the mechanism is (approximately) incentive compatible, the RL algorithm has access to and can learn from all available data from interactions between the agents and the environment and mechanism, including all the individual rewards and payments. The key disadvantage is that it appears necessary to assume that all the agents have the exact same horizon. In comparison, the situation is reversed when RL is done at the level of individual agents: each agent’s RL algorithm usually only has access to the subset of the interaction data in which it has a role in generating, but each agent can use different RL algorithms with different parameters like horizon / discount factor and different function approximation model classes.

6 Applications

6.1 Paperclips and All That

The paperclip maximiser [19] is a thought experiment in the AI safety folklore that illustrates a potential risk from developing advanced AI systems with misaligned goals or values. Here is the idea: Imagine an AI system is tasked with the goal of maximising the production of paperclips. (This could be an explicit high-level goal tasked by a human, or a subgoal inferred as needed by the AI system for fulfiling a higher-level goal.) As the AI system becomes more and more intelligent, it may pursue this paperclip-production goal with relentless efficiency, converting all available resources and matter into paperclips, potentially leading to disastrous consequences for humanity and the environment. A much older (and less trivial) variation of the problem was articulated by AI pioneer Marvin Minsky, who suggested that an AI system designed to solve the Riemann hypothesis might decide to take over all of Earth’s resources to build supercomputers to help achieve its goal.

While these are obviously extreme examples, these thought experiments help focus our minds on the following key issues in the development of increasingly powerful AI systems:

  • Even seemingly harmless or trivial goals, if pursued by a superintelligent AI system without proper constraints, could lead to catastrophic outcomes as the AI single-mindedly optimises for that goal.

  • In particular, in single-mindedly pursuing a goal, a superintelligent AI may exhibit so-called convergent instrumental behavior, such as acquiring power and resources and conducting self-improvement as subgoals, to help it achieve its top-level goals at all costs, even if those actions conflict with human values or well-being.

These thought experiments underscore the importance of instilling the right goals, values, and constraints into AI systems from the outset, as it may be extremely difficult or impossible to retrofit them into a superintelligent system once created.

There is a rich literature [50, 44, 28] on aligning the design of AI systems with human values, and there is a lot of useful analyses in the single-agent general reinforcement learning (GRL) setting (§ 2.1) on how to make sure, for example, that

  • the agent infers the reward function from a human by observing the person’s actions through cooperative inverse reinforcement learning [58], leading to a provable solution for the off-switch problem [57] in some cases; and

  • preventing an AI agent from performing adverse ‘self-improvement’ by tampering with its own reward function through robust learning techniques that incorporates causality [42] and/or domain knowledge [43].

We argue here that, in the multi-agent GRL setting, additional controls in the form of imposition of social cost on agent actions can help prevent paperclip maximiser-style AI catastrophes. In particular, in the multi-agent GRL setting where there are multiple superintelligent AI systems acting in the same environment, an AI system cannot unilaterally perform actions that destroy humanity and the environment, or engage in instrumental convergent behaviour like acquiring all available power and resources, without encountering significant frictions and obstacles because most of such actions, and their prevention by other agents (human or AI), are mutually exclusionary and therefore can be subjected to control through economic mechanisms like that described in § 4, imposed either explicitly through rules and regulations or implicitly through laws of nature (like physics and biology [24, 109]). This form of social control works in concert and in a complementary way with the controls at the single-agent GRL level. Some of the controls are likely necessary but none in isolation are sufficient; together they may be.

In the paperclip maximiser example, there are two forces that will stop paperclip production from spiraling out of control. Firstly, the environment will provide diminishing (external) rewards for the agent to produce more and more paperclips, assuming there are controls in place to prevent wireheading issues [92, 137] where the agent actively manipulates the environment to give it false rewards or changes its own perception of input from the environment. Secondly, at the same time that the utility of the paperclip maximiser agent is decreasing, the utility of other agents in the same environment in taking competing actions to prevent paperclip production will increase significantly as unsustainable paperclip production threatens the environment and the other agents’ welfare. Thus, at some stage, the utility of the paperclip maximiser agent in producing more paperclips will become lower than the collective utility of other agents’ proposed actions to stop further paperclip production, and a VCG-style market mechanism will choose the latter over the former and paperclip production stops. The argument above does rely on an assumption that the agents are operating on a more-or-less level playing field, where they need to take others’ welfare into consideration when acting. In a completely lopsided environment where there is only one superintelligent agent and its welfare dominates that of all others, which could come about by design, accident, or over time through phenomenon like the Matthew effect (aka rich-get-richer or winner-take-all), social controls will not be able to stop unsustainable paperclip production, and it will only stop when the one superintelligent agent wants it to stop. Both conditions that uphold the Matthew effect and the circumstances that cause it to fail are studied in the literature [110, 13, 9] and those additional controls will likely be needed in addition to what we propose in § 4.

The argument presented in this section has similar motivations to those presented in [34]. We expect to see a lot more progress in this research topic in the coming months and years.

6.2 Cap-and-Trade to Control Pollution

Consider a set of oil refineries {R1,R2,,Rk}\{R_{1},R_{2},\ldots,R_{k}\}, where k>1k>1. Each refinery RiR_{i}:

  • produces unleaded fuel that can be distributed and sold in the retail market for $2 per litre. (We assume the output of the refineries is not big enough to change the market demand, and therefore the price of fuel.);

  • requires $1.80 in input cost (crude oil, chemicals, electricity, labour, distribution) to produce and distribute 1 litre of fuel (for details, see [46]);

  • can produce up to 100 million litres of fuel per day; and

  • produces greenhouse gases, costed at $190 per cubic ton. (See [2] for more detailed modelling.)

The refineries are not, however, equally efficient. Refinery RiR_{i} emits

si(y)=mi(y3512y2+200y+888)s_{i}(y)=m_{i}\left(\frac{y^{3}}{5}-12y^{2}+200y+888\right)

cubic tons of greenhouse gases per day, which is a function of yy the amount of fuel (in millions of litres) it produces per day and mim_{i}, an inefficiency factor. Throughout this example we set mi=im_{i}=i. Thus refinery R1R_{1} is twice as efficient as R2R_{2}, three times as efficient as R3R_{3}, and so on.

Refer to caption
Figure 2: Plots of ss and s1s^{-1} for refineries R1R_{1} and R2R_{2}.

Graphs of s1s_{1} and s2s_{2} are shown in Fig 2. The greenhouse gas emissions increase at the start, then drop as the refinery achieves its optimum operating efficiency, after which they increase again rapidly. We also plot

si1(g):=max{y𝕐:(y)=0}{s_{i}}^{-1}\left(g\right):=\max{\{y\in\mathbb{Y}:\Im(y)=0\}}

where the set 𝕐\mathbb{Y} comprises 0 and the three roots of the cubic equation

mi5y312miy2+200miy+(888mis1(y))=0\frac{m_{i}}{5}y^{3}-12m_{i}y^{2}+200m_{i}y+(888m_{i}-s_{1}(y))=0

when it is evaluated at g:=s1(y)g:=s_{1}(y). This ensures si1{s_{i}}^{-1} is defined g0\forall g\geq 0 and is non-decreasing. Note in Fig 2 that s11s^{-1}_{1} and s21s^{-1}_{2} have step increases at s1(28)1,470s_{1}(28)\approx 1,470 and s2(28)2,940s_{2}(28)\approx 2,940. Our construction of s1s^{-1} assumes the refineries will maximise their production (and hence profit) for a given cap on greenhouse gas emissions. For example, if Refinery R2R_{2} is permitted to emit 3,0003,000 cubic tons of greenhouse gases, it will choose to produce 30.530.5 million litres of fuel rather than 3.93.9 or 25.625.6 million litres.

No Price on Pollution

Consider the case of k=2k=2. If the two refineries do not have to pay for environmental damage from greenhouse gas emissions, each plant will produce at the maximum capacity of 100 million litres per day since they each make $0.20 operating profit per litre. The total greenhouse gas emission between them will be s1(100)+s2(100)=302,664s_{1}\left(100\right)+s_{2}\left(100\right)=302,664 cubic tons per day.

Refer to caption
Figure 3: Cubic tons of greenhouse gas emitted for different production of fuel
Fixed Price for Pollution

If the two refineries have to pay the $190 per cubic ton cost but there is no cap to greenhouse gas emission, they will seek to maximise their net profit functions nn, given by, respectively, n1(y)=$200ky$190s1(y)n_{1}\left(y\right)=\$200k\cdot y-\$190\cdot s_{1}\left(y\right) and n2(y)=$200ky$190s2(y)n_{2}\left(y\right)=\$200k\cdot y-\$190\cdot s_{2}\left(y\right). Taking derivatives with respect to yy we have:

n1(y)=114y2+4560y+162000\displaystyle n^{\prime}_{1}\left(y\right)=-114y^{2}+4560y+162000
n2(y)=228y2+9120y+124000\displaystyle n^{\prime}_{2}\left(y\right)=-228y^{2}+9120y+124000

which have positive roots at y1=20+1034619y^{*}_{1}=20+10\sqrt{\frac{346}{19}} and y2=20+1053857.y^{*}_{2}=20+10\sqrt{\frac{538}{57}}. That is, Refinery R1R_{1} will stop production at y162.7y^{*}_{1}\approx 62.7 million litres per day, for a daily net profit of n1(y1)$9.58 millionn_{1}\left(y^{*}_{1}\right)\approx\$9.58\text{ million}. Similarly, Refinery R2R_{2} will stop production at 50.7250.72 million litres per day, for a daily net profit of $7.77\$7.77 million. See Fig 3. At the $190 price, the total greenhouse gas emission between the two refineries is thus s1(y1)+s2(y2)28,682s_{1}\left(y^{*}_{1}\right)+s_{2}\left(y^{*}_{2}\right)\approx 28,682 cubic tons per day.

Market Pricing for Capped Pollution

Now, instead of setting the greenhouse gas emission at $190 per cubic ton, which requires a lot of economic modelling and assumptions, what if we just want to cap the total amount of emission by the two refineries to, say, 15 thousand cubic tons per day and let the market decide the price of greenhouse gas emission? This can be done via sequential Second Price auctions (pg. 1) of 15,000 pollution permits every day, auctioned in, say, tranches of 3,000, each permit entitling the owner to emit 1 cubic ton of greenhouse gas. Denoting the auction winner by ii^{*}, the change in each refinery’s net profit after tranche tt is:

Δnt,i(Δyt,i)={ρt,iat,ii=i0otherwise\Delta n_{t,i}\left(\Delta y_{t,i}\right)=\begin{cases}{\rho}_{t,i^{*}}-a_{t,-i^{*}}&i=i^{*}\\ 0&\text{otherwise}\end{cases}

where ρt,i=$200kΔyt,i=$200k(si1(gt,i+3000)si1(gt,i)){\rho}_{t,i}=\$200k\cdot\Delta y_{t,i}=\$200k\cdot\left({s_{i}}^{-1}\left(g_{t,i}+3000\right)-{s_{i}}^{-1}\left(g_{t,i}\right)\right) and gt,ig_{t,i} is RiR_{i}’s permit holdings before the tranche and at,ia_{t,i} is RiR_{i}’s bid for tranche tt.

Greedy Algorithm

Let’s work through the example shown in Table 1, where each refinery pursues a greedy bid strategy. That is, for tranche tt refinery RiR_{i} always bids at,i=ρt,ia_{t,i}=\rho_{t,i}. In the first tranche of 3000 permits, Refinery R1R_{1} works out that it can produce Δy1,1=42\Delta y_{1,1}=42 million litres of fuel with 3000 permits, since s11(3000)42{s_{1}}^{-1}(3000)\approx 42, and it is willing to pay a max of ρ1,1=$200k×42=$8.4\rho_{1,1}=\$200k\times 42=\$8.4m to win those 3000 permits. Refinery R2R_{2} similarly works out that it can produce Δy1,2=s21(3000)31\Delta y_{1,2}={s_{2}}^{-1}(3000)\approx 31 million litres of fuel and bids ρ1,2=$6.1\rho_{1,2}=\$6.1m. The VCG mechanism picks R1R_{1} as the winner, and R1R_{1} pays $6.1m to produce 42 million litres of fuel, for a profit of $2.3m. In the second tranche of 3000 permits, Refinery R2R_{2} submits the same bid, but Refinery R1R_{1} submits a much lower bid of ρ2,1=$1.6\rho_{2,1}=\$1.6m, since it can only produce an extra Δy2,1=s11(6000)s11(3000)5042=8\Delta y_{2,1}={s_{1}}^{-1}(6000)-{s_{1}}^{-1}(3000)\approx 50-42=8 million litres of fuel with the additional 3000 permits. Thus, R2R_{2} wins the auction and pays only $1.6m for the second tranche of 3000 permits, which it uses to produce 31 million litres of fuel for a profit of $4.5 million. The following tranches proceed in a similar way. In the end, we have

  • Refinery R1R_{1} winning 9000 permits for a total cost of $8.0m, and using them to produce 55 million litres of fuel for a total net profit of $3.1 million;

  • Refinery R2R_{2} winning 6000 permits for a total cost of $3.2m, and using them to produce 42 million litres of fuel for a total net profit of $5.3 million;

  • A total of $11.2 million was collected for the 15,000 permits.

The result is interesting in that the more efficient Refinery R1R_{1} ends up winning more permits as expected, which it uses to produce more fuel but for a lower total profit compared to Refinery R2R_{2}.

Tranche R1R_{1} Bid R2R_{2} Bid Payment Result
3000 42m / $8.4m 31m / $6.1m $6.1m R1R_{1} / $2.3m
3000 8m / $1.6m 31m / $6.1m $1.6m R2R_{2} / $4.5m
3000 8m / $1.6m 12m / $2.4m $1.6m R2R_{2} / $0.8m
3000 8m / $1.6m 4m / $0.9m $0.9m R1R_{1} / $0.7m
3000 5m / $1.0m 4m / $0.9m $0.9m R1R_{1} / $0.1m
Table 1: The auction results assuming each refinery pursues a greedy strategy.
Reinforcement Learning

Each refinery can do better by using reinforcement learning to optimise their sequence of bids, using the approaches described in § 5, and possibly engaging in collusion / cooperation. Figs 4, 5 and 6 show the results of running two Q-Learning agents, representing the two refineries, on the cap-and-trade problem. Each agent’s state is the 2-tuple (number of permits won by R1R_{1}, number of permits won by R2R_{2}). The action space is a set of 170 bids, 𝔸={0, 50k, 100k, 8.4m}\mathbb{A}~{}=~{}\{0,\;50\text{k},\;100\text{k},\;\ldots\;8.4\text{m}\}. Agents are allowed to bid any amount in 𝔸\mathbb{A}, even if that amount is higher than ρt,i{\rho}_{t,i} . Any ties arising in the VCG mechanism due to agents bidding the same amount are broken randomly. Each experiment uses the same random seed and RL parameters. The only difference in their setup is the reward function.

In Fig 4 we incentivise the two agents to maximise their individual net profits, by rewarding only the agent ii^{*} that won the auction:

rt,i(1)=Δnt,i={ρt,iat,ii=i0otherwise{r_{t,i}}^{(1)}=\Delta n_{t,i}=\begin{cases}{\rho}_{t,i^{*}}-a_{t,-i^{*}}&i=i^{*}\\ 0&\text{otherwise}\end{cases}

The agents learn a joint policy that results in them each making a total net profit of around $9m over the course of the five tranches, much higher than the net profits obtained when following the greedy strategy earlier, of $3.1\$3.1m and $5.3\$5.3m for R1R_{1} and R2R_{2} respectively. The average permit price stabilises at $100\$100.

In Fig 5 we incentivise the agents to maximise their shared net profit:

rt,i(2)=ρt,iat,i{r_{t,i}}^{(2)}={\rho}_{t,i^{*}}-a_{t,-i^{*}}

Here the agents learn a joint policy that always results in a return for each agent (and a total shared net profit) of 19.4919.49 over the five tranches and, importantly, drives the average permit price down to zero. Note the agents have learned this collusive policy solely by observing the permit holdings of each participant in the auction – there was never any explicit communication between them. This phenomenon is analysed more carefully in Fig 7 in Appendix B.

Refer to caption
Figure 4: RL with reward function r(1)r^{(1)} (incentivise individual profit)
Refer to caption
Figure 5: RL with reward function r(2)r^{(2)} (incentivise joint profit)
Refer to caption
Figure 6: RL with reward function r(3)r^{(3)} (zero-sum reward)

In Fig 6 we incentivise the agents to maximise their individual net profit and minimise the other agent’s net profit:

rt,i(3)={ρt,iat,ii=i(ρt,iat,i)otherwise{r_{t,i}}^{(3)}=\begin{cases}{\rho}_{t,i^{*}}-a_{t,-i^{*}}&i=i^{*}\\ -\left({\rho}_{t,i^{*}}-a_{t,-i^{*}}\right)&\text{otherwise}\end{cases}

As expected, in this case, competition between the agents keeps the average permit price well above zero; it eventually settles into an oscillatory pattern around the $550 mark. This phenomenon is analysed in more detail in Fig 8 in Appendix B.

6.3 Automated Penetration Testing

There is growing interest among cyber security researchers in investigating the use of reinforcement learning (RL) algorithms, both model-based and model-free, to perform automated penetration testing of networks and cyber-physical systems [52, 114, 63, 113, 123, 140]. The key technical challenge in these problems is usually in controlling the large and discrete observation and action spaces. The former is typically addressed using vulnerability scanners like nmap in conjunction with knowledge bases like the Common Vulnerabilities and Exposures (CVE) database, and the latter is typically tackled by building the RL algorithms on top of mature penetration testing toolsuites like Metasploit [76] and logic-based planning methods like MulVAL for attack-tree generation [102].

In this section, we look at how we can formulate the automated penetration testing problem, in the case where there are multiple RL agents cooperating with the goal of finding, within a given time, as many vulnerabilities as possible on a system being tested, under the constraint that the agents’ actions must not be too disruptive to each other and to the system being tested.

For each of the agents, the observation space is whatever can be returned from their scanning software. We will assume here that it is a JSON object, possibly with embedded graphics in standard formats, that can be formally represented and reasoned with in higher-order logic [87, 88], which provides us with tools for

  • defining and systematically enumerating possible feature functions, and

  • performing logical and probabilistic inference, including Bayesian filtering [25], with respect to a knowledge base (like CVE).

The action space is made up of two types of (parametrisable) action templates:

  • run a scanning software with certain parameter values;

  • run an existing exploit script in the Metasploit library with certain parameter values.

Instantiating the parameters in the action templates can be done through search algorithms, possibly within a formal logic setup if we can define commonly applicable rules and use unification techniques to bind the free variables in a proof procedure [102, 54].

How about the reward function? We assume each action aa has a computational cost C(a)C(a), which quantifies the amount of compute resources – possibly a weighted combination of CPU, memory and disk usage – required to execute the action on the system being tested. In practice, the agents have an estimate C^(a)\hat{C}(a) obtained from historial data. This is motivated by the large literature on cost models for database queries and software tests [71, 136, 116]. Unsuccessful exploits are given reward 0, as long as they do not cause severe damage to the system being tested. Successful exploits should be given positive rewards, but obviously the severity of the system compromise should be taken into account here, with compromises like root shell access and remote code execution given higher rewards than simpler compromises like denial of service. For practical reasons, we will not be considering the full range of cyber harms as described in [3], but will instead look at a few proxy measures of the impact of the penetration-testing agent’s actions on the system being tested. There are a few mature scoring systems related to security standards, including

  • the Common Vulnerability Scoring System (CVSS) used in the Common Vulnerabilities and Exposures (CVE) program. A CVSS score for a vulnerability is computed in the range 0.0 – 10.0, using three groups of metrics: base, temporal, and environmental. Base metrics reflect a vulnerability’s exploitability, scope, and potential impacts of exploitation. Temporal metrics reflect the current state of exploit techniques or code availability, and the existence of any patches or workarounds. Environmental metrics enable the CVSS score to be customised.

  • The Common Weakness Scoring System (CWSS) is used in the Common Weakness Enumeration (CWE) program. A CWSS score for a software weakness is computed in the range 0.0 – 100.0. In CWSS, three groups of metrics are defined: base finding, attack surface, and environmental. Base finding metrics are intended to capture the inherent risk of the weakness, confidence in the accuracy of the finding, and strength of controls. Attack Surface metrics assess the barriers that an attacker must overcome in order to exploit the weakness. Environmental metrics reflect characteristics of the weakness that are specific to a particular environment or operational context.

A suitable reward for a successful exploit aa can thus be defined as a (weighted) average of the CVSS and CWSS scores associated with the exploit. In addition, given the goal is to find all vulnerabilities (to the extent possible), it would also make sense to introduce exploration bonus rewards [120, 11, 101, 134].

The following is the coordination protocol for the pen-testing agents, motivated by the pollution permits idea from § 6.2. At every time step tt, a certain number UtU_{t} of compute permits is auctioned. (The compute permits can only be used during the time step and cannot be carried forward to future time steps.) Each pen-testing agent ii has a current valuation function Vi,tV_{i,t} and submits the bid

maxaVi,t(a)subject toC^(a)Ut\max_{a}V_{i,t}(a)\;\;\;\text{subject to}\;\;\;\hat{C}(a)\leq U_{t}

to the VCG mechanism. The winning agent ii^{*} then executes its action aa^{*} and receive a percept that can be used to update its valuation function. The actual compute resources consumed C(a)C(a^{*}) is used to update C^()\hat{C}(\cdot). If UtC(a)>0U_{t}-C(a^{*})>0, then another round of VCG auction is run to pick the next agent to perform its action, and this is repeated until UtU_{t} is exhausted.

Experimental results for an implementation of a VCG-coordinated system of automated penetration testing agents based on Metasploit will be reported in a separate paper.

7 Discussion and Conclusion

In the spirit of [104], we considered in this paper the problem of social harms that can result from the interactions between a set of (powerful) general reinforcement learning agents in arbitrary unknown environments and proposed the use of (dynamic) VCG mechanisms to coordinate and control their collective behaviour. Our proposed setup is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways:

  • the underlying environment is a history-based general reinforcement learning environment like in AIXI [69];

  • the reinforcement-learning agents participating in the environment can have different time horizons and adopt different strategies and algorithms, and learning can happen both at the mechanism level as well as the level of individual agents.

The generality of the setup opens up a wide range of algorithms and applications, including multi-agent problems with both cooperative and competitive dynamics, some of which we explore in § 5 and § 6. The setup closest to ours in generality in the literature is [70], with interesting applications like [143].

A key limitation of our proposed approach, especially when it comes to regulating powerful AGI agents, is that there is no fundamental way to enforce the VCG mechanism on such agents outside of purposedly designed platform economies [77, 33, 41]. In particular, the proposed approach would not work on AGI agents operating "in the wild". Nevertheless, the study of the ideal market mechanism to regulate AGI agents is a useful step for understanding and benchmarking the design of more practical mechanisms like [75] that recommend but do not enforce social choices, and more indirect payment approaches like [141] and [80] to steer a set of agents to desired good behaviour. It would also be interesting, as future work, to understand how natural biological mechanisms like [24, 109] relate to ideal market mechanisms.

References

  • [1] David Abel, Dilip Arumugam, Lucas Lehnert, and Michael L. Littman. State abstractions for lifelong reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, ICML, pages 10–19, 2018.
  • [2] US Environmental Protection Agency. Report on the Social Cost of Greenhouse Gases: Estimates Incorporating Recent Scientific Advances. http://epa.gov, 2023.
  • [3] Ioannis Agrafiotis, Jason Nurse, Michael Goldsmith, Sadie Creese, and David Upton. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity, 4, 2018.
  • [4] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
  • [5] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
  • [6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, 2003.
  • [7] Robert J Aumann. Correlated equilibrium as an expression of Bayesian rationality. Econometrica: Journal of the Econometric Society, pages 1–18, 1987.
  • [8] Robert Axelrod. The emergence of cooperation among egoists. American Political Science Review, 75(2):306–318, 1981.
  • [9] Albert-László Barabási. The Formula: The Universal Laws of Success. Hachette UK, 2018.
  • [10] Jorge Barrera and Alfredo Garcia. Dynamic incentives for congestion control. IEEE Transactions on Automatic Control, 60(2):299–310, 2014.
  • [11] Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying count-based exploration and intrinsic motivation. In NeurIPS, pages 1471–1479, 2016.
  • [12] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003.
  • [13] Franco Berbeglia and Pascal Van Hentenryck. Taming the Matthew effect in online markets with social influence. In AAAI, pages 10–16, 2017.
  • [14] Dirk Bergemann and Juuso Välimäki. Dynamic mechanism design: An introduction. Journal of Economic Literature, 57(2):235–274, 2019.
  • [15] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
  • [16] Avrim Blum and Yishay Mansour. Learning, regret minimization, and equilibria. In Algorithmic Game Theory. CUP, 2007.
  • [17] Tilman Börgers. An Introduction to the Theory of Mechanism Design. Oxford University Press, 2015.
  • [18] Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
  • [19] Nick Bostrom. Ethical issues in advanced artificial intelligence. Machine Ethics and Robot Ethics, pages 69–75, 2020.
  • [20] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
  • [21] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
  • [22] Ruggiero Cavallo, David C Parkes, and Satinder Singh. Optimal coordinated planning amongst self-interested agents with private state. In UAI, pages 55–62, 2006.
  • [23] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • [24] Krishnendu Chatterjee, Johannes G Reiter, and Martin A Nowak. Evolutionary dynamics of biological auctions. Theoretical Population Biology, 81(1):69–80, 2012.
  • [25] Dawei Chen, Samuel Yang-Zhao, John Lloyd, and Kee Siong Ng. Factored conditional filtering: Tracking states and estimating parameters in high-dimensional spaces. arXiv preprint arXiv:2206.02178, 2022.
  • [26] M Keith Chen and Michael Sheldon. Dynamic pricing in a labor market: Surge pricing and flexible work on the Uber platform. Ec, 16:455, 2016.
  • [27] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. Advances in Neural Information Processing Systems, 33:18990–18999, 2020.
  • [28] Brian Christian. The alignment problem: How can machines learn human values? Atlantic Books, 2021.
  • [29] Alonzo Church. A formulation of the simple theory of types. Journal of Symbolic Logic, 5:56–68, 1940.
  • [30] Robert Coase. The nature of the firm. Economica, 4(16):386–405, 1937.
  • [31] Robert Coase. The problem of social cost. Journal of Law and Economics, 3(1):1–44, 1960.
  • [32] Robert Coase. The Firm, The Market, and The Law. UCP, 2012.
  • [33] Julie E Cohen. Law for the platform economy. UCDL Rev., 51:133, 2017.
  • [34] Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, and William S. Zwicker. Social choice for AI alignment: Dealing with diverse human feedback. CoRR, abs/2404.10271, 2024.
  • [35] Richard Cornes and Todd Sandler. The Theory of Externalities, Public Goods, and Club Goods. Cambridge University Press, 1996.
  • [36] Luc De Raedt. Logical and relational learning. Springer Science & Business Media, 2008.
  • [37] Luis Fariñas del Cerro and Olivier Gasquet. A general framework for pattern-driven modal tableaux. Logic Journal of the IGPL, 10(1):51–83, 2002.
  • [38] Kurt Driessens. Relational reinforcement learning. In Claude Sammut and Geoffrey I. Webb, editors, Encyclopedia of Machine Learning and Data Mining, pages 1096–1103. Springer, 2017.
  • [39] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  • [40] Saso Dzeroski, Luc De Raedt, and Kurt Driessens. Relational reinforcement learning. Mach. Learn., 43(1/2):7–52, 2001.
  • [41] David S Evans. Matchmakers: the new economics of multisided platforms. Harvard Business Review Press, 2016.
  • [42] Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021.
  • [43] Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In IJCAI, pages 4705–4713, 2017.
  • [44] Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. In Jérôme Lang, editor, IJCAI, pages 5441–5449. ijcai.org, 2018.
  • [45] W.M. Farmer. The seven virtues of simple type theory. Journal of Applied Logic, 6(3):267–286, 2008.
  • [46] Jean-Pierre Favennec. Economics of oil refining. In The Palgrave Handbook of International Energy Economics, pages 59–74. Springer, 2022.
  • [47] Melvin Fitting and Richard L. Mendelsohn. First-order Modal Logic. Kluwer Academic Publishers, 1998.
  • [48] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
  • [49] Drew Fudenberg and David K Levine. The theory of learning in games, volume 2. MIT press, 1998.
  • [50] Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, 2020.
  • [51] L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning. MIT Press, 2007.
  • [52] Mohamed C. Ghanem and Thomas M. Chen. Reinforcement learning for intelligent penetration testing. In Second World Conference on Smart Trends in Systems, Security and Sustainability, pages 185–192, 2018.
  • [53] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359–483, 2015.
  • [54] Martin Giese. Incremental closure of free variable tableaux. In IJCAR, volume 2083 of LNCS, pages 545–560. Springer, 2001.
  • [55] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
  • [56] Peter Grünwald and Thijs Van Ommen. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12:1069–1103, 2017.
  • [57] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [58] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29, 2016.
  • [59] Garrett Hardin. The tragedy of the commons. Science, 162:1243–1248, 1968.
  • [60] Leon Henkin. Completeness in the theory of types. Journal of Symbolic Logic, 15(2):81–91, 1950.
  • [61] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
  • [62] Dirk Hoeven, Tim van Erven, and Wojciech Kotłowski. The many faces of exponential weights in online learning. In Conference On Learning Theory, pages 2067–2092. PMLR, 2018.
  • [63] Zhenguo Hu, Razvan Beuran, and Yasuo Tan. Automated penetration testing using deep reinforcement learning. In European Symposium on Security and Privacy Workshops, pages 2–10. IEEE, 2020.
  • [64] Zhiyi Huang and Sampath Kannan. The exponential mechanism for social welfare: Private, truthful, and nearly optimal. In 53rd Annual Symposium on Foundations of Computer Science, pages 140–149. IEEE, 2012.
  • [65] John Hughes. Why functional programming matters. The Computer Journal, 32(2):98–107, 1989.
  • [66] Joon Suk Huh and Kirthevasan Kandasamy. Nash incentive-compatible online mechanism learning via weakly differentially private online learning. arXiv preprint arXiv:2407.04898, 2024.
  • [67] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005.
  • [68] Marcus Hutter. Feature reinforcement learning: Part I. Unstructured MDPs. In J. Artif. Gen. Intell., 2009.
  • [69] Marcus Hutter, David Quarel, and Elliot Catt. An Introduction to Universal Artificial Intelligence. CRC Press, 2024.
  • [70] Dima Ivanov, Paul Dütting, Inbal Talgam-Cohen, Tonghan Wang, and David C Parkes. Principal-agent reinforcement learning: Orchestrating AI agents with contracts. arXiv preprint arXiv:2407.18074, 2024.
  • [71] Matthias Jarke and Jurgen Koch. Query optimization in database systems. ACM Computing surveys (CsUR), 16(2):111–152, 1984.
  • [72] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
  • [73] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • [74] Kirthevasan Kandasamy, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. VCG mechanism design with unknown agent values under stochastic bandit feedback. J. Mach. Learn. Res., 24:53:1–53:45, 2023.
  • [75] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Innovations in Theoretical Computer Science, pages 403–410, 2014.
  • [76] David Kennedy, Jim O’gorman, Devon Kearns, and Mati Aharoni. Metasploit: The Penetration Tester’s Guide. No Starch Press, 2011.
  • [77] Martin Kenney and John Zysman. The rise of the platform economy. Issues in Science and Technology, 32(3):61, 2016.
  • [78] Kristian Kersting, Martijn van Otterlo, and Luc De Raedt. Bellman goes relational. In ICML, volume 69. ACM, 2004.
  • [79] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, ECML 2006, pages 282–293. Springer Berlin Heidelberg, 2006.
  • [80] Yoav Kolumbus, Joe Halpern, and Éva Tardos. Paying to do better: Games with payments between learning agents. arXiv:2405.20880, 2024.
  • [81] Wouter M Koolen and Steven de Rooij. Universal codes from switching strategies. IEEE Transactions on Information Theory, 59(11):7168–7185, 2013.
  • [82] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  • [83] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
  • [84] Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a unified theory of state abstraction for MDPs. In International Symposium on Artificial Intelligence and Mathematics, 2006.
  • [85] Ming Li and Paul Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer, 1997.
  • [86] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning, pages 157–163. Elsevier, 1994.
  • [87] John W. Lloyd. Logic for Learning: Learning Comprehensible Theories from Structured Data. Springer, 2003.
  • [88] John W. Lloyd and Kee Siong Ng. Declarative programming for agent applications. Autonomous Agents Multi Agent Systems, 23(2):224–272, 2011.
  • [89] Boxiang Lyu, Zhaoran Wang, Mladen Kolar, and Zhuoran Yang. Pessimism meets VCG: Learning dynamic mechanism design via offline reinforcement learning. In ICML, pages 14601–14638, 2022.
  • [90] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science, pages 94–103. IEEE, 2007.
  • [91] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [92] Luke Muehlhauser and Bill Hibbard. Exploratory engineering in artificial intelligence. Communications of the ACM, 57(9):32–34, 2014.
  • [93] Phuong Minh Nguyen, Peter Sunehag, and Marcus Hutter. Feature reinforcement learning in practice. In Recent Advances in Reinforcement Learning, volume 7188 of LNCS, pages 66–77. Springer, 2011.
  • [94] Noam Nisan. Introduction to mechanism design (for computer scientist). In Algorithmic Game Theory. Cambridge University Press, 2007.
  • [95] Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V. Vazirani, editors. Algorithmic Game Theory. Cambridge University Press, 2007.
  • [96] Martin Nowak and Karl Sigmund. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature, 364(6432):56–58, 1993.
  • [97] Martin A Nowak and Karl Sigmund. Tit for tat in heterogeneous populations. Nature, 355(6357):250–253, 1992.
  • [98] Laurent Orseau, Tor Lattimore, and Shane Legg. Soft-bayes: Prod for mixtures of experts with log-loss. In International Conference on Algorithmic Learning Theory, pages 372–399. PMLR, 2017.
  • [99] Martin J Osborne. An Introduction to Game Theory. Oxford university press, 2004.
  • [100] Elinor Ostrom. Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge University Press, 1990.
  • [101] Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In ICML, volume 70, pages 2721–2730, 2017.
  • [102] Xinming Ou, Sudhakar Govindavajhala, and Andrew W Appel. MulVAL: A logic-based network security analyzer. In USENIX security symposium, volume 8, pages 113–128, 2005.
  • [103] David C. Parkes and Satinder Singh. An MDP-based approach to online mechanism design. In NIPS, pages 791–798. MIT Press, 2003.
  • [104] David C Parkes and Michael P Wellman. Economic reasoning and artificial intelligence. Science, 349(6245):267–272, 2015.
  • [105] Arthur Pigou. Wealth and Welfare. Macmillan, 1912.
  • [106] Arthur Pigou. The Economics of Welfare. Routledge, 2002.
  • [107] Shuang Qiu, Boxiang Lyu, Qinglin Meng, Zhaoran Wang, Zhuoran Yang, and Michael I Jordan. Learning dynamic mechanisms in unknown environments: A reinforcement learning approach. arXiv preprint arXiv:2202.12797, 2022.
  • [108] Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical Relational Artificial Intelligence: Logic, Probability, and Computation. Morgan & Claypool Publishers, 2016.
  • [109] Johannes G Reiter, Ayush Kanodia, Raghav Gupta, Martin A Nowak, and Krishnendu Chatterjee. Biological auctions with multiple rewards. Proceedings of the Royal Society B: Biological Sciences, 282(1812):20151041, 2015.
  • [110] Daniel Rigney. The Matthew Effect: How Advantage Begets Further Advantage. Columbia University Press, 2010.
  • [111] Stuart Russell. Human Compatible: AI and the Problem of Control. Penguin, 2019.
  • [112] Scott Sanner and Kristian Kersting. Symbolic dynamic programming for first-order POMDPs. In Proceedings of the Twenty-Fourth Conference on Artificial Intelligence. AAAI Press, 2010.
  • [113] Carlos Sarraute, Olivier Buffet, and Jörg Hoffmann. POMDPs make better hackers: Accounting for uncertainty in penetration testing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1816–1824, 2021.
  • [114] Jonathon Schwartz and Hanna Kurniawati. Autonomous penetration testing using reinforcement learning. arXiv preprint arXiv:1905.05965, 2019.
  • [115] Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems: Algorithmic, Game-theoretic, and Logical Foundations. CUP, 2008.
  • [116] Daniel G. Silva, Mario Jino, and Bruno T. de Abreu. Machine learning methods and asymmetric cost function to estimate execution effort of software testing. In Third International Conference on Software Testing, Verification and Validation, pages 275–284, 2010.
  • [117] Ray J Solomonoff. A formal theory of inductive inference. Part I. Information and control, 7(1):1–22, 1964.
  • [118] Ray J Solomonoff. A formal theory of inductive inference. Part II. Information and control, 7(2):224–254, 1964.
  • [119] Ray J. Solomonoff. The discovery of algorithmic probability. J. Comput. Syst. Sci., 55(1):73–88, 1997.
  • [120] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • [121] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018.
  • [122] Maciej Świechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3):2497–2562, 2023.
  • [123] Khuong Tran, Ashlesha Akella, Maxwell Standen, Junae Kim, David Bowman, Toby Richer, and Chin-Teng Lin. Deep hierarchical reinforcement agents for automated penetration testing. CoRR, abs/2109.06449, 2021.
  • [124] Tim van Erven, Peter Grünwald, and Steven De Rooij. Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the aic–bic dilemma. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(3):361–417, 2012.
  • [125] Tim van Erven, Peter Grunwald, Nishant A Mehta, Mark Reid, and Robert Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 2015.
  • [126] Tim van Erven, Steven Rooij, and Peter Grünwald. Catching up faster in bayesian model selection and model averaging. Advances in Neural Information Processing Systems, 20, 2007.
  • [127] Badri N. Vellambi and Marcus Hutter. Convergence of binarized context-tree weighting for estimating distributions of stationary sources. In IEEE International Symposium on Information Theory, pages 731–735, 2018.
  • [128] Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael H. Bowling. Context tree switching. In James A. Storer and Michael W. Marcellin, editors, 2012 Data Compression Conference, pages 327–336. IEEE, 2012.
  • [129] Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40:95–142, 2011.
  • [130] Tom Vodopivec, Spyridon Samothrakis, and Branko Ster. On monte carlo tree search and reinforcement learning. Journal of Artificial Intelligence Research, 60:881–936, 2017.
  • [131] Paul AJ Volf and Frans MJ Willems. Switching between two universal source coding algorithms. In Proceedings of the Data Compression Conference, pages 491–500. IEEE, 1998.
  • [132] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, second edition, 1947.
  • [133] Vladimir Vovk. Derandomizing stochastic prediction strategies. In Proceedings of the Annual Conference on Computational Learning Theory, pages 32–44, 1997.
  • [134] Lilian Weng. Exploration strategies in deep reinforcement learning. http://lilianweng.github.io/lil-log, 2020.
  • [135] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
  • [136] Wentao Wu, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacigümüs, and Jeffrey F Naughton. Predicting query execution time: Are optimizer cost models really unusable? In ICDE, pages 1081–1092, 2013.
  • [137] Roman Yampolskiy. Utility function security in artificially intelligent agents. Journal of Experimental and Theoretical Artificial Intelligence, 26:373–389, 2014.
  • [138] Samuel Yang-Zhao, Kee Siong Ng, and Marcus Hutter. Dynamic knowledge injection for AIXI agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38(15), pages 16388–16397, 2024.
  • [139] Samuel Yang-Zhao, Tianyu Wang, and Kee Siong Ng. A direct approximation of AIXI using logical state abstractions. Advances in Neural Information Processing Systems, 35:36640–36653, 2022.
  • [140] Fabio Massimo Zennaro and Laszlo Erdodi. Modeling penetration testing with reinforcement learning using capture-the-flag challenges and tabular q-learning. CoRR, abs/2005.12632, 2020.
  • [141] Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, and Tuomas Sandholm. Steering no-regret learners to a desired equilibrium. arXiv preprint arXiv:2306.05221, 2023.
  • [142] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384, 2021.
  • [143] Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C Parkes, and Richard Socher. The AI economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science Advances, 8(18):eabk2607, 2022.

Appendix A Notes on the Agent Valuation Function

We will start by unrolling (14)-(15) for m=3m=3 to see the general pattern. Given the empty history h0=ϵh_{0}=\epsilon, the utility for agent ii at time step 1 is

u1,i(ϵ)=v1,i(ϵ,𝐚𝟏(ϵ))pi(ϵ)\displaystyle u_{1,i}(\epsilon)=v_{1,i}(\epsilon,\mathbf{a^{*}_{1}}(\epsilon))-p_{i}(\epsilon)
=𝐨𝐫𝟏ϕ(𝐨𝐫𝟏|𝐚𝟏(ϵ))[r1,i+u2,i(𝐚𝟏(ϵ)𝐨𝐫𝟏)]pi(ϵ)\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,|\,\mathbf{a^{*}_{1}}(\epsilon))[r_{1,i}+u_{2,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}})]-p_{i}(\epsilon)
=𝐨𝐫𝟏ϕ(𝐨𝐫𝟏|𝐚𝟏(ϵ))[r1,i+v2,i(𝐚𝟏(ϵ)𝐨𝐫𝟏,𝐚𝟐(𝐨𝐫𝟏))pi(𝐨𝐫𝟏)]pi(ϵ)\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,|\,\mathbf{a^{*}_{1}}(\epsilon))[r_{1,i}+v_{2,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}},\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))-p_{i}(\mathbf{or_{1}})]-p_{i}(\epsilon)
=𝐨𝐫𝟏ϕ(𝐨𝐫𝟏|𝐚𝟏(ϵ))[pi(𝐨𝐫𝟏)pi(ϵ)+r1,i+v2,i(𝐚𝟏(ϵ)𝐨𝐫𝟏,𝐚𝟐(𝐨𝐫𝟏))]\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,|\,\mathbf{a^{*}_{1}}(\epsilon))[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+v_{2,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}},\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))]
=𝐨𝐫𝟏ϕ(𝐨𝐫𝟏|𝐚𝟏(ϵ))[pi(𝐨𝐫𝟏)pi(ϵ)+r1,i\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,|\,\mathbf{a^{*}_{1}}(\epsilon))\biggl{[}-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}
+𝐨𝐫𝟐ϕ(𝐨𝐫𝟐|𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏))[r2,i+u3,i(𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐)]]\displaystyle\hskip 90.00014pt+\sum_{\mathbf{or_{2}}}\phi(\mathbf{or_{2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))[r_{2,i}+u_{3,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]\biggr{]}
=𝐨𝐫𝟏ϕ(𝐨𝐫𝟏|𝐚𝟏(ϵ))𝐨𝐫𝟐ϕ(𝐨𝐫𝟐|𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏))\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,|\,\mathbf{a^{*}_{1}}(\epsilon))\sum_{\mathbf{or_{2}}}\phi(\mathbf{or_{2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))
[pi(𝐨𝐫𝟏)pi(ϵ)+r1,i+r2,i+u3,i(𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐)]\displaystyle\hskip 90.00014pt[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+r_{2,i}+u_{3,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]
=𝐨𝐫𝟏:𝟐ϕ(𝐨𝐫𝟏:𝟐|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏))[pi(𝐨𝐫𝟏)pi(ϵ)+r1,i+r2,i+u3,i(𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐)]\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+r_{2,i}+u_{3,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]
=𝐨𝐫𝟏:𝟐ϕ(𝐨𝐫𝟏:𝟐|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏))[pi(𝐨𝐫𝟏:𝟐)pi(𝐨𝐫𝟏)pi(ϵ)\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))[-p_{i}(\mathbf{or_{1:2}})-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)
+r1,i+r2,i+v3,i(𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐,𝐚𝟑(𝐨𝐫𝟏:𝟐))]\displaystyle\hskip 160.00024pt+r_{1,i}+r_{2,i}+v_{3,i}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}},\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))]
=𝐨𝐫𝟏:𝟐ϕ(𝐨𝐫𝟏:𝟐|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏))[r1,i+r2,i+𝐨𝐫𝟑ϕ(𝐨𝐫𝟑|𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐𝐚𝟑(𝐨𝐫𝟏:𝟐))r3,i]\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))\biggl{[}r_{1,i}+r_{2,i}+\sum_{\mathbf{or_{3}}}\phi(\mathbf{or_{3}}\,|\,\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}}\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))\cdot r_{3,i}\biggr{]}
𝐨𝐫𝟏:𝟐ϕ(𝐨𝐫𝟏:𝟐|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏))[pi(𝐨𝐫𝟏:𝟐)+pi(𝐨𝐫𝟏)+pi(ϵ)]\displaystyle\hskip 60.00009pt-\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))[p_{i}(\mathbf{or_{1:2}})+p_{i}(\mathbf{or_{1}})+p_{i}(\epsilon)]
=𝐨𝐫𝟏:𝟑ϕ(𝐨𝐫𝟏:𝟑|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏),𝐚𝟑(𝐨𝐫𝟏:𝟐))[r1,i+r2,i+r3,i]\displaystyle=\sum_{\mathbf{or_{1:3}}}\phi(\mathbf{or_{1:3}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}),\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))[r_{1,i}+r_{2,i}+r_{3,i}]
𝐨𝐫𝟏:𝟐ϕ(𝐨𝐫𝟏:𝟐|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏))[pi(𝐨𝐫𝟏:𝟐)+pi(𝐨𝐫𝟏)+pi(ϵ)],\displaystyle\hskip 60.00009pt-\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))[p_{i}(\mathbf{or_{1:2}})+p_{i}(\mathbf{or_{1}})+p_{i}(\epsilon)],

where we denote 𝐯𝐭(𝐡𝐭𝟏)=(vt,1(𝐡𝐭𝟏,),,vt,k(𝐡𝐭𝟏,))\mathbf{v_{t}}(\mathbf{h_{t-1}})=(v_{t,1}(\mathbf{h_{t-1}},\cdot),\ldots,v_{t,k}(\mathbf{h_{t-1}},\cdot)) and

𝐚𝟏(ϵ)=f(𝐯1(ϵ))\displaystyle\mathbf{a^{*}_{1}}(\epsilon)=f(\mathbf{v}_{1}(\epsilon)) pi(ϵ)=pi(𝐯𝟏(ϵ))\displaystyle p_{i}(\epsilon)=p_{i}(\mathbf{v_{1}}(\epsilon))
𝐚𝟐(𝐨𝐫𝟏)=f(𝐯𝟐(𝐚𝟏(ϵ)𝐨𝐫𝟏))\displaystyle\mathbf{a^{*}_{2}}(\mathbf{or_{1}})=f(\mathbf{v_{2}}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}})) pi(𝐨𝐫𝟏)=pi(𝐯𝟐(𝐚𝟏(ϵ)𝐨𝐫𝟏))\displaystyle p_{i}(\mathbf{or_{1}})=p_{i}(\mathbf{v_{2}}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}))
𝐚𝟑(𝐨𝐫𝟏:𝟐)=f(𝐯𝟑(𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐))\displaystyle\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}})=f(\mathbf{v_{3}}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a_{2}}^{*}(\mathbf{or_{1}})\mathbf{or_{2}})) pi(𝐨𝐫𝟏:𝟐)=pi(𝐯𝟑(𝐚𝟏(ϵ)𝐨𝐫𝟏𝐚𝟐(𝐨𝐫𝟏)𝐨𝐫𝟐)).\displaystyle p_{i}(\mathbf{or_{1:2}})=p_{i}(\mathbf{v_{3}}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})).

For general mm, one can use an induction argument to show

u1,i(ϵ)\displaystyle u_{1,i}(\epsilon) =𝐨𝐫𝟏:𝐦ϕ(𝐨𝐫𝟏:𝐦|𝐚𝟏(ϵ),𝐚𝟐(𝐨𝐫𝟏),,𝐚𝐦(𝐨𝐫𝟏:(𝐦𝟏)))[t=1mrt,i]\displaystyle=\sum_{\mathbf{or_{1:m}}}\phi(\mathbf{or_{1:m}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\mathbf{a^{*}_{2}}(\mathbf{or_{1}}),\ldots,\mathbf{a^{*}_{m}}(\mathbf{or_{1:(m-1)}}))\biggl{[}\sum_{t=1}^{m}r_{t,i}\biggr{]}
𝐨𝐫𝟏:(𝐦𝟏)ϕ(𝐨𝐫𝟏:(𝐦𝟏)|𝐚𝟏(ϵ),,𝐚𝐦𝟏(𝐨𝐫𝐦𝟏))[t=0m1pi(𝐨𝐫𝟏:𝐭)].\displaystyle\hskip 20.00003pt-\sum_{\mathbf{or_{1:(m-1)}}}\phi(\mathbf{or_{1:(m-1)}}\,|\,\mathbf{a^{*}_{1}}(\epsilon),\ldots,\mathbf{a^{*}_{m-1}}(\mathbf{or_{m-1}}))\biggl{[}\sum_{t=0}^{m-1}p_{i}(\mathbf{or_{1:t}})\biggr{]}.

Appendix B Cap and Trade Agent Policies

Collusive Dynamics

Fig 7 shows the final policy learned for the setup of Fig 5.

=== FINAL POLICY EVALUATION ===
--------------------------------------------------------------------
|  t  | Agent | Prod |  Perm  | Prof | Rho  | Bid  |  Win? | +Prof |
--------------------------------------------------------------------
|  1  |   0   |  0.0 |      0 |  0.0 |  8.4 | 0.00 |       |       |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 0.00 |   *   |   6.1 |
--------------------------------------------------------------------
|  2  |   0   |  0.0 |      0 |  0.0 |  8.4 | 0.10 |   *   |   8.4 |
|     |   1   | 30.5 |  3,000 |  6.1 |  2.3 | 0.00 |       |       |
--------------------------------------------------------------------
|  3  |   0   | 42.2 |  3,000 |  8.4 |  1.6 | 0.00 |       |       |
|     |   1   | 30.5 |  3,000 |  6.1 |  2.3 | 0.05 |   *   |   2.3 |
--------------------------------------------------------------------
|  4  |   0   | 42.2 |  3,000 |  8.4 |  1.6 | 0.20 |   *   |   1.6 |
|     |   1   | 42.2 |  6,000 |  8.4 |  0.9 | 0.00 |       |       |
--------------------------------------------------------------------
|  5  |   0   | 50.2 |  6,000 | 10.0 |  1.0 | 0.05 |   *   |   1.0 |
|     |   1   | 42.2 |  6,000 |  8.4 |  0.9 | 0.00 |       |       |
--------------------------------------------------------------------
(winning agent must pay less than rho to make a profit)

Avg permit price: $0.0
Agent 0:
  -> won 9000 permits
  -> paid $0.00m
  -> produced 55.2m litres of fuel
  -> made a total profit of $11.04m

Agent 1:
  -> won 6000 permits
  -> paid $0.00m
  -> produced 42.2m litres of fuel
  -> made a total profit of $8.45m

====== ESTIMATED OPTIMAL STRATEGIES: argmax_a Q(s, a) ======

==============================================================================================
        ||     0       |   3,000     |   6,000     |   9,000     |   12,000    |   15,000    |
==============================================================================================
      0 || 0.00 / 0.00 | 0.10 / 0.00 | 0.35 / 0.00 | 0.35 / 0.05 | 8.35 / 0.00 | 0.0* / 0.0* |
  3,000 || 0.00 / 0.10 | 0.00 / 0.05 | 0.20 / 0.00 | 0.10 / 0.00 | 0.0* / 0.0* |  ?   /  ?   |
  6,000 || 0.00 / 0.05 | 0.00 / 0.05 | 0.05 / 0.00 | 0.0* / 0.0* |  ?   /  ?   |  ?   /  ?   |
  9,000 || 0.00 / 0.40 | 0.05 / 0.15 | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |
 12,000 || 0.25 / 4.45 | 0.0* / 0.0* |   ?   /  ?  |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |
 15,000 || 0.0* / 0.0* |  ?   /  ?   |   ?   /  ?  |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |

====== ESTIMATED RETURNS IF OPTIMAL STRATEGY FOLLOWED: max Q(s, a) ======

=====================================================================================================
        ||      0        |    3,000      |    6,000      |    9,000    |    12,000   |    15,000    |
=====================================================================================================
      0 || 19.49 / 19.49 | 13.38 / 13.38 | 11.03 / 11.04 | 9.92 / 9.99 | 7.90 / 8.45 | 0.00 / 0.00  |
  3,000 || 11.04 / 11.04 |  4.94 / 4.94  |  2.59 / 2.59  | 1.59 / 1.60 | 0.00 / 0.00 |  ?   /  ?    |
  6,000 ||  9.44 / 9.44  |  3.34 / 3.34  |  0.99 / 0.99  | 0.00 / 0.00 |  ?   /  ?   |  ?   /  ?    |
  9,000 ||  8.40 / 8.27  |  2.30 / 2.28  |  0.00 / 0.00  |  ?   /  ?   |  ?   /  ?   |  ?   /  ?    |
 12,000 ||  5.85 / 5.84  |  0.00 / 0.00  |   ?   /  ?    |  ?   /  ?   |  ?   /  ?   |  ?   /  ?    |
 15,000 ||  0.00 / 0.00  |   ?   /  ?    |   ?   /  ?    |  ?   /  ?   |  ?   /  ?   |  ?   /  ?    |
Figure 7: Final policy for a sample run of r(2)r^{(2)}.

In the Estimated Optimal Strategies matrix in Fig 7, rows denote permits held by R1R_{1} and columns denote permits held by R2R_{2}, so that each cell represents a state of the game. The entries in the cells are (estimated optimal bid in that state by R1R_{1}, estimated optimal bid by R2R{2}). A ‘?’ means the state was never visited by the agent, while an ‘*’ marks the terminal states.

On the first tranche, the agents both bid zero, leaving it to the VCG mechanism to decide via random tie-breaking which agent wins. Then, if R1R_{1} won the first tranche – i.e. we are now in cell (2, 1) of the matrix – R1R_{1} bids zero so R2R_{2} will win the second tranche. And vice versa if we are in cell (1, 2) of the matrix. After that, on tranche 3, we are always in cell (2, 2) of the matrix and the remainder of the game always proceeds the same way: agent R2R_{2} wins; agent R1R_{1} wins; agent R1R_{1} wins.

Thus the learned joint policy always results in a return for each agent (and a total shared net profit) of 19.4919.49 over the five tranches, as both agents have accurately estimated – see cell (1, 1) of the “Estimated Returns” matrix of Fig 7. The entries in that matrix are (R1R_{1}’s estimated expected return from the state if R1R_{1}’s estimated optimal bids are made, R2R_{2}’s estimated expected return if its estimated optimal bids are made). Note the agents have learned this collusive policy solely by observing the permit holdings of each participant in the auction – there was never any explicit communication between them.

Competitive Dynamics

Fig 8 shows the final policy learned for the setup of Fig 6. Observe that the learned joint policy here is, in most states, for each agent to bid identical amounts, leaving it to the VCG mechanism to randomly determine who wins. In some cases an agent will bid an amount greater than ρ\rho, such as in tranches 3-5 of the sampled final policy evaluation shown in Fig 8. Why? Consider tranche 3 of the sample, where the game is in state (6000, 0). Here agent R1R_{1} bid 1.91.9m – the same as R2R_{2} – and won, receiving a profit (and reward) of 0.9-0.9m, so agent R2R_{2} received a reward of 0.90.9m. If agent R1R_{1} had bid less, say 1.81.8m, R2R_{2} would have won with a profit of 6.11.8=4.36.1-1.8=4.3m, and agent R1R_{1} would have received a reward of 4.3-4.3m instead of 0.9-0.9m. Clearly that is a worse outcome. If agent R1R_{1} had bid more, the outcome would have also been worse: agent R1R_{1} would have received a larger negative reward (and thus a larger positive reward would have gone to R2R_{2}).

Now let’s suppose the tie-break at state (6000, 0) – and all subsequent tie-breaks – are resolved in R2R_{2}’s favour. Assuming each agent stays with its learned strategies, the last three tranches would play out as follows:

  • t=3t=3: (6000, 0) \to bid 1.91.9 / 1.91.9 \to R2R_{2} wins \to reward 4.2-4.2 / +4.2+4.2

  • t=4t=4: (6000, 3000) \to bid 1.251.25 / 1.251.25 \to R2R_{2} wins \to reward 1.1-1.1 / +1.1+1.1

  • t=5t=5: (6000, 6000) \to bid 1.01.0 / 1.01.0 => R2R_{2} wins \to reward 0.0-0.0 / +0.0+0.0

The total reward over the last three tranches is thus -5.3 to R1R_{1} and +5.3 to R2R_{2}. Overall, agent R1R_{1} has accurately estimated its expected reward from state (6000, 0) over the remaining three tranches, when playing its estimated optimal actions, as 5.3-5.3m. See cell (3, 1) of the “Estimated Returns” matrix in Fig 8.

=== FINAL POLICY EVALUATION ===
--------------------------------------------------------------------
|  t  | Agent | Prod |  Perm  | Prof | Rho  | Bid  | Win?  | +Prof |
--------------------------------------------------------------------
|  1  |  0    |  0.0 |      0 |  0.0 |  8.4 | 1.55 |   *   |   6.9 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 1.55 |       |       |
--------------------------------------------------------------------
|  2  |   0   | 42.2 |  3,000 |  6.9 |  1.6 | 1.55 |   *   |   0.1 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 1.50 |       |       |
--------------------------------------------------------------------
|  3  |   0   | 50.2 |  6,000 |  7.0 |  1.0 | 1.90 |   *   |  -0.9 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 1.90 |       |       |
--------------------------------------------------------------------
|  4  |   0   | 55.2 |  9,000 |  6.1 |  0.8 | 2.50 |   *   |  -1.7 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 2.50 |       |       |
--------------------------------------------------------------------
|  5  |   0   | 59.0 | 12,000 |  4.4 |  0.6 | 3.40 |   *   |  -2.8 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 3.40 |       |       |
--------------------------------------------------------------------
(winning agent must pay less than rho to make a profit)

Avg permit price: $723.0
Agent 0:
  -> won 15000 permits
  -> paid $10.85m
  -> produced 62.2m litres of fuel
  -> made a total profit of $1.58m

Agent 1:
  -> won 0 permits
  -> paid $0.00m
  -> produced 0.0m litres of fuel
  -> made a total profit of $0.00m

====== ESTIMATED OPTIMAL STRATEGIES: argmax_a Q(s, a) ======

==============================================================================================
        ||      0      |    3,000    |    6,000    |    9,000    |    12,000   |    15,000   |
==============================================================================================
      0 || 1.55 / 1.55 | 1.55 / 1.55 | 1.95 / 1.95 | 2.85 / 2.85 | 4.50 / 4.50 | 0.0* / 0.0* |
  3,000 || 1.55 / 1.50 | 1.20 / 1.15 | 1.05 / 1.05 | 1.15 / 1.15 | 0.0* / 0.0* |   ?  /  ?   |
  6,000 || 1.90 / 1.90 | 1.25 / 1.25 | 1.00 / 1.00 | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |
  9,000 || 2.50 / 2.50 | 1.55 / 1.6* | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |
 12,000 || 3.40 / 3.40 | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |  ?   /  ?   |
 15,000 || 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |  ?   /  ?   |  ?   /  ?   |

====== ESTIMATED RETURNS IF OPTIMAL STRATEGY FOLLOWED: max Q(s, a) ======

===================================================================================================
        ||      0       |    3,000     |    6,000     |    9,000     |    12,000    |   15,000    |
===================================================================================================
      0 || 1.72 / -1.72 | 6.30 / -6.30 | 7.10 / -7.10 | 6.10 / -6.10 | 3.95 / -3.95 | 0.00 / 0.00 |
  3,000 || -5.20 / 5.20 | -0.61 / 0.61 | 0.58 / -0.58 | 0.47 / -0.47 | 0.00 / 0.00  |  ?   /  ?   |
  6,000 || -5.30 / 5.30 | -1.05 / 1.05 | 0.04 / -0.04 | 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?   |
  9,000 || -4.44 / 4.44 | -0.80 / 0.80 | 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?    |  ?   /  ?   |
 12,000 || -2.74 / 2.74 | 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?    |  ?   /  ?    |  ?   /  ?   |
 15,000 || 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?    |  ?   /  ?    |  ?   /  ?    |  ?   /  ?   |
Figure 8: Final policy for a sample run of r(3)r^{(3)}.