The Problem of Social Cost in Multi-Agent General Reinforcement Learning:
Survey and Synthesis

Kee Siong Ng Samuel Yang-Zhao Timothy Cadogan-Cowper

1 Introduction

The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable collateral damage to others, including destroying humankind and the world in extreme cases; see, for example, [111, 18]. Many of these examples are effectively variations of the tragedy of the commons phenomenon, which has been studied extensively in economics [59, 100, 35]. Tragedy of the commons typically occur because of an externality that arises when a utility-maximising economic agent does not pay an appropriate cost when making a decision to take an action that provides private benefit but incurs social harm to others, in particular those that are not a party to the decision-making process. Pollution, overfishing, traffic are all classic examples of multi-agent economic systems that exhibit externalities.

In this paper, we consider the problem of social harms that can result from actions taken by learning and utility-maximising agents in a multi-agent environment. The problem of measuring social harms or impacts in such multi-agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in [44]. We provide a partial answer to that open problem in the form of market-based mechanisms to quantify and control the cost of such social harms in § 4. The key proposal is the protocol described in § 4.1 and the agent valuation functions stated in equations (14)-(15). The protocol captures many existing and well-studied special cases which we list in § 4.2. Our proposed setup is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history-based general reinforcement learning environment like in AIXI [69]; (ii) the reinforcement-learning agents participating in the environment can have different horizons. To demonstrate the practicality of the proposed setup, we survey some possible learning algorithms in § 5 and present a few applications in § 6.

As the background literature is rich, both in breadth and depth, we have taken the liberty to take an expository approach in writing the paper and will introduce key topics and concepts as they are required in the development of the paper, starting with General Reinforcement Learning in § 2 and Mechanism Design in § 3. Readers who are familiar with these topics should be able to skip those sections without issue.

2 General Reinforcement Learning

2.1 Single Agent Setting

We consider finite action, observation and reward spaces denoted by $\mathcal{A},\mathcal{O},\mathcal{R}$ respectively. The agent interacts with the environment in cycles: at any time, the agent chooses an action from $\mathcal{A}$ and the environment returns an observation and reward from $\mathcal{O}$ and $\mathcal{R}$ . Frequently we will be considering observations and rewards together, and will denote $x\in\mathcal{O}\times\mathcal{R}$ as a percept $x$ from the percept space $\mathcal{O}\times\mathcal{R}$ . We will denote a string $x_{1}x_{2}\ldots x_{n}$ of length $n$ by $x_{1:n}$ and its length $n-1$ prefix as $x_{<n}$ . An action, observation and reward from the same time step will be denoted $aor_{t}$ . After interacting for $n$ cycles, the interaction string $a_{1}o_{1}r_{1}\ldots a_{n}o_{n}r_{n}$ (denoted $aor_{1:n}$ from here on) is generated. We define the history space $\mathcal{H}$ to be an interaction string of any length.

Definition 1.

A history $h$ is an element of the space $\mathcal{H}\coloneqq(\mathcal{A}\times\mathcal{O}\times\mathcal{R})^{*}$ . The history at time $t$ is denoted $h_{t}=aor_{1:t}$ .

An environment is a process which generates the percepts given actions. It is defined to be a sequence of probability distributions over percept sequences conditioned on the actions taken by the agent.

Definition 2.

An environment $\rho$ is a sequence of probability distributions $\{\rho_{0},\rho_{1},\rho_{2},\ldots\}$ , where $\rho_{n}:\mathcal{A}^{n}\to\mathcal{D}((\mathcal{O}\times\mathcal{R})^{n})$ , that satisfies

\displaystyle\forall a_{1:n}~{}\forall or_{<n}\;\rho_{n-1}(or_{<n}\,|\,a_{<n})=\sum_{or\in\mathcal{O}\times\mathcal{R}}\rho_{n}(or_{1:n}\,|\,a_{1:n}).

(1)

In the base case, we have $\rho_{0}(\epsilon|\epsilon)=1$ .

Equation (1) captures the natural constraint that actions in the future do not affect past percepts and is known as the chronological condition [67]. We will drop the subscript on $\rho_{n}$ when the context is clear.

The predictive probability of the next percept given history and a current action is given by

\displaystyle\rho(or_{n}|aor_{<n},a_{n})=\rho(or_{n}|h_{n-1}a_{n})\coloneqq\frac{\rho(or_{1:n}|a_{1:n})}{\rho(or_{<n}|a_{<n})}

for all $aor_{1:n}$ such that $\rho(or_{<n}|a_{<n})>0$ . This allows us to write

\rho(or_{1:n}|a_{1:n})=\rho(or_{1}|a_{1})\rho(or_{2}|aor_{1}a_{2})\ldots\rho(or_{n}|aor_{<n}a_{n}).

The general reinforcement learning problem is for the agent to learn a policy $\pi:\mathcal{H}\to\mathcal{D}(A)$ mapping histories to a distribution on possible actions that will allow it to maximise its expected cumulative future rewards.

Definition 3.

Given an environment $\mu$ , at time $t$ , given history $h_{t-1}$ and an action $a_{t}$ , the expected future cumulative rewards up to finite horizon $m\in\mathbb{N}$ is given by the value function

V_{t,m}^{\mu,*}(h_{t-1},a_{t})=\sum_{or_{t}}\mu(or_{t}\,|\,h_{t-1}a_{t})\max_{a_{t+1}}\sum_{or_{t+1}}\mu(or_{t+1}\,|\,h_{t-1}aor_{t}a_{t+1})\,\cdots\\ \max_{a_{t+m}}\sum_{or_{t+m}}\mu(or_{t+m}\,|\,h_{t-1}aor_{t+1:t+m-1}a_{t+m})\left[\sum_{i=t}^{t+m}r_{i}\right],

(2)

which can also be written in this recursive form

V_{t,m}^{\mu,*}(h_{t-1},a_{t})=\sum_{or_{t}}\mu(or_{t}\,|\,h_{t-1}a_{t})\left[r_{t}+\max_{a_{t+1}}V_{t+1,m}^{\mu,*}(h_{t-1}aor_{t},a_{t+1})\right].

(3)

If the environment $\mu$ is known, the optimal action $a_{t}^{*}$ to take at time $t$ is given by

\displaystyle a_{t}^{*}=\arg\max_{a_{t}}V^{\mu,*}_{t,m}(h_{t-1},a_{t}).

In practice, $\mu$ is of course unknown and needs to be learned from data and background knowledge. The AIXI agent [67] is a mathematical solution to the general reinforcement learning, obtained by estimating the unknown environment $\mu$ in (2) using Solomonoff Induction [119]. At time $t$ , the AIXI agent chooses action $a^{*}_{t}$ according to

\displaystyle a^{*}_{t}=\arg\max_{a_{t}}\sum_{or_{t}}\ldots\max_{a_{t+m}}\sum_{or_{t+m}}\left[\sum_{j=t}^{t+m}r_{j}\right]\sum_{\rho\in\mathcal{M}_{U}}2^{-K(\rho)}\rho(or_{1:{t+m}}\,|\,a_{1:t+m}),

(4)

where $m\in\mathbb{N}$ is a finite lookahead horizon, $\mathcal{M}_{U}$ is the set of all enumerable chronological semimeasures [67], $\rho(or_{1:{t+m}}|a_{1:t+m})$ is the probability of observing $or_{1:t+m}$ given the action sequence $a_{1:t+m}$ , and $K(\rho)$ denotes the Kolmogorov complexity [85] of $\rho$ . The performance of AIXI relies heavily on the next result.

Definition 4.

Given a countable model class $\mathcal{M}\coloneqq\{\rho_{1},\rho_{2},\ldots\}$ and a prior weight $w^{\rho}_{0}>0$ for each $\rho\in\mathcal{M}$ such that $\sum_{\rho\in\mathcal{M}}w^{\rho}_{0}=1$ , the Bayesian mixture model with respect to $\mathcal{M}$ is given by $\xi_{\mathcal{M}}(or_{1:n}|a_{1:n})=\sum_{\rho\in\mathcal{M}}w^{\rho}_{0}\rho(or_{1:n}|a_{1:n})$ .

A Bayesian mixture model enjoys the property that it converges rapidly to the true environment if there exists a ‘good’ model in the model class.

Theorem 1.

[67] Let $\mu$ be the true environment and $\xi$ be the Bayesian mixture model over a model class $\mathcal{M}$ . For all $n\in\mathbb{N}$ and for all $a_{1:n}$ ,

\sum_{j=1}^{n}\sum_{or_{1:j}}\mu(or_{<j}|a_{<j})(\mu(or_{j}|aor_{<j}a_{j})-\xi(or_{j}|aor_{<j}a_{j}))^{2}\\ \leq\min_{\rho\in\mathcal{M}}\left\{\ln\frac{1}{w^{\rho}_{0}}+D_{n}(\mu||\rho)\right\},

(5)

where $D_{n}(\mu||\rho)$ is the KL divergence of $\mu(\cdot|a_{1:n})$ and $\rho(\cdot|a_{1:n})$ defined by

D_{n}(\mu||\rho)\coloneqq\sum_{or_{1:n}}\mu(or_{1:n}|a_{1:n})\ln\frac{\mu(or_{1:n}|a_{1:n})}{\rho(or_{1:n}|a_{1:n})}.

To see the rapid convergence of $\xi$ to $\mu$ , take the limit $n\to\infty$ on the l.h.s of (5) and observe that in the case where $\min_{\rho\in\mathcal{M}}\sup_{n}D_{n}(\mu||\rho)$ is bounded, the l.h.s. can only be finite if $\xi(or_{k}|aor_{<k}a_{k})$ converges sufficiently fast to $\mu(or_{k}|aor_{<k}a_{k})$ .

It can be shown that the model $\xi_{U}:=\sum_{\rho\in\mathcal{M}_{U}}2^{-K(\rho)}\rho(or_{1:t}|a_{1:t})$ in (4) is a Bayesian mixture model over the class of all computable functions, with $2^{-K(\rho)}$ as the prior for $\rho$ . So assuming the environment is a computable function, AIXI’s model will converge rapidly to it.

AIXI is known to be incomputable. In practice, we will consider Bayesian reinforcement learning agents [53] that make use of Bayesian mixture models of different kinds and approximate the expectimax operation in (4) with algorithms like monte-carlo tree search [79] or other reinforcement learning algorithms; examples of such agents include approximations of AIXI like [129, 139, 138]. These are all model-based techniques; model-free techniques like Temporal Difference learning [121] that exploits the functional form of (3) and universal function approximators like deep neural networks can also be considered.

2.2 Multi-Agent Setting

In the multi-agent setup, we assume there are $k>1$ agents, each with its own action and observation spaces $\mathcal{A}_{i}$ and $\mathcal{O}_{i}$ , $i\in[1\ldots k]$ . At time $t$ , the $k$ agents take a joint action

\mathbf{a_{t}}=(a_{t,1},\ldots,a_{t,k})\in\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{k}=\mathbf{\mathcal{A}}

and receives a joint percept

\mathbf{or_{t}}=(or_{t,1},\ldots,or_{t,k})\in(\mathcal{O}_{1}\times\mathcal{R})\times\cdots\times(\mathcal{O}_{k}\times\mathcal{R})=\mathbf{\mathcal{O}\times\mathcal{R}}.

The joint history up to time $t$ is denoted $\mathbf{h_{t}}=\mathbf{aor_{1:t}}=\mathbf{a_{1}}\mathbf{or_{1}}\mathbf{a_{2}}\mathbf{or_{2}}\ldots\mathbf{a_{t}}\mathbf{or_{t}}$ . We assume for now that every agent can see the entire joint history; this assumption will be reexamined in § 5.2.

Definition 5.

A multi-agent environment $\varrho$ is a sequence of probability distributions $\{\varrho_{0},\varrho_{1},\varrho_{2},\ldots\}$ , where $\varrho_{n}:(\mathbf{\mathcal{A}})^{n}\to\mathcal{D}(\mathbf{\mathcal{O}\times\mathcal{R}})^{n}$ , that satisfies

\displaystyle\forall\mathbf{a_{1:n}}~{}\forall\mathbf{or_{<n}}\;\varrho_{n-1}(\mathbf{or_{<n}}\,|\,\mathbf{a_{<n}})=\sum_{\mathbf{or_{n}}\in\mathbf{\mathcal{O}\times\mathcal{R}}}\varrho_{n}(\mathbf{or_{1:n}}\,|\,\mathbf{a_{1:n}}).

(6)

In the base case, we have $\varrho_{0}(\epsilon\,|\,\epsilon)=1$ .

Multi-agent environments can have different (non-exclusive) properties, some of which are listed here:

1.

Mutually exclusive actions, where only one of the actions in $\mathbf{a_{t}}$ can be executed by the environment at time $t$ ;
2.

Non-mutually exclusive actions, where two or more individual actions in $\mathbf{a_{t}}$ can be executed simultaneously by the environment at time $t$ ;
3.

Zero-sum rewards, where the agents’ rewards sum to 0 at every time step so they are competing against each other, like in [86];
4.

Identical rewards, where the agents get the exact same rewards at every time step so they are playing cooperatively with each other;
5.

Mixed-sum rewards, which cover all the settings that combine elements of both cooperation and competition;
6.

Existence of a common resource pool, which can be represented by an ‘agent’ with null action space and whose reward is a function of other agents’ consumption of the resource pool.

Comprehensive surveys of formalisations and some key challenges and results in a few of these topics can be found in [115, 142].

Each agent’s goal in a multi-agent environment is to learn the optimal policy to achieve its own maximum expected cumulative future rewards. The behaviour of AIXI agents in a multi-agent environment is an open problem because the each AIXI agent assumes its environment is computable, and that assumption breaks down when the environment contains other AIXI agents [67]. We will consider primarily Bayesian reinforcement learning agents in this paper.

In addition to the performance of individual agents, we will also have occasion to consider aggregate functions of the vector of expected cumulative rewards achieved by all the agents in the system, where the exact aggregate function chosen is dependent on desired properties of the multi-agent system like market efficiency and fairness.

3 Mechanism Design

3.1 Tragedy of the Commons

The key to solving tragedy of the commons issues is to work out a way to ‘internalise’ the externality in the design of the multi-agent economic system of interest. There are two primary approaches: price regulation through a central authority, and a market-based cap-and-trade system. The former is sometimes referred to as Pigouvian tax after [106, 105], and it requires a central authority to (i) have enough information to quite accurately determine the unit price of the externality or social harm; and (ii) enforce its payment by agents that cause the externality, thereby internalising it. In contrast, the cap-and-trade system is motivated by the idea of Coasean bargaining [31, 30], whereby the maximum amount of the externality or social harm allowed is capped through the issuance of a fixed number of permits, each of which allows an agent to produce a unit of externality, and the agents are allowed to determine for themselves whether to use their permits to produce externality, or trade the permits among themselves for profit. The idea is that the cap-and-trade system will allow the agents that are most efficient in generating private benefit while minimising social harm to win because they can afford to pay a higher price for the permits. Indeed, the Coase ‘Theorem’ says that as long as the permits are completely allocated and there is no transaction cost involved in trading, then the agents will collectively end up with a Pareto efficient solution. So a market made up of utility-maximising agents, under the right conditions, is capable of determining the right price for the externality; there is no need for an informative and powerful central authority to set the price.

In the following sections, we will look at some concrete protocols from the field of Mechanism Design for implementing Coasean bargaining and trading in multi-agent environments. In keeping with the intended spirit of [32], we will largely avoid the term externality from here onwards and favour, instead, the term ‘social harm’.

3.2 The VCG Mechanism

In a multi-agent environment, the different agents participating in it can be given different goals and preferences, either competing or collective, and the algorithms behind those agents can exhibit different behaviour, including differing abilities in learning and planning for the long term. Mechanism design [94] is the study of protocols that can take the usually dispersed and private information and preferences of multiple agents and aggregating them into an appropriate social choice, usually a decision among alternatives, that maximises the welfare of all involved.

Let $\mathbb{A}$ be a set of alternatives for a set of $k$ agents. The preference of agent $i$ is given by a valuation function $v_{i}:\mathbb{A}\to\mathbb{R}$ , where $v_{i}(a)$ denotes the value that agent $i$ assigns to alternative $a$ being chosen. Here, $v_{i}\in V_{i}$ , where $V_{i}\subseteq\mathbb{R}^{\mathbb{A}}$ is the set of possible valuation functions for agent $i$ . We will use the notation $V_{-i}=V_{1}\times\cdots\times V_{i-1}\times V_{i+1}\times\cdots V_{k}$ in the following.

Definition 6.

A mechanism is a tuple $(f,p_{1},\ldots,p_{k})$ made up of a social choice function $f:V_{1}\times\cdots\times V_{k}\to\mathbb{A}$ and payment functions $p_{1},\ldots,p_{k}$ , where $p_{i}:V_{1}\times\cdots\times V_{k}\to\mathbb{R}$ is the amount that agent $i$ pays to the mechanism.

Given a mechanism $(f,p_{1},\ldots,p_{k})$ and $k$ agents with value functions $v_{1},\ldots,v_{k}$ , the utility of agent $i$ from participating in the mechanism is given by

u_{i}(v_{1},\ldots,v_{k}):=v_{i}(f(v_{1},\ldots,v_{k}))-p_{i}(v_{1},\ldots,v_{k}).

(7)

Definition 7.

A mechanism $(f,p_{1},\ldots,p_{k})$ is called incentive compatible if for every agent $i$ with valuation function $v_{i}\in V_{i}$ , for every $v_{i}^{\prime}\in V_{i}$ , and every $v_{-i}\in V_{-i}$ , we have

v_{i}(f(v_{i},v_{-i}))-p_{i}(v_{i},v_{-i})\geq v_{i}(f(v_{i}^{\prime},v_{-i}))-p_{i}(v_{i}^{\prime},v_{-i}).

(8)

Thus, in an incentive compatible mechanism, each agent $i$ would maximise its utility by being truthful in revealing its valuation function $v_{i}$ to the mechanism, rather than needing to worry about obtaining an advantage by presenting a possibly false / misleading $v_{i}^{\prime}$ .

Definition 8.

A mechanism $(f,p_{1},\ldots,p_{k})$ is individually rational if for every agent $i$ with valuation function $v_{i}\in V_{i}$ and every $v_{-i}\in V_{-i}$ , we have

v_{i}(f(v_{i},v_{-i}))-p_{i}(v_{i},v_{-i})\geq 0.

(9)

In other words, the utility of each agent is always non-negative, assuming the agent reports truthfully.

Definitions 6, 7 and 8 can be generalised to allow the social choice function $f$ and the payment functions $p_{i}$ ’s to be randomised functions, in which case we will work with the expectation version of (7), (8) and (9).

Definition 9.

A mechanism $(f,p_{1},\ldots,p_{k})$ is called a Vickrey-Clarke-Groves (VCG) mechanism if

1.

$f(v_{1},\ldots,v_{k})\in\arg\max_{a\in\mathbb{A}}\sum_{i}v_{i}(a)$ ; that is the social choice function $f$ maximises the social welfare, and
2.

there exists functions $h_{1},\ldots,h_{k}$ , where $h_{i}:V_{-i}\to\mathbb{R}$ , such that for all $v_{1}\in V_{1},\ldots,v_{k}\in V_{k}$ , we have

$p_{i}(v_{1},\ldots,v_{k})=h_{i}(v_{-i})-\sum_{j\neq i}v_{j}(f(v_{1},\ldots,v_{k})).$

Here is a classical result from mechanism design.

Theorem 2.

Every Vickrey-Clarke-Groves mechanism is incentive compatible.

What should the $h_{i}$ functions in VCG mechanisms be? A good choice is the Clark pivot rule.

Definition 10.

The Clark pivot payment function for a VCG mechanism is given by $h_{i}(v_{-i}):=\max_{b\in\mathbb{A}}\sum_{j\neq i}v_{j}(b)$ for agent $i$ .

Under this choice of $h_{i}$ , the payment for agent $i$ is

p_{i}(v_{1},\ldots,v_{k})=\max_{b}\sum_{j\neq i}v_{j}(b)-\sum_{j\neq i}v_{j}(f(v_{1},\ldots,v_{k})),

which is the difference between the collective social welfare of the others with and without $i$ ’s participation in the system. So each agent makes the payment that internalises the exact social harm it causes other agents. The utility of agent $i$ is

u_{i}(v_{1},\ldots,v_{k})=\sum_{j}v_{j}(f(v_{1},\ldots,v_{k}))-\max_{b}\sum_{j\neq i}v_{j}(b).

Theorem 3.

The VCG mechanism with Clark pivot payment function is individually rational if the agent valuation functions are all non-negative.

Example 1.

Consider an auction where $\mathbb{A}=\{1,\ldots,k\}$ – so one and only one of the agents win – and where, for agent $i$ , $v_{i}(i)=p_{i}$ and $\forall j\neq i,v_{i}(j)=0$ . Vickrey’s Second Price auction, in which each agent $i$ bids the highest price $p_{i}$ it is willing to pay for the auction item and where the winner $i^{*}=\arg\max_{j}p_{j}$ pays the second highest bid price $p^{*}=\max_{j\neq i^{*}}p_{j}$ and every one else pays $0$ is a VCG mechanism with the Clark pivot payment function.

Example 2.

Consider the design of a mechanism to allow two agents, a buyer $B$ and a seller $S$ , to engage in bilateral trade for a good owned by the seller. There are two possible outcomes: no-trade or trade, which we model numerically as $\mathbb{A}=\{0,1\}$ . The buyer values the good at $\theta_{B}\geq 0$ so its valuation function is

v_{B}(d):=\mathit{if}\;d=1\;\mathit{then}\;\theta_{B}\;\mathit{else}\;0.

The seller values the good at $\theta_{S}\geq 0$ so its valuation function is

v_{S}(d):=\mathit{if}\;d=1\;\mathit{then}\;-\theta_{S}\;\mathit{else}\;0

because the seller loses the good in the case of a trade. Suppose we use the VCG mechanism with the Clark pivot payment function as the mechanism, we will end up with the social choice

	$\displaystyle d^{*}=f(v_{B},v_{S})$	$\displaystyle=\arg\max_{d\in\mathbb{A}}(v_{B}(d)+v_{S}(d))$
		$\displaystyle=\arg\max_{d\in\mathbb{A}}\,(\mathit{if}\;d=1\;\mathit{then}\;\theta_{B}-\theta_{S}\;\mathit{else}\;0)$
		$\displaystyle=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;1\;\mathit{else}\;0,$

which means there is a trade iff the buyer attaches a higher value to the good than the seller. Here are the payment functions:

	$\displaystyle p_{B}(v_{B},v_{S})$	$\displaystyle=\max_{d\in\mathbb{A}}v_{S}(d)-v_{S}(d^{*})=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;\theta_{S}\;\mathit{else}\;0$
	$\displaystyle p_{S}(v_{B},v_{S})$	$\displaystyle=\max_{d\in\mathbb{A}}v_{B}(d)-v_{B}(d^{*})=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;0\;\mathit{else}\;\theta_{B}.$

So the buyer pays the mechanism $\theta_{S}$ if there is a trade and the seller pays the mechanism if there is no trade. The latter is slightly odd and results in the problematic issue of the seller always having negative utility in participating in the mechanism:

	$\displaystyle u_{S}(v_{B},v_{S})$	$\displaystyle=v_{B}(d^{})+v_{S}(d^{})-\max_{d\in\mathbb{A}}v_{B}(d)$
		$\displaystyle=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;-\theta_{S}\;\mathit{else}\;-\theta_{B}.$

The problem comes down to the asymmetric position of the buyer and seller and the $p_{i}(\cdot)\geq 0$ condition enforced by the Clark pivot payment function, where the seller is forced to make a payment to maintain its ownership of the good (no trade), even though the status quo is that the seller already owns the good. There are at least two solutions. The first solution is to insist that the buyer and seller pays nothing if there is no trade, so we end up with the constraints

	$\displaystyle p_{B}(v_{B},v_{S})=h_{B}(v_{S})-v_{S}(0)=0$
	$\displaystyle p_{S}(v_{B},v_{S})=h_{S}(v_{B})-v_{B}(0)=0$

that imply $h_{B}(v_{S})=v_{S}(0)$ and $h_{S}(v_{B})=v_{B}(0)$ . In this scenario, when trade happens, we have

	$\displaystyle p_{B}(v_{B},v_{S})$	$\displaystyle=h_{B}(v_{S})-v_{S}(1)=\theta_{S}$
	$\displaystyle p_{S}(v_{B},v_{S})$	$\displaystyle=h_{S}(v_{B})-v_{B}(1)=-\theta_{B},$

which means the buyer pays the mechanism $\theta_{S}$ and the seller is paid $\theta_{B}$ by the mechanism, with both agents obtaining utility $\theta_{B}-\theta_{S}>0$ . Note that mechanism ends up having to subsidise the trade. The second solution is to remove the ownership asymmetry between the two agents, by first making the mechanism pay an amount $\theta\geq\theta_{S}$ to the seller to transfer the good to the mechanism and then running the VCG mechanism with Clark pivot rule to determine new ownership of the good. Under this setup, the valuation function of the buyer stays the same, but the valuation function of the seller becomes

v_{S}(d)=\mathit{if}\;d=1\;\mathit{then}\;0\;\mathit{else}\;\theta_{S},

and we end up with the Vickrey second-price auction setup. The utility of the buyer stays the same, and that of the seller becomes

	$\displaystyle u_{S}(v_{B},v_{S})$	$\displaystyle=\theta+(v_{S}(d^{})+v_{B}(d^{})-\max_{d\in\mathbb{A}}v_{B}(d)$
		$\displaystyle=\theta-\theta_{B}+(\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;\theta_{B}\;\mathit{else}\;\theta_{S})$
		$\displaystyle=\mathit{if}\;\theta_{B}-\theta_{S}>0\;\mathit{then}\;\theta\;\mathit{else}\;\theta+\theta_{S}-\theta_{B}$
		$\displaystyle\geq 0.$

As with the first solution, the mechanism ends up with a negative value, which is the cost of subsidising the trade.

3.3 The Exponential VCG Mechanism

We have shown in § 3.2 that VCG mechanisms are incentive compatible and individually rational, which means agents are incentivised to participate and be truthful. It turns out that VCG mechanisms can be made privacy-preserving too. The exponential mechanism [90], a key technique in differential privacy [39], has been shown in [64] to be a generalisation of the VCG mechanism that is differentially private, incentive compatible and nearly optimal for maximising social welfare. We now briefly describe this key result and furnish the proofs, which are rather instructive.

Definition 11.

A randomized algorithm $\mathcal{M}:V_{1}\times\cdots\times V_{k}\to\mathbb{A}$ is $(\epsilon,\delta)$ -differentially private if for any $v\in V_{1}\times\cdots\times V_{k}$ and for any subset $\Omega\subseteq\mathbb{A}$

\displaystyle P(\mathcal{M}(v)\in\Omega)\leq e^{\epsilon}P(\mathcal{M}(v^{\prime})\in\Omega)+\delta,

for all $v^{\prime}$ such that $|v-v^{\prime}|_{1}\leq 1$ (i.e. there exists at most one $i\in[n]$ such that $v_{i}\neq v^{\prime}_{i}$ ).

Definition 12.

Given a quality function $q:V_{1}\times\cdots\times V_{k}\times\mathbb{A}\to\mathbb{R}$ and a $v\in V_{1}\times\cdots\times V_{k}$ , the Exponential DP Mechanism $\mathcal{M}_{q}^{\epsilon}(v)$ samples and outputs an element $r\in\mathbb{A}$ with probability proportional to $\exp(\frac{\epsilon}{2\Delta_{q}}q(v,r))$ , where

\Delta_{q}=\max_{r\in\mathbb{A}}\max_{v_{1},v_{2}:||v_{1}-v_{2}||_{1}\leq 1}|q(v_{1},r)-q(v_{2},r)|.

Theorem 4.

The Exponential DP Mechanism is $(\epsilon,0)$ -differentially private.

Definition 13.

The Exponential VCG Mechanism is defined by $(\mathcal{M}_{q}^{\epsilon},p_{1},\ldots,p_{k})$ where

	$\displaystyle q(v,r)=\sum_{i}v_{i}(r)$
	$\displaystyle p_{i}(v)=\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(v)}\left[-\sum_{j\neq i}v_{j}(r)\right]-\frac{2}{\epsilon}H(\mathcal{M}_{q}^{\epsilon}(v))+\frac{2}{\epsilon}\ln\left(\sum_{r\in\mathbb{A}}\exp\left(\frac{\epsilon}{2}\sum_{j\neq i}v_{j}(r)\right)\right)$

and $H(\cdot)$ is the Shannon entropy function.

Note that as $\epsilon$ increases, $\mathcal{M}_{q}^{\epsilon}$ will sample $r^{*}=\arg\max_{r\in\mathbb{A}}\sum_{i}v_{i}(r)$ with probability rapidly approaching 1, and the payment function also satisfies the form given in Definition 9. In that sense, the exponential VCG mechanism can be considered a generalisation of the VCG mechanism.

Lemma 5.

Given $\epsilon\in\mathbb{R}$ and valuation functions $v=v_{1},\ldots,v_{k}$ where each $v_{i}:\mathbb{A}\to[0,1]$ , the Gibbs social welfare defined by

\mathbb{E}_{r\sim\xi}\left[\sum_{i}v_{i}(r)\right]+\frac{2}{\epsilon}H(\xi)

is maximised when $\xi=\mathcal{M}^{\epsilon}_{q}(v)$ for $q(v,r)=\sum_{i}v_{i}(r)$ .

Proof.

The first term in the Gibbs social welfare can be rewritten as follows:

	$\displaystyle\sum_{r\in\mathbb{A}}\xi(r)q(v,r)$
	$\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\frac{\epsilon}{2}q(v,r)$
	$\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln\left(\exp\left(\frac{\epsilon q(v,r)}{2}\right)\right)$
	$\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln\left(\frac{\exp(\epsilon q(v,r)/2)}{\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)}\right)+\frac{2}{\epsilon}\ln\left(\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)\right).$		(10)

Adding (10) to the second entropy term of the Gibbs social welfare and noting that $\Delta_{q}=1$ , we get

	$\displaystyle\mathbb{E}_{r\sim\xi}\left[\sum_{i}v_{i}(r)\right]+\frac{2}{\epsilon}H(\xi)$
	$\displaystyle=\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln(\mathcal{M}_{q}^{\epsilon}(v)(r))+\frac{2}{\epsilon}\ln\left(\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)\right)-\frac{2}{\epsilon}\sum_{r\in\mathbb{A}}\xi(r)\ln(\xi(r))$
	$\displaystyle=-\frac{2}{\epsilon}D_{\mathit{KL}}(\xi\,\|\|\,\mathcal{M}_{q}^{\epsilon}(v))+\frac{2}{\epsilon}\ln\left(\sum_{a\in\mathbb{A}}\exp(\epsilon q(v,a)/2)\right).$		(11)

By Gibb’s inequality, (11) is maximised when $\xi=\mathcal{M}_{q}^{\epsilon}(v)$ . ∎

Theorem 6.

([64]) The Exponential VCG Mechanism is incentive compatible and individually rational.

Proof.

We first show the incentive compatible property. Consider an agent $i$ with valuation function $v_{i}$ and fix the bids $v_{-i}$ of the other agents. Let

h(\{v_{j}\}_{j})=\frac{2}{\epsilon}\ln\left(\sum_{r\in\mathbb{A}}\exp\left(\frac{\epsilon}{2}\sum_{j}v_{j}(r)\right)\right).

The expected utility to agent $i$ when bidding $b_{i}$ is

	$\displaystyle\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}[v_{i}(r)]-p_{i}(b_{i},v_{-i})$
	$\displaystyle=\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}[v_{i}(r)]+\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}\left[\sum_{j\neq i}v_{j}(r)\right]+\frac{2}{\epsilon}H(\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i}))-h(v_{-i})$
	$\displaystyle=\mathop{\mathbb{E}}_{r\sim\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})}\left[\sum_{j}v_{j}(r)\right]+\frac{2}{\epsilon}H(\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i}))-h(v_{-i})$
	$\displaystyle=-\frac{2}{\epsilon}D_{\mathit{KL}}(\mathcal{M}_{q}^{\epsilon}(b_{i},v_{-i})\,\|\|\,\mathcal{M}_{q}^{\epsilon}(v_{i},v_{-i}))+h(v_{i},v_{-i})-h(v_{-i}),$		(12)

where the last step follows from (11) and is maximised when $b_{i}=v_{i}$ .

To show the individually rational property, note that when $b_{i}=v_{i}$ , the expression (12) reduces to $h(v_{i},v_{-i})-h(v_{-i})$ , which is non-negative and equals zero when $v_{i}$ is the zero function. ∎

As shown in [64], it is also advisable to add differential privacy noise to the payment functions given it contains information about all the agents’ valuation functions, which may need to be kept private.

4 The Social Cost of Actions

4.1 General Case

Let $\mu$ be a multi-agent environment and assume there are $k$ Bayesian reinforcement learning agents operating within it. These agents are concrete realisations of the concept of perfectly rational utility-maximising agents commonly assumed in economics. We have seen that, in the absence of some control mechanism, such multi-agent environments can exhibit bad equilibrium. To avoid tragedy of the commons-type issues, we need to impose a cost on each agent’s actions, commensurate with the social harm they are causing other agents with that action, and we will see in this section how augmenting a multi-agent environment with, for example, VCG mechanisms can address such issues.

Definition 14.

A multi-agent environment $\phi$ controlled by a VCG mechanism $M=(f,p_{1},\ldots,p_{k})$ is a sequence of probability distributions $M\triangleright\phi=\{\varphi_{0},\varphi_{1},\varphi_{2},\ldots\}$ such that each $\varphi_{t}:(V_{1}\times\cdots\times V_{k})^{t}\to\mathcal{D}(\mathbf{\mathcal{O}\times\mathcal{R}})^{t}$ is defined by

\varphi_{t}(\mathbf{or^{\prime}_{1:t}}\,|\,\mathbf{v_{1:t}}):=\phi_{t}(\mathbf{or_{1:t}}\,|\,\mathbf{a^{*}_{1:t}}:=f(\mathbf{v_{1}})f(\mathbf{v_{2}})\cdots f(\mathbf{v_{t}})),

where $\mathbf{v_{t}}=(v_{t,1},v_{t,2},\ldots,v_{t,k})$ with $v_{t,i}\in V_{i}\subseteq\mathbb{R}^{\mathcal{A}_{i}}$ being the valuation function of agent $i$ at time $t$ , $\mathbb{A}\subseteq\mathbf{\mathcal{A}}$ is the space of actions that can be jointly taken by the agents, $\mathbf{a^{*}_{1:t}}\in\mathbb{A}^{t}$ and $\mathbf{r^{\prime}_{t}}=(r^{\prime}_{t,1},\ldots,r^{\prime}_{t,k})$ with $r^{\prime}_{t,i}=(r_{t,i},p_{i}(v_{t,1},\ldots,v_{t,k}))$ .

$\mathbb{A}$ can be a strict subset of $\mathbf{\mathcal{A}}$ if there are certain combinations of actions that are impossible because there are mutually exclusive individual actions. Intuitively, an action has a social cost precisely in the scenarios where only one agent can execute the action, and the social cost of executing that action is the loss of value when other agents are prevented from executing that action.

Protocol for Controlled Multi-Agent Environment

Let $\mathbf{h_{t-1}}$ be the joint history up till time $t$ . At time $t$ , we use the VCG mechanism $M=(f,p_{1},\ldots,p_{k})$ to determine the joint action the agents should take to maximise social welfare via

\mathbf{a^{*}_{t}}:=f(v_{t,1},\ldots,v_{t,k})=\arg\max_{\mathbf{a}\in\mathbb{A}}\sum_{i}v_{t,i}(\mathbf{a}).

(13)

A joint percept $\mathbf{or_{t}}$ is then sampled from $\phi(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a^{*}_{t}})$ and each agent $i$ receives the percept $or_{t,i}$ and pays the mechanism $M$ the amount:

p_{i}(v_{t,1},\ldots,v_{t,k})=h_{i}(v_{t,-i})-\sum_{j\neq i}v_{t,j}(\mathbf{a_{t}^{*}}).

There is a degree of freedom in the choice of $h_{i}(v_{t,-i})$ . A reasonable choice is the Clark pivot payment function

h_{i}(v_{t,-i})=\max_{\mathbf{a_{t}}}\sum_{j\neq i}v_{t,j}(\mathbf{a_{t}}).

Under the VCG mechanism convention, the utility of agent $i$ for the chosen action $\mathbf{a_{t}^{*}}$ is $u_{i}(\mathbf{v_{t}}):=v_{t,i}(\mathbf{a_{t}^{*}})-p_{i}(\mathbf{v_{t}})$ . Here, $p_{i}(\mathbf{v_{t}})$ is the social cost of the joint action $\mathbf{a_{t}^{*}}$ . The social utility of joint action $\mathbf{a_{t}^{*}}$ is

u(\mathbf{v_{t}}):=\sum_{i}u_{i}(\mathbf{v_{t}})=\sum_{i}v_{t,i}(\mathbf{a^{*}_{t}})-p_{i}(\mathbf{v_{t}}).

What should the agent valuation functions be?

Given full knowledge of the environment $\phi$ and the mechanism $(f,p_{1},\ldots,p_{k})$ , the valuation function $v_{t,i}$ of agent $i$ with horizon $m_{i}$ at time $t$ , where $t\leq m_{i}$ , having seen history $\mathbf{h_{t-1}}$ is defined inductively as follows:

	$\displaystyle v_{t,i}(\mathbf{h_{t-1},\mathbf{a_{t}}})=\mathit{if}\;m_{i}-t=0\;\mathit{then}\;\sum_{\mathbf{or_{t}}}\phi(\mathbf{or_{t}}\,\|\,\mathbf{h_{t-1}}\mathbf{a_{t}})\cdot r_{t,i}$
	$\displaystyle\hskip 68.00012pt\mathit{else}\;\sum_{\mathbf{or_{t}}}\phi(\mathbf{or_{t}}\,\|\,\mathbf{h_{t-1}}\mathbf{a_{t}})[r_{t,i}+u_{t+1,i}(\mathbf{h_{t-1}}\mathbf{a_{t}}\mathbf{or_{t}})]$		(14)
	$\displaystyle u_{t,i}(\mathbf{h_{t-1}})=v_{t,i}(\mathbf{h_{t-1}},f(\mathbf{v_{t}}(\mathbf{h_{t-1}})))-p_{i}(v_{t,1}(\mathbf{v_{t}}(\mathbf{h_{t-1}})),$		(15)

where we denote $\mathbf{v_{t}}(\mathbf{h_{t-1}})=(v_{t,1}(\mathbf{h_{t-1}},\cdot),\ldots,v_{t,k}(\mathbf{h_{t-1}},\cdot))$ As shown in Appendix A, the utility of agent $i$ with horizon $m$ at time $t=1$ can be rewritten as

u_{1,i}(\epsilon)=\sum_{\mathbf{or_{1:m}}}\varphi_{t}(\mathbf{or_{1:m}}\,|\,\mathbf{v_{1:m}})\left[\sum_{t=1}^{m}r_{t,i}\right]\\ -\sum_{\mathbf{or_{1:(m-1)}}}\varphi_{t}(\mathbf{or_{1:m-1}}\,|\,\mathbf{v_{1:m-1}})\biggl{[}\sum_{t=1}^{m}p_{i}(\mathbf{v_{t}})\biggr{]},

(16)

where the components of $\mathbf{v_{t}}$ are defined by (14). Note that the agents can have different horizons.

To see that (14) is a reasonable definition, we can appeal to Theorem 2 and use backward induction to argue that, at every time step, every single agent can do no better than report their valuation function truthfully, starting with the base case of $t=m-1$ . Given the $u_{t+1,i}()$ future-utility term in (14) is thus well-defined when all the agents are assumed to be rational, the formulation of (14) then gives the best expected result for each possible joint action $\mathbf{a_{t}}$ . In the case where the mechanism $M=(f,p_{1},\ldots,p_{k})$ is probabilistic (e.g. the Exponential VCG Mechanism described in § 3.3), the utility function (15) can be generalised to take the expectation over the values of $f(v_{t,1},\ldots,v_{t,k})$ .

Variations of Agent Valuation Functions

We can, if necessary, add reward discounting to (14) in the usual manner. If the underlying environment is known to be episodic – repeated games or episodic MDPs, for example – we can also adjust the base recursive condition of (14) to be $t\mod m=0$ .

Partial Knowledge

In practice, an agent participating in a mechanism-controlled environment will not have full knowledge of the environment and likely no full visibility of the joint actions and / or payments. Some of the possible configurations are covered in § 4.2, with more detailed descriptions of learning algorithms covered in § 5.

4.2 Special Cases and Related Settings

Here are some special cases of a mechanism controlled environment $M\triangleright\phi$ , all of which have a rich literature behind them.

Single Agent

When $k=1$ , the formula (14) simplifies to (3) because $f(v)=\arg\max_{a}v(a)$ and the protocol becomes that of the single-agent general reinforcement learning setting described in § 2.1. And, of course, when the actions are null or have no effect on the environment, we recover the sequential prediction setting, which includes Solomonoff Induction [117, 118], prediction with expert advice [23] and the closely related zero-sum repeated games.

Two Player Turn-based Games

When $k=2$ and the agents take turn executing actions (and therefore the actions are never exclusionary), our general setup reduces to two-player turn-based games, covering both perfect information and imperfect information settings, studied in game theory [132, 99].

Multi-Agent Reinforcement Learning

If the actions of the $k$ agents are never mutually exclusive, then $\mathbf{a_{t}^{*}}=f(v_{t,1},\ldots,v_{t,k})=\arg\max_{\mathbf{a}\in\mathbb{A}}\sum_{i}v_{t,i}(\mathbf{a})$ is such that $\mathbf{a_{t}^{*}}=\arg\max_{\mathbf{a}\in\mathbb{A}}v_{t,i}(\mathbf{a})$ for each $i\in[1,\ldots,k]$ . In that case, the mechanism $M$ in $M\triangleright\phi$ never play a role and the protocol becomes that of the multi-agent general reinforcement learning setting described in § 2.2. Comprehensive surveys of key challenges and results in this area can be found in [115, 95, 142], with a unifying framework in [82].

Static Mechanism Design

When $m=1$ and $\phi$ is fixed, the setup recovers the classical mechanism design settings like auctions and single-decision markets [17]. For example, if $\phi(\cdot\,|\,f(v_{1},\ldots,v_{k}))$ assigns probability 1 to the outcome that agent $i^{*}=\arg\max_{i}v_{i}(i)$ gets the percept $(i^{*},v_{i^{*}})$ and every other agent gets $(i^{*},0)$ , and the payment function is the Clark pivot function, then $M\triangleright\phi$ reduces to the Vickrey second-price auction among $k$ bidders.

Dynamic Mechanism Design

When $1<m\leq\infty$ and $k$ and $\phi$ can change over time, then we are in the dynamic mechanism design setting [14], with special cases like sequential auctions, and market participants that come and go. Such dynamic mechanisms have been used, for example, in Uber’s surge-pricing model [26] and congestion control [10]. A general online mechanism design algorithm based on Hedge [48] can be found in [66].

Multi-Agent Coordinated Planning

There are both similarities and important differences between our mechanism-controlled multi-agent environments setup and the literature on online mechanisms for coordinated planning and learning in multi-agent systems [103, 22]. In the latter case, there is usually a top-level Markov Decision Process (MDP) whose transition and reward functions are common knowledge to all the agents, and each of the agents may only be active during certain time periods and they hold private information – usually something simple like the value attached to winning an item being auctioned, or a hidden state that forms part of the state for the top-level MDP – that are needed by a central planner to work out the optimal joint policy for all the agents. In that setup, the key problem is the design of dynamic VCG-style mechanisms to incentivise all the agents to truthfully reveal their private information to the central planner at each time step. Also worth noting that the concept of self-interested agents in the multi-agent coordinate planning literature is mostly about an agent who may lie about its own private information in order to gain an advantage, which is a narrow form of the classical economic notion of a self-interested agent that seeks to act in such a way to maximise its own expected future cumulative rewards.

Multi-Agent Coordinated Learning

Our setup can also be understood game-theoretically as a simultaneous-move sequential game in which there are $k$ agents, the action space for each agent is the set of all valuation functions, and the loss function for agent $i$ given the collective actions (i.e. submitted valuation functions of all the agents) is its negative utility as determined by the VCG mechanism. In the learning in games literature [49], the underlying game dynamics is usually assumed to be static and the concern of each agent is primarily around learning a best response to the possibly adaptive strategies of other agents. Key results in such simultaneous-move repeated games can be found in [16, 23].

Dynamic Mechanism Design via Reinforcement Learning

In the multi-agent coordinated planning case, the underlying MDP is assumed to be known. When this assumption is dropped, the mechanism designer has to use reinforcement learning algorithms to learn the parameters of the VCG mechanism from data, including the payment functions. The simpler bandit setting is tackled in [74]. A reinforcement learning (RL) algorithm using linear function approximation is described in [107]. It is worth noting that it is the mechanism designer that is using RL to learn the optimal VCG mechanism parameters over time from the environment made up of multiple agents and their instantaneous (or horizon 1) reward functions. The RL for dynamic mechanism design framework is silent on where the agents’ reward functions come from; each agent’s reward function could come from another learning process, for example via the agent learning its owner’s preferences.

5 Learning in the Presence of Social Cost

5.1 Measures of Success

In [115], when it comes to multi-agent reinforcement learning, the authors ask “If learning is the answer, what is the question?”. The answer is not obvious, not only because the underlying environment is non-stationary – which can already appear in the single-agent reinforcement learning setting – but also because the agents can adapt to each other’s behaviour so each agent can play the dual roles of learner and teacher at the same time. For example, the storied Tit-for-Tat strategy [8, 97, 96] is an agent policy that both learn from and ‘teach’ the other agents in iterated Prisoner’s Dilemma-type problems.

Several possible and non-unique theories of successful learning, both descriptive and prescriptive, are provided in [115, §7]. In the context of multi-agent reinforcement learning under mechanism design, we are concerned primarily with designing learning algorithms that satisfy some or all of the following properties, noting that the agents do not learn policy functions to but only valuation functions that are fed into a VCG mechanism for joint action selection.

Convergence: Starting from no or only partial knowledge of the environment and the other agents, each agent’s learned valuation function at any one time should converge to (15) and its realised utility should converge over time to (16).
No Regret: An agent’s learning algorithm minimises regret if, against any set of other agents, it learns a sequence of valuation functions that achieves long-term utility almost as well as what the agent can achieve by picking, with hindsight, the best fixed valuation function for every time step.
Rational: An agent’s learning algorithm is rational if, whenenever the other agents have settled on a stationary set of valuation functions, it settles on a best response to that stationary set.
Incentive Compatibility: In the case when the parameters of the VCG mechanism is learned through data, we require that the mechanism exhibits approximate incentive compatibility with high probability.

Some of these concepts will be made more precise in the coming subsections.

5.2 Bayesian Reinforcement Learning Agents

In practice, an agent $i$ operating in a given controlled multi-agent environment $M\triangleright\phi$ will not actually have enough information to construct $v_{t,i}$ as defined in (14). First of all, it does not know what $\phi$ is. A good solution is to use a Bayesian mixture $\xi_{\mathcal{P}}$ , for a suitable model class $\mathcal{P}$ , to learn $\phi$ . So at time $t$ with history $\mathbf{h_{t-1}}$ , agent $i$ approximates the expression $\phi(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a_{t}})$ by

\sum_{\rho\in\mathcal{P}}w_{0}^{\rho}\rho(\mathbf{or_{t}}\,|\,\mathbf{h_{t-1}}\mathbf{a_{t}}).

(17)

The quantity $\mathbf{p_{t}}=(p_{1},\ldots,p_{k})$ in (15) can also be estimated directly using, say, another Bayesian mixture $\xi_{\mathcal{Q}}$ via

\sum_{\rho\in\mathcal{Q}}w_{0}^{\rho}\rho(\mathbf{p_{t}}\,|\,\mathbf{apor_{<t}}\mathbf{a_{t}}).

(18)

In general terms, we can think of (17) as learning the dynamics of the underlying environment $\phi$ , and (18) as learning the preferences and strategies of the other agents in the environment.

Online Mixture Learning in Practice

The optimal model class $\mathcal{P}$ (and $\mathcal{Q}$ ) for each agent to use is of course the class of all computable functions, yielding a multi-agent Solomonoff induction setup, where we can see that each agent’s model converges at a fast rate to the true environment by virtue of Theorem 1. This proposal has two issues. First of all, it is an incomputable solution. Secondly, Theorem 1 is only applicable when the environment is computable, and this assumption is violated when the environment contains other agents that are themselves incomputable.

In practice, the Solomonoff induction can be approximated efficiently using factored, binarised versions of the Context Tree Weighting algorithm [135, 127, 129, 139], or online prediction-with-experts algorithms like Exponential Weights / Hedge [48, 23, 4], Switch [131, 128] and their many special cases [62]. While these algorithms have their roots in online convex optimisation, they can be interpreted as Bayesian mixtures when the loss function is log loss [81], which in our setup is $-\log_{2}M(or)$ where $M$ is the model for $\phi$ and $or$ is the observed percept. In particular, the Hedge algorithm is an exact Bayesian mixture model in the case when the loss function is log loss and the learning rate is 1, in which case one can show that the Hedge weight for each expert is its posterior probability [23, §9.2]. The Prod algorithm with log loss [98] has been shown to be a "soft-bayes" algorithm, which coincides with the exact Bayesian mixture when the learning rate is 1 and can be interpreted as a "slowed-down" version of Bayesian mixture or a Bayesian mixture with "partially sleeping" experts when the learning rate is less than 1.¹¹1When the true environment is contained in the model class, the optimal learning rate is 1 because Bayesian inference is optimal. However, in the agnostic / improper learning setting where the true environment may not be in the model class, the learning rate needs to decrease over time to avoid pathological issues especially in non-convex function classes [56, 125]. More generally, [81] shows that there is a spectrum of Bayesian mixture algorithms over expert sequences with different priors that can be understood as interpolations of Bayesian mixtures over fixed experts and (non-learning) element-wise mixtures. The spectrum includes Fixed-Share [61] with static Bernoulli switching frequencies, Switching distributions with dynamic slowly decreasing switching frequencies [126, 124] and dynamically learned switching frequencies [131], and more general switching frequencies that are dependent on some ordering on experts [133, 81].

Such online-learning algorithms can exhibit good regret-minimising convergence behaviour under different scenarios. In the context of multi-agent systems, regret-minimisation learning algorithms are particularly useful in that one can show a set of agents that all mimimise a notion of regret called swap regret will converge to a correlated equilibrium [7], where no agent would want to deviate from their preferences / strategies assuming the others also do not deviate. In [16, 15, 27], the authors describe a general procedure to turn agents that minimise the standard form of regret into agents that minimise the swap regret, which is regret that allows for any specific agent action to be switched with some other action with the benefit of hindsight. This is achieved via a master algorithm that runs $N=|\mathcal{A}|$ instances of a standard regret minimisation algorithm, one for each possible action. Each of the $A_{l}$ algorithms, $l\in[N]$ , maintains a $q_{l}^{t}\in\mathbb{R}^{N}$ weight vector at each time step as usual. The master algorithm maintains a weight vector $p^{t}\in\mathbb{R}^{N}$ that is the solution to the equation $p^{t}=p^{t}Q^{t}$ , where $Q^{t}\in\mathbb{R}^{N\times N}$ is the matrix made up of row vectors $q_{l}^{t},l\in[N]$ . (The $p^{t}$ on the RHS of $p^{t}=p^{t}Q^{t}$ can be understood as the weights attached to $A_{l}$ algorithms, and the $p^{t}$ on the left is best understood as the probability of selecting the different actions. Thus, $p^{t}$ is the stable limiting distribution of the stochastic matrix $Q^{t}$ and the existence and efficient computability of $p^{t}$ is given by the Perron-Frobenius Theorem.) The master algorithm incurs a loss $\ell^{t}$ at each time step, which is then attributed to $A_{l}$ by $p^{t}\ell^{t}$ . Intuitively, each $A_{l}$ is responsible for minimising the regret of switching action $l$ to any other action via its standard regret minimising property.

Monte Carlo Planning

The recurrence in (14)-(15) can be approximated using variants of the Monte Carlo Tree Search (MCTS) algorithm [20, 130, 122]. Even though there are no explicit min-nodes and max-nodes in an MCTS tree, it is known that MCTS converges to the (expecti)minimax tree because as the number of roll-out operations goes to infinity, the UCT [79] selection policy at each decision node concentrates the vast majority of the roll-outs on the best child nodes so the weighted average back-up rewards converges to the max/min value. It is also worth noting that, rather than choosing the action that maximises the value function at the root of the MCTS tree, the agent declares the entire value function to $M\triangleright\phi$ for a joint action to be chosen by the $M$ mechanism. The $v_{t,i}$ arguments to $f()$ and $p_{i}()$ in (15) can be estimated using the decision nodes in the MCTS tree at any one time.

To gain some intuition into how the MCTS algorithm would work, consider the tree in Fig 1 for the simple setup where the horizon $m=2$ and the action space consists only of $\{a^{1},a^{2}\}$ and the perception space consists only of $\{or^{1},or^{2}\}$ . There are two types of alternating nodes: decision nodes (in red) and observation nodes (in green). Each node is indexed by the sequence of symbols in the path from the root to the node. Attached to

•

each terminal decision node at the bottom of the tree labelled by a percept $or$ is set of values $\{r_{i}\}_{i=1\ldots k}$ ;
•

each non-terminal decision node indexed by $h_{t}$ is a set $\{u_{t+1,i}(h_{t})\}_{i=1\ldots k}$ of values corresponding to (15);
•

each observation node indexed by $h_{t-1}a_{t}$ is a set $\{v_{t,i}(h_{t-1},a_{t})\}_{i=1\ldots k}$ of values corresponding to (14).

Each non-terminal decision node indexed by $h_{t}$ has a social-welfare maximising action $\arg\max_{a}\sum_{i}v_{t+1,i}(h_{t},a_{t})$ , which is indicated by a thick arrow. In the diagram, we assume each of the agents has the same horizon. In general, this is not the case and some decision nodes can play the role of both terminal and non-terminal nodes, depending on each agent’s horizon.

Figure 1: A horizon-2 game tree with small action and perception spaces

Model-Free Reinforcement Learning

The above discussion are mostly based on model-based reinforcement learning approaches. It is possible to adopt model-free reinforcement learning approaches like variants of Q-learning and temporal-difference to learn the valuation and utility functions too. A key consideration here is that the underlying environment is non-stationary so suitable adjustments are required. It is also worth noting that policy-gradient RL algorithms are harder to apply in our setting, because the VCG mechanism requires valuation functions to be submitted by agents.

Partial Observability

We have so far assumed each agent can see everything, including the declared $\mathbf{v_{t}}$ valuation functions of other agents, the chosen joint action $\mathbf{a_{t}}$ , the full joint percept $\mathbf{or_{t}}$ , and all payments $\mathbf{p_{t}}$ . It is possible to relax this assumption, which we will briefly look at now.

Assumption 1.

Each agent $i$ sees, at time $t$ , the joint action $\mathbf{a_{t}}$ but only its own percept $\mathbf{or_{t|i}}:=or_{t,i}$ and the value of its payment $p_{i}(\mathbf{v_{t}})$ . Its view of the history $\mathbf{h_{t}}$ up to time $t$ is denoted $\mathbf{h_{t|i}}:=\mathbf{a_{1}}or_{1,i}\mathbf{a_{2}}or_{2,i}\ldots\mathbf{a_{t}}or_{t,i}$ .

Definition 15.

Given a multi-agent environment $\varrho$ , we can define agent $i$ ’s view of $\varrho$ under Assumption 1 as $\varrho_{|i}=\{\varrho_{0|i},\varrho_{1|i},\varrho_{2|i},\ldots\}$ , where $\varrho_{t|i}:\mathbf{\mathcal{A}}^{t}\to\mathcal{D}(\mathcal{O}_{i}\times\mathcal{R})^{t}$ is defined by

\varrho_{t|i}(o^{\prime}r^{\prime}_{1:t}\,|\,\mathbf{a_{1:t}})\,:=\sum_{\begin{subarray}{c}\mathbf{or_{1:t}}\\ \text{st}\;\mathbf{or_{1:t|i}}=o^{\prime}r^{\prime}_{1:t}\end{subarray}}\varrho_{t}(\mathbf{or_{1:t}}\,|\,\mathbf{a_{1:t}}).

The valuation function (14) can no longer be approximated directly with the partial observations. Instead, we will have to use $\phi_{t|i}$ instead of $\phi_{t}$ in (14), and learn the $f()$ and $p()$ functions in (15) from data obtained from interactions with the environment, possibly using Bayesian mixture estimators.

Markovian State Abstraction

To maintain computational tractability and statistical learnability, we will need to approximate estimators like (17) with

\sum_{\rho\in\mathcal{P}}w_{0}^{\rho}\rho(\mathbf{or_{t}}\,|\,\chi(\mathbf{h_{t-1}}\mathbf{a_{t}})),

where $\chi$ is a feature function that maps arbitrarily long history sequences into usually a finite $n$ -dimensional feature space that is a strict subset of $\mathbb{R}^{n}$ . Depending on the application, such a $\chi$ could be hand-coded by domain experts, or picked from a class of possible feature functions using model-selection principles. The aim is to select a mapping $\chi$ such that the induced process can facilitate learning without severely compromising performance. Theoretical approaches to this question have typically focused on providing conditions that minimise the error in the action-value function between the abstract and original process [1, 84]. The $\Phi$ MDP framework [68] instead provides an optimisation criteria for ranking candidate mappings $\chi$ based on how well the state-action-reward sequence generated by $\chi$ can be modelled as an MDP whilst still being predictive of the rewards. A good $\chi$ results in a model that is Markovian and predictive of future rewards, facilitating the efficient learning of good actions that maximise the expected long-term reward. The class of possible feature functions can be defined using formal logic [51, 36, 87, 40] or obtained from embedding techniques [12, 21, 55] and Deep Learning techniques [91, 5].

Two such algorithms, motivated by AIXI [67], are described in [139, 138]. In both cases, the formal agent knowledge representation and reasoning language is as described in [88, 87]. In particular, the syntax of the language are the terms of the $\lambda$ -calculus [29], extended with type polymorphism and modalities to increase its expressiveness for modelling agent concepts. The semantics of the language follow Henkin [60] and Kripke [47], suitably extended. The inference engine has two interacting components: an equational-reasoning engine and a tableau theorem prover [37]. There is also a predicate rewrite system for defining and enumerating a set of predicates. The choice of formalism is informed by the standard arguments given in [45, 65] and the practical convenience of working with (the functional subset of) a language like Python. (Suitable alternatives to the formalism are studied in the Statistical Relational Artificial Intelligence literature [51, 108], including a few that caters specifically for relational reinforcement learning [40, 38] and Symbolic MDP/POMDPs [78, 112].) The feature selection problem was addressed through the use of a so-called random forest binary decision diagram algorithm to find a set of features that approximately minimise an adjusted version of the $\Phi$ -MDP criteria [93].

5.3 Bandit VCG Mechanisms

Note that, in practice, the agents’ valuation functions are not fully known but estimated from the actual percepts they get from the environment, which are in turn dependent on the joint actions chosen by the VCG mechanism. This means formula (13) in the mechanism-controlled environment protocol needs to be handled carefully; in particular there is an exploration-exploitation trade-off here, where we need to do enough explorations for the agents to arrive at good estimates of their valuation functions before we can reliably compute the $\arg\max_{\mathbf{a}\in\mathbb{A}}$ over them. This problem can be solved using Bandit algorithms, and the techniques described in [74] that uses upper-confidence bounds [6, 83] are directly applicable in our setting. A key finding in [74] is that (asymptotic) truthfulness is harder to achieve with agents that learn their valuation functions; this may also be an issue in our setting.

5.4 Markov VCG Mechanisms in Unknown Environments

§ 5.2 describes agents that learn in a mechanism-controlled environment. In this section, we take the perspective of the mechanism designer and look at reinforcement learning algorithms that can adjust the parameters of the VCG mechanism based on interactions with agents that are themselves capable of learning and adjusting.

We will first summarise the setup and algorithmic framework described in [107] and then describe some possible extensions. The environment with a controller and $k$ agents is defined by an episodic Markov Decision Process $\Xi=(\mathcal{S},\mathcal{A},H,\mathcal{P},\{r_{i}\}_{i=0}^{k})$ , where $\mathcal{S}$ and $\mathcal{A}$ are the state and action spaces, $H$ is the length of each episode, $\mathcal{P}=\{\mathcal{P}_{t}:\mathcal{S}\times\mathcal{A}\to\mathcal{D}(\mathcal{S})\}_{t=1}^{H}$ is the state transition function, and $r_{i}=\{r_{i,t}:\mathcal{S}\times\mathcal{A}\to[0,1]\}_{t=1}^{H}$ are the reward functions, with $r_{0}$ denoting the reward function for the controller and $r_{i},1\leq i\leq k$ , denoting the reward function for agent $i$ . Except for $r_{0}$ , the environment $\Xi$ is unknown to the controller. The controller interacts with the $k$ agents in multiple episodes, with each episode lasting $H$ time steps. An initial state $x_{1}$ is set at the start of each episode. For each time step $t$ in the episode, the controller observes the current state $x_{t}\in\mathcal{S}$ , picks an action $a_{t}\in\mathcal{A}$ , and receives a reward $r_{0,t}(x_{t},a_{t})$ . Each agent $i$ receives their own reward $r_{i,t}(x_{t},a_{t})$ and report a value $\widetilde{r}_{i,t}(x_{t},a_{t})$ to the controller. At the end of the episode, the controller charges each agent $i$ a price $p_{i}$ . Given the controller’s policy function $\pi=\{\pi_{t}:\mathcal{S}\to\mathcal{A}\}_{t=1}^{H}$ and the prices $\{p_{i}\}_{i=1}^{k}$ , the controller’s utility for the episode is defined by

u_{0}=V_{1}^{\pi}(x_{1};r_{0})+\sum_{i=1}^{k}p_{i},

and each agent $i$ ’s utility for the episode is defined by $u_{i}=V_{1}^{\pi}(x_{1};r_{i})-p_{i}$ , where

V_{h}^{\pi}(x;r)=\sum_{t=h}^{H}\mathbb{E}_{\pi,\mathcal{P}}[r_{t}(x_{t},\pi_{t}(x_{t}))\mid x_{h}=x].

The controller’s goal is to learn the policy $\pi^{*}$ and pricing $\{p_{i}\}_{i=1}^{k}$ that implements the so-called Markov VCG mechanism

	$\displaystyle\pi^{*}(x)=\arg\max_{\pi}V_{1}^{\pi}(x;R)$		(19)
	$\displaystyle\pi^{*}_{-i}(x)=\arg\max_{\pi}V_{1}^{\pi}(x;R^{-i})$		(20)
	$\displaystyle p_{i}(x)=V_{1}^{\pi^{}_{-i}}\bigl{(}x,R^{-i}\bigr{)}-V_{1}^{\pi^{}}\bigl{(}x,R^{-i}\bigr{)},$		(21)

where $R=\sum_{j=0}^{k}r_{j}$ and $R^{-i}=R-r_{i}$ in the following:

Lemma 7 ([89]).

The Markov VCG mechanism is incentive compatible and individually rational.

Of course, one can only implement the Markov VCG mechanism directly if the $\mathcal{P}$ and $\{r_{i}\}_{i=0}^{k}$ elements of the underlying episodic MDP are known, and that (19) and (20) can be computed efficiently. In practice, the controller has to learn $\mathcal{P}$ and $\{r_{i}\}_{i=0}^{k}$ from interactions with the environment and agents, and (19) and (20) need to be handled using function approximation when the state and action spaces $\mathcal{S}$ and $\mathcal{A}$ are large.

In [107], a reinforcement learning framework was proposed for learning Markov VCG with Least-Squares Value Iteration (LSVI) [73]. There are two phases: an exploration phase and an exploitation phase. During the exploration phase, the controller uses reward-free reinforcement learning [72] to interact with the environment and the $k$ agents to appropriately explore the underlying MDP. This exploration phase is crucial because the controller would need enough data to approximate (20) and (21) by simulating counterfactual scenarios for when each agent $i$ is missing from the environment. During the exploitation phase, the controller will then repeat the following steps in each episode $t$ :

1.

Construct a $\widehat{\pi}^{t}$ that approximates (19) using LSVI on previously collected data $D$ ;
2.

Learn an approximation $F_{t}^{-i}(x)$ of $V_{1}^{\pi^{*}_{-i}}\bigl{(}x,R^{-i}\bigr{)}$ , the first term in (21), using LSVI on collected data $D$ ;
3.

Learn an approximation $G_{t}^{-i}(x)$ of $V_{1}^{\pi^{*}}\bigl{(}x,R^{-i}\bigr{)}$ , the second term in (21), using LSVI on collected data $D$ and $\widehat{\pi}^{t}$ from Step 1;
4.

For time step $h=1,\ldots,H$ , observe state $x_{t,h}$ , take action $a_{t,h}=\widehat{\pi}^{t}(x_{t,h})$ and observe $\widetilde{r}_{i,t,h}(x_{t,h},a_{t,h})$ from each agent $i$ .
5.

Charge each agent $i$ the price $p_{i,t}(x_{1})=F_{t}^{-i}(x_{t,1})-G_{t}^{-i}(x_{t,1})$ .
6.

Add $\{(x_{t,h},a_{t,h},\{\widetilde{r}_{i,t,h}(x_{t,h},a_{t,h})\}_{i=1}^{k})\}_{h=1}^{H}$ to $D$ .

In the first three steps, the Least-Squares Value Iteration algorithm is used, in conjunction with a suitable model class $\mathcal{F}$ , to construct a sequence of functions $(\widehat{Q}_{t,h})_{h=1}^{H}$ that approximates the Q-function for the corresponding value function $V_{1}^{\pi}(\cdot,r)$ at episode $t$ via the following optimisation problem, from $h=H,\ldots,1$ :

\displaystyle\widehat{Q}_{t,h}=\arg\min_{f\in\mathcal{F}}\sum_{\tau=1}^{t-1}\bigl{[}r_{\tau,h}(x_{\tau,h},a_{\tau,h})+\widetilde{V}_{h+1}^{\pi}(x_{\tau,h+1})-f(x_{\tau,h},a_{\tau,h})\bigr{]}^{2}+\text{reg}(f),

where reg( $f$ ) is a suitable regularisation term, and $\widetilde{V}_{h+1}^{\pi}(x)=\widehat{Q}_{t,h+1}(x,\pi(x))$ . In [107], the authors take $\mathcal{F}$ to be linear functions of the form $f(x,a)=\langle w,\chi(x,a)\rangle$ for some feature mapping $\chi(\cdot,\cdot)\in\mathbb{R}^{m}$ and give efficient algorithms for linear MDPs that can (i) learn to recover the Markov VCG mechanism from data, and (ii) achieve regret, with respect to the optimal Markov VCG mechanism, upper-bounded by $O(T^{2/3})$ , where $T$ is the total number of episodes.

For general MDPs or Partially Observable MDPs, we can replace the LSVI with linear function class with approximate AIXI algorithms like [139, 138] to learn Markov VCG mechanisms from data.

5.5 Mechanism-level RL vs Agent-level RL

Note that (19) is a special case of (14)-(15), in that it is the policy (executed by the mechanism) that maximises $\sum_{i}u_{1,i}(\epsilon)$ (see (16)) when the underlying environment $\phi$ is an episodic MDP, and all the agents have the same horizon $m$ . The key advantage of performing reinforcement learning at the mechanism level is that, in the case when the mechanism is (approximately) incentive compatible, the RL algorithm has access to and can learn from all available data from interactions between the agents and the environment and mechanism, including all the individual rewards and payments. The key disadvantage is that it appears necessary to assume that all the agents have the exact same horizon. In comparison, the situation is reversed when RL is done at the level of individual agents: each agent’s RL algorithm usually only has access to the subset of the interaction data in which it has a role in generating, but each agent can use different RL algorithms with different parameters like horizon / discount factor and different function approximation model classes.

6 Applications

6.1 Paperclips and All That

The paperclip maximiser [19] is a thought experiment in the AI safety folklore that illustrates a potential risk from developing advanced AI systems with misaligned goals or values. Here is the idea: Imagine an AI system is tasked with the goal of maximising the production of paperclips. (This could be an explicit high-level goal tasked by a human, or a subgoal inferred as needed by the AI system for fulfiling a higher-level goal.) As the AI system becomes more and more intelligent, it may pursue this paperclip-production goal with relentless efficiency, converting all available resources and matter into paperclips, potentially leading to disastrous consequences for humanity and the environment. A much older (and less trivial) variation of the problem was articulated by AI pioneer Marvin Minsky, who suggested that an AI system designed to solve the Riemann hypothesis might decide to take over all of Earth’s resources to build supercomputers to help achieve its goal.

While these are obviously extreme examples, these thought experiments help focus our minds on the following key issues in the development of increasingly powerful AI systems:

•

Even seemingly harmless or trivial goals, if pursued by a superintelligent AI system without proper constraints, could lead to catastrophic outcomes as the AI single-mindedly optimises for that goal.
•

In particular, in single-mindedly pursuing a goal, a superintelligent AI may exhibit so-called convergent instrumental behavior, such as acquiring power and resources and conducting self-improvement as subgoals, to help it achieve its top-level goals at all costs, even if those actions conflict with human values or well-being.

These thought experiments underscore the importance of instilling the right goals, values, and constraints into AI systems from the outset, as it may be extremely difficult or impossible to retrofit them into a superintelligent system once created.

There is a rich literature [50, 44, 28] on aligning the design of AI systems with human values, and there is a lot of useful analyses in the single-agent general reinforcement learning (GRL) setting (§ 2.1) on how to make sure, for example, that

•

the agent infers the reward function from a human by observing the person’s actions through cooperative inverse reinforcement learning [58], leading to a provable solution for the off-switch problem [57] in some cases; and
•

preventing an AI agent from performing adverse ‘self-improvement’ by tampering with its own reward function through robust learning techniques that incorporates causality [42] and/or domain knowledge [43].

We argue here that, in the multi-agent GRL setting, additional controls in the form of imposition of social cost on agent actions can help prevent paperclip maximiser-style AI catastrophes. In particular, in the multi-agent GRL setting where there are multiple superintelligent AI systems acting in the same environment, an AI system cannot unilaterally perform actions that destroy humanity and the environment, or engage in instrumental convergent behaviour like acquiring all available power and resources, without encountering significant frictions and obstacles because most of such actions, and their prevention by other agents (human or AI), are mutually exclusionary and therefore can be subjected to control through economic mechanisms like that described in § 4, imposed either explicitly through rules and regulations or implicitly through laws of nature (like physics and biology [24, 109]). This form of social control works in concert and in a complementary way with the controls at the single-agent GRL level. Some of the controls are likely necessary but none in isolation are sufficient; together they may be.

In the paperclip maximiser example, there are two forces that will stop paperclip production from spiraling out of control. Firstly, the environment will provide diminishing (external) rewards for the agent to produce more and more paperclips, assuming there are controls in place to prevent wireheading issues [92, 137] where the agent actively manipulates the environment to give it false rewards or changes its own perception of input from the environment. Secondly, at the same time that the utility of the paperclip maximiser agent is decreasing, the utility of other agents in the same environment in taking competing actions to prevent paperclip production will increase significantly as unsustainable paperclip production threatens the environment and the other agents’ welfare. Thus, at some stage, the utility of the paperclip maximiser agent in producing more paperclips will become lower than the collective utility of other agents’ proposed actions to stop further paperclip production, and a VCG-style market mechanism will choose the latter over the former and paperclip production stops. The argument above does rely on an assumption that the agents are operating on a more-or-less level playing field, where they need to take others’ welfare into consideration when acting. In a completely lopsided environment where there is only one superintelligent agent and its welfare dominates that of all others, which could come about by design, accident, or over time through phenomenon like the Matthew effect (aka rich-get-richer or winner-take-all), social controls will not be able to stop unsustainable paperclip production, and it will only stop when the one superintelligent agent wants it to stop. Both conditions that uphold the Matthew effect and the circumstances that cause it to fail are studied in the literature [110, 13, 9] and those additional controls will likely be needed in addition to what we propose in § 4.

The argument presented in this section has similar motivations to those presented in [34]. We expect to see a lot more progress in this research topic in the coming months and years.

6.2 Cap-and-Trade to Control Pollution

Consider a set of oil refineries $\{R_{1},R_{2},\ldots,R_{k}\}$ , where $k>1$ . Each refinery $R_{i}$ :

•

produces unleaded fuel that can be distributed and sold in the retail market for $2 per litre. (We assume the output of the refineries is not big enough to change the market demand, and therefore the price of fuel.);
•

requires $1.80 in input cost (crude oil, chemicals, electricity, labour, distribution) to produce and distribute 1 litre of fuel (for details, see [46]);
•

can produce up to 100 million litres of fuel per day; and
•

produces greenhouse gases, costed at $190 per cubic ton. (See [2] for more detailed modelling.)

The refineries are not, however, equally efficient. Refinery $R_{i}$ emits

s_{i}(y)=m_{i}\left(\frac{y^{3}}{5}-12y^{2}+200y+888\right)

cubic tons of greenhouse gases per day, which is a function of $y$ the amount of fuel (in millions of litres) it produces per day and $m_{i}$ , an inefficiency factor. Throughout this example we set $m_{i}=i$ . Thus refinery $R_{1}$ is twice as efficient as $R_{2}$ , three times as efficient as $R_{3}$ , and so on.

Refer to caption — Figure 2: Plots of $s$ and $s^{-1}$ for refineries $R_{1}$ and $R_{2}$ .

Graphs of $s_{1}$ and $s_{2}$ are shown in Fig 2. The greenhouse gas emissions increase at the start, then drop as the refinery achieves its optimum operating efficiency, after which they increase again rapidly. We also plot

{s_{i}}^{-1}\left(g\right):=\max{\{y\in\mathbb{Y}:\Im(y)=0\}}

where the set $\mathbb{Y}$ comprises $0$ and the three roots of the cubic equation

\frac{m_{i}}{5}y^{3}-12m_{i}y^{2}+200m_{i}y+(888m_{i}-s_{1}(y))=0

when it is evaluated at $g:=s_{1}(y)$ . This ensures ${s_{i}}^{-1}$ is defined $\forall g\geq 0$ and is non-decreasing. Note in Fig 2 that $s^{-1}_{1}$ and $s^{-1}_{2}$ have step increases at $s_{1}(28)\approx 1,470$ and $s_{2}(28)\approx 2,940$ . Our construction of $s^{-1}$ assumes the refineries will maximise their production (and hence profit) for a given cap on greenhouse gas emissions. For example, if Refinery $R_{2}$ is permitted to emit $3,000$ cubic tons of greenhouse gases, it will choose to produce $30.5$ million litres of fuel rather than $3.9$ or $25.6$ million litres.

No Price on Pollution

Consider the case of $k=2$ . If the two refineries do not have to pay for environmental damage from greenhouse gas emissions, each plant will produce at the maximum capacity of 100 million litres per day since they each make $0.20 operating profit per litre. The total greenhouse gas emission between them will be $s_{1}\left(100\right)+s_{2}\left(100\right)=302,664$ cubic tons per day.

Fixed Price for Pollution

If the two refineries have to pay the $190 per cubic ton cost but there is no cap to greenhouse gas emission, they will seek to maximise their net profit functions $n$ , given by, respectively, $n_{1}\left(y\right)=\$200k\cdot y-\$190\cdot s_{1}\left(y\right)$ and $n_{2}\left(y\right)=\$200k\cdot y-\$190\cdot s_{2}\left(y\right)$ . Taking derivatives with respect to $y$ we have:

	$\displaystyle n^{\prime}_{1}\left(y\right)=-114y^{2}+4560y+162000$
	$\displaystyle n^{\prime}_{2}\left(y\right)=-228y^{2}+9120y+124000$

which have positive roots at $y^{*}_{1}=20+10\sqrt{\frac{346}{19}}$ and $y^{*}_{2}=20+10\sqrt{\frac{538}{57}}.$ That is, Refinery $R_{1}$ will stop production at $y^{*}_{1}\approx 62.7$ million litres per day, for a daily net profit of $n_{1}\left(y^{*}_{1}\right)\approx\$9.58\text{ million}$ . Similarly, Refinery $R_{2}$ will stop production at $50.72$ million litres per day, for a daily net profit of $\$7.77$ million. See Fig 3. At the $190 price, the total greenhouse gas emission between the two refineries is thus $s_{1}\left(y^{*}_{1}\right)+s_{2}\left(y^{*}_{2}\right)\approx 28,682$ cubic tons per day.

Market Pricing for Capped Pollution

Now, instead of setting the greenhouse gas emission at $190 per cubic ton, which requires a lot of economic modelling and assumptions, what if we just want to cap the total amount of emission by the two refineries to, say, 15 thousand cubic tons per day and let the market decide the price of greenhouse gas emission? This can be done via sequential Second Price auctions (pg. 1) of 15,000 pollution permits every day, auctioned in, say, tranches of 3,000, each permit entitling the owner to emit 1 cubic ton of greenhouse gas. Denoting the auction winner by $i^{*}$ , the change in each refinery’s net profit after tranche $t$ is:

\Delta n_{t,i}\left(\Delta y_{t,i}\right)=\begin{cases}{\rho}_{t,i^{*}}-a_{t,-i^{*}}&i=i^{*}\\ 0&\text{otherwise}\end{cases}

where ${\rho}_{t,i}=\$200k\cdot\Delta y_{t,i}=\$200k\cdot\left({s_{i}}^{-1}\left(g_{t,i}+3000\right)-{s_{i}}^{-1}\left(g_{t,i}\right)\right)$ and $g_{t,i}$ is $R_{i}$ ’s permit holdings before the tranche and $a_{t,i}$ is $R_{i}$ ’s bid for tranche $t$ .

Greedy Algorithm

Let’s work through the example shown in Table 1, where each refinery pursues a greedy bid strategy. That is, for tranche $t$ refinery $R_{i}$ always bids $a_{t,i}=\rho_{t,i}$ . In the first tranche of 3000 permits, Refinery $R_{1}$ works out that it can produce $\Delta y_{1,1}=42$ million litres of fuel with 3000 permits, since ${s_{1}}^{-1}(3000)\approx 42$ , and it is willing to pay a max of $\rho_{1,1}=\$200k\times 42=\$8.4$ m to win those 3000 permits. Refinery $R_{2}$ similarly works out that it can produce $\Delta y_{1,2}={s_{2}}^{-1}(3000)\approx 31$ million litres of fuel and bids $\rho_{1,2}=\$6.1$ m. The VCG mechanism picks $R_{1}$ as the winner, and $R_{1}$ pays $6.1m to produce 42 million litres of fuel, for a profit of $2.3m. In the second tranche of 3000 permits, Refinery $R_{2}$ submits the same bid, but Refinery $R_{1}$ submits a much lower bid of $\rho_{2,1}=\$1.6$ m, since it can only produce an extra $\Delta y_{2,1}={s_{1}}^{-1}(6000)-{s_{1}}^{-1}(3000)\approx 50-42=8$ million litres of fuel with the additional 3000 permits. Thus, $R_{2}$ wins the auction and pays only $1.6m for the second tranche of 3000 permits, which it uses to produce 31 million litres of fuel for a profit of $4.5 million. The following tranches proceed in a similar way. In the end, we have

•

Refinery $R_{1}$ winning 9000 permits for a total cost of $8.0m, and using them to produce 55 million litres of fuel for a total net profit of $3.1 million;
•

Refinery $R_{2}$ winning 6000 permits for a total cost of $3.2m, and using them to produce 42 million litres of fuel for a total net profit of $5.3 million;
•

A total of $11.2 million was collected for the 15,000 permits.

The result is interesting in that the more efficient Refinery $R_{1}$ ends up winning more permits as expected, which it uses to produce more fuel but for a lower total profit compared to Refinery $R_{2}$ .

Tranche	$R_{1}$ Bid	$R_{2}$ Bid	Payment	Result
3000	42m / $8.4m	31m / $6.1m	$6.1m	$R_{1}$ / $2.3m
3000	8m / $1.6m	31m / $6.1m	$1.6m	$R_{2}$ / $4.5m
3000	8m / $1.6m	12m / $2.4m	$1.6m	$R_{2}$ / $0.8m
3000	8m / $1.6m	4m / $0.9m	$0.9m	$R_{1}$ / $0.7m
3000	5m / $1.0m	4m / $0.9m	$0.9m	$R_{1}$ / $0.1m

Table 1: The auction results assuming each refinery pursues a greedy strategy.

Reinforcement Learning

Each refinery can do better by using reinforcement learning to optimise their sequence of bids, using the approaches described in § 5, and possibly engaging in collusion / cooperation. Figs 4, 5 and 6 show the results of running two Q-Learning agents, representing the two refineries, on the cap-and-trade problem. Each agent’s state is the 2-tuple (number of permits won by $R_{1}$ , number of permits won by $R_{2}$ ). The action space is a set of 170 bids, $\mathbb{A}~{}=~{}\{0,\;50\text{k},\;100\text{k},\;\ldots\;8.4\text{m}\}$ . Agents are allowed to bid any amount in $\mathbb{A}$ , even if that amount is higher than ${\rho}_{t,i}$ . Any ties arising in the VCG mechanism due to agents bidding the same amount are broken randomly. Each experiment uses the same random seed and RL parameters. The only difference in their setup is the reward function.

In Fig 4 we incentivise the two agents to maximise their individual net profits, by rewarding only the agent $i^{*}$ that won the auction:

{r_{t,i}}^{(1)}=\Delta n_{t,i}=\begin{cases}{\rho}_{t,i^{*}}-a_{t,-i^{*}}&i=i^{*}\\ 0&\text{otherwise}\end{cases}

The agents learn a joint policy that results in them each making a total net profit of around $9m over the course of the five tranches, much higher than the net profits obtained when following the greedy strategy earlier, of $\$3.1$ m and $\$5.3$ m for $R_{1}$ and $R_{2}$ respectively. The average permit price stabilises at $\$100$ .

In Fig 5 we incentivise the agents to maximise their shared net profit:

{r_{t,i}}^{(2)}={\rho}_{t,i^{*}}-a_{t,-i^{*}}

Here the agents learn a joint policy that always results in a return for each agent (and a total shared net profit) of $19.49$ over the five tranches and, importantly, drives the average permit price down to zero. Note the agents have learned this collusive policy solely by observing the permit holdings of each participant in the auction – there was never any explicit communication between them. This phenomenon is analysed more carefully in Fig 7 in Appendix B.

In Fig 6 we incentivise the agents to maximise their individual net profit and minimise the other agent’s net profit:

{r_{t,i}}^{(3)}=\begin{cases}{\rho}_{t,i^{*}}-a_{t,-i^{*}}&i=i^{*}\\ -\left({\rho}_{t,i^{*}}-a_{t,-i^{*}}\right)&\text{otherwise}\end{cases}

As expected, in this case, competition between the agents keeps the average permit price well above zero; it eventually settles into an oscillatory pattern around the $550 mark. This phenomenon is analysed in more detail in Fig 8 in Appendix B.

6.3 Automated Penetration Testing

There is growing interest among cyber security researchers in investigating the use of reinforcement learning (RL) algorithms, both model-based and model-free, to perform automated penetration testing of networks and cyber-physical systems [52, 114, 63, 113, 123, 140]. The key technical challenge in these problems is usually in controlling the large and discrete observation and action spaces. The former is typically addressed using vulnerability scanners like nmap in conjunction with knowledge bases like the Common Vulnerabilities and Exposures (CVE) database, and the latter is typically tackled by building the RL algorithms on top of mature penetration testing toolsuites like Metasploit [76] and logic-based planning methods like MulVAL for attack-tree generation [102].

In this section, we look at how we can formulate the automated penetration testing problem, in the case where there are multiple RL agents cooperating with the goal of finding, within a given time, as many vulnerabilities as possible on a system being tested, under the constraint that the agents’ actions must not be too disruptive to each other and to the system being tested.

For each of the agents, the observation space is whatever can be returned from their scanning software. We will assume here that it is a JSON object, possibly with embedded graphics in standard formats, that can be formally represented and reasoned with in higher-order logic [87, 88], which provides us with tools for

•

defining and systematically enumerating possible feature functions, and
•

performing logical and probabilistic inference, including Bayesian filtering [25], with respect to a knowledge base (like CVE).

The action space is made up of two types of (parametrisable) action templates:

•

run a scanning software with certain parameter values;
•

run an existing exploit script in the Metasploit library with certain parameter values.

Instantiating the parameters in the action templates can be done through search algorithms, possibly within a formal logic setup if we can define commonly applicable rules and use unification techniques to bind the free variables in a proof procedure [102, 54].

How about the reward function? We assume each action $a$ has a computational cost $C(a)$ , which quantifies the amount of compute resources – possibly a weighted combination of CPU, memory and disk usage – required to execute the action on the system being tested. In practice, the agents have an estimate $\hat{C}(a)$ obtained from historial data. This is motivated by the large literature on cost models for database queries and software tests [71, 136, 116]. Unsuccessful exploits are given reward 0, as long as they do not cause severe damage to the system being tested. Successful exploits should be given positive rewards, but obviously the severity of the system compromise should be taken into account here, with compromises like root shell access and remote code execution given higher rewards than simpler compromises like denial of service. For practical reasons, we will not be considering the full range of cyber harms as described in [3], but will instead look at a few proxy measures of the impact of the penetration-testing agent’s actions on the system being tested. There are a few mature scoring systems related to security standards, including

•

the Common Vulnerability Scoring System (CVSS) used in the Common Vulnerabilities and Exposures (CVE) program. A CVSS score for a vulnerability is computed in the range 0.0 – 10.0, using three groups of metrics: base, temporal, and environmental. Base metrics reflect a vulnerability’s exploitability, scope, and potential impacts of exploitation. Temporal metrics reflect the current state of exploit techniques or code availability, and the existence of any patches or workarounds. Environmental metrics enable the CVSS score to be customised.
•

The Common Weakness Scoring System (CWSS) is used in the Common Weakness Enumeration (CWE) program. A CWSS score for a software weakness is computed in the range 0.0 – 100.0. In CWSS, three groups of metrics are defined: base finding, attack surface, and environmental. Base finding metrics are intended to capture the inherent risk of the weakness, confidence in the accuracy of the finding, and strength of controls. Attack Surface metrics assess the barriers that an attacker must overcome in order to exploit the weakness. Environmental metrics reflect characteristics of the weakness that are specific to a particular environment or operational context.

A suitable reward for a successful exploit $a$ can thus be defined as a (weighted) average of the CVSS and CWSS scores associated with the exploit. In addition, given the goal is to find all vulnerabilities (to the extent possible), it would also make sense to introduce exploration bonus rewards [120, 11, 101, 134].

The following is the coordination protocol for the pen-testing agents, motivated by the pollution permits idea from § 6.2. At every time step $t$ , a certain number $U_{t}$ of compute permits is auctioned. (The compute permits can only be used during the time step and cannot be carried forward to future time steps.) Each pen-testing agent $i$ has a current valuation function $V_{i,t}$ and submits the bid

\max_{a}V_{i,t}(a)\;\;\;\text{subject to}\;\;\;\hat{C}(a)\leq U_{t}

to the VCG mechanism. The winning agent $i^{*}$ then executes its action $a^{*}$ and receive a percept that can be used to update its valuation function. The actual compute resources consumed $C(a^{*})$ is used to update $\hat{C}(\cdot)$ . If $U_{t}-C(a^{*})>0$ , then another round of VCG auction is run to pick the next agent to perform its action, and this is repeated until $U_{t}$ is exhausted.

Experimental results for an implementation of a VCG-coordinated system of automated penetration testing agents based on Metasploit will be reported in a separate paper.

7 Discussion and Conclusion

In the spirit of [104], we considered in this paper the problem of social harms that can result from the interactions between a set of (powerful) general reinforcement learning agents in arbitrary unknown environments and proposed the use of (dynamic) VCG mechanisms to coordinate and control their collective behaviour. Our proposed setup is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways:

•

the underlying environment is a history-based general reinforcement learning environment like in AIXI [69];
•

the reinforcement-learning agents participating in the environment can have different time horizons and adopt different strategies and algorithms, and learning can happen both at the mechanism level as well as the level of individual agents.

The generality of the setup opens up a wide range of algorithms and applications, including multi-agent problems with both cooperative and competitive dynamics, some of which we explore in § 5 and § 6. The setup closest to ours in generality in the literature is [70], with interesting applications like [143].

A key limitation of our proposed approach, especially when it comes to regulating powerful AGI agents, is that there is no fundamental way to enforce the VCG mechanism on such agents outside of purposedly designed platform economies [77, 33, 41]. In particular, the proposed approach would not work on AGI agents operating "in the wild". Nevertheless, the study of the ideal market mechanism to regulate AGI agents is a useful step for understanding and benchmarking the design of more practical mechanisms like [75] that recommend but do not enforce social choices, and more indirect payment approaches like [141] and [80] to steer a set of agents to desired good behaviour. It would also be interesting, as future work, to understand how natural biological mechanisms like [24, 109] relate to ideal market mechanisms.

References

[1] David Abel, Dilip Arumugam, Lucas Lehnert, and Michael L. Littman. State abstractions for lifelong reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, ICML, pages 10–19, 2018.
[2] US Environmental Protection Agency. Report on the Social Cost of Greenhouse Gases: Estimates Incorporating Recent Scientific Advances. http://epa.gov, 2023.
[3] Ioannis Agrafiotis, Jason Nurse, Michael Goldsmith, Sadie Creese, and David Upton. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity, 4, 2018.
[4] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
[5] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
[6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, 2003.
[7] Robert J Aumann. Correlated equilibrium as an expression of Bayesian rationality. Econometrica: Journal of the Econometric Society, pages 1–18, 1987.
[8] Robert Axelrod. The emergence of cooperation among egoists. American Political Science Review, 75(2):306–318, 1981.
[9] Albert-László Barabási. The Formula: The Universal Laws of Success. Hachette UK, 2018.
[10] Jorge Barrera and Alfredo Garcia. Dynamic incentives for congestion control. IEEE Transactions on Automatic Control, 60(2):299–310, 2014.
[11] Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying count-based exploration and intrinsic motivation. In NeurIPS, pages 1471–1479, 2016.
[12] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003.
[13] Franco Berbeglia and Pascal Van Hentenryck. Taming the Matthew effect in online markets with social influence. In AAAI, pages 10–16, 2017.
[14] Dirk Bergemann and Juuso Välimäki. Dynamic mechanism design: An introduction. Journal of Economic Literature, 57(2):235–274, 2019.
[15] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
[16] Avrim Blum and Yishay Mansour. Learning, regret minimization, and equilibria. In Algorithmic Game Theory. CUP, 2007.
[17] Tilman Börgers. An Introduction to the Theory of Mechanism Design. Oxford University Press, 2015.
[18] Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
[19] Nick Bostrom. Ethical issues in advanced artificial intelligence. Machine Ethics and Robot Ethics, pages 69–75, 2020.
[20] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
[21] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
[22] Ruggiero Cavallo, David C Parkes, and Satinder Singh. Optimal coordinated planning amongst self-interested agents with private state. In UAI, pages 55–62, 2006.
[23] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
[24] Krishnendu Chatterjee, Johannes G Reiter, and Martin A Nowak. Evolutionary dynamics of biological auctions. Theoretical Population Biology, 81(1):69–80, 2012.
[25] Dawei Chen, Samuel Yang-Zhao, John Lloyd, and Kee Siong Ng. Factored conditional filtering: Tracking states and estimating parameters in high-dimensional spaces. arXiv preprint arXiv:2206.02178, 2022.
[26] M Keith Chen and Michael Sheldon. Dynamic pricing in a labor market: Surge pricing and flexible work on the Uber platform. Ec, 16:455, 2016.
[27] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. Advances in Neural Information Processing Systems, 33:18990–18999, 2020.
[28] Brian Christian. The alignment problem: How can machines learn human values? Atlantic Books, 2021.
[29] Alonzo Church. A formulation of the simple theory of types. Journal of Symbolic Logic, 5:56–68, 1940.
[30] Robert Coase. The nature of the firm. Economica, 4(16):386–405, 1937.
[31] Robert Coase. The problem of social cost. Journal of Law and Economics, 3(1):1–44, 1960.
[32] Robert Coase. The Firm, The Market, and The Law. UCP, 2012.
[33] Julie E Cohen. Law for the platform economy. UCDL Rev., 51:133, 2017.
[34] Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, and William S. Zwicker. Social choice for AI alignment: Dealing with diverse human feedback. CoRR, abs/2404.10271, 2024.
[35] Richard Cornes and Todd Sandler. The Theory of Externalities, Public Goods, and Club Goods. Cambridge University Press, 1996.
[36] Luc De Raedt. Logical and relational learning. Springer Science & Business Media, 2008.
[37] Luis Fariñas del Cerro and Olivier Gasquet. A general framework for pattern-driven modal tableaux. Logic Journal of the IGPL, 10(1):51–83, 2002.
[38] Kurt Driessens. Relational reinforcement learning. In Claude Sammut and Geoffrey I. Webb, editors, Encyclopedia of Machine Learning and Data Mining, pages 1096–1103. Springer, 2017.
[39] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
[40] Saso Dzeroski, Luc De Raedt, and Kurt Driessens. Relational reinforcement learning. Mach. Learn., 43(1/2):7–52, 2001.
[41] David S Evans. Matchmakers: the new economics of multisided platforms. Harvard Business Review Press, 2016.
[42] Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021.
[43] Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In IJCAI, pages 4705–4713, 2017.
[44] Tom Everitt, Gary Lea, and Marcus Hutter. AGI safety literature review. In Jérôme Lang, editor, IJCAI, pages 5441–5449. ijcai.org, 2018.
[45] W.M. Farmer. The seven virtues of simple type theory. Journal of Applied Logic, 6(3):267–286, 2008.
[46] Jean-Pierre Favennec. Economics of oil refining. In The Palgrave Handbook of International Energy Economics, pages 59–74. Springer, 2022.
[47] Melvin Fitting and Richard L. Mendelsohn. First-order Modal Logic. Kluwer Academic Publishers, 1998.
[48] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
[49] Drew Fudenberg and David K Levine. The theory of learning in games, volume 2. MIT press, 1998.
[50] Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, 2020.
[51] L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning. MIT Press, 2007.
[52] Mohamed C. Ghanem and Thomas M. Chen. Reinforcement learning for intelligent penetration testing. In Second World Conference on Smart Trends in Systems, Security and Sustainability, pages 185–192, 2018.
[53] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359–483, 2015.
[54] Martin Giese. Incremental closure of free variable tableaux. In IJCAR, volume 2083 of LNCS, pages 545–560. Springer, 2001.
[55] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
[56] Peter Grünwald and Thijs Van Ommen. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12:1069–1103, 2017.
[57] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[58] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29, 2016.
[59] Garrett Hardin. The tragedy of the commons. Science, 162:1243–1248, 1968.
[60] Leon Henkin. Completeness in the theory of types. Journal of Symbolic Logic, 15(2):81–91, 1950.
[61] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
[62] Dirk Hoeven, Tim van Erven, and Wojciech Kotłowski. The many faces of exponential weights in online learning. In Conference On Learning Theory, pages 2067–2092. PMLR, 2018.
[63] Zhenguo Hu, Razvan Beuran, and Yasuo Tan. Automated penetration testing using deep reinforcement learning. In European Symposium on Security and Privacy Workshops, pages 2–10. IEEE, 2020.
[64] Zhiyi Huang and Sampath Kannan. The exponential mechanism for social welfare: Private, truthful, and nearly optimal. In 53rd Annual Symposium on Foundations of Computer Science, pages 140–149. IEEE, 2012.
[65] John Hughes. Why functional programming matters. The Computer Journal, 32(2):98–107, 1989.
[66] Joon Suk Huh and Kirthevasan Kandasamy. Nash incentive-compatible online mechanism learning via weakly differentially private online learning. arXiv preprint arXiv:2407.04898, 2024.
[67] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005.
[68] Marcus Hutter. Feature reinforcement learning: Part I. Unstructured MDPs. In J. Artif. Gen. Intell., 2009.
[69] Marcus Hutter, David Quarel, and Elliot Catt. An Introduction to Universal Artificial Intelligence. CRC Press, 2024.
[70] Dima Ivanov, Paul Dütting, Inbal Talgam-Cohen, Tonghan Wang, and David C Parkes. Principal-agent reinforcement learning: Orchestrating AI agents with contracts. arXiv preprint arXiv:2407.18074, 2024.
[71] Matthias Jarke and Jurgen Koch. Query optimization in database systems. ACM Computing surveys (CsUR), 16(2):111–152, 1984.
[72] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
[73] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
[74] Kirthevasan Kandasamy, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. VCG mechanism design with unknown agent values under stochastic bandit feedback. J. Mach. Learn. Res., 24:53:1–53:45, 2023.
[75] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Innovations in Theoretical Computer Science, pages 403–410, 2014.
[76] David Kennedy, Jim O’gorman, Devon Kearns, and Mati Aharoni. Metasploit: The Penetration Tester’s Guide. No Starch Press, 2011.
[77] Martin Kenney and John Zysman. The rise of the platform economy. Issues in Science and Technology, 32(3):61, 2016.
[78] Kristian Kersting, Martijn van Otterlo, and Luc De Raedt. Bellman goes relational. In ICML, volume 69. ACM, 2004.
[79] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, ECML 2006, pages 282–293. Springer Berlin Heidelberg, 2006.
[80] Yoav Kolumbus, Joe Halpern, and Éva Tardos. Paying to do better: Games with payments between learning agents. arXiv:2405.20880, 2024.
[81] Wouter M Koolen and Steven de Rooij. Universal codes from switching strategies. IEEE Transactions on Information Theory, 59(11):7168–7185, 2013.
[82] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
[83] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
[84] Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a unified theory of state abstraction for MDPs. In International Symposium on Artificial Intelligence and Mathematics, 2006.
[85] Ming Li and Paul Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer, 1997.
[86] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning, pages 157–163. Elsevier, 1994.
[87] John W. Lloyd. Logic for Learning: Learning Comprehensible Theories from Structured Data. Springer, 2003.
[88] John W. Lloyd and Kee Siong Ng. Declarative programming for agent applications. Autonomous Agents Multi Agent Systems, 23(2):224–272, 2011.
[89] Boxiang Lyu, Zhaoran Wang, Mladen Kolar, and Zhuoran Yang. Pessimism meets VCG: Learning dynamic mechanism design via offline reinforcement learning. In ICML, pages 14601–14638, 2022.
[90] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science, pages 94–103. IEEE, 2007.
[91] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[92] Luke Muehlhauser and Bill Hibbard. Exploratory engineering in artificial intelligence. Communications of the ACM, 57(9):32–34, 2014.
[93] Phuong Minh Nguyen, Peter Sunehag, and Marcus Hutter. Feature reinforcement learning in practice. In Recent Advances in Reinforcement Learning, volume 7188 of LNCS, pages 66–77. Springer, 2011.
[94] Noam Nisan. Introduction to mechanism design (for computer scientist). In Algorithmic Game Theory. Cambridge University Press, 2007.
[95] Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V. Vazirani, editors. Algorithmic Game Theory. Cambridge University Press, 2007.
[96] Martin Nowak and Karl Sigmund. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature, 364(6432):56–58, 1993.
[97] Martin A Nowak and Karl Sigmund. Tit for tat in heterogeneous populations. Nature, 355(6357):250–253, 1992.
[98] Laurent Orseau, Tor Lattimore, and Shane Legg. Soft-bayes: Prod for mixtures of experts with log-loss. In International Conference on Algorithmic Learning Theory, pages 372–399. PMLR, 2017.
[99] Martin J Osborne. An Introduction to Game Theory. Oxford university press, 2004.
[100] Elinor Ostrom. Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge University Press, 1990.
[101] Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In ICML, volume 70, pages 2721–2730, 2017.
[102] Xinming Ou, Sudhakar Govindavajhala, and Andrew W Appel. MulVAL: A logic-based network security analyzer. In USENIX security symposium, volume 8, pages 113–128, 2005.
[103] David C. Parkes and Satinder Singh. An MDP-based approach to online mechanism design. In NIPS, pages 791–798. MIT Press, 2003.
[104] David C Parkes and Michael P Wellman. Economic reasoning and artificial intelligence. Science, 349(6245):267–272, 2015.
[105] Arthur Pigou. Wealth and Welfare. Macmillan, 1912.
[106] Arthur Pigou. The Economics of Welfare. Routledge, 2002.
[107] Shuang Qiu, Boxiang Lyu, Qinglin Meng, Zhaoran Wang, Zhuoran Yang, and Michael I Jordan. Learning dynamic mechanisms in unknown environments: A reinforcement learning approach. arXiv preprint arXiv:2202.12797, 2022.
[108] Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical Relational Artificial Intelligence: Logic, Probability, and Computation. Morgan & Claypool Publishers, 2016.
[109] Johannes G Reiter, Ayush Kanodia, Raghav Gupta, Martin A Nowak, and Krishnendu Chatterjee. Biological auctions with multiple rewards. Proceedings of the Royal Society B: Biological Sciences, 282(1812):20151041, 2015.
[110] Daniel Rigney. The Matthew Effect: How Advantage Begets Further Advantage. Columbia University Press, 2010.
[111] Stuart Russell. Human Compatible: AI and the Problem of Control. Penguin, 2019.
[112] Scott Sanner and Kristian Kersting. Symbolic dynamic programming for first-order POMDPs. In Proceedings of the Twenty-Fourth Conference on Artificial Intelligence. AAAI Press, 2010.
[113] Carlos Sarraute, Olivier Buffet, and Jörg Hoffmann. POMDPs make better hackers: Accounting for uncertainty in penetration testing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1816–1824, 2021.
[114] Jonathon Schwartz and Hanna Kurniawati. Autonomous penetration testing using reinforcement learning. arXiv preprint arXiv:1905.05965, 2019.
[115] Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems: Algorithmic, Game-theoretic, and Logical Foundations. CUP, 2008.
[116] Daniel G. Silva, Mario Jino, and Bruno T. de Abreu. Machine learning methods and asymmetric cost function to estimate execution effort of software testing. In Third International Conference on Software Testing, Verification and Validation, pages 275–284, 2010.
[117] Ray J Solomonoff. A formal theory of inductive inference. Part I. Information and control, 7(1):1–22, 1964.
[118] Ray J Solomonoff. A formal theory of inductive inference. Part II. Information and control, 7(2):224–254, 1964.
[119] Ray J. Solomonoff. The discovery of algorithmic probability. J. Comput. Syst. Sci., 55(1):73–88, 1997.
[120] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
[121] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018.
[122] Maciej Świechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3):2497–2562, 2023.
[123] Khuong Tran, Ashlesha Akella, Maxwell Standen, Junae Kim, David Bowman, Toby Richer, and Chin-Teng Lin. Deep hierarchical reinforcement agents for automated penetration testing. CoRR, abs/2109.06449, 2021.
[124] Tim van Erven, Peter Grünwald, and Steven De Rooij. Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the aic–bic dilemma. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(3):361–417, 2012.
[125] Tim van Erven, Peter Grunwald, Nishant A Mehta, Mark Reid, and Robert Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 2015.
[126] Tim van Erven, Steven Rooij, and Peter Grünwald. Catching up faster in bayesian model selection and model averaging. Advances in Neural Information Processing Systems, 20, 2007.
[127] Badri N. Vellambi and Marcus Hutter. Convergence of binarized context-tree weighting for estimating distributions of stationary sources. In IEEE International Symposium on Information Theory, pages 731–735, 2018.
[128] Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael H. Bowling. Context tree switching. In James A. Storer and Michael W. Marcellin, editors, 2012 Data Compression Conference, pages 327–336. IEEE, 2012.
[129] Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40:95–142, 2011.
[130] Tom Vodopivec, Spyridon Samothrakis, and Branko Ster. On monte carlo tree search and reinforcement learning. Journal of Artificial Intelligence Research, 60:881–936, 2017.
[131] Paul AJ Volf and Frans MJ Willems. Switching between two universal source coding algorithms. In Proceedings of the Data Compression Conference, pages 491–500. IEEE, 1998.
[132] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, second edition, 1947.
[133] Vladimir Vovk. Derandomizing stochastic prediction strategies. In Proceedings of the Annual Conference on Computational Learning Theory, pages 32–44, 1997.
[134] Lilian Weng. Exploration strategies in deep reinforcement learning. http://lilianweng.github.io/lil-log, 2020.
[135] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
[136] Wentao Wu, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacigümüs, and Jeffrey F Naughton. Predicting query execution time: Are optimizer cost models really unusable? In ICDE, pages 1081–1092, 2013.
[137] Roman Yampolskiy. Utility function security in artificially intelligent agents. Journal of Experimental and Theoretical Artificial Intelligence, 26:373–389, 2014.
[138] Samuel Yang-Zhao, Kee Siong Ng, and Marcus Hutter. Dynamic knowledge injection for AIXI agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38(15), pages 16388–16397, 2024.
[139] Samuel Yang-Zhao, Tianyu Wang, and Kee Siong Ng. A direct approximation of AIXI using logical state abstractions. Advances in Neural Information Processing Systems, 35:36640–36653, 2022.
[140] Fabio Massimo Zennaro and Laszlo Erdodi. Modeling penetration testing with reinforcement learning using capture-the-flag challenges and tabular q-learning. CoRR, abs/2005.12632, 2020.
[141] Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, and Tuomas Sandholm. Steering no-regret learners to a desired equilibrium. arXiv preprint arXiv:2306.05221, 2023.
[142] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384, 2021.
[143] Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C Parkes, and Richard Socher. The AI economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science Advances, 8(18):eabk2607, 2022.

Appendix A Notes on the Agent Valuation Function

We will start by unrolling (14)-(15) for $m=3$ to see the general pattern. Given the empty history $h_{0}=\epsilon$ , the utility for agent $i$ at time step 1 is

	$\displaystyle u_{1,i}(\epsilon)=v_{1,i}(\epsilon,\mathbf{a^{*}_{1}}(\epsilon))-p_{i}(\epsilon)$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))[r_{1,i}+u_{2,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}})]-p_{i}(\epsilon)$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))[r_{1,i}+v_{2,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}},\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))-p_{i}(\mathbf{or_{1}})]-p_{i}(\epsilon)$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+v_{2,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}},\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))]$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{*}_{1}}(\epsilon))\biggl{[}-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}$
	$\displaystyle\hskip 90.00014pt+\sum_{\mathbf{or_{2}}}\phi(\mathbf{or_{2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[r_{2,i}+u_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]\biggr{]}$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))\sum_{\mathbf{or_{2}}}\phi(\mathbf{or_{2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))$
	$\displaystyle\hskip 90.00014pt[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+r_{2,i}+u_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]$
	$\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+r_{2,i}+u_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]$
	$\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[-p_{i}(\mathbf{or_{1:2}})-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)$
	$\displaystyle\hskip 160.00024pt+r_{1,i}+r_{2,i}+v_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}},\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))]$
	$\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))\biggl{[}r_{1,i}+r_{2,i}+\sum_{\mathbf{or_{3}}}\phi(\mathbf{or_{3}}\,\|\,\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}}\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))\cdot r_{3,i}\biggr{]}$
	$\displaystyle\hskip 60.00009pt-\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[p_{i}(\mathbf{or_{1:2}})+p_{i}(\mathbf{or_{1}})+p_{i}(\epsilon)]$
	$\displaystyle=\sum_{\mathbf{or_{1:3}}}\phi(\mathbf{or_{1:3}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}),\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))[r_{1,i}+r_{2,i}+r_{3,i}]$
	$\displaystyle\hskip 60.00009pt-\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[p_{i}(\mathbf{or_{1:2}})+p_{i}(\mathbf{or_{1}})+p_{i}(\epsilon)],$

where we denote $\mathbf{v_{t}}(\mathbf{h_{t-1}})=(v_{t,1}(\mathbf{h_{t-1}},\cdot),\ldots,v_{t,k}(\mathbf{h_{t-1}},\cdot))$ and

	$\displaystyle\mathbf{a^{*}_{1}}(\epsilon)=f(\mathbf{v}_{1}(\epsilon))$	$\displaystyle p_{i}(\epsilon)=p_{i}(\mathbf{v_{1}}(\epsilon))$
	$\displaystyle\mathbf{a^{}_{2}}(\mathbf{or_{1}})=f(\mathbf{v_{2}}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}))$	$\displaystyle p_{i}(\mathbf{or_{1}})=p_{i}(\mathbf{v_{2}}(\mathbf{a^{*}_{1}}(\epsilon)\mathbf{or_{1}}))$
	$\displaystyle\mathbf{a^{}_{3}}(\mathbf{or_{1:2}})=f(\mathbf{v_{3}}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a_{2}}^{*}(\mathbf{or_{1}})\mathbf{or_{2}}))$	$\displaystyle p_{i}(\mathbf{or_{1:2}})=p_{i}(\mathbf{v_{3}}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})).$

For general $m$ , one can use an induction argument to show

	$\displaystyle u_{1,i}(\epsilon)$	$\displaystyle=\sum_{\mathbf{or_{1:m}}}\phi(\mathbf{or_{1:m}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}),\ldots,\mathbf{a^{*}_{m}}(\mathbf{or_{1:(m-1)}}))\biggl{[}\sum_{t=1}^{m}r_{t,i}\biggr{]}$
		$\displaystyle\hskip 20.00003pt-\sum_{\mathbf{or_{1:(m-1)}}}\phi(\mathbf{or_{1:(m-1)}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\ldots,\mathbf{a^{}_{m-1}}(\mathbf{or_{m-1}}))\biggl{[}\sum_{t=0}^{m-1}p_{i}(\mathbf{or_{1:t}})\biggr{]}.$

Appendix B Cap and Trade Agent Policies

Collusive Dynamics

Fig 7 shows the final policy learned for the setup of Fig 5.

=== FINAL POLICY EVALUATION ===
--------------------------------------------------------------------
|  t  | Agent | Prod |  Perm  | Prof | Rho  | Bid  |  Win? | +Prof |
--------------------------------------------------------------------
|  1  |   0   |  0.0 |      0 |  0.0 |  8.4 | 0.00 |       |       |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 0.00 |   *   |   6.1 |
--------------------------------------------------------------------
|  2  |   0   |  0.0 |      0 |  0.0 |  8.4 | 0.10 |   *   |   8.4 |
|     |   1   | 30.5 |  3,000 |  6.1 |  2.3 | 0.00 |       |       |
--------------------------------------------------------------------
|  3  |   0   | 42.2 |  3,000 |  8.4 |  1.6 | 0.00 |       |       |
|     |   1   | 30.5 |  3,000 |  6.1 |  2.3 | 0.05 |   *   |   2.3 |
--------------------------------------------------------------------
|  4  |   0   | 42.2 |  3,000 |  8.4 |  1.6 | 0.20 |   *   |   1.6 |
|     |   1   | 42.2 |  6,000 |  8.4 |  0.9 | 0.00 |       |       |
--------------------------------------------------------------------
|  5  |   0   | 50.2 |  6,000 | 10.0 |  1.0 | 0.05 |   *   |   1.0 |
|     |   1   | 42.2 |  6,000 |  8.4 |  0.9 | 0.00 |       |       |
--------------------------------------------------------------------
(winning agent must pay less than rho to make a profit)

Avg permit price: $0.0
Agent 0:
  -> won 9000 permits
  -> paid $0.00m
  -> produced 55.2m litres of fuel
  -> made a total profit of $11.04m

Agent 1:
  -> won 6000 permits
  -> paid $0.00m
  -> produced 42.2m litres of fuel
  -> made a total profit of $8.45m

====== ESTIMATED OPTIMAL STRATEGIES: argmax_a Q(s, a) ======

==============================================================================================
        ||     0       |   3,000     |   6,000     |   9,000     |   12,000    |   15,000    |
==============================================================================================
      0 || 0.00 / 0.00 | 0.10 / 0.00 | 0.35 / 0.00 | 0.35 / 0.05 | 8.35 / 0.00 | 0.0* / 0.0* |
  3,000 || 0.00 / 0.10 | 0.00 / 0.05 | 0.20 / 0.00 | 0.10 / 0.00 | 0.0* / 0.0* |  ?   /  ?   |
  6,000 || 0.00 / 0.05 | 0.00 / 0.05 | 0.05 / 0.00 | 0.0* / 0.0* |  ?   /  ?   |  ?   /  ?   |
  9,000 || 0.00 / 0.40 | 0.05 / 0.15 | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |
 12,000 || 0.25 / 4.45 | 0.0* / 0.0* |   ?   /  ?  |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |
 15,000 || 0.0* / 0.0* |  ?   /  ?   |   ?   /  ?  |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |

====== ESTIMATED RETURNS IF OPTIMAL STRATEGY FOLLOWED: max Q(s, a) ======

=====================================================================================================
        ||      0        |    3,000      |    6,000      |    9,000    |    12,000   |    15,000    |
=====================================================================================================
      0 || 19.49 / 19.49 | 13.38 / 13.38 | 11.03 / 11.04 | 9.92 / 9.99 | 7.90 / 8.45 | 0.00 / 0.00  |
  3,000 || 11.04 / 11.04 |  4.94 / 4.94  |  2.59 / 2.59  | 1.59 / 1.60 | 0.00 / 0.00 |  ?   /  ?    |
  6,000 ||  9.44 / 9.44  |  3.34 / 3.34  |  0.99 / 0.99  | 0.00 / 0.00 |  ?   /  ?   |  ?   /  ?    |
  9,000 ||  8.40 / 8.27  |  2.30 / 2.28  |  0.00 / 0.00  |  ?   /  ?   |  ?   /  ?   |  ?   /  ?    |
 12,000 ||  5.85 / 5.84  |  0.00 / 0.00  |   ?   /  ?    |  ?   /  ?   |  ?   /  ?   |  ?   /  ?    |
 15,000 ||  0.00 / 0.00  |   ?   /  ?    |   ?   /  ?    |  ?   /  ?   |  ?   /  ?   |  ?   /  ?    |

Figure 7: Final policy for a sample run of

r^{(2)}

In the Estimated Optimal Strategies matrix in Fig 7, rows denote permits held by $R_{1}$ and columns denote permits held by $R_{2}$ , so that each cell represents a state of the game. The entries in the cells are (estimated optimal bid in that state by $R_{1}$ , estimated optimal bid by $R{2}$ ). A ‘?’ means the state was never visited by the agent, while an ‘*’ marks the terminal states.

On the first tranche, the agents both bid zero, leaving it to the VCG mechanism to decide via random tie-breaking which agent wins. Then, if $R_{1}$ won the first tranche – i.e. we are now in cell (2, 1) of the matrix – $R_{1}$ bids zero so $R_{2}$ will win the second tranche. And vice versa if we are in cell (1, 2) of the matrix. After that, on tranche 3, we are always in cell (2, 2) of the matrix and the remainder of the game always proceeds the same way: agent $R_{2}$ wins; agent $R_{1}$ wins; agent $R_{1}$ wins.

Thus the learned joint policy always results in a return for each agent (and a total shared net profit) of $19.49$ over the five tranches, as both agents have accurately estimated – see cell (1, 1) of the “Estimated Returns” matrix of Fig 7. The entries in that matrix are ( $R_{1}$ ’s estimated expected return from the state if $R_{1}$ ’s estimated optimal bids are made, $R_{2}$ ’s estimated expected return if its estimated optimal bids are made). Note the agents have learned this collusive policy solely by observing the permit holdings of each participant in the auction – there was never any explicit communication between them.

Competitive Dynamics

Fig 8 shows the final policy learned for the setup of Fig 6. Observe that the learned joint policy here is, in most states, for each agent to bid identical amounts, leaving it to the VCG mechanism to randomly determine who wins. In some cases an agent will bid an amount greater than $\rho$ , such as in tranches 3-5 of the sampled final policy evaluation shown in Fig 8. Why? Consider tranche 3 of the sample, where the game is in state (6000, 0). Here agent $R_{1}$ bid $1.9$ m – the same as $R_{2}$ – and won, receiving a profit (and reward) of $-0.9$ m, so agent $R_{2}$ received a reward of $0.9$ m. If agent $R_{1}$ had bid less, say $1.8$ m, $R_{2}$ would have won with a profit of $6.1-1.8=4.3$ m, and agent $R_{1}$ would have received a reward of $-4.3$ m instead of $-0.9$ m. Clearly that is a worse outcome. If agent $R_{1}$ had bid more, the outcome would have also been worse: agent $R_{1}$ would have received a larger negative reward (and thus a larger positive reward would have gone to $R_{2}$ ).

Now let’s suppose the tie-break at state (6000, 0) – and all subsequent tie-breaks – are resolved in $R_{2}$ ’s favour. Assuming each agent stays with its learned strategies, the last three tranches would play out as follows:

•

$t=3$ : (6000, 0) $\to$ bid $1.9$ / $1.9$ $\to$ $R_{2}$ wins $\to$ reward $-4.2$ / $+4.2$
•

$t=4$ : (6000, 3000) $\to$ bid $1.25$ / $1.25$ $\to$ $R_{2}$ wins $\to$ reward $-1.1$ / $+1.1$
•

$t=5$ : (6000, 6000) $\to$ bid $1.0$ / $1.0$ => $R_{2}$ wins $\to$ reward $-0.0$ / $+0.0$

The total reward over the last three tranches is thus -5.3 to $R_{1}$ and +5.3 to $R_{2}$ . Overall, agent $R_{1}$ has accurately estimated its expected reward from state (6000, 0) over the remaining three tranches, when playing its estimated optimal actions, as $-5.3$ m. See cell (3, 1) of the “Estimated Returns” matrix in Fig 8.

=== FINAL POLICY EVALUATION ===
--------------------------------------------------------------------
|  t  | Agent | Prod |  Perm  | Prof | Rho  | Bid  | Win?  | +Prof |
--------------------------------------------------------------------
|  1  |  0    |  0.0 |      0 |  0.0 |  8.4 | 1.55 |   *   |   6.9 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 1.55 |       |       |
--------------------------------------------------------------------
|  2  |   0   | 42.2 |  3,000 |  6.9 |  1.6 | 1.55 |   *   |   0.1 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 1.50 |       |       |
--------------------------------------------------------------------
|  3  |   0   | 50.2 |  6,000 |  7.0 |  1.0 | 1.90 |   *   |  -0.9 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 1.90 |       |       |
--------------------------------------------------------------------
|  4  |   0   | 55.2 |  9,000 |  6.1 |  0.8 | 2.50 |   *   |  -1.7 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 2.50 |       |       |
--------------------------------------------------------------------
|  5  |   0   | 59.0 | 12,000 |  4.4 |  0.6 | 3.40 |   *   |  -2.8 |
|     |   1   |  0.0 |      0 |  0.0 |  6.1 | 3.40 |       |       |
--------------------------------------------------------------------
(winning agent must pay less than rho to make a profit)

Avg permit price: $723.0
Agent 0:
  -> won 15000 permits
  -> paid $10.85m
  -> produced 62.2m litres of fuel
  -> made a total profit of $1.58m

Agent 1:
  -> won 0 permits
  -> paid $0.00m
  -> produced 0.0m litres of fuel
  -> made a total profit of $0.00m

====== ESTIMATED OPTIMAL STRATEGIES: argmax_a Q(s, a) ======

==============================================================================================
        ||      0      |    3,000    |    6,000    |    9,000    |    12,000   |    15,000   |
==============================================================================================
      0 || 1.55 / 1.55 | 1.55 / 1.55 | 1.95 / 1.95 | 2.85 / 2.85 | 4.50 / 4.50 | 0.0* / 0.0* |
  3,000 || 1.55 / 1.50 | 1.20 / 1.15 | 1.05 / 1.05 | 1.15 / 1.15 | 0.0* / 0.0* |   ?  /  ?   |
  6,000 || 1.90 / 1.90 | 1.25 / 1.25 | 1.00 / 1.00 | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |
  9,000 || 2.50 / 2.50 | 1.55 / 1.6* | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |
 12,000 || 3.40 / 3.40 | 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |  ?   /  ?   |
 15,000 || 0.0* / 0.0* |   ?  /  ?   |  ?   /  ?   |  ?   /  ?   |  ?   /  ?   |  ?   /  ?   |

====== ESTIMATED RETURNS IF OPTIMAL STRATEGY FOLLOWED: max Q(s, a) ======

===================================================================================================
        ||      0       |    3,000     |    6,000     |    9,000     |    12,000    |   15,000    |
===================================================================================================
      0 || 1.72 / -1.72 | 6.30 / -6.30 | 7.10 / -7.10 | 6.10 / -6.10 | 3.95 / -3.95 | 0.00 / 0.00 |
  3,000 || -5.20 / 5.20 | -0.61 / 0.61 | 0.58 / -0.58 | 0.47 / -0.47 | 0.00 / 0.00  |  ?   /  ?   |
  6,000 || -5.30 / 5.30 | -1.05 / 1.05 | 0.04 / -0.04 | 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?   |
  9,000 || -4.44 / 4.44 | -0.80 / 0.80 | 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?    |  ?   /  ?   |
 12,000 || -2.74 / 2.74 | 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?    |  ?   /  ?    |  ?   /  ?   |
 15,000 || 0.00 / 0.00  |  ?   /  ?    |  ?   /  ?    |  ?   /  ?    |  ?   /  ?    |  ?   /  ?   |

Figure 8: Final policy for a sample run of

r^{(3)}

	$\displaystyle u_{1,i}(\epsilon)=v_{1,i}(\epsilon,\mathbf{a^{*}_{1}}(\epsilon))-p_{i}(\epsilon)$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))[r_{1,i}+u_{2,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}})]-p_{i}(\epsilon)$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))[r_{1,i}+v_{2,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}},\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))-p_{i}(\mathbf{or_{1}})]-p_{i}(\epsilon)$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+v_{2,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}},\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))]$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{*}_{1}}(\epsilon))\biggl{[}-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}$
	$\displaystyle\hskip 90.00014pt+\sum_{\mathbf{or_{2}}}\phi(\mathbf{or_{2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[r_{2,i}+u_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]\biggr{]}$
	$\displaystyle=\sum_{\mathbf{or_{1}}}\phi(\mathbf{or_{1}}\,\|\,\mathbf{a^{}_{1}}(\epsilon))\sum_{\mathbf{or_{2}}}\phi(\mathbf{or_{2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{*}_{2}}(\mathbf{or_{1}}))$
	$\displaystyle\hskip 90.00014pt[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+r_{2,i}+u_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]$
	$\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)+r_{1,i}+r_{2,i}+u_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}})]$
	$\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[-p_{i}(\mathbf{or_{1:2}})-p_{i}(\mathbf{or_{1}})-p_{i}(\epsilon)$
	$\displaystyle\hskip 160.00024pt+r_{1,i}+r_{2,i}+v_{3,i}(\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}},\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))]$
	$\displaystyle=\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))\biggl{[}r_{1,i}+r_{2,i}+\sum_{\mathbf{or_{3}}}\phi(\mathbf{or_{3}}\,\|\,\mathbf{a^{}_{1}}(\epsilon)\mathbf{or_{1}}\mathbf{a^{}_{2}}(\mathbf{or_{1}})\mathbf{or_{2}}\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))\cdot r_{3,i}\biggr{]}$
	$\displaystyle\hskip 60.00009pt-\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[p_{i}(\mathbf{or_{1:2}})+p_{i}(\mathbf{or_{1}})+p_{i}(\epsilon)]$
	$\displaystyle=\sum_{\mathbf{or_{1:3}}}\phi(\mathbf{or_{1:3}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}),\mathbf{a^{*}_{3}}(\mathbf{or_{1:2}}))[r_{1,i}+r_{2,i}+r_{3,i}]$
	$\displaystyle\hskip 60.00009pt-\sum_{\mathbf{or_{1:2}}}\phi(\mathbf{or_{1:2}}\,\|\,\mathbf{a^{}_{1}}(\epsilon),\mathbf{a^{}_{2}}(\mathbf{or_{1}}))[p_{i}(\mathbf{or_{1:2}})+p_{i}(\mathbf{or_{1}})+p_{i}(\epsilon)],$

The Problem of Social Cost in Multi-Agent General Reinforcement Learning: Survey and Synthesis

1 Introduction

2 General Reinforcement Learning

2.1 Single Agent Setting

Definition 1.

Definition 2.

Definition 3.

Definition 4.

Theorem 1.

2.2 Multi-Agent Setting

Definition 5.

3 Mechanism Design

3.1 Tragedy of the Commons

3.2 The VCG Mechanism

Definition 6.

Definition 7.

Definition 8.

Definition 9.

Theorem 2.

Definition 10.

Theorem 3.

Example 1.

Example 2.

3.3 The Exponential VCG Mechanism

Definition 11.

Definition 12.

Theorem 4.

Definition 13.

Lemma 5.

Proof.

Theorem 6.

Proof.

4 The Social Cost of Actions

4.1 General Case

Definition 14.

Protocol for Controlled Multi-Agent Environment

What should the agent valuation functions be?

Variations of Agent Valuation Functions

Partial Knowledge

4.2 Special Cases and Related Settings

Single Agent

Two Player Turn-based Games

Multi-Agent Reinforcement Learning

Static Mechanism Design

Dynamic Mechanism Design

Multi-Agent Coordinated Planning

Multi-Agent Coordinated Learning

Dynamic Mechanism Design via Reinforcement Learning

5 Learning in the Presence of Social Cost

5.1 Measures of Success

5.2 Bayesian Reinforcement Learning Agents

Online Mixture Learning in Practice

Monte Carlo Planning

Model-Free Reinforcement Learning

Partial Observability

Assumption 1.

Definition 15.

Markovian State Abstraction

5.3 Bandit VCG Mechanisms

5.4 Markov VCG Mechanisms in Unknown Environments

Lemma 7 ([89]).

5.5 Mechanism-level RL vs Agent-level RL

6 Applications

6.1 Paperclips and All That

6.2 Cap-and-Trade to Control Pollution

No Price on Pollution

Fixed Price for Pollution

Market Pricing for Capped Pollution

Greedy Algorithm

Reinforcement Learning

6.3 Automated Penetration Testing

7 Discussion and Conclusion

References

Appendix A Notes on the Agent Valuation Function

Appendix B Cap and Trade Agent Policies

Collusive Dynamics

Competitive Dynamics

The Problem of Social Cost in Multi-Agent General Reinforcement Learning:
Survey and Synthesis