Approximate Estimation of High-dimension Execution Skill for Dynamic Agents in Continuous Domains

Delma Nieves-Rivera
[email protected]
Mississippi State University
Starkville, MS 93444 USA Christopher Archibald
[email protected]
Brigham Young University
Provo, UT 84602 USA

Abstract

In many real-world continuous action domains, human agents must decide which actions to attempt and then execute those actions to the best of their ability. However, humans cannot execute actions without error. Human performance in these domains can potentially be improved by the use of AI to aid in decision-making. One requirement for an AI to correctly reason about what actions a human agent should attempt is a correct model of that human’s execution error, or skill. Recent work has demonstrated successful techniques for estimating this execution error with various types of agents across different domains. However, this previous work made several assumptions that limit the application of these ideas to real-world settings. First, previous work assumed that the error distributions were symmetric normal, which meant that only a single parameter had to be estimated. In reality, agent error distributions might exhibit arbitrary shapes and should be modeled more flexibly. Second, it was assumed that the execution error of the agent remained constant across all observations. Especially for human agents, execution error changes over time, and this must be taken into account to obtain effective estimates. To overcome both of these shortcomings, we propose a novel particle-filter-based estimator for this problem. After describing the details of this approximate estimator, we experimentally explore various design decisions and compare performance with previous skill estimators in a variety of settings to showcase the improvements. The outcome is an estimator capable of generating more realistic, time-varying execution skill estimates of agents, which can then be used to assist agents in making better decisions and improve their overall performance.

1 Introduction

In many real-world settings, agents are required to select and execute continuous actions. There are two factors that can influence the success of an agent in these domains. First, the actions that could be selected might vary in their quality. Second, agents typically cannot execute those actions with perfect precision due to some amount of execution error that varies from agent to agent. As an example, imagine that you just joined a darts tournament. For your first game, you are assigned to play against agent A. You get to observe a match between agent A and agent B right before your match. While you are observing, can you gauge the skill level of agent A? Specifically, can you assess the ability of A to select effective target actions? Is it clear how accurately agent A can perform the selected actions? Is it possible to get any insights just by observing the actions agent A is executing? Now picture this from a different perspective. Imagine that you are assigned to coach agent A instead. Can you assess agent A’s abilities in terms of the questions presented above, and determine which areas should be focused on for improvement? Imagine having a tool to help with these types of questions.

The skill estimation problem was introduced as a framework to address these questions [4, 6]. This prior work has focused on estimating the decision-making skill and execution skill of an agent, given only observations of the noisy, executed actions. Multiple methods have been proposed and shown to be successful at producing accurate skill estimates under different assumptions about the decision-making ability of the agent. In particular, the JEEDS method, which simultaneously estimates the decision-making and execution skill of an agent, has been shown to work extremely well with a wide variety of agents [6]. In this paper, we explore the question of how to translate this success to domains in which some of the assumptions made by the JEEDS are not satisfied. The two particular assumptions that we investigate are 1) that the execution error of an agent can be accurately modeled by a symmetric Gaussian distribution, and 2) that the execution skill of an agent is stationary and it doesn’t change while the agent is being observed.

The focus on symmetric Gaussian distributions was leveraged by the JEEDS method to allow only a single execution skill parameter to be estimated. However, in many cases, a higher-dimensional representation of execution skill is desired. One case where this can occur is when actions are executed in a low dimensional space, but a more nuanced and flexible error distribution is desired. An example of this would be the game of darts. The action space has two dimensions, but in real life, errors are not necessarily distributed symmetrically in the 2D action space. To model execution error as a multivariate Gaussian distribution with an arbitrary covariance matrix, there are now 3 parameters to estimate (the variance in the 2 principal dimensions and the correlation between them). This gives us now a 3-dimensional space over which to maintain a distribution and perform inference to estimate the execution skill of the observed agent. The JEEDS method maintained beliefs over a discretization of the single execution skill parameter, but the processing time was linear in the number of hypotheses utilized. Increasing to three dimensions drastically reduces the coverage of the hypotheses and/or increases the running time of the algorithm.

Another case where a higher-dimensional representation of execution skill is desired is when the action space itself is high-dimensional. An example is computational billiards, where the action space has 5 dimensions [2, 3]. As each dimension has a different scale and physical meaning, it is anticipated that execution error will be different in each dimension, resulting in 5 separate parameters to maintain a distribution over and estimate. In either case, the result is a need for the ability to estimate execution skill in a higher dimension.

The focus on stationary execution skill is perhaps not an oversimplification when observations of an agent acting are gathered over the course of a limited time frame, but even then, it might be desirable to be able to see the effects of fatigue on execution skill. Certainly, over longer time horizons, for example to monitor the execution skill of an athlete over an entire season, or a career, some adaptation of previous methods must be made to enable them to produce time-varying skill estimates that can shed insight into an agent’s changing abilities.

To perform an accurate estimation of agent skill and overcome these two limitations of previous work, in this paper, we propose a novel particle-filter-based estimator. This new estimator provides the ability to estimate execution skill in higher dimensions, giving the designer more flexibility over how the algorithm scales. It also provides a framework in which time-varying execution skill can be estimated. A final contribution of this paper is the application of this novel method to real-world pitching data from Major League Baseball (MLB), demonstrating the ability to give insight into a pitcher’s ability, which is an important necessity for any AI system that might provide target action suggestions.

The remainder of the paper will proceed as follows: first, in Section 2, the necessary background, notation, definitions, and related work will be provided that enable us to present the proposed method and give context to its contributions. Section 3 will describe the proposed skill estimation method. Section 4 will describe the experimental procedure we will use, and the results of the experiments will be presented in Section 5. In Section 6, we apply the proposed method to data from MLB pitchers and compare the results to those from a previous method. Finally, in Section 7 we conclude.

2 Background

This section presents the relevant terminology and notation required to present the proposed method.

2.1 Problem Statement

The environments of interest for the skill estimation problem are those that feature continuous action spaces and can be modeled as Markov Decision Processes (MDPs). These MDPs consist of a set of states $S$ , a continuous set of actions $A\subseteq\mathbb{R}^{n}$ , a reward function $R:S\times A\mapsto\mathbb{R}$ and a transition function $P$ which specifies a distribution over potential subsequent states for any given state and action combination. An agent is an entity that acts in an MDP and consists of two components: one for decision-making and one for action execution. The decision-making component utilizes information about the current state of the MDP and the agent’s execution component to determine a target action, or more generally, a probability distribution over target actions, given the state. The distribution can be of any form so long as it produces a final intended or target action for a given state. The execution component refers to how accurately an agent can execute intended actions. This is represented by a probability distribution over random perturbations, $\chi$ , from which a perturbation is sampled and added to the target action every time the agent acts. The resulting action, which is then actually executed in the MDP, is referred to as the executed action. Figure 1 depicts the relationship between these components as a part of the agent’s interaction with the environment.

Refer to caption — Figure 1: Agent Environment Interaction (from [6])

The concepts of decision-making skill and execution skill were introduced in [3] as terms to describe the quality of an agent’s reasoning and the agent’s execution accuracy, respectively, in our setting. As described above, this notion of skill gave rise to the skill estimation problem [4, 6]. Precisely, this problem is to estimate parameters describing the agent’s decision-making and execution components, given observations of the agent acting in the MDP. These observations contain descriptions of the state as well as the executed action, but crucially, they do not include the target action. Methods have been proposed for estimating both execution skill and decision-making skill, and have been shown to be effective at producing skill estimates, under certain assumptions.

2.2 Previous Skill Estimation Methods

Several estimation methods have been proposed in the literature and shown to be effective at producing execution skill estimates. These methods differ in the type of information they utilize. The Observed Reward (OR) method [4] focused on estimating an agent’s execution skill by analyzing the rewards it receives during interactions with the environment. This method compares the average observed reward obtained by the agent with the rewards that would be expected from perfectly rational agents with a variety of execution skill levels. The final estimate is produced by interpolation. The OR method assumes that agents are perfectly rational, which limits its applicability. It was shown to produce accurate estimates for rational agents, but fails when agents do not make decisions optimally.

To relax the perfect-rationality assumption, an alternative method, The Bayesian Approach (TBA) (later referred to as AXE in [6]), was proposed [5]. This method models the skill estimation problem as a Bayesian network. Given the network, probabilistic inference can be performed to produce execution skill estimates, under the assumption that the agent would select one of a set of focal actions, or another action uniformly at random. This led to improved performance with a wider variety of agents, but required the definition of a set of focal points for a domain.

To avoid the need for any assumptions about an agent’s decision-making skill, later work proposed an alternate Bayesian network that explicitly incorporates the decision-making component as a random variable [6]. The resulting Joint Estimation of Execution and Decision-making Skill (JEEDS) method estimates both an agent’s execution and decision-making skill levels simultaneously. Reward information is utilized to reason about the agents’ decision-making, and the executed actions give insight into execution skill. JEEDS was shown to produce far more accurate skill estimates for agents across the entire spectrum of rationality levels.

Our proposed method, described in Section 3, utilizes a similar Bayesian approach as the prior methods, but differs in its modeling of the agent’s skill as time-varying random variables in the dynamic Bayesian network. Our proposed method also utilizes a different Bayesian inference algorithm to determine its skill estimates given the observations. The performance of the proposed method will be experimentally compared to JEEDS, since JEEDS is the best-performing existing method.

2.3 Related Work

The topic of skill estimation has connections to many other lines of research. Methods for estimating the execution error of darts players have been proposed [25, 16], but in this work, the assumption is that the target action of the player is known. Recent work has proposed a novel handicap system for darts that utilizes execution skill estimates of the players [11]. A previous execution skill estimation technique [5] was used as a vital component in an application to baseball pitching [15]. In that paper, the interaction between pitcher and batter was modeled as a game, and equilibrium strategies for the pitcher were computed under various assumptions. This gives a good example of how execution skill estimates can be leveraged within larger frameworks to provide AI assistance to humans in real-world domains. Another paper focused on correlating the shape of a pitcher’s error distribution to their pitching mechanics, which was done in a controlled setting where targets could be specified for the pitcher [20].

The notion of decision-making skill has been defined for players in sports [1], and also relates to work on bounded rationality [21]. The specific softmax model that we utilize in this paper has been used in many other works, including the definition of quantal response equilibrium [14]. It has been used to adjust the strength of AI agents for Go [26], and to model human behavior [27]. Other work has proposed methods for determining this parameter for agents, given observations of their actions in two-player zero-sum games [13].

Estimating the properties of other agents in games is often called opponent modeling, and skill estimation can be seen as a method for determining important properties of observed agents [17]. Opponent modeling has been explored in the context of poker [8, 9, 10], real-time strategy games [19], and $n$ -player games [22].

One could model the execution error as part of the transition function of the underlying MDP, resulting in a different MDP for each agent in the same environment. One potential way to view skill estimation work is as an attempt to separate transition information that is agent-specific from environment-specific information. This could potentially be helpful for transfer learning, an important problem in reinforcement learning [23, 28]. For example, how can a strategy developed for an agent with one execution skill level be utilized by an agent with a different execution skill level?

Finally, our proposed skill estimation method utilizes standard Bayesian reasoning techniques [18]. Particle filters have been used extensively in robotics to track the state of a robotic system [24]. One previous study used a particle filter for opponent modeling in Kuhn poker [7]. The task was to infer the parameters representing the strategy being used by the opponent in the game. At a high level, our approach is similar to theirs, but all the specific details: the components of the particle filter, the nature of the observations, and the whole setting, are different.

3 Monte Carlo Skill Estimation

This section presents the details of the proposed method: Monte Carlo Skill Estimation (MCSE). MCSE operates within a Bayesian framework, representing agent skill levels, target actions, and executed actions as random variables. A probability distribution over the space of possible skill levels will be updated each time an executed action is observed. What differentiates MCSE from prior work is the fact that the skill levels are represented as time-varying random variables in the dynamic Bayesian network, and that a particle filter is used as the inference method to reason about the involved random variables. The remainder of this section will describe the MCSE method in detail.

3.1 Bayesian Framework

The dynamic Bayesian network used to guide inference for MCSE is shown in Figure 2. It consists of the following random variables:

•

$\Sigma$ is a multivariate random variable consisting of $k$ dimensions which correspond to the execution skill parameters of the agent that are to be estimated. $\Sigma$ is unobserved and isn’t directly influenced by anything else. Execution skill estimation methods will perform inference to update a probability distribution over possible execution skill levels, given the observations. $\sigma$ will be used to indicate a specific set of execution skill distribution parameters.
•

$\Lambda$ is a random variable corresponding to the decision-making skill of the agent. This is also unobserved and isn’t influenced by anything else, as it is considered an inherent characteristic of an agent. Decision-making skill estimation methods will infer a distribution over possible decision-making skill levels, given observations. $\lambda$ will be used to refer to a specific decision-making skill level.
•

$T$ is a random variable indicating the target action for a given observation. This random variable is also not directly observed, but can be influenced by the execution and decision-making skill random variables $\Sigma$ and $\Lambda$ . A target action for a given observation will be referred to as $t$ . The influence of the current state on the target action will be left implicit.
•

$X$ is a random variable indicating the action that is actually executed and observed. A specific executed action will be referred to as $x$ . This random variable is directly influenced by the target action random variable $T$ , as well as the execution skill multivariate random variable $\Sigma$ . $P(x\hskip 2.0pt|\hskip 2.0ptt,\sigma)$ indicates the true conditional distribution over executed actions given an intended action $t$ and an execution skill level $\sigma$ , and will directly correspond to the execution noise distribution $\chi^{\sigma}$ , centered on $t$ .

All of the variables are indexed by the observation number, as in this work, we model agents’ skill as time-varying, and the target and executed actions will in general also differ from observation to observation. Moreover, we focus on episodic environments but we hypothesize that MCSE can be successfully extended to sequential domains in the same way JEEDS was extended [6], by replacing the reward function with an optimal action-value function.

3.2 Particle Filter Approach Overview

The proposed MCSE method will utilize a particle filter to perform probabilistic inference over the space of possible skill levels. A set of particles will be maintained, and this set will represent the current probability distribution over the agent’s skill. Each particle will correspond to a complete specification of an agent’s skill, including specific execution skill parameters $\sigma$ and rationality parameter $\lambda$ . Every time an executed action and state are observed, each particle will compute a weight. This weight corresponds to the probability of an agent with that particle’s skill level executing the observed action. The particle will multiply its previous weight by this new weight. Periodically, a new set of particles will be resampled from the old set, with replacement. When resampling occurs, the weight for each newly resampled particle will be initialized to 1. When a skill estimate is required, it will be computed as the weighted average value of all particle skill parameters. Details for each of these components will be given in the following sections.

3.3 Execution Skill Model

The MCSE method requires the ability to compute the probability of executing any specific action, given the parameters for an execution noise distribution and a target action. This will be represented abstractly by a probability density function (pdf) $f_{\sigma}$ , where $\sigma$ represents a specific set of execution skill parameters. Thus, the conditional probability distribution $P(x\hskip 2.0pt|\hskip 2.0ptt,\sigma)=f_{\sigma}(x\hskip 2.0pt|\hskip 2.0ptt)$ . This pdf can be made concrete given the specific parameterized execution skill distribution that will be used in a setting.

3.4 Decision-making Skill Model

The MCSE method uses the softmax function, shown in Equation 1, to model imperfect decision-making as a function of the rationality parameter $\lambda_{b}$ , as was done in [6]. This function provides a distribution over target actions, given a specific state, execution skill parameters $\sigma_{b}$ , and decision-making parameter $\lambda_{b}$ . Higher $\lambda_{b}$ values correspond to increased decision-making skill in this model.

P(t\hskip 2.0pt|\hskip 2.0pt\sigma,\lambda)=\frac{e^{\lambda V(t,\sigma)}}{\sum_{t^{\prime}\in A}e^{\lambda V(t^{\prime},\sigma)}}

(1)

The function $V:A\times\Sigma\mapsto\mathbb{R}$ is used to represent the expected reward for an action is a state $s$ , given a target action and execution skill level. This is computed as $V(a,\sigma_{b})=\mathbb{E}_{\varepsilon\sim\chi^{\sigma_{b}}}[R(s,a+\varepsilon)]$ .

3.5 MCSE Method Description

The specific steps of the MCSE method are now provided. The details for some of the steps will be given in subsequent sections.

1.

Initialize the set of particles $B=\{b_{0},b_{1},\ldots,b_{M}\}$ , where each $b_{i}=(\sigma_{b_{i}},\lambda_{b_{i}})$ , with all skill parameters sampled from uniform distributions over their respective ranges.
2.
Repeat for each observed executed action $x_{n}$ :
1. (a)
  
  For each particle $b\in B$ update weight as
  $w_{b}\mathrel{*}=P(x_{n}\hskip 2.0pt|\hskip 2.0pt\sigma_{b},\lambda_{b})$ using Equation 2
2. (b)
  
  Possibly resample new set of particles (see Section 3.5.2)
3. (c)
  
  For each new particle $b\in B$ perturb particle parameters as described in Section 3.5.3 to account for dynamic skill

3.5.1 Weight Computation

The weight for a particle $b$ corresponds to $P(x_{n}\hskip 2.0pt|\hskip 2.0pt\sigma_{b},\lambda_{b})$ , which is the probability of observing executed action $x_{n}$ , given skill parameters $\sigma_{b}$ and $\lambda_{b}$ . The formula for this weight is shown in Equation 2, which was derived using the structure of the dynamic Bayesian network shown in Figure 2 and the MCSE components discussed earlier. The derivation follows that given in [6].

P(x_{n}\hskip 2.0pt|\hskip 2.0pt\sigma_{b},\lambda_{b})=\frac{1}{\sum_{t^{\prime}\in A}e^{\lambda_{b}V(t^{\prime},\sigma_{b}})}\sum_{t_{n}\in A}e^{\lambda_{b}V(t_{n},\sigma_{b})}f_{\sigma_{b}}(x_{n}\hskip 2.0pt|\hskip 2.0ptt_{n})

(2)

3.5.2 Resampling Procedure

When the particles are resampled, a certain percentage ( $r$ ) are resampled from the previous set, with replacement. Each particle is selected in proportion to its weight. Randomly initialized particles are then added to the new set of particles until there are again $M$ particles. In this paper, we investigate two methods for determining when to resample the set of particles. The first is the simplest, the set of particles is resampled after every observation. The second focuses on computing the number of effective particles $n_{eff}$ and only resamples when $n_{eff}$ is below a certain threshold. $n_{eff}$ is computed by summing all the particle weights.

3.5.3 Particle Perturbation

When a single particle is resampled more than once, multiple copies of it are included in the set of particles. Without further steps, each of these particles would simply have the same weight and could not help cover the entire space of possible skill levels that the agent could have. In this step, the parameters of each particle are perturbed. This is done by independently sampling a zero-mean Gaussian distribution for each parameter. The standard deviation of each of these distributions is set to be a specific fraction $w_{\%}$ of the width of the corresponding parameter’s range of possible values. We will explore the performance of MCSE with different values for $w_{\%}$ in Section 5.

3.5.4 Generating a Skill Estimate

Finally, given a set of particles along with their corresponding weights, how is a skill estimate produced? In MCSE the weighted average of all the particle parameters is computed, and the skill levels specified by those average skill parameters are the estimates produced by the model. If an estimate is required immediately following a resampling step, then the newly random particles are not included in this weighted average.

4 Experimental Setup

In this section, the experiments that were used to evaluate the MCSE method are described. These experiments were done in a simulated variant of darts, a commonly used domain for execution skill estimation. This simulated experimental environment allows for the accuracy of different skill estimation techniques to be evaluated, since the true skill levels of the agents are known.

4.1 Experimental Domain: Two-Dimensional Darts

The traditional game of darts involves players throwing pointy projectiles (the darts) at a circular board. The player gets points depending on which region of the board the dart sticks in. The traditional game of darts has been the basis for much previous work in estimating agent skill [25, 16, 6]. We utilize the two-dimensional darts (2D-Darts) variant of the game introduced in [6], where the base rewards of the traditional dartboard are randomly shuffled. This variation guarantees that each dartboard presented to an agent is different and challenging, enabling the agent to showcase its decision-making and execution abilities more explicitly.

We parameterize execution skill in 2D-Darts using three parameters, which together define the covariance matrix of a bivariate Gaussian distribution. The three parameters we use are $\sigma_{x}$ , $\sigma_{y}$ , and $\rho$ , which result in a covariance matrix of $\Sigma=\begin{pmatrix}\sigma_{x}^{2}&\sigma_{x}\sigma_{y}\rho\\ \sigma_{x}\sigma_{y}\rho&\sigma_{y}^{2}\end{pmatrix}$ . We will represent by $f_{\Sigma}$ the probability density function corresponding to a bivariate Gaussian distribution with covariance $\Sigma$ .

4.2 Agents

Different types of agents, each with an unique decision-making component, were utilized for the experiments. These are the same types used in the experiments conducted in [6]. The Rational agent selects an optimal action that will maximize its expected reward, with respect to its execution skill level. The Flip agent employs an $\epsilon$ -greedy strategy, allowing it to select an optimal action w.p. $\lambda_{f}\in[0,1]$ or a uniform random one. The Softmax agent uses Equation 1 to probabilistically select an action w. p. $\lambda_{s}\in[0,\infty]$ . From the possible actions that will yield an expected reward of $\lambda_{d}\in[0,1]$ times the maximum possible expected, the Deceptive agent selects the action that is the farthest from an optimal action. Note that the agents with an imperfect decision-making component have a single parameter ( $\lambda$ ) to represent their level of rationality. For each, the higher the $\lambda$ , the more rational the agent is. We assume that each agent has a correct model of its execution noise.

Each of these types of agents were combined with a wide variety of execution skill levels in the experiments. For the experiments conducted in [6], only agents with non-changing symmetrical execution skill distributions were tested. In contrast, for the current work, agents with both symmetric and asymmetric distributions were used. In addition, experiments were conducted both when agents have stationary execution skill distributions and when they have time-varying distributions. The results and findings for these experiments are presented in Section 5.1 and Section 5.2, respectively.

4.3 Experimental Procedure

The process used to conduct each experiment was as follows. First, each agent being used in an experiment was initialized with the same execution skill level and a random decision-making skill level sampled from their respective range of rationality parameters. Each agent then repeatedly saw a state, selected a target action, and had an execution noise perturbation drawn from its execution noise distribution added to its target action. The final executed action was then observed by each of the estimation methods, which produced an estimate after each observation. All agents in an experiment experienced the same sequence of environment states and the same sequence of execution noise perturbations, to reduce variance in the results.

A resolution of 5.0 mm was used to discretize the action space. The reward function for each state was convolved with the agent’s execution noise distribution (using the same resolution) to compute the expected reward for all actions. Bivariate zero-mean Gaussian distributions were used for the true execution noise distributions, where $\sigma_{x}$ and $\sigma_{y}$ were selected from the range $[3.0,150.5]$ (mm), and $\rho$ from the range $[-0.75,0.75]$ . The agent rationality parameter ranges were: $\lambda_{f},\lambda_{d}\in[0.0,1.0]$ , and $\lambda_{s}\in[0.001,32.0]$ .

4.4 Evaluating Execution Skill

Previous estimators output a single number representing the standard deviation of the agent’s symmetric execution noise distribution. Estimators were compared using the mean squared error (MSE) of this estimate compared to the true standard deviation. The MCSE method produces estimates for multiple execution skill parameters so the MSE won’t be as meaningful for comparison, as an error for each parameter would need to be included. Instead, we use Jeffreys divergence (JD) to measure the difference between the estimated execution noise distribution and the agent’s true execution noise distribution. The Jeffreys Divergence between two distributions $P$ and $Q$ is defined as follows: $JD(P||Q)=D_{KL}(P||Q)+D_{KL}(Q||P)$ , where $D_{KL}$ is the Kullback-Leibler (KL) divergence. The KL divergence between two bivariate normal distributions can be computed in closed form from the corresponding means and covariance matrices.

4.5 Preliminary Experiments

The MCSE method has multiple parameters that must be specified. These include the number of particles to use ( $M$ ), the amount of motion model noise to add to the parameters $w_{\%}$ , the resampling strategy (always or $n_{eff}$ , the percentage of particles to resample ( $r$ ), and the $n_{eff}$ threshold to use when it is the resampling strategy, A preliminary set of experiments was conducted to explore different parameter values to select good parameters for the remaining experiments. These experiments only used stationary Rational agents, running for 100 observations in 2D-Darts. The results of this exploration are presented in the Appendix A, but the general observations were that higher $M$ improved performance, $w_{\%}$ didn’t appear to have much of an impact, and higher $r$ values generally led to better performance.

Based on this exploration, in the the remaining 2D-Darts experiments the MCSE parameter combination used was: $M=1000$ , $r=0.9$ , $w_{\%}=0.005$ , with resampling based on $n_{eff}$ with a threshold of 0.5. This combination had the smallest average JD after 100 observations in the preliminary experiments. We will refer to this estimator as MCSE.

The selected MCSE method was compared to the JEEDS estimator from previous work. In the 2D-Darts experiments, JEEDS utilized 33 hypothesis skill levels for both decision-making and execution skill, resulting in 1,089 combined hypotheses. This number was chosen so that JEEDS would have approximately the same number of hypothesis samples as MCSE. The decision-making skill level hypotheses for JEEDS were logarithmically distributed. All JEED beliefs were initialized to be uniform over the space of possible skill parameters.

5 Experimental Results

Two additional sets of experiments were conducted to compare the MCSE estimator to JEEDS, the previous best-performing execution skill estimation method. One set focused on agents with arbitrary, randomly generated execution skill covariance matrices, but whose skill didn’t change during the experiment. The other set used agents whose execution skill varied over the course of the observations, to explore how effectively MCSE can track the skill of dynamic agents. Details, results, and findings for each are presented next.

5.1 Estimating Execution Skill in Higher Dimensions

We first investigate the performance of MCSE at estimating execution skill in higher dimensions. To do this, the following process was repeated. First, a random set of execution skill parameters ( $\sigma_{x},\sigma_{y},\rho$ ) were generated, each from their respective ranges. This execution skill level was used by all agents, who additionally each had a random decision-making skill level assigned, again from each agent’s respective range of rationality parameters. These agents faced 100 randomly generated 2D-Darts states, and each estimator observed the sequence of states and executed actions, producing their execution skill estimate after each observation. At least 4000 agents of each type were included in these experiments.

Figure 4 shows the average JD for each method over observations, where the average is taken across all experiments involving that agent. It is immediately clear that the performance of the MCSE estimators is significantly more accurate than JEEDS when used on agents with arbitrary execution noise distributions. The average Jeffreys divergence for the MCSE estimators converges to values between 1 and 2 after the 100 observations for all the different agents, while the JEEDS average JD only nears 9 for the Rational agent, but is between 12 and 15 for the other agent types. To give some sense of the scale of these JD values, a visualization is given in Figure 3, showing pairs of distributions and their corresponding JD.

To shed more light on where the improvement is coming from, Figure 5 separates the agents into two groups, those with more symmetric execution noise distributions and the others. There were 14% of the agents that we categorized as symmetric, with $|\sigma_{x}-\sigma_{y}|<50$ and $|\rho|<0.2$ . It would make sense for JEEDS to do better on the symmetric agents, as those agents match the assumption of symmetry made by the JEEDS method. The symmetric results, shown on the top row of Figure 5, show that MCSE is competitive with and often barely outperforms JEEDS on these agents. The average JD for the both estimators is still good, converging to values below 2. The asymmetric results show that these are the cases where JEEDS really struggles and MCSE does very well, with the plots closely resembling the overall plots shown in Figure 4.

From these experiments, we conclude that the MCSE is much more effective at estimating execution skill in higher dimensions than the previous methods. It can do this consistently, even with agents who are not perfectly rational in their action selection mechanism.

5.2 Estimating Execution Skill Over Time

We next explore the ability of MCSE to estimate execution skill as it changes over time. To do this, we made variations of the Rational and the Softmax agents, with two different ways that skill can change over time. Each require initial and final skill levels to be specified, with one agent type changing skill abruptly at a single observation, and the other changing skill gradually across all observations. Abrupt changes are made at a random point in the middle third of the observations (between 33 and 66 with 100 observations). The execution skill level for a given dynamic agent was created as follows: First, $[8.0,15.0]$ mm and $[130.0,145.0]$ mm were defined as representative ranges for agents with more accurate and less accurate execution skills, respectively. Then, two samples were drawn from each range (one for $\sigma_{x}$ and one for $\sigma_{y}$ , along with a random $\rho$ value) to create the initial and final skill levels for the agent. The orders in which these were assigned varied as the agent’s skill level could progress from more accurate to less accurate or vice-versa.

MCSE again exhibits smaller errors than JEEDS at this task. The abrupt execution skill changes cause massive spikes in JD for JEEDS, from which it never completely recovers. While MCSE also has more error around the changes, it is of lower magnitude and the estimate quickly recovers afterwards. The gradual agents present more of a challenge, as the MCSE estimators reach average JD error levels of over 5. This is still less than the error of the JEEDS method. Perhaps more troubling is the fact that for the gradual agents, as the end of the observations nears, the error is ever-increasing. This is perhaps understandable, as the agent is always changing its execution skill level, and this is perhaps a greater challenge for both methods than we anticipated. Overall, MCSE shows much improvement over JEEDS at estimating dynamic skill, but there appears to still be room for improvement in the gradual changing skill case.

Experiments of MCSE with two different $w\%$ values were also performed to investigate whether the amount of noise affected the performance of the estimators on this task. Results, found in Appendix C show that the noise did not significantly affect their performance.

6 Application: Skill in Baseball

We now demonstrate the applicability of MCSE to real-world Major League Baseball (MLB) data. In baseball, a pitcher attempts to throw a baseball so that a batter cannot successfully hit it.¹¹1For complete rules see https://www.rulesofsport.com/sports/baseball.html The execution skill of the pitcher is an important factor in the pitcher’s success. Public data exists that gives the location of each pitch [12]. We use the same techniques as [15] to generate a reward function over the space of pitch locations for each pitch. The walks per inning (BB/IP) statistic is generally considered to be a measure of a pitcher’s accuracy. Using data from 2021, we selected three top-ranked pitchers by BB/IP and three bottom-ranked pitchers, each with at least 100 fastball (FF) pitches that we could feed into the MCSE and JEEDS estimators. The generalized variance (GV) (determinant of the estimated covariance matrix) for each pitcher on the FF pitch type is shown in Table 1, with the top-ranked pitchers shown above the bottom-ranked. This number gives a sense of the spread of a bivariate normal distribution, where higher numbers indicate less accuracy.

Table 1: Execution Skill Estimates for MLB Pitchers

		GV		% in Strike Zone
Pitcher	BP/IP	JEEDS	MCSE	JEEDS	MCSE
Chris Martin	0.091	0.162	0.019	63.25	94.51
Jacob deGrom	0.124	0.278	0.011	53.84	97.69
Corey Kluber	0.128	0.585	0.030	41.69	89.96
Tanner Scott	0.741	0.597	0.024	41.25	90.43
Jake Diekman	0.736	0.433	0.153	46.46	63.40
Tucker Davidson	0.673	0.424	0.017	46.75	95.10

The methods were generally able to distinguish between the pitchers in the two groups (the average GV for the top pitchers is lower than for the bottom pitchers for each estimator), and the estimates vary significantly between estimators. We are unable to state which estimator is more accurate, as the ground truth error distributions for the pitchers are unknown. Table 1, also shows, in the rightmost column, our estimated probability of a pitch being in the strike zone for each pitcher, given they were aimed in exact middle. This was estimated using one million samples from each estimated distribution. Figure 7 shows the 50% confidence ellipses for the estimates produced by the MCSE estimator, giving a sense as to the shape of the estimated error distributions. Each ellipse is centered on the average fastball pitch location for that pitcher. Most of the estimates have more variance in the y dimension than the x dimension, showing the value of estimating execution skill in a higher dimension.

7 Conclusions

In this paper, the Monte Carlo Skill Estimation (MCSE) was introduced, which utilizes a particle filter to perform inference in a dynamic Bayesian network with random variables corresponding to the execution and decision-making skill of an observed agent. MCSE makes two contributions: 1) it can model and estimate execution skill in multiple dimensions, and 2) it can better handle agents whose execution skill varies over time. These properties make it a natural method to use on real-world data, as was demonstrated on MLB data. In the future, we plan to explore additional enhancements and extensions of the basic particle filter approach that was used in this paper, given the success of this initial exploration. We also plan to apply the MCSE method to additional real-world data sources.

References

[1] D. Araujo, K. Davids, J. Chow, and P. Passos. The development of decision making skill in sport: an ecological dynamics perspective, pages 157–169. Nova Science Publishers, 2009.
[2] C. Archibald, A. Altman, and Y. Shoham. Analysis of a winning computational billiards player. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1377–1382, 2009.
[3] C. Archibald, A. Altman, and Y. Shoham. Success, strategy and skill: an experimental study. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, pages 1089–1096. International Foundation for Autonomous Agents and Multiagent Systems, 2010.
[4] C. Archibald and D. Nieves-Rivera. Execution skill estimation. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, pages 1859–1861, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems.
[5] C. Archibald and D. Nieves-Rivera. Bayesian execution skill estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6014–6021, Jul. 2019.
[6] C. Archibald and D. Nieves-Rivera. Estimating Agent Skill in Continuous Action Domains. Journal of Artificial Intelligence Research, 2024.
[7] N. Bard and M. Bowling. Particle filtering for dynamic agent modelling in simplified poker. In Proceedings of the Twenty-Second Conference on Artificial Intelligence (AAAI), pages 515–521, 2007.
[8] N. Bard, M. Johanson, N. Burch, and M. Bowling. Online implicit agent modelling. In Proceedings of the Twelfth International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 255–262, 2013.
[9] N. Bard, D. Nicholas, C. Szepesvari, and M. Bowling. Decision-theoretic clustering of strategies. In Proceedings of the Fourteenth International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2015. To Appear.
[10] D. Billings, D. Papp, J. Schaeffer, and D. Szafron. Opponent modeling in poker. In AAAI/IAAI, pages 493–499, 1998.
[11] T. C. Chan, C. Fernandes, and R. Walker. No more throwing darts at the wall: Developing fair handicaps for darts using a markov decision process. In MIT Sloan Sports Analytics Conference, 2024.
[12] J. LeDoux. Introducing pybaseball: an open source package for baseball data analysis. https://jamesrledoux.com/projects/open-source/introducing-pybaseball/, 2017. Accessed: 04-20-2024.
[13] C. K. Ling, F. Fang, and J. Z. Kolter. Large scale learning of agent rationality in two-player zero-sum games. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6104–6111, Jul. 2019.
[14] R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6–38, July 1995.
[15] W. Melville, J. Melville, T. Dawson, D. Nieves-Rivera, C. Archibald, and D. Grimsman. A game theoretical approach to optimal pitch sequencing. In MIT Sloan Sports Analytics Conference, 2023.
[16] T. Miller and C. Archibald. Monte carlo skill estimation for darts. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–8, 2021.
[17] S. Nashed and S. Zilberstein. A survey of opponent modeling in adversarial domains. Journal of Artificial Intelligence Research, 73:277–327, 2022.
[18] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, Upper Saddle River, NJ, USA, 3rd edition, 2009.
[19] F. Schadd, S. Bakkes, and P. Spronck. Opponent modeling in real-time strategy games. In GAMEON, pages 61–70, 2007.
[20] M. Shinya, S. Tsuchiya, Y. Yamada, K. Nakazawa, K. Kudo, and S. Oda. Pitching form determines probabilistic structure of errors in pitch location. Journal of Sports Sciences, 35(21):2142–2147, 2017. PMID: 28102105.
[21] H. A. Simon. Theories of bounded rationality. Decision and organization, 1(1):161–176, 1972.
[22] N. Sturtevant, M. Zinkevich, and M. Bowling. ProbMaxn: Opponent modeling in n-player games. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), pages 1057–1063, 2006.
[23] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009.
[24] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, Cambridge, MA, 2005.
[25] R. J. Tibshirani, A. Price, and J. Taylor. A statistician plays darts. Journal of the Royal Statistical Society. Series A (Statistics in Society), 174(1):213–226, 2011.
[26] I.-C. Wu, T.-R. Wu, A.-J. Liu, H. Guei, and T. Wei. On strength adjustment for mcts-based programs. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):1222–1229, Jul. 2019.
[27] R. Yang, C. Kiekintveld, F. Ordonez, M. Tambe, and R. John. Improved computational models of human behavior in security games. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3, pages 1155–1156. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
[28] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.

Appendix A Preliminary Experiments

This section presents details for the two different rounds of experiments conducted to systematically explore a range of parameter values with the goal of identifying optimal parameters to be used for subsequent experiments presented in the paper.

A total of 9 different stationary Rational agents were evaluated in each round. These agents presented execution skill levels of $(10,10)mm$ , $(10,100)mm$ , and $(100,100)mm$ . Three $\rho$ parameters $(-0.75,0.0,0.75)$ were used with each execution skill level. This allowed for the set of agents to have samples from symmetric agents and asymmetric ones. All of the agents were seen for 100 observations in the 2D-Darts domain.

A.1 Round 1: Parameter Sweep

The goal of this round of experiments was to determine an ideal value for the motion model noise $w_{\%}$ , select the most effective resampling strategy (whether to employ constant resampling or to use $n_{eff}$ , and identify an ideal percentage of particles to resample ( $r$ ).

A total of 18 different estimators were evaluated. Table 2 presents the different combinations used for the parameters of the estimators along with their final JD value. Note that they are sorted from smallest to highest JD. The selected estimator (in bold) was $w_{\%}=0.005$ , $r=90$ , with $n_{eff}$ .

Estimator			JD
$w_{\%}$	$r$	$n_{eff}$ ?
0.005	90	yes	0.2052
0.005	95	yes	0.2064
0.005	75	yes	0.2339
0.002	75	yes	0.2366
0.005	95	no	0.2451
0.002	90	yes	0.2465
0.005	90	no	0.2579
0.002	90	no	0.2706
0.002	95	yes	0.2787
0.020	95	yes	0.3002
0.020	90	yes	0.3167
0.020	95	no	0.3471
0.002	95	no	0.3526
0.020	90	no	0.4018
0.020	75	yes	0.4314
0.005	75	no	0.4436
0.002	75	no	0.4656
0.020	75	no	0.6821

Table 2: Evaluation of the Estimators

A.2 Round 2: Setting the Number of Particles

The goal of this round of experiments was to determine an ideal value for the number of particles to use ( $M$ ) within a given estimator.

Experiments were conducted with the best-performing estimator from the previous round ( $w_{\%}=0.005$ , $r=90$ , with $n_{eff}$ ) while varying $M$ in $[$ 50, 100, 500, 1000, 1500, 2000 $]$ . Table 3 presents the total number of experiments obtained for each selected agent across different $M$ ’s. Figure 8 shows how the final JD changes as $M$ increases. $M=1000$ was selected for subsequent experiments as its performance is comparable to that of $M=1500$ and $M=2000$ , while also reducing the running time of the experiments.

	Number of Particles
Rational Agent	50	100	500	1000	1500	2000
(10,10,-0.75)	1000	974	1147	1151	1095	941
(10,10,0.0)	1000	977	1159	1183	1128	986
(10,10,0.75)	1000	975	1153	1168	1106	978
(10,100,-0.75)	953	945	936	1111	975	1043
(10,100,0.0)	955	948	947	1142	1007	1090
(10,100,0.75)	955	947	942	1123	992	1079
(100,100,-0.75)	946	937	916	952	936	1220
(100,100,0.0)	947	940	928	974	961	1266
(100,100,0.75)	946	939	926	956	945	1249
Total	8702	8582	9054	9760	9145	9852

Table 3: Experiments Conducted - Round 2

Appendix B Experiments Performed: Different Agent Types

Agents	Total
Rational	4255
Softmax	4092
Flip	4196
Deceptive	4129
Abrupt Rational	5838
Gradual Rational	5942
Abrupt Softmax	5633
Gradual Softmax	5948

Table 4: Experiments Conducted

Approximate Estimation of High-dimension Execution Skill for Dynamic Agents in Continuous Domains

Abstract

1 Introduction