ifaamas \acmConference[AAMAS ’23]Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023)May 29 – June 2, 2023 London, United KingdomA. Ricci, W. Yeoh, N. Agmon, B. An (eds.) \copyrightyear2023 \acmYear2023 \acmDOI \acmPrice \acmISBN \acmSubmissionID1049 \affiliation \institutionTufts University \cityMedford, MA \countryUnited States \affiliation \institutionArizona State University \cityTempe, AZ \countryUnited States \affiliation \institutionArizona State University \cityTempe, AZ \countryUnited States \affiliation \institutionArizona State University \cityTempe, AZ \countryUnited States \affiliation \institutionArizona State University \cityTempe, AZ \countryUnited States \affiliation \institutionArizona State University \cityTempe, AZ \countryUnited States \affiliation \institutionArizona State University \cityTempe, AZ \countryUnited States \affiliation \institutionTufts University \cityMedford, MA \countryUnited States \affiliation \institutionTufts University \cityMedford, MA \countryUnited States
Methods and Mechanisms for Interactive Novelty Handling in Adversarial Environments
Abstract.
Learning to detect, characterize and accommodate novelties is a challenge that agents operating in open-world domains need to address to be able to guarantee satisfactory task performance. Certain novelties (e.g., changes in environment dynamics) can interfere with the performance or prevent agents from accomplishing task goals altogether. In this paper, we introduce general methods and architectural mechanisms for detecting and characterizing different types of novelties, and for building an appropriate adaptive model to accommodate them utilizing logical representations and reasoning methods. We demonstrate the effectiveness of the proposed methods in evaluations performed by a third party in the adversarial multi-agent board game Monopoly. The results show high novelty detection and accommodation rates across a variety of novelty types, including changes to the rules of the game, as well as changes to the agent’s action capabilities.
Key words and phrases:
Open-world AI, Agent Architecture, Adaptive Multiagent Systems1. Introduction: Open-world AI
Many classical adversarial AI tasks, such as game playing, take place in “closed-world” domains where all aspects of the domain—the types of entities, their properties, their actions, and the overall domain dynamics—are fixed. They are typically known to the agents before they start their task performance, and they do not change during task execution. Examples of such domains are “perfect information games” such as Chess, Go, or Ms.Pac-man, where the rules of the game, the goals of the players, and the entire state of the game are always known by all agents Brown et al. (2018); Nash (1951); Perez-Liebana et al. (2016). This characteristic simplifies the game AI behavior by limiting the number of novelties to instances of known types (e.g., a chess move with the bishop a player has not seen before), thus allowing the development of the game AI without needing to anticipate any unknown scenarios within the bounds of the system (e.g., a novel piece with novel rules being introduced).
In contrast, agents operating in an “open-world” must be able to handle changes to entities and domain rules. Specifically, in the context of open-world games, the rules, the state, and the actions of other players might only be partially known or could change anytime. The agent thus must discover these changes while playing the game Boney et al. (2021); Wanyana and Moodley (2021). Especially interactive novelties where agents interact with each other and with the environment present a challenge to any agent departing from a closed-world assumption Ponsen et al. (2007); Ganzfried and Sandholm (2011); Heinrich and Silver (2016). In open-world environments, the action’s effects and interaction’s effects can be changed during the task operation time. Therefore, making the wrong move or wrongfully interacting with other agents can cause the agent to fail the task.
In an effort to tackle the challenges of interactive novelties in adversarial open worlds, we propose general methods and architectural mechanisms that allow AI agents to detect, characterize, and adapt to interactive novelties in adversarial games. We develop a general novelty-handling framework, as well as symbolic logical reasoning methods to detect, learn, and adapt to novelties in open-world environments. Our main contributions include (1) an architectural framework to handle interactive novelties in an adversarial environment, and (2) new logical reasoning approaches to characterize novelties and accommodate them during planning (that expands current state space, action space, and expected action effects).
2. Background and Related Work
Recent applications of multi-agent environments such as multiplayer games Peng et al. (2017), Poker Billings et al. (1998), social robotic systems Barakova et al. (2018), and adversarial attack and defense Ren et al. (2020) consist of adversary elements and complex agent’s behaviors. Therefore, learning how to adapt to the opponents’ strategies becomes an essential task for current AI architecture. Unlike collaborative AI, where all the agents manage to work together to pursue a team goal, adversarial AI agents must learn other agents’ behaviors to develop suitable strategies to maximize their own goals. This paper uses the open-world Monopoly environment as the primary test bed. Monopoly contains several main characteristics of an adversarial environment, such as unknown opponents’ behaviors, stochastic elements (e.g., dice rolls, community cards, and chance cards), and novelties in the game. These characteristics can be found in many real-world domains, such as stock market forecasting, self-driving vehicles, or cybersecurity.
Current cognitive architecture systems such as probabilistic graphical models Koller and Friedman (2009); Sucar (2015) provide an excellent tool that combines graph theory and probability theory to enable efficient probabilistic reasoning and learning. The model is widely used in the AI community as one of the main tools to generate state-of-the-art results. These results show the capabilities of the model to handle some of the challenges in traditional cognitive architecture, such as perception, interaction, and adaptation. However, these approaches are not explicitly developed to deal with closed-world environment. Even though these methods have shown excellent results in closed-world environments, addressing open-world and interactive novelty remains a challenge.
Over the past two decades, many research studies attempted to tackle the challenge of open-world AI. However, the challenge of integrating a general intelligence system capable of detecting and adapting to an open-world environment still remains unsolved Peng et al. (2021); Goertzel (2014); Gizzi et al. (2021); Sarathy et al. (2021). Several challenges of integrating general AI systems are pointed out in previous studies, such as the difficulty of integrating the requisite capabilities (e.g., detecting novelty and adapting to the novelty), and the difficulty of measuring the performance of the agent towards human-like behavior AI Goertzel (2014). Reinforcement learning (RL) methods have been proposed as a solution for open-world environment Padakandla et al. (2020); Choi et al. (2000); Hester and Stone (2012); Arulkumaran et al. (2017) in recent years. These methods use past and present experience to learn a new representation of the world or attempt to construct a suitable control policy in dynamically-changing environments. However, RL and deep RL suffer to adapt to small environmental changes. Small pixels change in Atari arcade games can cause the RL agent to corrupt and fail to complete the task, and adaptation to novelties may often take as long as training the agent from scratch Goel et al. (2021). Finally, recent works in the explainable AI (XAI) literature have looked at answering contrastive queries Sreedharan et al. (2020), which could very well be about potential novelties in the world. However, applying such a line of work for detecting open-world novelties would require an agent (assumed to be a human in the loop in XAI) to formulate and initiate queries to the agent to elicit the presence of novelties. Similarly, XAI works Verma et al. (2022) that initiate an explanatory dialogue depending on the human in the loop (instead of automated detection and characterization) to analyze and detect open-world novelties. Finally, there are works that learn terms in the user’s vocabulary Soni et al. (2022). The user can then use these terms to advise the agent on accommodating the novelty.
Current approaches in cognitive AI systems such as Cognitive-Affective State System (CASS), and Sigma Cognitive Architecture have attempted to address the open-world AI challenge Ustun et al. (2021); Ji et al. (2004); Goertzel (2014). Both architectures have been constructed to solve the problem without updating their core components or characterizing the novelty. These approaches may improve the overall performance of the AI. However, both architectures are not good enough to apprehend specific changes in the environment and accommodate those changes. More developments are needed for these architectures to perform well in a realistic open-world environment, where a part of the information can be changed, such as adversary mental models, transition functions, and agents’ actions and interactions.
3. Preliminaries
3.1. n-Person Non-Cooperative Stochastic Turn-Based Games
We consider as the non-cooperative stochastic game environment consisting of a finite, non empty state space ; players, ; a finite set of action set for each of the players; a set of conditional transition probabilities between states , such that ; a reward function so that , where . An n-person stochastic game is turn-based if at each state, there is exactly one player who determines the next state. In order to formulate the problem, we extend the action sets for to be state dependent. For each particular state , there is a restricted action set , there is at most such that .
At the beginning of the game, all players start at the same initial state . Each player independently performs an action . Given and the selected actions , the next state is derived based on and , with a probability . Then, each player independently performs an action , next state is derived based on and , with a probability . The game continues in this fashion for an infinite number of steps, or until the goal is reached. Therefore, the game generates a random history . Based on a partial of the history , we can derive the conditional distribution, so-called strategy , with is the set of probability measures on . A strategy set consists of a strategy for each player is used to determine the next action . Finally, the reward function is specified based on the transition of the game, and is the discount factor which determines the importance of immediate and long-term future rewards.
3.2. Interactive Novelty
In general, novelty refers to a change in the environment where the agent can neither apprehend the change from its own knowledge base nor from its past experience. In this paper, we want to address the challenge of detecting and accommodating interactive novelty. More specifically, interactive novelty refers to the change in agent’s actions, interactions, and relations.
-
•
Novelty Level 1 [Action]: New classes or attributes of external agent behavior.
-
•
Novelty Level 2 [Interaction]: New classes or attributes of dynamic, local properties of behaviors impacting multiple entities.
-
•
Novelty Level 3 [Relation]: New classes or attributes of static, local properties of the relationship between multiple entities.
We denote as the interaction set of the agent. The set represents the agent’s capability to interact with other agents, or with the environment. The relation set represents the relationship of the agent with other agents, or agent with the environment, such that the relationship is shown as a part of the history, or action sequence. Each action in the action set is defined by a preconditions set and an effects set . A preconditions set of an action includes all the conditions that need to be satisfied in order to execute the action. Meanwhile, the effects set of an action indicates the expected results after a successful execution of action .
The set of interactive novelties consist of all the changes that can occur in action set, interaction set, and relation set. In this scenario, action novelty refers to changes in action space, action preconditions or action effects. We denote as the new action set, which contains all unknown actions to the agent, such that , and , where is the agent knowledge base. We assume that the preconditions and effects of the new action set are completely unknown to the agent, and both must be discovered through agent’s interactions. Similarly, we can present the new interaction set as , and relation set , then formulate interaction novelty and relation novelty accordingly.
3.3. Problem Formulation: Interactive Novelty Detection and Adaptation
The integrated architecture allows us to map all the essential information in section 3.1 of the environment to the knowledge base, . Based on the information, we can construct the strategy using the truncated-rollout MTCS solver. However, because interactive novelties may occur throughout the course of the game, the plan must be adjusted in order to accommodate new actions, interactions, or relations.
As described in Section 3.1, the pre-novelty environment is represented as a non-cooperative stochastic turn-based game:
In order to detect and accommodate interactive novelties, we define a detection function to determine if there is any unexpected change in the environment after the agent selects an action in state and observes the next state , or if the agent performed a new action. In addition, an identification function characterizes the cause of the change based on logical reasoning. The purpose of these functions is to represent the new environment after novelty (post-novelty) , such that
where is the new state space post-novelty. The set post-novelty is the new finite action set with respect to each agent in the environment, is the new conditional transition function, and is the new reward function post-novelty. From the new model of the post-novelty world , we modify the current strategy set in order to adapt to the changes in the environment, as described in the next section.
4. Adversarial Domain: Open-World Monopoly
4.1. Environment Implementation
Monopoly is a multi-player adversarial board game where all players start at the same position. The game can support up to 4 players, described in Figure 1. All players roll dice to move across the board. The ultimate goal of the game is to be the last player standing after bankrupting other players. This objective can be achieved by buying properties, and railroads, monopolizing color sets, and developing houses on properties. If one player lands on a property owned by another player, they get charged rent or a fee. After monopolizing color sets and developing houses and hotels, players can charge higher fees when other players land on their properties. The game includes different surprise factors such as chance cards, community cards, jail, auction, and trading ability between agents. These elements can completely change the game. Hence, any action in the game needs to be adapted to dice rolls, community cards, chance cards, and the decisions of other players. These game characteristics make it more challenging for integrated planning and execution. In the game simulator, novelties can be injected on top of the standard game to study how the agent detects and accommodates these changes Haliem et al. (2021). The third-party team that ran the evaluation developed the Open-World Monopoly environment. Unlike traditional Monopoly, where we can fully observe all the states and actions of other agents, the Open-world Monopoly does not allow us to monitor all the actions and interactions on our turn Kejriwal and Thomas (2021). So, the environment is partially observable.

4.2. Interactive Novelties in Monopoly
We implement three different categories of interactive novelty discussed in Novelty Characterization section into a classic Monopoly game. Some theoretical examples of novelty are described as below:
-
•
Action Novelty: This class of novelty can be illustrated through a stay-in-jail action. For this novelty, the player could stay in jail as long as they want. However, the player must pay a certain fee to the bank each time they receive rent (when the player decides to stay in jail by their own choice voluntarily).
-
•
Interaction Novelty: We illustrate the relation novelty through a loan interaction between two agents. For example, a player could send a loan request to another player and pay the loan back over a specific amount of time that both parties agree.
-
•
Relation Novelty: We illustrate the relation novelty through a relation property, where we enforce a relation of homogeneity between properties in a specific monopolized color group (one color group). The player must homogeneously improve a monopolized set of properties in a given move. For example, imagine the player has 3 orange properties (a monopolized set). In the default game, you could set up a house on the first property, and leave the second one unimproved. For this novelty, in the move, if you improve the first property, you must also improve the second and third so that the properties are ’homogeneously’ improved at the end of the move. Failure to do this will lead to the improvement being revoked at the end of the move.
5. The Architectural Framework
The architecture includes four main components: the environment interface, the novelty handling component, a knowledge base, and a planning agent, as shown in Figure 2. The Novelty Handler component was integrated in the “Agent Development Environment” (ADE) Andronache and Scheutz (2006) which allows for the development of different integrated architectures. The Knowledge Base of the agent is constructed by the Belief Stack, Action Stack, Interaction Stack, and Relation Stack. The Planning Agent component develops and operates the plan based on the information in the knowledge base and the goal. The Monopoly Interface connects to the Monopoly API so that novelties can be injected into the environment. These novelties are detected and characterized by the novelty handler component. The component is developed using Answer Set Programming (ASP), a declarative programming oriented towards challenging search problems Baral (2003); Baral et al. (2004); Gebser et al. (2014). After the novelties are determined, the novelty handler updates the new actions, effects, or states to the knowledge base. When the agent receives the updated information, the planning agent then reconstructs the plan according to the new knowledge base.

5.1. Novelty Detection
We record the information of the game as provided by the game environment and compare it with our “expectation” state of the game board. This “expectation” state is derived from the agent’s knowledge base of the game, including expected states, actions, action preconditions, and end effects. Then, the game environment provides us with the actual game board states and actions that have occurred between the current time step and the previous time our agent performed an action (e.g., after our agent lands on a property and buys it, all other agents get to act in order until it is our agent’s turn again). When we notice a discrepancy between our expected state and the actual state, we surmise that something must have changed within the game, i.e., a novelty may have been introduced, which makes some aspects of our domain representation incorrect. Such unpredicted changes require the agent to update its knowledge base accordingly (e.g., a new action is added to the action space of Monopoly). An example of the novelty-handling component is shown in Algorithm 1. The evaluation is run in a tournament setup (many games in a tournament), discussed in section 6. Therefore, when the agent detects a novelty from the current game, this novelty information will be used to adapt to the next game.
5.2. Novelty Characterization
Next, the agent uses a novelty identification module to characterize the novelty. This module has several sub-modules (which can be run in parallel), each focused on determining a specific novelty type. Each novelty identification sub-module uses the same ASP code (except for two changes) that is used for hypothetical reasoning about the effect of an action. The first change is that, a particular parameter, which is the focus of that specific sub-module, which was originally a fact, is now replaced by “choice” rules of ASP that enumerate different values that the parameter can take. The second change is that constraints are added to remove possible answer sets where the predicted game board state does not match the observed game board state. The resulting program’s answer sets give us the parameter values which reconcile the predicted game board state and the observed game board state. If there is only one answer set and thus a unique parameter value, then if this value is different from the value we had earlier, we have identified a novelty. Now we can update our ASP code that was used for hypothetical reasoning by simply replacing the earlier value of the parameter with the new value.
Below we first give a glimpse of how ASP can be used for reasoning about the next state and how that code can be minimally modified to infer a novelty.
To reason about the next state, the ASP code will first define the game parameters through facts such as the following:
dice_value(1..6). player(player1;player2). cash(1..1000). asset("B&O_Railroad"). penalty(50).
Then rules of the following form are used to define actions and fluents.
action(sell_property(P,X)) :- player(P), asset(X). fluent(asset_owned(P,V)) :- player(P), asset(V).
Properties of actions, such as their pre-conditions, and their effects are defined using rules of the following kind:
%Executability of selling assets :- occurs(sell_property(P,V), T), player(P), asset(V), time(T), not holds(asset_owned(P,V),T). %Effect of selling assets not_holds(asset_owned(P,V),T+1) :- holds(asset_owned(P,V),T), occurs(sell_property(P,V),T), player(P), asset(V),time(T). not_holds(asset_mortgaged(P,V),T+1) :- holds(asset_owned(P,V),T), occurs(sell_property(P,V), T), player(P), asset(V), time(T). holds(current_cash(P,X+Y),T+1) :- holds(current_cash(P,X),T), occurs(sell_property(P,V),T), not holds(asset_mortgaged(P,V),T), asset_price(V,Y), player(P), asset(V), time(T). not_holds(current_cash(P,X),T+1) :- holds(current_cash(P,X),T), occurs(sell_property(P,V),T), not holds(asset_mortgaged(P,V),T), asset_price(V,Y), player(P), asset(V), time(T). holds(current_cash(P,X+Y),T+1) :- holds(current_cash(P,X),T), occurs(sell_property(P,V),T), holds(asset_mortgaged(P,V),T), asset_m_price(V,Y), player(P), asset(V), time(T). not_holds(current_cash(P,X),T+1) :- holds(current_cash(P,X),T), occurs(sell_property(P,V),T), holds(asset_mortgaged(P,V),T), asset_m_price(V,Y), player(P), asset(V), time(T). %Executability of paying jail fine :- occurs(pay_jail_fine(P), T), player(P), time(T), not holds(in_jail(P), T). :- occurs(pay_jail_fine(P), T), player(P), time(T), not holds(current_cash(P, _), T). :- occurs(pay_jail_fine(P), T), player(P), time(T), holds(current_cash(P,X),T), X < 50. %Effect of paying jail fine not_holds(in_jail(P), T+1) :- holds(in_jail(P), T), occurs(pay_jail_fine(P), T), player(P), time(T). not_holds(current_cash(P, X), T+1) :- holds(current_cash(P,X),T), holds(in_jail(P), T), occurs(pay_jail_fine(P), T), player(P), time(T). holds(current_cash(P, X-50), T+1) :- holds(current_cash(P,X),T), holds(in_jail(P), T), occurs(pay_jail_fine(P), T), player(P), time(T).
The inertia rules are expressed as follows:
holds(F,T+1) :- fluent(F), holds(F,T), not not_holds(F,T+1), time(T). not_holds(F,T+1) :- fluent(F), not_holds(F,T), not holds(F,T+1), time(T).
The initial state is defined using holds facts with respect to time step 0, such as:
holds(in_jail(player1), 0). holds(current_cash(player1,500),0).
An action occurrence at time step 0 is then defined as a fact in the following form.
occurs(pay_jail_fine(player1),0).
Now when a complete ASP program with rules and facts of the above kind is run, we get an answer set from which we can determine the state of the world at time step 1.
Suppose that the answer set has the facts:
holds(in_jail(player1), 0). occurs(pay_jail_fine(player1),0). holds(current_cash(player1,500),0). holds(current_cash(player1,450),1).
while our next observation gives us:
obs(current_cash(player1,477),1).
The discrepancy between our prediction about player1’s current_cash being 500 (at time step 1) is different from our observation that player1’s current_cash is 477. This suggests there is a novelty. This can be determined by the following two simple rules.
discrepancy(F,T) :- fluent(F), time(T), holds(F,T), not observed(F,T). discrepancy(F,T) :- fluent(F), time(T), not holds(F,T), observed(F,T).
While the above could have been implemented in any language, including in the simulator language, which we also implemented in Python, having it in ASP, makes it easier for us to take the next step, which is to find out what the novelty is.
In ASP, we have to modify the above ASP code by adding the following and removing “penalty(50)” (referring to the jail fine in the Monopoly game) from the original code.
oneto200(1..500). 1 { penalty(X) : oneto200(X)} 1. %choice rule :- obs(current_cash(P,X),1), holds(current_cash(P,Y),1), X!=Y, player(P).
In the above, the first fact and the choice rule defines the range of penalty that we are exploring. If we had just those two rules, we will multiple answer sets with a penalty ranging from 1 to 500. The constraint (the last ASP rule) then eliminates all the answer sets where the observation about current_cash does not match with the holds. In the answer set that remains, we get the penalty value that would make the observation match with the holds, thus allowing us to figure out the novelty with respect to the penalty. In this particular case, the program will have the answer set with ”penalty(23)” thus characterizing the novelty that the penalty is now 23.
5.3. Novelty Accommodation
Since novelties in the state (features, dynamics, actions) mean the agent would have to replan often and would have to do so based on the most updated information, we were interested in developing an online planning algorithm to determine the best action. However, with environments that are both long-horizon and stochastic, using online planning approaches like Monte-Carlo tree search, quickly becomes intractable. To address this problem, we formulate a truncated-rollout-based algorithm that uses updated domain dynamics (learned from detected novelties) for a few steps of the rollout and then uses a state evaluation function to approximate the return for the rest of that rollout. In our evaluation function, we use both domain-specific components and a more general heuristic to approximate the return from the state after the truncated rollout. Furthermore, to ensure the agent adapts to the detected novelties, we made both the environment simulator used for rollouts and the evaluation function sufficiently flexible and conditioned on the environment attributes; we only used a few tuned constants. Thus, whenever a novelty was detected, we updated the relevant attributes in our simulator and evaluation function before running our algorithm to decide our actions. Using this approach, we are able to incorporate novel information into our decision-making process and adapt efficiently. An example of the whole process is shown in Algorithm 2.
We will now provide a detailed description of the rollout algorithm and the evaluation function. In our algorithm, when choosing the next best action in a given state, we execute multiple rollouts for each possible action and compute the mean return value for each action. Each rollout is terminated either when some terminal state is reached or when some number of actions have been taken. The rollouts use the updated domain dynamics of the environment. Due to potentially high branching factor, we keep these rollouts to be short (which also limits the effects of errors in our characterization of any novelties).
However, to infer some approximation of the long-term value of an action, we use an evaluation function. Our evaluation function consists of two components: one that is domain-specific and the other that is heuristic-based and can be applied to any domain in general. The heuristic component of the evaluation function involves relaxing the domain and performing another rollout on the relaxed domain for some depth . Some examples of relaxations include limiting adversarial agents’ actions and determination of domain dynamics. For instance, in the case of the Monopoly domain, we prevent the agent from taking any buying or trading actions. On the other hand, the domain-specific component of the evaluation function computes the value of the state as the sum of two terms: and where is the value of all the assets the agent owns, whereas computes the maximum rent that the agent would get if it gains a Monopoly over any color, scaled down by how far the agent is to get the monopoly.
6. Evaluation & Results
6.1. External Evaluation
In an effort to maintain the integrity of the evaluation, all the information about the novelty was hidden from our team, and all the information about our architecture or methodologies was also hidden from the evaluation team. The external evaluations were performed by a third-party team that originally created the Open-world Monopoly domain. Our agent was evaluated on the three interactive novelties: action, interaction, and relation. For each type of interactive novelties, more than 20 different novelties were introduced during the evaluation process. Each novelty level also has three difficulty levels (shown in Table 2). The difficulty levels expressed the difficulty of detecting and using the novelty. For instance, if the novelty involved an action, an easy difficulty means the action was available for the agent to perform without any preconditions. A medium difficulty means the actions can be detected by observing other agents. A hard difficulty means the agent can only act under specific circumstances, and it may require the agent to explore the environment to learn the action. There were more than 60 novelties in which the agent was evaluated in total. At least 600 tournaments (100 games per tournament) were run to measure our agent’s performance. Tournaments were started with a traditional Monopoly game (no novelty). At a certain point throughout the tournament (non-specific, e.g., on the game), a novelty was introduced. To avoid ambiguity between novelties, only one specific novelty at a time was injected into a tournament. In our internal evaluation, ASP performed excellently in novelty detection and characterization. However, due to the characteristics of ASP, the novelty-handling component run time can be very high. Moreover, due to the nature of the game Monopoly (the game can go on indefinitely), and limited computational resources, we decided to use Python to overload the requirements of our solver and leverage the access to the simulator instead of using ASP to the first model and subsequently detect for novelties, to optimize the run time for the external evaluation.
Our agent was evaluated based on four different metrics. M1 is the percent of correctly detected trials (CDT). In this case, the percent of CDT is the percent of trails that have at least one True Positive and no False Positives (FP). M2 is the percent FP (the agent reports novelty when no novelty exists). M3 is the novelty reaction performance (NRP) before the novelty was introduced (pre-novelty). M4 is the novelty reaction performance (NRP) after the novelty was introduced (post-novelty). To measure the NRP, our agent was evaluated against a heuristic agent which embedded some of the most common strategies in Monopoly (e.g., target some specific color, never buy some properties, always reserve money, etc.). Finally, we compute the novelty reaction performance (NRP) of the agent based on the following formula:
Where, is the win rate of our agent (pre-novelty for M3, and post-novelty for M4). is .
Novelty Level 1: Action | |||
---|---|---|---|
Metrics | Easy | Medium | Hard |
Mean | Mean | Mean | |
M1 | |||
M2 | |||
M3 | |||
M4 | |||
Novelty Level 2: Interaction | |||
M1 | |||
M2 | |||
M3 | |||
M4 | |||
Novelty Level 3: Relation | |||
M1 | |||
M2 | |||
M3 | |||
M4 |
The results suggest that our cognitive architecture provides outstanding solutions for the game regardless of the complexity of the environment and differing levels of novelty. Furthermore, our agent achieved a perfect precision score ( in percent of CDT and of FP) at all difficulty levels of action and interaction novelties. The agent achieved a nearly perfect precision score in relation novelties. However, the agent missed of the novelties in the hard level of difficulty. These failures to detect certain novelties happened due to the nature of the relation novelty category: we can only detect these novelties types when a specific action is executed. Due to the stochasticity of the Monopoly game, the agent would sometimes not perform a specific action throughout the entire evaluation. To identify a relation novelty, the agent may need to perform a particular action at a specific state to reveal the novelty. For example, in the relation property novelty scenario (discussed in section 4.2), this novelty only occurs when we monopolize a green color group (one of the most challenging color groups to monopolize due to the cost of each property). The agent may then fail to detect the novelty because the agent would never monopolize the green color group throughout testing. The M3 and M4 NRP scores in all novelty levels show that our agent outperformed the baseline agent before and after when novelties were introduced. The scores in Table 1 indicate that our cognitive architecture and accommodation strategies allow the planning agent to handle interactive novelties perfectly.
6.2. Internal Evaluation
6.2.1. Agent Performance Without Novelty Accommodation
In order to understand the effectiveness of the novelty handler components (detection, characterization, and accommodation), we conduct experiments on all the novelties and record the win rate of the MCTS agent with and without the support from the novelty handler across all the novelties with a random number of games for each game tournament. Table 2 shows the overall performance of the MCTS agent with the novelty handler against the MCTS agent without the novelty handler. The result suggests that the MCTS agent with the support of the novelty handler outperforms the vanilla MCTS agent without novelty handling. There is a significant win rate difference between them. Furthermore, the results also indicate that the novelty handler components play an essential role in the agent’s performance in an adversarial open-world domain. Although some novelties can have an essential effect on the game, and some novelties may not affect the game (nuisance novelties), the novelty handler mechanism still shows its efficiency in enhancing the agent’s performance. For example, restricted color novelty can significantly affect the agent’s strategies for buying and trading properties. On the other hand, other novelties, such as selling houses or property rates, can have minimal effects on the game.
Novelty | Win rate of adaptive MCTS agent | Win rate of non-adaptive MCTS agent |
---|---|---|
Mean SD | Mean SD | |
Action | ||
Relation | ||
Interaction |
6.2.2. Agent Performance Against Existing Methods
In order to learn the performance level of our agent, we compare our agent against other Monopoly-playing agents. For this experiment, we evaluate our agent’s performance against the hybrid deep reinforcement learning agent Bonjour et al. (2021). The hybrid agent uses proximal policy optimization (PPO) and double-deep Q-learning (DDQN) algorithms. The authors compare the standard reinforcement approach to their hybrid approach, and the experimental results show that the hybrid agents outperform traditional RL agents. Significantly, the hybrid PPO agent has a win rate of against a fixed-policy agent that is developed base on the Monopoly world champion’s strategy. In our evaluation, we ran two instances of our agents against one of the fixed-policy agents and the hybrid deep reinforcement learning agent in trials. The results are shown in table 3. The results show our agent’s dominant performance against the hybrid reinforcement learning approach. Our agent has a more than win rate in the tournament compared to of the hybrid learning agent.
Our Agent 1 | Our Agent 2 | Hybrid Agent | |
---|---|---|---|
Mean SD | Mean SD | Mean SD | |
Win ratio |
7. Conclusion
Our work presented a new agent architecture for interactive novelty handling in an adversarial environment that can detect, characterize, and accommodate novelties. Our architecture is modeled based on the thought process of human cognition when we deal with environmental changes. First, we use ASP to detect and characterize interactive novelties (action, interaction, and relation). Then, we update the detected novelties to our agent’s knowledge base. Finally, we utilize the truncated-rollout MCTS agent to accommodate the novelty. The external evaluation results support the cognitive architecture’s effectiveness in handling different levels of interactive novelty.
However, the architecture has potential limitations in novelty characterization and learning agent behavior. One limitation of this architecture is the capability to learn the opponents’ behaviors. Our cognitive architecture does not explicitly model the opponent’s strategy to detect the change in other agents’ behaviors and adapt accordingly. To address this limitation, we propose two additional models that can be a part of the novelty handler component. The first approach is to model the opponents’ behavior using probabilistic reasoning Pearl (1988); 198 (1988); J.P. (1988). In these models, we can learn the action probability distribution based on the game state, which helps us detect any change in opponents’ behaviors.
Secondly, we would like to model the opponents’ behavior using reinforcement learning. Recent applications of reinforcement learning show promising results in learning opponents’ behavior without knowing opponent’s observations and actions during both training and execution processes Papoudakis and Albrecht (2020); Radulescu et al. (2020). Ultimately, we believe improving the model’s capability of predicting another agent’s behaviors is the biggest area for growth.
This work was funded in part by DARPA grant W911NF-20-2-0006. We would like to thank Mayank Kejriwal, Shilpa Thomas, Min-Hsueh Chiu and other members of the University of Southern California team for the Monopoly simulator and agent evaluation.
References
- (1)
- 198 (1988) 1988. THE MORGAN KAUFMANN SERIES IN REPRESENTATION AND REASONING. In Probabilistic Reasoning in Intelligent Systems, Judea Pearl (Ed.). Morgan Kaufmann, San Francisco (CA), i. https://doi.org/10.1016/B978-0-08-051489-5.50001-1
- Andronache and Scheutz (2006) Virgil Andronache and Matthias Scheutz. 2006. Ade - an Architecture Development Environment for Virtual and Robotic Agents. Int. J. Artif. Intell. Tools 15 (2006), 251–286.
- Arulkumaran et al. (2017) Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 6 (2017), 26–38.
- Barakova et al. (2018) E. I. Barakova, M. De Haas, W. Kuijpers, N. Irigoyen, and A. Betancourt. 2018. Socially grounded game strategy enhances bonding and perceived smartness of a humanoid robot. Connection Science 30, 1 (2018), 81–98. https://doi.org/10.1080/09540091.2017.1350938 arXiv:https://doi.org/10.1080/09540091.2017.1350938
- Baral (2003) Chitta Baral. 2003. Knowledge Representation, Reasoning and Declarative Problem Solving. Cambridge University Press. https://doi.org/10.1017/CBO9780511543357
- Baral et al. (2004) Chitta Baral, Michael Gelfond, and Richard Scherl. 2004. Using answer set programming to answer complex queries. In Proceedings of the Workshop on Pragmatics of Question Answering at HLT-NAACL 2004. Association for Computational Linguistics, Boston, Massachusetts, USA, 17–22. https://aclanthology.org/W04-2503
- Billings et al. (1998) Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. 1998. Opponent modeling in poker. Aaai/iaai 493, 499 (1998), 105.
- Boney et al. (2021) Rinu Boney, Alexander Ilin, Juho Kannala, and Jarno Seppanen. 2021. Learning to Play Imperfect-Information Games by Imitating an Oracle Planner. IEEE Transactions on Games (2021). https://doi.org/10.1109/TG.2021.3067723
- Bonjour et al. (2021) Trevor Bonjour, Marina Haliem, Aala Oqab Alsalem, Shilpa Thomas, Hongyu Li, Vaneet Aggarwal, Mayank Kejriwal, and Bharat K. Bhargava. 2021. Decision Making in Monopoly Using a Hybrid Deep Reinforcement Learning Approach. IEEE Transactions on Emerging Topics in Computational Intelligence 6 (2021), 1335–1344.
- Brown et al. (2018) Noam Brown, Tuomas Sandholm, and Brandon Amos. 2018. Depth-Limited Solving for Imperfect-Information Games. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/34306d99c63613fad5b2a140398c0420-Paper.pdf
- Choi et al. (2000) Samuel Choi, Dit-Yan Yeung, and Nevin Zhang. 2000. An Environment Model for Nonstationary Reinforcement Learning. In Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12. MIT Press. https://proceedings.neurips.cc/paper/1999/file/e8d92f99edd25e2cef48eca48320a1a5-Paper.pdf
- Ganzfried and Sandholm (2011) Sam Ganzfried and Tuomas Sandholm. 2011. Game Theory-Based Opponent Modeling in Large Imperfect-Information Games. In International Foundation for Autonomous Agents and Multiagent Systems (Taipei, Taiwan) (AAMAS ’11). 533–540.
- Gebser et al. (2014) Martin Gebser, Roland Kaminski, Benjamin Kaufmann, and Torsten Schaub. 2014. Clingo = ASP + Control: Preliminary Report. CoRR abs/1405.3694 (2014). arXiv:1405.3694 http://arxiv.org/abs/1405.3694
- Gizzi et al. (2021) Evana Gizzi, Hassan Amel, Wo Wei Lin, Keenan Rhea, and Jivko Sinapov. 2021. Toward Creative Problem Solving Agents: Action Discovery through Behavior Babbling. 2021 IEEE International Conference on Development and Learning (ICDL) (2021), 1–7.
- Goel et al. (2021) Shivam Goel, Gyan Tatiya, Matthias Scheutz, and Jivko Sinapov. 2021. NovelGridworlds: A Benchmark Environment for Detecting and Adapting to Novelties in Open Worlds. In International Foundation for Autonomous Agents and Multiagent Systems (AAMAS).
- Goertzel (2014) Ben Goertzel. 2014. Artificial General Intelligence: Concept, State of the Art, and Future Prospects. Journal of Artificial General Intelligence 0 (01 2014). https://doi.org/10.2478/jagi-2014-0001
- Haliem et al. (2021) Marina Haliem, Trevor Bonjour, Aala Oqab Alsalem, Shilpa Thomas, Hongyu Li, Vaneet Aggarwal, Bharat Bhargava, and Mayank Kejriwal. 2021. Learning Monopoly Gameplay: A Hybrid Model-Free Deep Reinforcement Learning and Imitation Learning Approach. ArXiv: 2103.00683 (2021).
- Heinrich and Silver (2016) Johannes Heinrich and David Silver. 2016. Deep Reinforcement Learning from Self-Play in Imperfect-Information Games. ArXiv: 1603.01121 (2016). arXiv:1603.01121 http://arxiv.org/abs/1603.01121
- Hester and Stone (2012) Todd Hester and Peter Stone. 2012. Intrinsically Motivated Model Learning for a Developing Curious Agent. In AAMAS Adaptive Learning Agents (ALA) Workshop (Valencia, Spain).
- Ji et al. (2004) Qiang Ji, Wayne Gray, Markus Guhe, and Michael Schoelles. 2004. Ji, Q., Gray, W. D., Guhe, M., Schoelles, M. J. (March 2004). Towards an integrated cognitive. (04 2004).
- J.P. (1988) J.P. 1988. Preface to the Fourth Printing. In Probabilistic Reasoning in Intelligent Systems, Judea Pearl (Ed.). Morgan Kaufmann, San Francisco (CA), ix. https://doi.org/10.1016/B978-0-08-051489-5.50006-0
- Kejriwal and Thomas (2021) Mayank Kejriwal and Shilpa Thomas. 2021. A multi-agent simulator for generating novelty in monopoly. Simulation Modelling Practice and Theory 112 (2021), 102364. https://doi.org/10.1016/j.simpat.2021.102364
- Koller and Friedman (2009) D. Koller and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. https://books.google.co.in/books?id=7dzpHCHzNQ4C
- Nash (1951) John Nash. 1951. Non-Cooperative Games. Annals of Mathematics 54, 2 (1951), 286–295. http://www.jstor.org/stable/1969529
- Padakandla et al. (2020) Sindhu Padakandla, Prabuchandran K. J., and Shalabh Bhatnagar. 2020. Reinforcement learning algorithm for non-stationary environments. Applied Intelligence 50, 11 (Jun 2020), 3590–3606. https://doi.org/10.1007/s10489-020-01758-5
- Papoudakis and Albrecht (2020) Georgios Papoudakis and Stefano V. Albrecht. 2020. Variational Autoencoders for Opponent Modeling in Multi-Agent Systems. CoRR abs/2001.10829 (2020). arXiv:2001.10829 https://arxiv.org/abs/2001.10829
- Pearl (1988) Judea Pearl. 1988. Chapter 3 - MARKOV AND BAYESIAN NETWORKS: Two Graphical Representations of Probabilistic Knowledge. In Probabilistic Reasoning in Intelligent Systems, Judea Pearl (Ed.). Morgan Kaufmann, San Francisco (CA), 77–141. https://doi.org/10.1016/B978-0-08-051489-5.50009-6
- Peng et al. (2017) Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. 2017. Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games. https://doi.org/10.48550/ARXIV.1703.10069
- Peng et al. (2021) Xiangyu Peng, Jonathan C. Balloch, and Mark O. Riedl. 2021. Detecting and Adapting to Novelty in Games. ArXiv: 2106.02204 (2021). arXiv:2106.02204 https://arxiv.org/abs/2106.02204
- Perez-Liebana et al. (2016) Diego Perez-Liebana, Spyridon Samothrakis, Julian Togelius, Tom Schaul, and Simon M Lucas. 2016. General video game ai: Competition, challenges and opportunities. In Thirtieth AAAI conference on artificial intelligence.
- Ponsen et al. (2007) Marc Ponsen, Pieter Spronck, Héctor Muñoz-Avila, and David W. Aha. 2007. Knowledge acquisition for adaptive game AI. Science of Computer Programming 67, 1 (2007), 59–75. https://doi.org/10.1016/j.scico.2007.01.006 Special Issue on Aspects of Game Programming.
- Radulescu et al. (2020) Roxana Radulescu, Timothy Verstraeten, Yijie Zhang, Patrick Mannion, Diederik M. Roijers, and Ann Nowé. 2020. Opponent Learning Awareness and Modelling in Multi-Objective Normal Form Games. CoRR abs/2011.07290 (2020). arXiv:2011.07290 https://arxiv.org/abs/2011.07290
- Ren et al. (2020) Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. 2020. Adversarial Attacks and Defenses in Deep Learning. Engineering 6, 3 (2020), 346–360. https://doi.org/10.1016/j.eng.2019.12.012
- Sarathy et al. (2021) Vasanth Sarathy, Daniel Kasenberg, Shivam Goel, Jivko Sinapov, and Matthias. Scheutz. 2021. SPOTTER: Extending Symbolic Planning Operators through Targeted Reinforcement Learning. In International Foundation for Autonomous Agents and Multiagent Systems (AAMAS).
- Soni et al. (2022) Utkarsh Soni, Sarath Sreedharan, Mudit Verma, Lin Guan, Matthew Marquez, and Subbarao Kambhampati. 2022. Towards customizable reinforcement learning agents: Enabling preference specification through online vocabulary expansion. https://doi.org/10.48550/ARXIV.2210.15096
- Sreedharan et al. (2020) Sarath Sreedharan, Utkarsh Soni, Mudit Verma, Siddharth Srivastava, and Subbarao Kambhampati. 2020. Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations. arXiv preprint arXiv:2002.01080 (2020).
- Sucar (2015) Luis Enrique Sucar. 2015. Probabilistic graphical models. Advances in Computer Vision and Pattern Recognition. London: Springer London. doi 10, 978 (2015), 1.
- Ustun et al. (2021) Volkan Ustun, Paul S. Rosenbloom, Seyed Sajjadi, and Jeremy Nuttal. 2021. Controlling Synthetic Characters in Simulations: A Case for Cognitive Architectures and Sigma. CoRR abs/2101.02231 (2021). arXiv:2101.02231 https://arxiv.org/abs/2101.02231
- Verma et al. (2022) Mudit Verma, Ayush Kharkwal, and Subbarao Kambhampati. 2022. Advice Conformance Verification by Reinforcement Learning agents for Human-in-the-Loop. arXiv preprint arXiv:2210.03455 (2022).
- Wanyana and Moodley (2021) Tezira Wanyana and Deshendran Moodley. 2021. An Agent Architecture for Knowledge Discovery and Evolution. 241–256. https://doi.org/10.1007/978-3-030-87626-5_18