This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Lipschitz Lifelong Monte Carlo Tree Search for Mastering Non-Stationary Tasks

Zuyuan Zhang
The George Washington University
[email protected]
&Tian Lan
The George Washington University
[email protected]
Abstract

Monte Carlo Tree Search (MCTS) has proven highly effective in solving complex planning tasks by balancing exploration and exploitation using Upper Confidence Bound for Trees (UCT). However, existing work have not considered MCTS-based lifelong planning, where an agent faces a non-stationary series of tasks – e.g., with varying transition probabilities and rewards – that are drawn sequentially throughout the operational lifetime. This paper presents LiZero for Lipschitz lifelong planning using MCTS. We propose a novel concept of adaptive UCT (aUCT) to transfer knowledge from a source task to the exploration/exploitation of a new task, depending on both the Lipschitz continuity between tasks and the confidence of knowledge in in Monte Carlo action sampling. We analyze LiZero’s acceleration factor in terms of improved sampling efficiency and also develop efficient algorithms to compute aUCT in an online fashion by both data-driven and model-based approaches, whose sampling complexity and error bounds are also characterized. Experiment results show that LiZero significantly outperforms existing MCTS and lifelong learning baselines in terms of much faster convergence (3\sim4x) to optimal rewards. Our results highlight the potential of LiZero to advance decision-making and planning in dynamic real-world environments.

1 Introduction

Monte Carlo Tree Search (MCTS) has demonstrated state-of-the-art performance in solving many challenging planning tasks, from playing the game of go Silver et al. (2016) and chess to logistic planning Silver et al. (2017). It performs look-ahead searches based on Monte Carlo sampling of the actions to balance efficient exploration and optimized exploitation in the large search space. Recent efforts have focused on developing MCTS algorithms for real-world domains that require the elimination of certain standard assumptions. Examples include MuZero Schrittwieser et al. (2020a) that leverages the decoding of hidden states to avoid requiring the knowledge of the game dynamics; and MAzero Liu et al. (2024) that performs multi-agent search through decentralized execution. However, existing work have not considered lifelong non-stationarity of task dynamics, which may manifest itself in many open world domains, as the task environment can often vary over time or across scenarios. It requires novel MCTS algorithms that can adapt in response, accumulate, and exploit knowledge throughout the learning process.

We consider MCTS-based lifelong planning under non-stationarity. An agent faces a series of changing planning tasks – e.g., with varying transition probabilities and rewards – which are drawn sequentially throughout the operational lifetime. Transferring knowledge from prior experience to continually adapt Monte Carlo sampling of the actions and thus speed up searches in new tasks is a key question in this setting. We note that although continual and lifelong planning has been studied in reinforcement learning (RL) context, e.g., learning models of the non-stationary task environment Xie et al. (2020), identifying reusable skills Lu et al. (2020), or estimating Bayesian sampling posteriors Fu et al. (2022), such prior work are not applicable to MCTS. Monte Carlo action sampling in MCTS relies on Upper Confidence Tree (UCT) or polynomial Upper Confidence Tree (pUCT) Auger et al. (2013); Matsuzaki (2018) to balance exploration and exploitation in large search spaces. To the best of our knowledge, there has been not existing work analyzing the transfer of knowledge from past MCTS searches to new tasks, thus enabling adaptive the UCT/pUCT rules in lifelong MCTS.

This paper proposes LiZero for Lipschitz lifelong planning using MCTS. We quantify a novel concept that the amount of knowledge transferable from a source task to the UCT/pUCT rule of a new task depends on both the similarity between the tasks as well as the confidence of the knowledge. More precisely, by defining a distance metric between two MDPs, we refine the concentration argument and drive a new adaptive UCT bound (denoted as aUCT in this paper) for lifelong MCTS. The aUCT is shown to consist of two components – relating to (i) the Lipschitz continuity between the two tasks and (ii) the confidence of knowledge due to the numbers of samples in Monte Carlo action sampling. Our results enable the development a novel LiZero algorithm that makes use of prior experience to run an adaptive MCTS by simulating/traversing from the root node and selecting actions according to the aUCT rule, until reaching a leaf node. We also analyze aUCT’s acceleration factor in terms of improved sampling efficiency due to cross-task transfer. It is shown that smaller task distance and higher confidence can both lead to higher acceleration in aUCT.

To support practical deployment of LiZero in lifelong planning, we need efficient solutions to compute aUCT in an online fashion. To this end, we develop practical algorithms to estimate various terms in aUCT and especially the distance metric between two MDPs, from either available state-action samples using a data-driven approach or a parameterized distance using a model-based (deep learning) approach. We provide rigorous analysis on the sampling complexity of the data-driven approach, to ensure arbitrarily small error with high probability, by modeling a non-stationary policy update process by a filtration – i.e., an increasing sequence of σ\sigma-algebras. For the model-based approach, we obtain an upper bound using a parameterized distance of the neural network models. These results enable effective LiZero application to open world tasks. We evaluate LiZero on a series of learning tasks with varying transition probabilities and rewards. It is shown that LiZero significantly outperforms MCTS and lifelong RL baselines (e.g.,  Winands (2024); Kocsis and Szepesvári (2006); Cheng et al. ; Schrittwieser et al. (2020b); Brafman and Tennenholtz (2002); Lecarpentier et al. (2021a)) in terms of faster convergence to higher optimal rewards. Utilizing the knowledge of only a few source tasks, LiZero achieves 3\sim4x speedup with about 31%31\% higher early reward in the first half of the learning process.

Our key contributions are as follows. First, we study theoretically the transfer of past experience in MCTS and develop a novel aUCT rule, depending on both Lipschitz continuity between tasks and the confidence of knowledge in Monte Carlo action sampling. It is proven to provide positive acceleration in MCTS due to cross-task transfer. Second, we develop LiZero for lifelong MCTS planning, with efficient methods for online estimation of aUCT and analytical error bounds. Finally, LiZero achieves significant speed-up over MCTS and lifelong RL baselines in lifelong planning.

2 Background

Monte Carlo Tree Search (MCTS) Kocsis and Szepesvári (2006); Silver et al. (2016); Schrittwieser et al. (2020a) is a heuristic search algorithm often applied to problems modeled as MDPs to handle exploration and exploitation dynamically. MCTS builds a search tree by exploring actions from the current state, simulating outcomes, and using those results to update estimated values for the selected actions. Normally, the problems solved by MCTS can be modeled using a Markov Decision Process (MDP) Even-Dar et al. (2004), which is formally defined as a tuple: 𝒮,𝒜,R,P\langle\mathcal{S},\mathcal{A},R,P\rangle, where 𝒮\mathcal{S} is the state space, and 𝒜\mathcal{A} is the action space, RsaR_{s}^{a} is the reward of taking action aa in state ss and PP is the transition probability matrix.

In the MCTS framework, the Upper Confidence Bound for Trees (UCT) Coulom (2006) and its variant, polynomial Upper Confidence Trees (pUCT) Matsuzaki (2018); Auger et al. (2013), are among the most commonly used selection strategies for balancing exploration and exploitation during node selection. Although these bounds are theoretically grounded and have achieved great empirical success, they are based on static environment assumptions and do not consider dynamic, non-stationary environments Pourshamsaei and Nobakhti (2024); Hernandez-Leal et al. (2017); Goldberg and Matarić (2003), where state transitions and reward distributions may change over time, thus requiring the transfer of past knowledge to exploration/exploitation of new tasks. In this paper, we consider MCTS- based lifelong planning, where an agent faces a non-stationary series of tasks – e.g., with varying transition probabilities and reward – and requires the development of new adaptive UCT bounds.

Lifelong reinforcement learning Lecarpentier et al. (2021b); Xie et al. (2020); Fu et al. (2022); Lu et al. (2020); Auger et al. (2013); Zou et al. (2024); Zhang et al. (2024a) is the process of learning a series of problems from unknown MDPs (Markov Decision Processes) online. Each time an MDP is sampled, it is treated as a separate RL (Reinforcement Learning) problem, where the agent can understand the environment and adjust its own policy Da Silva et al. (2018); Hawasly and Ramamoorthy (2013); Abel et al. (2018); Zhang et al. (2024b, c). The goal is for the agent to interact with the environment using policy π\pi to achieve the maximum expected reward. We can reasonably believe that the knowledge gained in similar MDPs can be reused. We note that despite recent work on continual and lifelong RL context, e.g., learning models of the non-stationary task environment Xie et al. (2020), identifying reusable skills Lu et al. (2020), or estimating Bayesian sampling posteriors Fu et al. (2022), these prior work are not applicable to MCTS in lifelong learning settings. It requires to re-examine and derive adaptive UCT bounds, for knowledge transfer.

3 Our Proposed Solution

3.1 Deriving adaptive Upper Confidence Bound (aUCT)

To derive the proposed aUCT rule, we consider set of mm past known MDPs 1,,m\mathcal{M}_{1},\ldots,\mathcal{M}_{m} and their leaned search policies π1,πm\pi_{1},\ldots\pi_{m}. Let SS and AA be their state and action spaces, respectively111Without loss of generality, we assume that the MDPs have the same state and action spaces. Otherwise, we can consider the extended MDPs defined on the union of their state and action spaces., Ni(s,a)N_{i}(s,a) be the visit count of MPD i\mathcal{M}_{i} to state-action pair (sS,aA)(s\in S,a\in A), W(s,a)W(s,a) to denote its sampled return, and QiNi(s,a)=Wi(s,a)/Ni(s,a)Q^{N_{i}}_{\mathcal{M}_{i}}(s,a)=W_{i}(s,a)/N_{i}(s,a) be the learned estimate for Q-value of MDP i\mathcal{M}_{i}. Our goal is to apply these knowledge toward learning a new MDP, denoted by \mathcal{M}. To this end, we derive a new Lipschitz upper confidence bound for \mathcal{M}, which utilizes and transfers the knowledge from past MDPs 1,,N\mathcal{M}_{1},\ldots,\mathcal{M}_{N}, thus obtaining an improved Monte Carlo action sampling strategy that limits the tree search on \mathcal{M} to a smaller subsets of sampled actions. We use N(s,a)N(s,a) to denote the visit count of the new MDP to (sS,aA)(s\in S,a\in A), W(s,a)W(s,a) to denote the sampled return, and thus QN(s,a)=W(s,a)/N(s,a)Q^{N}_{\mathcal{M}}(s,a)=W(s,a)/N(s,a) to denote its current Q-value estimate.

Our key idea in this paper is that an improved upper confidence bound for the new MDP \mathcal{M} can be obtained by (i) analyzing the Lipschitz continuity between the past and new MDPs with respect to the upper confidence bounds and (ii) taking into account the confidence and aleatory uncertainty of the learned Q-value estimates to determine to what extent the learned knowledge from each i\mathcal{M}_{i} is pertinent. Intuitively, the more similar \mathcal{M} and i\mathcal{M}_{i} are and the more samples (and thus higher confidence) we have in the learned Q-value estimates, the less exploration we would need to perform for solving \mathcal{M} through MCTS. Our analysis will lead to an improved upper confidence bound that guides the MCTS on the new MDP \mathcal{M} over a much smaller subset of action samples, thus significantly improving the search performance. We start with introducing a definition of the distance between any two given MDPs, =R,P,=R,P\mathcal{M}=\langle R,P\rangle,\ {\mathcal{M}}^{\prime}=\langle{R}^{\prime},{P}^{\prime}\rangle, with reward functions R,RR,R^{\prime} and state transitions P,PP,P^{\prime}, respectively. We choose a positive scaling factor κ>0\kappa>0 to combine the distances with respect to transition probabilities and rewards. Proofs of all theorems and corollaries are presented in the appendix.

Definition 3.1.

Give two MDPs =R,P,=R,P\mathcal{M}=\langle R,P\rangle,\ {\mathcal{M}}^{\prime}=\langle{R}^{\prime},{P}^{\prime}\rangle, and a distribution for sampling the state transitions 𝒰:𝒮×𝒜×𝒮[0,1]\mathcal{U}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}^{\prime}\rightarrow[0,1], we define the pseudometric between the MDPs as:

d(,)\displaystyle d(\mathcal{M},\mathcal{M}^{\prime}) =ΔR+κΔP\displaystyle=\Delta R+\kappa\cdot\Delta P
=𝔼(s,a,s)𝒰[|RsaRsa|+κ|PssaPssa|].\displaystyle=\mathbb{E}_{(s,a,s^{\prime})\sim\mathcal{U}}\left[|R_{s}^{a}-{R}^{\prime a}_{s}|+\kappa|P_{ss^{\prime}}^{a}-{P^{\prime}_{ss^{\prime}}}^{a}|\right].

Here d(,)d(\mathcal{M},\mathcal{M}^{\prime}) is our definition of distance between two MDPs, \mathcal{M} and \mathcal{M}^{\prime}. We choose 𝒰\mathcal{U} to be uniform distribution for sampling the state transitions in this paper. In Section 4, we discuss practical algorithms to estimate the distance metric between two MDPs, from either available state-action samples using a data-driven approach or a parameterized distance using a model-based (deep learning) approach. The sampling complexity and error bounds are also analyzed.

Next, we prove the main result of this paper and show that the upper confidence bounds of \mathcal{M} and \mathcal{M}^{\prime} is Lipschitz continuous with respect to distance d(,)d(\mathcal{M},\mathcal{M}^{\prime}). We obtain a new upper confidence bound for \mathcal{M}, by transfer the knowledge from the learned Q-value estimates QN(s,a)=W(s,a)/N(s,a)Q^{N^{\prime}}_{\mathcal{M}^{\prime}}(s,a)=W^{\prime}(s,a)/N^{\prime}(s,a) of MDP \mathcal{M}^{\prime}. Obviously, the bound also depends on the confidence of learned Q-value estimates, relating to the visit counts N(s,a)N(s,a) and N(s,a)N^{\prime}(s,a).

Theorem 3.2 (Lipschitz aUCT Rule).

Consider two MDPs MM and M{M}^{\prime} with visit count N,NN,N^{\prime} and corresponding estimate Q-values QMN(s,a),QMN(s,a)Q_{M}^{N}(s,a),Q_{M^{\prime}}^{N^{\prime}}(s,a), respectively. With probability at least (1δ)(1-\delta) for some positive δ>0\delta>0, we have

|QN(s,a)QN(s,a)|Ld(,)+P(N,N)\displaystyle\left|Q_{\mathcal{M}}^{N}(s,a)-Q_{\mathcal{M}^{\prime}}^{N^{\prime}}(s,a)\right|\leq L\cdot d(\mathcal{M},\mathcal{M}^{\prime})+P(N,N^{\prime}) (1)

where L=1/(1γ)L={1}/({1-\gamma}) is a Lipschitz constant, d(,)d(\mathcal{M},\mathcal{M}^{\prime}) is the distance between MDPs, and P(N,N)P(N,N^{\prime}) is given by

P(N,N)=2Rmax1γln(2/δ)2min(N,N)P(N,N^{\prime})=\frac{2R_{\max}}{1-\gamma}\sqrt{\frac{\ln(2/\delta)}{2\cdot{\rm min}(N,N^{\prime})}} (2)

In the theorem above, we show that the estimate Q-values between two MDPs are bounded by two terms, i.e., a Lipschitz continuity term depending on the distance d(,)d(\mathcal{M},\mathcal{M}^{\prime}) between the two environments and a confidence term depending on the number N,NN,N^{\prime} of samples used to estimate the Q-values. The Lipschitz continuity term measures how much the learned knowledge of source MDP \mathcal{M} is pertinent to the new MDP \mathcal{M}^{\prime}, while the confidence terms P(N,N)P(N,N^{\prime}) quantifies the sampling bias arising from statistical uncertainty due to limited sampling in MCTS. We note that as the number of samples NN goes to infinity, we have QN(s,a)Q(s,a)Q_{\mathcal{M}}^{N}(s,a)\rightarrow Q_{\mathcal{M}}^{*}(s,a) in Theorem 3.2, approaching the true Q-value Q(s,a)Q_{\mathcal{M}}^{*}(s,a) of the new MDP. Our theorem effectively provides an upper confidence bound for the true Q-value of the new MDP, based on knowledge transfer from the source MDP. We also note that as both numbers N,NN,N^{\prime} goes to infinity, the confidence term becomes P(N,N)0P(N,N^{\prime})\rightarrow 0. Our theorem recovers the Lipschitz lifelong RL Lecarpentier et al. (2021b) as a special case of our results, with respect to the true Q-values of the two MDPs.

We apply Theorem 3.2 to MCTS-based lifelong planning with a non-stationary series of mm tasks, 1,,m\mathcal{M}_{1},\ldots,\mathcal{M}_{m}. Our goal is to obtain an improved bound on the true Q-value of the new task \mathcal{M} based on knowledge transfer. To this end, we independently apply the knowledge from each past MDP, i.e., QiNi(s,a)=Wi(s,a)/Ni(s,a)Q^{N_{i}}_{\mathcal{M}_{i}}(s,a)=W_{i}(s,a)/N_{i}(s,a), to the new MDP. By taking the minimum of these bounds and making NN\rightarrow\infty, it provides a tightest upper bound on the true Q-value Q(s,a)Q_{\mathcal{M}}^{*}(s,a) of the new MDP, which is defined as our aUCT bound, as it adaptively transfers knowledge from past tasks to the new tasks in MCTS-based lifelong planning. The result is summarized in the following corollary.

Corollary 3.3 (aUCT bound in lifelong planning).

Given MDPs 1,,m\mathcal{M}_{1},\ldots,\mathcal{M}_{m}, the new MDP’s true Q-value is bounded by Q(s,a)UaUCTQ_{\mathcal{M}}^{*}(s,a)\leq U_{\rm aUCT} with probability at least (1δ)(1-\delta). The aUCT bound UaUCTU_{\rm aUCT} is given by

UaUCT(s,a)min1im[QMiNi(s,a)+Ld(,i)+2Rmax1γln(2/δ)2Ni(s,a)]\displaystyle U_{\rm aUCT}(s,a)\triangleq\min_{1\leq i\leq m}\Bigg{[}Q_{M_{i}}^{N_{i}}(s,a)+L\cdot d(\mathcal{M},\mathcal{M}_{i})+\frac{2R_{\max}}{1-\gamma}\sqrt{\frac{\ln(2/\delta)}{2N_{i}(s,a)}}\Bigg{]} (3)

Obtaining this corollary is straightforward from Theorem 3.2 by taking NN\rightarrow\infty and considering the tightest bound of all knowledge transfers. In the context of MCTS-based lifelong planning, the more knowledge we have from solving past tasks, the more likely we can easily plan a new task, as the aUCT bound UaUCT(s,a)U_{\rm aUCT}(s,a) is taken over the minimum of all past tasks. The confidence of past knowledge, i.e., the statistical uncertainty due to sampling number NiN_{i}, also affects the knowledge transfer to the new task.

3.2 Our Proposed LiZero Algorithm Using aUCT

We use the derived aUCT to design a highly efficient LiZero algorithm for MCTS-based lifelong planning. The LiZero algorithm transfers knowledge from past known tasks by computing UaUCT(s,a)U_{\rm aUCT}(s,a) in Corollary 3.3. It requires efficient estimate of the distance d(,i)d(\mathcal{M},\mathcal{M}_{i}) (as defined in Definition 3.1) between the source MDPs and the new (target) MDP. We will present practical algorithms for such distance estimate in the next section and present analysis on the sampling complexity and error bounds. We will first introduce our LiZero algorithm in this section. We note that, during MCTS, direct exploration/search in the new task \mathcal{M} also produces new knowledge and leads to improved UCT bound of \mathcal{M}. Therefore, our proposed LiZero combines both knowledge transfer through UaUCT(s,a)U_{\rm aUCT}(s,a) and knowledge from direct exploration/search in \mathcal{M}.

The search in our proposed LiZero algorithm is divided into three stages, repeated for a certain number of simulations. First, each simulation starts from the internal root state and finishes when the simulation reaches a leaf node. Let QN(s,a)=W(s,a)/N(s,a)Q^{N}_{\mathcal{M}}(s,a)=W(s,a)/N(s,a) be the current estimate of the new MDP and N(s)=a𝒜N(s,a)N(s)=\sum_{a\in\mathcal{A}}N(s,a) be the visit count to state s𝒮s\in\mathcal{S}. For each simulated time-step, LiZero chooses an action aa by maximizing a combined upper confidence bound based on aUCT, i.e.,

a=argmaxamin[W(s,a)N(s,a)+ClnN(s)N(s,a),UaUCT(s,a)]a={\rm arg}\max_{a}\min\left[\frac{W(s,a)}{N(s,a)}+C\sqrt{\frac{\ln N(s)}{N(s,a)}},U_{\rm aUCT}(s,a)\right]

In practice, we can also use the maximum possible return Rmax/(1γ)R_{\max}/(1-\gamma) as an initial value of the search. Next, at the final time-step of the simulation, the reward and state are computed by a dynamics function. A new node, corresponding to the leaf state, is then added to the search tree. Finally, at the end of the simulation, the statistics along the trajectory are updated. Let GG be the accumulative (discounted) reward for state-action (s,a)(s,a) from the simulation. We update the statistics by:

QN+1(s,a):-N(s,a)QN(s,a)+GN(s,a)+1,\displaystyle Q^{N+1}_{\mathcal{M}}(s,a)\coloneq\frac{N(s,a)\cdot Q^{N}_{\mathcal{M}}(s,a)+G}{N(s,a)+1},
N(s,a):-N(s,a)+1.\displaystyle N(s,a)\coloneq N(s,a)+1.

Intuitively, at the start of task \mathcal{M}’s MCTS, there are not sufficient samples available, and thus UaUCT(s,a)U_{\rm aUCT}(s,a) serves as a tighter upper confidence bound than that resulted from the Monte Carlo actions sampling in \mathcal{M}. As more samples are obtained during the search process, the standard UCT bound is expected to become tighter than UaUCT(s,a)U_{\rm aUCT}(s,a). The use of both bounds will ensure both efficient knowledge transfer and task-specific search. The pseudo-code of LiZero is provided in Appendix A.2.

For the proposed LiZero algorithm, we prove that it can result in accelerated convergence in MCTS. More precisely, we analyze the sampling complexity for the learned Q-value estimate QN(s,a)Q^{N}_{\mathcal{M}}(s,a) to converge to the true value Q(s,a)Q^{*}_{\mathcal{M}}(s,a), and demonstrate a strictly positive acceleration factor, compared to the standard UCT. The results are summarized in the following theorem.

Theorem 3.4.

To ensure the convergence in a finite state-action space, max(s,a)|QN(s,a)Q(s,a)|ϵ\max_{(s,a)}|Q^{N}_{\mathcal{M}}(s,a)-Q_{\mathcal{M}}^{*}(s,a)|\leq\epsilon with probability 1δ1-\delta, the number of samples required by standard UCT is

O~(|𝒮||𝒜|(1γ)3ϵ2ln1δ),\displaystyle\tilde{O}\left(\frac{|\mathcal{S}|\cdot|\mathcal{A}|}{(1-\gamma)^{3}\epsilon^{2}}\ln\frac{1}{\delta}\right), (4)

while the proposed LiZero algorithm requires:

O~(1Γ|𝒮||𝒜|(1γ)3ϵ2ln1δ),\displaystyle\tilde{O}\left(\frac{1}{\Gamma}\cdot\frac{|\mathcal{S}|\cdot|\mathcal{A}|}{(1-\gamma)^{3}\epsilon^{2}}\ln\frac{1}{\delta}\right), (5)

where Γ>1\Gamma>1 is an acceleration factor given by

Γ=(s,a)𝒮1𝒮01(Δ(s,a))2(s,a)𝒮1(1)+(s,a)𝒮01(Δ(s,a))2,\displaystyle\Gamma=\frac{\sum_{(s,a)\in\mathcal{S}_{1}\cup\mathcal{S}_{0}}\frac{1}{(\Delta^{\mathcal{M}}_{(s,a)})^{2}}}{\sum_{(s,a)\in\mathcal{S}_{1}}(1)+\sum_{(s,a)\in\mathcal{S}_{0}}\frac{1}{(\Delta^{\mathcal{M}}_{(s,a)})^{2}}}, (6)

and 𝒮1={(s,a)i:UaUCT(s,a)<Q(s,a)}\mathcal{S}_{1}=\{(s,a)\mid\exists i:U_{\rm aUCT}(s,a)<Q^{*}_{\mathcal{M}}(s,a^{*})\} is a state-action set where UaUCTU_{\rm aUCT} of action aa is lower than the optimal return of aa^{*} in state ss; and Δ(s,a)[Q(s,a)Q(s,a)]\Delta^{\mathcal{M}}_{(s,a)}\propto[Q_{\mathcal{M}}^{*}(s,a^{*})-Q_{\mathcal{M}}^{*}(s,a)] is a normalized advantage in the range of [0,1][0,1].

The theorem shows that LiZero achieves a strictly improved acceleration Γ>1\Gamma>1 with a reduced sampling complexity (by 1/Γ1/\Gamma), in terms of ensuring convergence to the optimal estimates, i.e., max(s,a)|QN(s,a)Q(s,a)|ϵ\max_{(s,a)}|Q^{N}_{\mathcal{M}}(s,a)-Q_{\mathcal{M}}^{*}(s,a)|\leq\epsilon with probability 1δ1-\delta. Since the normalized advantage Δ(s,a)\Delta^{\mathcal{M}}_{(s,a)} is in [0,1][0,1], we have 1/Δ(s,a)11/\Delta^{\mathcal{M}}_{(s,a)}\geq 1. It is then easy to see that the value of Γ\Gamma depends on the cardinality |𝒮1||\mathcal{S}_{1}| and the normalized advantage Δ(s,a)\Delta^{\mathcal{M}}_{(s,a)}. More precisely, LiZero achieves higher acceleration when (i) our aUCTaUCT makes more actions aa less favorable, as UaUCT(s,a)<Q(s,a)U_{\rm aUCT}(s,a)<Q^{*}_{\mathcal{M}}(s,a^{*}) implies that the sub-optimality of action aa in ss can be more easily determined due to aUCT; or (ii) aUCTaUCT helps establish tighter bounds in cases with a smaller advantage, which naturally requires more samples to distinguish the optimal actions – since Γ\Gamma increases as the normalized advantage becomes smaller for (s,a)𝒮1(s,a)\in\mathcal{S}_{1}, while being larger for (s,a)𝒮0(s,a)\in\mathcal{S}_{0}. These explain LiZero’s ability to achieve much higher acceleration and lower sampling complexity, resulted from significantly reduced search spaces. We will evaluate this acceleration/speedup through experiments in Section 5.

4 Estimaing aUCT in Practice

To deploy LiZero in practice, we need to estimate aUCT, and in particular, the distance d,id_{\mathcal{M},\mathcal{M}_{i}} between two MDPS. Sampling all transitions based on a uniform distribution 𝒰\mathcal{U}, as defined in Definition 3.1, is clearly too expensive. Thus, we develop efficient algorithms to estimate the distance metric, from either available state-action samples using a data-driven approach or a parameterized distance using a model-based (deep learning) approach. In this section, we also provide rigorous analysis on the sampling complexity and error bounds of the proposed algorithms for distance estimate. The results allow us to readily implement LiZero in practical environments. We will late evaluate the performance of different distance estimaters in Section 5 and present the numerical results.

More precisely, we first propose an algorithm to estimate the distance between two MDPs, \mathcal{M} and \mathcal{M}^{\prime}, using trajectory samples drawn from their search policies during MCTS and then making the use of importance sampling to mitigate the bias. We will start with analyzing a stationary search policy and then extend the results to a non-stationary policy update process, by modeling it as a filtration – i.e., an increasing sequence of σ\sigma-algebra. Next, since many practical problems are faced with extremely large or even continuous action and state spaces (i.e., 𝒜\mathcal{A} and 𝒮\mathcal{S}), we further consider a model-based approach by learning neural network approximations of the MDPs – denoted by parameter sets ϕ\phi and ϕ\phi^{\prime}, respectively – and then computing an upper bound on the distance using a parameterized distance of the neural network models. Analysis on sampling complexity and error bounds are provided as theorems in this section.

4.1 Sample-based Distance Estimate

During MCTS, transition samples are collected from the search to train a search policy π\pi. It is easy to see that we can leverage these transition samples to estimate distance d(,)d(\mathcal{M},\mathcal{M}^{\prime}) between two MDPs, as long as we address the bias arising from gap between search policy π\pi and desired sampling distribution 𝒰\mathcal{U} in the distance definition d(,)d(\mathcal{M},\mathcal{M}^{\prime}). It also allows us to obtain a consistent estimate of MDP distance, without depending on the search policy that is updated during training. We note that this bias can be addressed by importance sampling.

Let ΔX(s,a)=ΔRsa+κΔPsa\Delta X(s,a)=\Delta R_{s}^{a}+\kappa\Delta P_{s}^{a} be the distance metric for a given state-action pair (s,a)(s,a). We can rewrite the distance as d(,)=𝔼(s,a)𝒰[ΔX(s,a)]d(\mathcal{M},\mathcal{M}^{\prime})=\mathbb{E}_{(s,a)\sim\mathcal{U}}[\Delta X(s,a)]. We denote p𝒰(s,a)p_{\mathcal{U}}(s,a) as the probability (or density) of sampling (s,a)(s,a) according to distribution 𝒰\mathcal{U}. Importance sampling implies:

𝔼(s,a)𝒰[ΔX(s,a)]=𝔼(s,a)π[p𝒰(s,a)π(s,a)ΔX(s,a)],\displaystyle\mathbb{E}_{(s,a)\sim\mathcal{U}}[\Delta X(s,a)]=\mathbb{E}_{(s,a)\sim\pi}\left[\frac{p_{\mathcal{U}}(s,a)}{\pi(s,a)}\cdot\Delta X(s,a)\right], (7)

which can be readily computed from the collected transition samples, following the search policy π(s,a)\pi(s,a). Therefore, for a given set of samples {(si,ai),i=1,,n}\{(s_{i},a_{i}),\forall i=1,\ldots,n\} collected from a search policy π(s,a)\pi(s,a), we can estimate the distance by the empirical mean:

d^1=1ni=1nwiΔX(si,ai),withwi=𝒰(si,ai)π(si,ai)\displaystyle\hat{d}_{1}=\frac{1}{n}\sum_{i=1}^{n}w_{i}\Delta X(s_{i},a_{i}),\ {\rm with}\ w_{i}=\frac{\mathcal{U}(s_{i},a_{i})}{\pi(s_{i},a_{i})} (8)

where wiw_{i} is the importance sampling weight.

As long as the state-action pairs with π(s,a)>0\pi(s,a)>0 cover the support of 𝒰\mathcal{U}, this estimator satisfies 𝔼[d^1]=d(,)\mathbb{E}[\hat{d}_{\mathcal{1}}]=d(\mathcal{M},\mathcal{M}^{\prime}), meaning it is unbiased. Let α\alpha be the "coverage" of policy π(s,a)\pi(s,a), i.e., π(s,a)α>0\pi(s,a)\geq\alpha>0, and p𝒰maxp_{\mathcal{U}}^{\max} be the maximum desired sampling probability. We summarize this result in the following theorem and state the sampling complexity for estimator d^1\hat{d}_{1} to ϵ\epsilon-converge to d(,)d(\mathcal{M},\mathcal{M}^{\prime}).

Theorem 4.1 (Sampling Complexity under Stationarity).

Assume that for any (s,a)(s,a), the reward plus transition difference is bounded, i.e., ΔX(s,a)[0,b]\Delta X(s,a)\in[0,b], and that there exists α\alpha such that π(s,a)α>0\pi(s,a)\geq\alpha>0. When nn independent samples are used to estimate d^1\hat{d}_{1}, we have

Pr{|d^1d(,)|ϵ}1δ\displaystyle\text{Pr}\{|\hat{d}_{1}-d(\mathcal{M},\mathcal{M}^{\prime})|\leq\epsilon\}\geq 1-\delta (9)

for any δ(0,1)\delta\in(0,1), if the number of samples satisfy

n12ϵ2b2(p𝒰maxα)2ln(2δ).\displaystyle n\geq\frac{1}{2\epsilon^{2}}b^{2}\left(\frac{p_{\mathcal{U}}^{\max}}{\alpha}\right)^{2}\cdot\ln\left(\frac{2}{\delta}\right). (10)

Thus, we obtain a convergence guarantee in the sense of arbitrarily high probability 1δ1-\delta and arbitrarily small error ϵ\epsilon, for estimating d(,)d(\mathcal{M},\mathcal{M}^{\prime}) using d^1\hat{d}_{1}. d^1\hat{d}_{1} is unbiased and ensures convergence to the true distance as the number of samples is sufficiently large.

We note that in many practical settings, the search policy π\pi would not stick to a stationary distribution. In contrast, it is continuously updated in each iteration, resulting in a non-stationary sequence of policies over time, i.e., π1,π2,,πk\pi_{1},\pi_{2},\dots,\pi_{k}. Thus, the transition samples (sk,ak)(s_{k},a_{k})’s we obtain at each step kk for estimating the distance d(,)d(\mathcal{M},\mathcal{M}^{\prime}) are indeed drawn from a different πk\pi_{k}. We cannot assume that the samples follow a stationary distribution (nor that {ΔXkw}\{\Delta X^{w}_{k}\} are i.i.d.) in importance sampling. To address this problem, we model the non-stationary process of policy updates as a filtration – i.e., an increasing sequence of σ\sigma-algebra. In particular, we make the following assumption: at the kk-th sampling step, the environment is forcibly reset to a predetermined policy πk\pi_{k} or independently draws a state from an external memory. This assumption is reasonable because, in many episodic learning scenarios, the environment is inherently divided into episodes: at the beginning of each episode, the state is reset to some initial distribution (e.g., the opening state in Atari games or the initial pose in MuJoCo). This naturally results in the “reset" assumption.

In this setup, the policy πk\pi_{k} at step kk is determined by information at step k1k-1 or earlier. Consequently, once πk\pi_{k} is fixed, the distribution (marginal) of ΔXkw=p𝒰(sk,ak)πk(sk,ak)ΔX(sk,ak)\Delta X^{w}_{k}=\frac{p_{\mathcal{U}}(s_{k},a_{k})}{\pi_{k}(s_{k},a_{k})}\Delta X(s_{k},a_{k}) is also fixed. Therefore, we can establish the filtration {k,k=1,2,}\{\mathcal{F}_{k},k=1,2,\ldots\} as follows:

k1=σ{π1,,πk,(s1,a1),,(sk1,ak1)},\displaystyle\mathcal{F}_{k-1}=\sigma\{\pi_{1},...,\pi_{k},(s_{1},a_{1}),...,(s_{k-1},a_{k-1})\}, (11)

where σ{}\sigma\{\cdot\} denotes the smallest σ\sigma-algebra generated by the random elements. Thus, we obtain:

𝔼[ΔXk|k1]\displaystyle\mathbb{E}[\Delta X_{k}|\mathcal{F}_{k-1}] =𝔼(sk,ak)πk[p𝒰(sk,ak)πk(sk,ak)ΔX(sk,ak)]\displaystyle=\mathbb{E}_{(s_{k},a_{k})\sim\pi_{k}}\left[\frac{p_{\mathcal{U}}(s_{k},a_{k})}{\pi_{k}(s_{k},a_{k})}\cdot\Delta X(s_{k},a_{k})\right] (12)
=𝔼(sk,ak)𝒰[ΔX(s,a)]\displaystyle=\mathbb{E}_{(s_{k},a_{k})\sim\mathcal{U}}[\Delta X(s,a)]
=d(,)\displaystyle=d(\mathcal{M},\mathcal{M}^{\prime})

This allows us to obtain another empirical estimator d^2\hat{d}_{2} using the filtration model. We analyze the sampling complexity of d^2\hat{d}_{2} and summarise the results in the following theorem.

Theorem 4.2 (Sampling Complexity under Non-Stationarity).

Under the same conditions as Theorem 4.1 when nn independent samples are used to estimate d^2\hat{d}_{2}, we have

Pr{|d^2d(,)|ϵ}1δ\displaystyle\text{Pr}\{|\hat{d}_{2}-d(\mathcal{M},\mathcal{M}^{\prime})|\leq\epsilon\}\geq 1-\delta (13)

for any δ(0,1)\delta\in(0,1), if the number of samples satisfy

n2ϵ2b2(p𝒰maxα)2ln(2δ).\displaystyle n\geq\frac{2}{\epsilon^{2}}b^{2}\left(\frac{p_{\mathcal{U}}^{\max}}{\alpha}\right)^{2}\cdot\ln\left(\frac{2}{\delta}\right). (14)

It implies that more samples are needed considering the non-stationarity of policy update process for distance estimate.

4.2 Model-based Distance Estimate

When the action and state spaces, 𝒜\mathcal{A} and 𝒮\mathcal{S} are very large or even continuous, employing the sample based method will become increasingly expensive. Therefore, we propose a model-based approach to first approximate the dynamics of MDPs \mathcal{M} and \mathcal{M}^{\prime} using two neural networks and then estimate d(,)d(\mathcal{M},\mathcal{M}^{\prime}) based on the parameterized distance between the neural networks.

To this end, we need to establish a bound on d(,)d(\mathcal{M},\mathcal{M}^{\prime}) using the distance between their neural network parameters. We use a neural network Ψϕ:𝒮×𝒜Δ(𝒮)\Psi_{\phi}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) to model the MDP dynamics. Many model-based learning algorithms, such as PILCO Deisenroth and Rasmussen (2011),MBPO Janner et al. (2019),PETS Chua et al. (2018),MuZero Schrittwieser et al. (2020a), can be employed to learn the models of \mathcal{M} and \mathcal{M}^{\prime}. Let ϕ\phi be the neural network parameters of MDP \mathcal{M} and ϕ\phi^{\prime} be the neural network parameters of MDP \mathcal{M}^{\prime}. We define a distance in the parameter space:

d^para=ρ(ϕ,ϕ)0,\displaystyle\hat{d}_{para}=\rho(\phi,\phi^{\prime})\geq 0, (15)

where ρ\rho is a distance or divergence measure in the parameter space, such as the 2\ell_{2}-norm, 1\ell_{1}-norm, or certain kernel distances. Intuitively, if ϕ\phi and ϕ\phi^{\prime} are very close, it indicates that the two neural networks are similar in fitting the dynamics of the respective MDPs. It suggests that the two MDPs should have a small distance. To provide a more rigorous characterization of this concept, we present the following theorem, which demonstrates that under proper assumptions, the distance d^para\hat{d}_{para} based on neural network parameters can serve as an upper bound for the desired d(,)d(\mathcal{M},\mathcal{M}^{\prime}). Let κ=Rmaxγ/(1γ)\kappa={R_{\max}\gamma}/({1-\gamma}) be a constant.

Theorem 4.3.

If the neural networks modeling \mathcal{M} and \mathcal{M}^{\prime} satisfy the Lipschitz condition, i.e., there exists a constant L>0L>0 such that (s,a)\forall(s,a), Ψϕ(s,a)Ψϕ(s,a)1Lρ(ϕ,ϕ),||\Psi_{\phi}(s,a)-\Psi_{\phi^{\prime}}(s,a)||_{1}\leq L\cdot\rho(\phi,\phi^{\prime}), then we have:

d(,)(1+κ)Ld^para.\displaystyle d(\mathcal{M},\mathcal{M}^{\prime})\leq(1+\kappa)L\hat{d}_{\text{para}}. (16)

The theorem indicates that by learning neural networks to model the MDP dynamics, we can estimate the distance d(,)d(\mathcal{M},\mathcal{M}^{\prime}) by estimating the distance between the neural network parameters. This parameterized distance can be computed for event continuous action and state spaces.

5 Evaluations

Our experiments evaluate LiZero on series of ten learning tasks with varying transition probabilities and rewards. We demonstrate LiZero’s ability to transfer past knowledge in MCTS-based planning, resulting in significant convergence speedup (3\sim4x) and early reward improvement (about 31% average improvement during the first half of learning process) in lifelong planning problems. All experiments are conducted on a Linux machine with AMD EPYC 7513 32-Core Processor CPU and an NVIDIA RTX A6000 GPU, implemented in python3. All source codes are made available in the supplementary material.

Refer to caption
(a) Task 1
Refer to caption
(b) Task 2
Refer to caption
(c) Task 6
Refer to caption
(d) Task 10
Figure 1: Comparing LiZero with MCTS and lifelong RL baselines. We demonstrate the convergence of different algorithms on representatives Tasks 1, 2, 6, and 10, in a non-stationary sequence of ten tasks. In Task 1, since no prior knowledge is yet available, our LiZero and other MCTS baselines show similar convergence speed and optimal rewards. From Task 2 to Task 10, as more knowledge from past tasks gets transferred to the new task by LiZero, it outperforms all baselines with more significantly improved convergence speed. In Task 10 with maximum past knowledge, LiZero demonstrates the largest improvement in convergence speed and optimal reward.
Name Task1 Task2 Task3 Task4 Task5 Task6 Task7 Task8 Task9 Task10 Total
LiZero-U 4.83±\pm0.05 5.98±\pm0.10 5.99±\pm0.07 5.94±\pm0.05 6.08±\pm0.07 6.05±\pm0.16 6.01±\pm0.11 6.03±\pm0.05 6.04±\pm0.04 6.03±\pm0.09 58.98
LiZero-P 4.65±\pm0.06 5.89±\pm0.11 5.90±\pm0.12 5.90±\pm0.03 5.62±\pm0.19 5.68±\pm0.22 5.76±\pm0.12 5.87±\pm0.03 5.78±\pm0.06 5.79±\pm0.20 56.84
LiZero-N 4.64±\pm0.08 5.56±\pm0.07 5.56±\pm0.07 5.52±\pm0.05 5.52±\pm0.08 5.48±\pm0.06 5.50±\pm0.09 5.45±\pm0.06 5.50±\pm0.06 5.48±\pm0.05 54.21
MCTS-R 4.51±\pm0.07 4.43±\pm0.08 4.32±\pm0.11 4.24±\pm0.05 4.18±\pm0.07 4.24±\pm0.10 4.25±\pm0.03 4.47±\pm0.06 4.34±\pm0.03 4.39±\pm0.08 43.37
MCTS-O 4.52±\pm0.06 4.87±\pm0.08 4.57±\pm0.04 4.16±\pm0.03 4.78±\pm0.05 4.91±\pm0.06 4.04±\pm0.03 3.70±\pm0.05 3.02±\pm0.07 2.96±\pm0.06 41.53
pUCT 4.66±\pm0.04 4.71±\pm0.06 4.69±\pm0.13 4.77±\pm0.09 4.74±\pm0.04 4.87±\pm0.05 4.94±\pm0.06 4.72±\pm0.05 4.86±\pm0.07 4.77±\pm0.03 47.73
RMax 1.02±\pm0.02 1.05±\pm0.01 1.01±\pm0.02 1.03±\pm0.01 1.04±\pm0.01 1.05±\pm0.01 1.03±\pm0.03 1.04±\pm0.02 1.03±\pm0.02 1.03±\pm0.01 10.33
LRMax 1.05±\pm0.01 1.05±\pm0.02 1.04±\pm0.02 1.06±\pm0.03 1.05±\pm0.01 1.06±\pm0.02 1.04±\pm0.01 1.06±\pm0.03 1.05±\pm0.01 1.04±\pm0.01 10.50
Table 1: The table summarizes the rewards and standard deviations obtained in sequential tasks. It shows that LiZero achieves about 31% early reward improvement on average, compared with MCTS baselines (including two versions of MCTS with UCT Winands (2024); Kocsis and Szepesvári (2006); Cheng et al. and one with pUCT similar to MuZero Schrittwieser et al. (2020b)) and lifelong RL baselines (including RMax Brafman and Tennenholtz (2002) and LRMax Lecarpentier et al. (2021a)). MCTS-R and MCTS-O demonstrate similar level of performance, both better than lifelong RL and slightly below pUCT. LiZero algorithms outperform MCTS baselines by about 31% early reward improvement on average. With more accurate distance estimates – i.e., from Lizero-N to LiZero-P and to LiZerio-U – we observe further improvement due to better knowledge transfer that comes with more accurate aUCT.
Refer to caption
Refer to caption
(a)
Figure 2: In Figure 2, LiZero shows a comfortable speedup of 3\sim4x, compared with MCTS and lifelong RL baselines, in terms of achieving the same level of optimal rewards with higher sample efficiency. In Figure 2(a), Our ablation study comparing different distance estimators in LiZero-U, LiZero-P, and LiZero-N, while MCTS-R can be viewed as a baseline without distance estimator. The relevant performance of these algorithms are provided in Table 1 and Figure 2 and thus not repeated here.The superior performance of LiZero is indeed resulted from the use of aUCT in MCTS. The tighter aUCT bounds we use, the higher performance we can obtain.

In the evaluation, we consider some state-of-the-art baselines using MCTS and lifelong RL. In particular, we consider two versions of MCTS algorithms that leverage UCT Winands (2024); Kocsis and Szepesvári (2006); Cheng et al. : MCTS-R denotes a version that restarts the search from scratch for each new task, and MCTS-O denotes a version that is oblivious to the non-stationary task dynamics and continues to build upon the search tree from the past. We also consider state-of-the-art MCTS using pUCT, similar to MuZero and related algorithms Schrittwieser et al. (2020b). We have two lifelong RL algorithms: RMax Brafman and Tennenholtz (2002) and LRMax Lecarpentier et al. (2021a), which exploits a similar Lipschitz continuity in RL, but do not consider MCTS using upper confidence bounds. We evaluate three versions of LiZero using different methods for estimating aUCT by computing the task distances, as presented in Section 4. LiZero-U employs a direct distance estimate based on Definition 3.1; LiZero-P is the data-driven distance estimater d^2\hat{d}_{2} using samples following the search policy; and LiZero-N is the neural-network based estimator d^para\hat{d}_{para} using parameter distances.

The experimental environment we used is a variation of the "tight" task by Abel et al. Abel et al. (2018). It generates a non-stationary sequence of ten learning tasks. Each task consists of a 25×2525\times 25 grid world, with the initial state located at the center, and four possible actions: up, down, left, and right. The three cells in the top-right corner and one cell in the bottom-left corner are designated as goal cells. For each task, the reward for the goal cells is randomly chosen from the range [0.9, 1]. The remaining cells will randomly generate interference rewards within the range [0, 0.1]. Its state transition matrix selects its own slip probability (performing an action different from the chosen one) within the range [0, 0.1]. This ensures that the sequence of tasks have varying reward and transition probabilities. Each task for 1,000 epochs. These operations are repeated multiple times to narrow the confidence interval.

Figure 1 shows the convergence of different algorithms on representatives Tasks 1, 2, 6, and 10, in a non-stationary series of ten task. As tasks are drawn sequentially, LiZero-U, LiZero-P, and LiZero-N algorithms converge more rapidly than the MCTS and lifelong RL baselines. This speedup becomes evident as early as the second task (Task 2) – while similar convergences are observed in Task 1 as no prior knowledge is yet available. From Task 2 to Task 10, as more knowledge from past tasks gets transferred to the new task by LiZero, it outperforms all baselines in more significantly improved convergence speed. In Task 10 with maximum past knowledge, LiZero outperforms all baselines in convergence speed and optimal reward. MCTS-O (which is oblivious to changing task dynamics) exhibits increased deficiency, as tasks further evolve, and perform worse than MCTS-R (which restarts the search from scratch).

In Table 1, we summarize the average rewards (and their standard deviations) obtained in sequential tasks by different algorithms during their first 500 epochs (i.e., first half of the learning process). LiZero algorithms achieves about 31% early reward improvement on average. As for MCTS baselines with UCT, MCTS-R shows similar reward across different tasks, while MCTS-O demonstrates higher volatility – due to its reliance on how task dynamics evolve. pUCT achieves higher performance due to the use of improved probabilistic UCT similar to MuZero. All MCTS baselines show better results than lifelong RL algorithms (i.e., RMax and LRMax), which are known to be less sample efficient and require more epochs for exploration/exploitation. With more accurate distance estimates – i.e., from Lizero-N to LiZero-P and to LiZerio-U – we observe further improved results due to better knowledge transfer that comes with more accurate aUCT calculations.

To evaluate the speedup of LiZero, Figure 2 shows the average number of epochs needed by different algorithms to achieve 60%, 70%, and 80% of the optimal reward, respectively. We note that LiZero shows a comfortable speedup of 3\sim4x, compared with MCTS and lifelong RL baselines, while RL baselines are clearly much less sample efficient that MCTS-based planning in general. We do not go beyond 80% in this plot since some baselines are never able to achieve more than 80% of the optimal reward that LiZero obtains. The results provide numerical examples of the acceleration factor Γ\Gamma characterized in Theorem 3.4. A more accurate aUCT bound (like in LiZero-U) generally means further acceleration/speedup.

Ablation Study. Our ablation study considers the impact of distance estimator on performance. Figure 2(a) shows the distance estimators in LiZero-U, LiZero-P, and LiZero-N (each with decreasing accuracy) across the sequence of tasks, while for the purpose of ablation study, MCTS-R can be viewed as an algorithm without distance estimator. Comparing the performance of these algorithms in Table 1 and Figure 2, we see that the superior performance of LiZero is indeed resulted from the use of aUCT in MCTS – The tighter aUCT bounds we use, the higher performance we can achieve. Using no distance estimator and thus only UCT (in MCTS-R) leads to the lowest performance. Further, as tasks are drawn, the distance estimates decrease quickly, and by the third task, its is already very small, implying accurate aUCT calculation for knowledge transfer.

6 Conclusions

We study theoretically the transfer of past experience in MCTS-based lifelong planning and develop a novel aUCT rule, depending on both Lipschitz continuity between tasks and the confidence of knowledge in Monte Carlo action sampling. The proposed aUCT is proven to provide positive acceleration in MCTS due to cross-task transfer and enable the development of a new lifelong MCTS algorithm, namely LiZero. We also present efficient methods for online estimation of aUCT and provide analysis on the sampling complexity and error bounds. LiZero is implemented on a non-stationary series of learning tasks with varying transition probabilities and rewards. It outperforms MCTS and lifelong RL baselines with 3\sim4x speed-up in solving new tasks and about 31% higher early reward.

Impact Statement

This paper proposes a novel framework for applying Monte Carlo Tree Search (MCTS) in lifelong learning settings, addressing the challenges posed by non-stationary environments and dynamic game dynamics. By introducing the adaptive Upper Confidence Bound for Trees (aUCT) and leveraging insights from previous MDPs (Markov Decision Processes), our work significantly enhances the efficiency and adaptability of decision-making algorithms across evolving tasks.

The broader societal implications of this research include its potential to improve AI applications in robotics, automated systems, and other domains requiring dynamic decision-making under uncertainty. For instance, this framework could be used in autonomous systems to adaptively respond to changing environments, thereby improving safety and reliability. At the same time, it is crucial to acknowledge and mitigate potential risks, such as unintended biases or over-reliance on prior knowledge that may not fully represent novel situations.

Ethical considerations for this work focus on its use in high-stakes applications, such as healthcare, finance, or defense, where decision-making under uncertainty could have significant consequences. Developers and practitioners should implement safeguards to ensure responsible deployment, including comprehensive testing in diverse scenarios and establishing clear boundaries for its use.

By advancing the state of the art in continual learning and decision-making, this research contributes to the development of more adaptable and intelligent AI systems while highlighting the importance of ethical and responsible innovation in AI technologies.

References

  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  • Schrittwieser et al. (2020a) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020a.
  • Liu et al. (2024) Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, and Chongjie Zhang. Efficient multi-agent reinforcement learning by planning. arXiv preprint arXiv:2405.11778, 2024.
  • Xie et al. (2020) Annie Xie, James Harrison, and Chelsea Finn. Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv:2006.10701, 2020.
  • Lu et al. (2020) Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020.
  • Fu et al. (2022) Haotian Fu, Shangqun Yu, Michael Littman, and George Konidaris. Model-based lifelong reinforcement learning with bayesian exploration. Advances in Neural Information Processing Systems, 35:32369–32382, 2022.
  • Auger et al. (2013) David Auger, Adrien Couetoux, and Olivier Teytaud. Continuous upper confidence trees with polynomial exploration–consistency. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13, pages 194–209. Springer, 2013.
  • Matsuzaki (2018) Kiminori Matsuzaki. Empirical analysis of puct algorithm with evaluation functions of different quality. In 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pages 142–147. IEEE, 2018.
  • Winands (2024) Mark HM Winands. Monte-carlo tree search. In Encyclopedia of computer graphics and games, pages 1179–1184. Springer, 2024.
  • Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
  • (12) Scott Cheng, Mahmut Kandemir, and Ding-Yong Hong. Speculative monte-carlo tree search. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
  • Schrittwieser et al. (2020b) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, December 2020b. ISSN 1476-4687. doi: 10.1038/s41586-020-03051-4. URL http://dx.doi.org/10.1038/s41586-020-03051-4.
  • Brafman and Tennenholtz (2002) Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  • Lecarpentier et al. (2021a) Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L Littman. Lipschitz lifelong reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8270–8278, 2021a.
  • Even-Dar et al. (2004) Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. Advances in neural information processing systems, 17, 2004.
  • Coulom (2006) Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
  • Pourshamsaei and Nobakhti (2024) Hossein Pourshamsaei and Amin Nobakhti. Predictive reinforcement learning in non-stationary environments using weighted mixture policy. Applied Soft Computing, 153:111305, 2024.
  • Hernandez-Leal et al. (2017) Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz De Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
  • Goldberg and Matarić (2003) Dani Goldberg and Maja J Matarić. Maximizing reward in a non-stationary mobile robot environment. Autonomous Agents and Multi-Agent Systems, 6:287–316, 2003.
  • Lecarpentier et al. (2021b) Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L. Littman. Lipschitz lifelong reinforcement learning, 2021b. URL https://arxiv.org/abs/2001.05411.
  • Zou et al. (2024) Yifei Zou, Zuyuan Zhang, Congwei Zhang, Yanwei Zheng, Dongxiao Yu, and Jiguo Yu. A distributed abstract mac layer for cooperative learning on internet of vehicles. IEEE Transactions on Intelligent Transportation Systems, 2024.
  • Zhang et al. (2024a) Congwei Zhang, Yifei Zou, Zuyuan Zhang, Dongxiao Yu, Jorge Torres Gómez, Tian Lan, Falko Dressler, and Xiuzhen Cheng. Distributed age-of-information scheduling with noma via deep reinforcement learning. IEEE Transactions on Mobile Computing, 2024a.
  • Da Silva et al. (2018) Felipe Leno Da Silva, Matthew E Taylor, and Anna Helena Reali Costa. Autonomously reusing knowledge in multiagent reinforcement learning. In IJCAI, pages 5487–5493, 2018.
  • Hawasly and Ramamoorthy (2013) Majd Hawasly and Subramanian Ramamoorthy. Lifelong learning of structure in the space of policies. In 2013 AAAI Spring Symposium Series, 2013.
  • Abel et al. (2018) David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, and Michael Littman. Policy and value transfer in lifelong reinforcement learning. In International Conference on Machine Learning, pages 20–29. PMLR, 2018.
  • Zhang et al. (2024b) Zuyuan Zhang, Hanhan Zhou, Mahdi Imani, Taeyoung Lee, and Tian Lan. Collaborative ai teaming in unknown environments via active goal deduction. arXiv preprint arXiv:2403.15341, 2024b.
  • Zhang et al. (2024c) Zuyuan Zhang, Mahdi Imani, and Tian Lan. Modeling other players with bayesian beliefs for games with incomplete information. arXiv preprint arXiv:2405.14122, 2024c.
  • Deisenroth and Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
  • Janner et al. (2019) Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
  • Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
  • Qiao et al. (2024) Jing Qiao, Zuyuan Zhang, Sheng Yue, Yuan Yuan, Zhipeng Cai, Xiao Zhang, Ju Ren, and Dongxiao Yu. Br-defedrl: Byzantine-robust decentralized federated reinforcement learning with fast convergence and communication efficiency. In IEEE INFOCOM 2024-IEEE Conference on Computer Communications, pages 141–150. IEEE, 2024.
  • Gao et al. (2024) Mengtong Gao, Yifei Zou, Zuyuan Zhang, Xiuzhen Cheng, and Dongxiao Yu. Cooperative backdoor attack in decentralized reinforcement learning with theoretical guarantee. arXiv preprint arXiv:2405.15245, 2024.
  • Riis (2024) Søren Riis. Mastering nim and impartial games with weak neural networks: An alphazero-inspired multi-frame approach. arXiv preprint arXiv:2411.06403, 2024.
  • Chen et al. (2024) Yiwei Chen, Wenhao Li, XiuZhen Cheng, and Pengfei Hu. A survey of acoustic eavesdropping attacks: Principle, methods, and progress. High-Confidence Computing, page 100241, 2024.
  • Zhang et al. (2025) Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Network diffuser for placing-scheduling service function chains with inverse demonstration. arXiv preprint arXiv:2501.05673, 2025.
  • Yin (2025) Chun-Wu Yin. Predefined time convergence guaranteed performance control for uncertain systems based on reinforcement learning. Engineering Applications of Artificial Intelligence, 140:109734, 2025.

Appendix A Appendix / supplemental material

A.1 Proof of Theorem 3.2

Proof.

Proof of Theorem 3.2 Since in the MCTS UCB algorithm, the estimated Q-values are obtained through multiple simulations, we need to analyze how the differences in simulation results between two MDPs affect the estimated Q-values.

However, due to the randomness involved in the simulation process of the two MDPs:

  • Transition randomness: Due to different transition probabilities, the two MDPs may move to different next states even when starting from the same state and action.

  • Action selection randomness: When using the UCB algorithm, action selection depends on the current statistical information, which in turn relies on the past simulation results.

The randomness mentioned above makes it impossible for us to compare two independent random simulation processes directly Qiao et al. (2024); Gao et al. (2024); Riis (2024); Chen et al. (2024); Zhang et al. (2025); Yin (2025).

To eliminate the impact of randomness, we need to construct a coupled simulation process for the two MDPs in the same probability space, allowing for a direct comparison between them. Then we will incorporate the additional errors caused by randomness into the analysis as error terms. For this purpose, we present the following assumptions.

Assumption A.1.

Let us temporarily assume that the actions selected in each simulation are the same for the two MDPs.

  • Initial action consistency: The simulation starts from the same statess

  • Action selection consistency: The same action aa is chosen in each state.

Note: This is a strong assumption and may not hold in practice. We will discuss its impact later.

Thus, we can obtain the difference in cumulative rewards between the two MDPs in a single simulation as:

ΔG=GMGM=t=0Tγt(R(stM,at)R(stM,at))\displaystyle\Delta G=G_{M}-G_{M^{\prime}}=\sum\limits_{t=0}^{T}\gamma^{t}(R(s_{t}^{M},a_{t})-R^{\prime}(s_{t}^{M^{\prime}},a_{t})) (17)

Where stMs_{t}^{M} and stMs_{t}^{M^{\prime}} are the states of the two MDPs at step tt, and ata_{t} is the action selected at step tt.

So we can get

|QMn1(s,a)QMn2(s,a)|=|1n1i=1n1GM,i1n2i=1n2GM,i|ΔG¯=|1ni=1nΔGi|\left|Q_{M}^{n_{1}}(s,a)-Q_{M^{\prime}}^{n_{2}}(s,a)\right|=\left|\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}G_{M,i}-\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}G_{M,i}\right|\leq\bar{\Delta G}=\left|\frac{1}{n}\sum_{i=1}^{n}\Delta G_{i}\right| (18)

where n=min{n1,n2}n=\min\{n_{1},n_{2}\} To estimate the expectation and variance of ΔG\Delta G, we need to analyze how the differences in the state sequences affect the cumulative rewards.

We present several settings for the state differences.

  • Probability of state difference: At each time step tt, the probability that the states of the two MDPs differ is denoted as ptp_{t}.

  • Initial state is the same: p0=0p_{0}=0.

  • State difference propagation: Due to differences in transition probabilities, state differences may accumulate in subsequent time steps.

Since the probability of state differences occurring at each step is difficult to calculate precisely, we can use the total variation distance to estimate the probability of transitioning to different states. We present the definition of the total variation distance between the transition probabilities of the two MDPs and a recursive method for calculating the probability of state differences.

Definition A.2.

Under action ata_{t}, starting from state sts_{t}, the total variation distance between the transition probabilities of the two MDPs is:

DTV(P,P)=12s|P(s|st,at)P(s|st,at)|\displaystyle D_{TV}(P,P^{\prime})=\frac{1}{2}\sum\limits_{s^{\prime}}|P(s^{\prime}|s_{t},a_{t})-P^{\prime}(s^{\prime}|s_{t},a_{t})| (19)

Thus, starting from the same state sts_{t} and action ata_{t}, the probability that the two MDPs transition to different next states is at most DTV(P,P)ΔP2D_{TV}(P,P^{\prime})\leq\frac{\Delta P}{2}.

Thus, the probability of state differences occurring can be recursively expressed as:

pt+1pt+(1pt)DTV(P,P)pt+ΔP2\displaystyle p_{t+1}\leq p_{t}+(1-p_{t})\cdot D_{TV}(P,P^{\prime})\leq p_{t}+\frac{\Delta P}{2} (20)

So

pttΔP2\displaystyle p_{t}\leq t\cdot\frac{\Delta P}{2} (21)

Thus, at each time step tt, the expected difference in cumulative rewards is:

𝔼[|ΔG|]\displaystyle\mathbb{E}[|\Delta G|] =𝔼[t=0Tγt(R(stM,at)R(stM,at))]\displaystyle=\mathbb{E}[\sum_{t=0}^{T}\gamma^{t}(R(s_{t}^{M},a_{t})-R^{\prime}(s_{t}^{M^{\prime}},a_{t}))] (22)
=t=0Tγt(𝔼[R(stM,at)R(stM,at)]The impact of reward function differences+𝔼[R(stM,at)R(stM,at)]Reward differences caused by state differences)\displaystyle=\sum_{t=0}^{T}\gamma^{t}(\underbrace{\mathbb{E}[R(s_{t}^{M},a_{t})-R^{\prime}(s_{t}^{M},a_{t})]}_{\text{The impact of reward function differences}}+\underbrace{\mathbb{E}[R^{\prime}(s_{t}^{M},a_{t})-R^{\prime}(s_{t}^{M^{\prime}},a_{t})]}_{\text{Reward differences caused by state differences}})
t=0Tγt(ΔR+2Rmaxpt)\displaystyle\leq\sum_{t=0}^{T}\gamma^{t}(\Delta R+2R_{\max}\cdot p_{t})
=ΔR1γ+t=0Tγt2RmaxtΔP2\displaystyle=\frac{\Delta R}{1-\gamma}+\sum_{t=0}^{T}\gamma^{t}\cdot 2R_{\max}\cdot t\cdot\frac{\Delta P}{2}
=ΔR1γ+RmaxΔPt=0Ttγt\displaystyle=\frac{\Delta R}{1-\gamma}+R_{\max}\Delta P\sum_{t=0}^{T}t\gamma^{t}
=ΔR1γ+RmaxΔPγ(1γ)2\displaystyle=\frac{\Delta R}{1-\gamma}+R_{\max}\Delta P\cdot\frac{\gamma}{(1-\gamma)^{2}}

To estimate the variance of the cumulative reward difference, since the cumulative reward is bounded, its variance is also finite. We can easily obtain

|ΔG|Gmax=2Rmax1γ\displaystyle|\Delta G|\leq G_{\max}=\frac{2R_{\max}}{1-\gamma} (23)

According to Hoeffding:

P(|ΔG¯𝔼[ΔG¯]|ϵ)2exp(2nϵ2Gmax2)\displaystyle P(|\bar{\Delta G}-\mathbb{E}[\bar{\Delta G}]|\geq\epsilon)\leq 2\exp(-\frac{2n\epsilon^{2}}{G_{\max}^{2}}) (24)

Thus, with probability at least 1δ1-\delta, we have:

|Q^Mn(s,a)Q^Mn(s,a)|\displaystyle|\hat{Q}_{M}^{n}(s,a)-\hat{Q}_{M^{\prime}}^{n}(s,a)| 𝔼[|ΔG¯|]+Gmaxln(2/δ)2n\displaystyle\leq\mathbb{E}[|\Delta\bar{G}|]+G_{\max}\sqrt{\frac{\ln(2/\delta)}{2n}} (25)
=ΔR1γ+RmaxΔPγ(1γ)2+2Rmax1γln(2/δ)2n\displaystyle=\frac{\Delta R}{1-\gamma}+R_{\max}\Delta P\cdot\frac{\gamma}{(1-\gamma)^{2}}+\frac{2R_{\max}}{1-\gamma}\sqrt{\frac{\ln(2/\delta)}{2n}}
=11γ(ΔR+Rmaxγ1γΔP)+2Rmax1γln(2/δ)2n\displaystyle=\frac{1}{1-\gamma}(\Delta R+\frac{R_{\max}\gamma}{1-\gamma}\Delta P)+\frac{2R_{\max}}{1-\gamma}\sqrt{\frac{\ln(2/\delta)}{2n}}
=L(ΔR+κΔP)+L2\displaystyle=L(\Delta R+\kappa\Delta P)+L_{2}

A.2 Proof of Theorem 3.4

Proof.

Proof of Theorem 3.4 First, we consider the case of a single MDP and assume that we have a "universal" upper bound U(s,a)QM(s,a)U(s,a)\geq Q_{M}^{*}(s,a).

Lemma A.3.

Since U(s,a)QMU(s,a)\geq Q_{M}^{*} holds for all (s,a)(s,a), and initially Q(s,a)U(s,a)Q(s,a)\leq U(s,a), for any update, Q(s,a)Q(s,a) maintains Q(s,a)U(s,a)Q(s,a)\leq U(s,a) and Q(s,a)(a non-negative expected estimate)Q(s,a)\geq(\text{a non-negative expected estimate}).

The above two points illustrate Since we update using Q(s,a)=min{Q^(s,a),U(s,a)}Q(s,a)=\min\{\hat{Q}(s,a),U(s,a)\} And since U(s,a)Q(s,a)U(s,a)\geq Q^{*}(s,a), during all sampling processes, if Q^(s,a)\hat{Q}(s,a) overestimates Q(s,a)Q^{*}(s,a) significantly, it will still be truncated by U(s,a)U(s,a), ensuring that Q(s,a)U(s,a)Q(s,a)\leq U(s,a). When Q^(s,a)\hat{Q}(s,a) gradually approaches Q(s,a)Q^{*}(s,a), it will no longer be truncated. This does not hinder the convergence of QQ to QQ^{*}.

Theorem A.4 (Convergence in a Single MDP).

If there are infinitely many samples for each state ss and its available actions aa (i.e., every branch in the MCTS search tree is "continuously" expanded), then the Q(s,a)Q(s,a) generated by the above update formula almost surely converges to QM(s,a)Q_{M}^{*}(s,a).

Now we aim to demonstrate that after completing certain MDPs (tasks) M¯1,M¯2,,M¯m\bar{M}_{1},\bar{M}_{2},\dots,\bar{M}_{m}, and then switching to a new MDP MM, the algorithm achieves faster convergence.

First, we analyze the classic scenario without upper bounds. In a finite state-action space, to achieve the desired outcome with high probability 1δ1-\delta:

max(s,a)𝒮×𝒜|Qn(s,a)QM(s,a)|ϵ\displaystyle\max_{(s,a)\in\mathcal{S}\times\mathcal{A}}|Q_{n}(s,a)-Q_{M}^{*}(s,a)|\leq\epsilon (26)

The standard UCT/UCB theory typically provides a time complexity of O~(|𝒮||𝒜|(1γ)3ϵ2ln1δ)\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}\epsilon^{2}}\ln\frac{1}{\delta}\right). To prove this theorem, we just need to analyze the acceleration factor Γ\Gamma, comparing the sampling complexity of our aUCT and standard UCT.

More specifically, if we examine each specific (s,a)(s,a), the analysis often resembles that of multi-armed bandits: for "suboptimal" (s,a)(s,a), approximately O~(1(Δ(s,a)M)2ln1δ)\tilde{O}\left(\frac{1}{(\Delta^{M}_{(s,a)})^{2}}\ln\frac{1}{\delta}\right) samples are required. Where Δ(s,a)M=QM(s,a)QM(s,a)\Delta^{M}_{(s,a)}=Q_{M}^{*}(s,a^{*})-Q_{M}^{*}(s,a) is the value gap between the action and the optimal action. Summing up the exploration costs for all state-action pairs gives a total magnitude of (s,a)1(Δ(s,a)M)2\sum_{(s,a)}\frac{1}{(\Delta^{M}_{(s,a)})^{2}}.

Now we introduce the case with upper bounds and analyze how to reduce the number of samples across different MDPs.

To quantitatively represent this acceleration, we divide the state-action pairs (s,a)(s,a) into two groups:

  • 𝒮1:\mathcal{S}_{1}: Upper bounds are sufficiently tight and are truncated to be lower than the optimal action from the very beginning.

    𝒮1={(s,a)|i:UM¯i(s,a)<QM0(s,a)}\displaystyle\mathcal{S}_{1}=\left\{(s,a)|\exists i:U_{\bar{M}_{i}}(s,a)<Q_{M}^{0}(s,a)\right\} (27)
  • 𝒮0:\mathcal{S}_{0}: The upper bounds are not "tight enough," i.e.,

    𝒮0=remaining actions\displaystyle\mathcal{S}_{0}=\text{remaining actions} (28)

For (s,a)𝒮1(s,a)\in\mathcal{S}_{1}:

We treat each sampling as a multi-armed bandit. Let the true mean of the optimal arm be μ\mu^{*}. For a certain arm jj, its true mean is known to satisfy μjUj<μ\mu_{j}\leq U_{j}<\mu^{*}.

Even if we truncate μ^n(j)\hat{\mu}_{n}(j) at UjU_{j}, the UCB algorithm’s "optimistic estimate" for this arm at step nn is still:

Qn(j)=min{μ^n(j),Uj}+cln(n)Nj(n)\displaystyle Q_{n}(j)=\min\left\{\hat{\mu}_{n}(j),U_{j}\right\}+c\sqrt{\frac{\ln(n)}{N_{j}(n)}} (29)
Uj+cln(n)Nj(n)<μ\displaystyle U_{j}+c\sqrt{\frac{\ln(n)}{N_{j}(n)}}<\mu^{*} (30)

Let Δ=μUj\Delta=\mu^{*}-U_{j}. As long as:

ln(n)Nj(n)Δ2c\displaystyle\sqrt{\frac{\ln(n)}{N_{j}(n)}}\leq\frac{\Delta}{2c} (31)

From the above, it can be ensured that Qn(j)Q_{n}(j) cannot exceed μΔ/2\mu^{*}-\Delta/2. So

Nj(n)4c2ln(n)Δ2\displaystyle N_{j}(n)\geq\frac{4c^{2}\ln(n)}{\Delta^{2}} (32)

Where we obtain a sampling time complexity of O~(lnn)\tilde{O}(\ln n).

For (s,a)𝒮0(s,a)\in\mathcal{S}_{0}, these (s,a)(s,a) cannot be pruned by "truncation." They still require multiple samples, as in classic UCT, to determine whether they are truly optimal. For any (s,a)𝒮0(s,a)\in\mathcal{S}_{0}, we still need approximately O(1(Δ(s,a)M)2ln1δ)O\left(\frac{1}{(\Delta^{M}_{(s,a)})^{2}}\ln\frac{1}{\delta}\right) samples to distinguish that it is not as good as (s,a)(s,a^{*}). Thus, the sampling complexity of our algorithm is:

XaUCT=(s,a)𝒮1O~(lnn)+(s,a)𝒮0O~(1(Δ(s,a)M)2ln1δ),\displaystyle X_{\rm aUCT}=\sum_{(s,a)\in\mathcal{S}_{1}}\tilde{O}\left(\ln n\right)+\sum_{(s,a)\in\mathcal{S}_{0}}\tilde{O}\left(\frac{1}{(\Delta^{M}_{(s,a)})^{2}}\ln\frac{1}{\delta}\right), (33)

Using the fact that O~(lnn)O~(ln1δ)\tilde{O}(\ln n)\sim\tilde{O}(\ln\frac{1}{\delta}), we can rewrite this as

XaUCT=(s,a)𝒮1O~(ln1δ)+(s,a)𝒮0O~(1(Δ(s,a)M)2ln1δ).\displaystyle X_{\rm aUCT}=\sum_{(s,a)\in\mathcal{S}_{1}}\tilde{O}\left(\ln\frac{1}{\delta}\right)+\sum_{(s,a)\in\mathcal{S}_{0}}\tilde{O}\left(\frac{1}{(\Delta^{M}_{(s,a)})^{2}}\ln\frac{1}{\delta}\right). (34)

In contrast, the sampling complexity of the standard UCT can be obtained using the same analysis, i.e.,

XUCT=(s,a)𝒮0𝒮1O~(1(Δ(s,a)M)2ln1δ).\displaystyle X_{\rm UCT}=\sum_{(s,a)\in\mathcal{S}_{0}\cup\mathcal{S}_{1}}\tilde{O}\left(\frac{1}{(\Delta^{M}_{(s,a)})^{2}}\ln\frac{1}{\delta}\right). (35)

Comparing the order bounds from Equation (35) and Equation (34), we can find the acceleration factor Γ\Gamma as

Γ=(s,a)𝒮1𝒮01(Δ(s,a))2(s,a)𝒮1(1)+(s,a)𝒮01(Δ(s,a))2,\displaystyle\Gamma=\frac{\sum_{(s,a)\in\mathcal{S}_{1}\cup\mathcal{S}_{0}}\frac{1}{(\Delta^{\mathcal{M}}_{(s,a)})^{2}}}{\sum_{(s,a)\in\mathcal{S}_{1}}(1)+\sum_{(s,a)\in\mathcal{S}_{0}}\frac{1}{(\Delta^{\mathcal{M}}_{(s,a)})^{2}}}, (36)

which is the desired result in the theorem.

A.3 Proof of Theorem 4.1

Proof.

Proof of Theorem 4.1 First, we need to establish unbiasedness and boundedness. For unbiasedness, we can derive:

𝔼[Xi]=𝔼(s,a)π[𝒰(s,a)π(s,a)ΔX(s,a)]=𝔼(s,a)𝒰[ΔX(s,a)]=d(M,M)\displaystyle\mathbb{E}[X_{i}]=\mathbb{E}_{(s,a)\sim\pi}[\frac{\mathcal{U}(s,a)}{\pi(s,a)}\cdot\Delta X(s,a)]=\mathbb{E}_{(s,a)\sim\mathcal{U}}[\Delta X(s,a)]=d(M,M^{\prime}) (37)

Therefore, 𝔼[d^𝒰]=d(M,M)\mathbb{E}[\hat{d}_{\mathcal{U}}]=d(M,M^{\prime}), meaning d^𝒰\hat{d}_{\mathcal{U}} is an unbiased estimator.

wi=𝒰(si,ai)π(si,ai)𝒰maxα\displaystyle w_{i}=\frac{\mathcal{U}(s_{i},a_{i})}{\pi(s_{i},a_{i})}\leq\frac{\mathcal{U}_{\max}}{\alpha} (38)

Where 𝒰max=max(s,a)𝒰(s,a)=1|𝒮||𝒜|\mathcal{U}{\max}=\max_{(s,a)}\mathcal{U}(s,a)=\frac{1}{|\mathcal{S}|\cdot|\mathcal{A}|}. So we can get:

Xi=wiΔX(si,ai)(𝒰maxα)b\displaystyle X_{i}=w_{i}\Delta X(s_{i},a_{i})\leq(\frac{\mathcal{U}_{\max}}{\alpha})b (39)

So we can get Xi[0,C]X_{i}\in[0,C] where C=𝒰maxαbC=\frac{\mathcal{U}_{\max}}{\alpha}b.

Based on the above analysis, we have X¯N=1Ni=1NXi=d^𝒰\bar{X}_{N}=\frac{1}{N}\sum_{i=1}^{N}X_{i}=\hat{d}_{\mathcal{U}}, μ=𝔼[Xi]=d(M,M)\mu=\mathbb{E}[X_{i}]=d(M,M^{\prime}). According to Hoeffding’s inequality, for X¯N[0,C]\bar{X}_{N}\in[0,C], we have:

Pr{|X¯Nμ|ϵ}2exp(2Nϵ2C2)\displaystyle\text{Pr}\{|\bar{X}_{N}-\mu|\geq\epsilon\}\leq 2\exp(-\frac{2N\epsilon^{2}}{C^{2}}) (40)

To achieve a confidence level of δ\delta, it requires:

2exp(2Nϵ2C2)δexp(2Nϵ2C2)δ22Nϵ2C2lnδ22Nϵ2C2ln2δNC22ϵ2ln2δ\displaystyle 2\exp(-\frac{2N\epsilon^{2}}{C^{2}})\leq\delta\Leftrightarrow\exp(-\frac{2N\epsilon^{2}}{C^{2}})\leq\frac{\delta}{2}\Leftrightarrow-\frac{2N\epsilon^{2}}{C^{2}}\leq\ln\frac{\delta}{2}\Leftrightarrow\frac{2N\epsilon^{2}}{C^{2}}\geq\ln\frac{2}{\delta}\Leftrightarrow N\geq\frac{C^{2}}{2\epsilon^{2}}\ln\frac{2}{\delta} (41)

We get if fulfilled:

N12ϵ2(𝒰maxαb)2ln2δ\displaystyle N\geq\frac{1}{2\epsilon^{2}}(\frac{\mathcal{U}_{\max}}{\alpha}b)^{2}\ln\frac{2}{\delta} (42)

There is then a high probability error upper bound:

Pr{|d^𝒰d(M,M)|ϵ}1δ\displaystyle\text{Pr}\{|\hat{d}_{\mathcal{U}}-d(M,M^{\prime})|\leq\epsilon\}\geq 1-\delta (43)

A.4 Proof of Theorem 4.2

Proof.

Proof of Theorem 4.2 Constructing a martingale difference, let:

Sn:=k=1n(Xkd(M,M)),Yk:=Xk𝔼[Xk|k1]\displaystyle S_{n}:=\sum_{k=1}^{n}(X_{k}-d(M,M^{\prime})),Y_{k}:=X_{k}-\mathbb{E}[X_{k}|\mathcal{F}_{k-1}] (44)

According to the martingale condition in formula 12, we know that Yk=Xkd(M,M)Y_{k}=X_{k}-d(M,M^{\prime}), and Sn=k=1nYkS_{n}=\sum_{k=1}^{n}Y_{k} satisfies 𝔼[Yk|k1]=0\mathbb{E}[Y_{k}|\mathcal{F}_{k-1}]=0. Thus, {Sn,n}\{S_{n},\mathcal{F}_{n}\} is a martingale process.

Since πk(s,a)αwk𝒰maxα\pi_{k}(s,a)\geq\alpha\Rightarrow w_{k}\leq\frac{\mathcal{U}_{\max}}{\alpha}, and ΔX(s,a)bXk=wkΔX(sk,ak)𝒰maxαb=:C\Delta X(s,a)\leq b\Rightarrow X_{k}=w_{k}\Delta X(s_{k},a_{k})\leq\frac{\mathcal{U}_{\max}}{\alpha}b=:C. Therefore, we have:

|Yk|max{Xk,d(M,M)}C\displaystyle|Y_{k}|\leq\max\{X_{k},d(M,M^{\prime})\}\leq C (45)

According to the Azuma-Hoeffding inequality for bounded martingale differences, we have:

Pr{|Sn|t}2exp(t22NC2)\displaystyle\text{Pr}\{|S_{n}|\geq t\}\leq 2\exp(-\frac{t^{2}}{2NC^{2}}) (46)

Let t=Nϵt=N\epsilon, then |Sn|t|S_{n}|\geq t is equivalent to |k=1nXkNd(M,M)|Nϵ\left|\sum_{k=1}^{n}X_{k}-Nd(M,M^{\prime})\right|\geq N\epsilon, that is:

|d^𝒰(n)d(M,M)|ϵ\displaystyle|\hat{d}_{\mathcal{U}}^{(n)}-d(M,M^{\prime})|\geq\epsilon (47)

So:

Pr{|d^𝒰(N)d(M,M)|ϵ}2exp(Nϵ22C2)\displaystyle\text{Pr}\{|\hat{d}_{\mathcal{U}}^{(N)}-d(M,M^{\prime})|\geq\epsilon\}\leq 2\exp(-\frac{N\epsilon^{2}}{2C^{2}}) (48)

Thus, as long as N2C2ϵ2ln2δN\geq\frac{2C^{2}}{\epsilon^{2}}\ln\frac{2}{\delta}, we have Pr{|d^𝒰(N)d(M,M)|ϵ}δ\text{Pr}\{|\hat{d}_{\mathcal{U}}^{(N)}-d(M,M^{\prime})|\geq\epsilon\}\leq\delta. ∎

A.5 Proof of Theorem 4.3

Proof.

Proof of Theorem 4.3 We decompose d𝒰d_{\mathcal{U}}.

d𝒰(M,Mi)\displaystyle d_{\mathcal{U}}(M,M_{i}) =𝔼(s,a)𝒰[|RsaRsa,(i)|Reward difference+κs|PssaPssa,(i)|transition difference]\displaystyle=\mathbb{E}_{(s,a)\sim\mathcal{U}}[\underbrace{|R_{s}^{a}-R_{s}^{a,(i)}|}_{\text{Reward difference}}+\underbrace{\kappa\sum_{s^{\prime}}|P_{ss^{\prime}}^{a}-P_{ss^{\prime}}^{a,(i)}|}_{\text{transition difference}}] (49)
𝔼(s,a)𝒰[|RsaRsa,(i)|+κΨϕ(s,a)Ψϕi(s,a)1]\displaystyle\simeq\mathbb{E}_{(s,a)\sim\mathcal{U}}[|R_{s}^{a}-R_{s}^{a,(i)}|+\kappa||\Psi_{\phi}(s,a)-\Psi_{\phi_{i}}(s,a)||_{1}]
𝔼(s,a)𝒰[L3ρ(ϕ,ϕi)+κL3ρ(ϕ,ϕi)]\displaystyle\leq\mathbb{E}_{(s,a)\sim\mathcal{U}}[L_{3}\rho(\phi,\phi_{i})+\kappa L_{3}\rho(\phi,\phi_{i})]
L3ρ(ϕ,ϕi)+κL3ρ(ϕ,ϕi)\displaystyle\leq L_{3}\rho(\phi,\phi_{i})+\kappa L_{3}\rho(\phi,\phi_{i})
=(1+κ)L3ρ(ϕ,ϕi)\displaystyle=(1+\kappa)L_{3}\rho(\phi,\phi_{i})
=(1+κ)L3d^para(M,Mi)\displaystyle=(1+\kappa)L_{3}\hat{d}_{para}(M,M_{i})

Appendix B Pseudo-code

Algorithm 1 UMCTS
0:  {1,,M},𝒰,κ,L,L2(i),γ,Rmax,C,T\{\mathcal{M}_{1},\dots,\mathcal{M}_{M}\},\mathcal{U},\kappa,L,L_{2}^{(i)},\gamma,R_{\max},C,T
1:  for i=1i=1 to MM do
2:     Repeat sampling (s,a)(s,a) from the uniform distribution 𝒰\mathcal{U} to update RR and PP.
3:     for j=1j=1 to MM do
4:        d(i,j)𝔼(s,a,s)𝒰[|RsaR¯sa|+κ|PssaP¯ssa|]d(\mathcal{M}_{i},\mathcal{M}_{j})\leftarrow\mathbb{E}_{(s,a,s^{\prime})\sim\mathcal{U}}\bigl{[}\,|R_{s}^{a}-\overline{R}_{s}^{a}|+\kappa\,|P_{ss^{\prime}}^{a}-\overline{P}_{ss^{\prime}}^{a}|\bigr{]}
5:     end for
6:     Initialize root node s0s_{0}, set N(),N(,),W(,)N(\cdot),N(\cdot,\cdot),W(\cdot,\cdot) to 0
7:     for t=1t=1 to TT do
8:        Selection:
9:         Set current node ss0s\leftarrow s_{0}
10:        while child nodes of ss are fully expanded do
11:           Choose a=argmax𝑎(Q(s,a))a=\underset{a}{\mathrm{argmax}}\;\bigl{(}Q(s,a)\bigr{)}// using Eq. (*) below
12:           schild node after action as\leftarrow\text{child node after action $a$}
13:        end while
14:        Expansion:
15:         Expand one non-visited action anewa_{\mathrm{new}} at ss, sample ss^{\prime} from environment or model
16:         Create new child node ss^{\prime}, set N(s,)=0N(s^{\prime},\cdot)=0, W(s,)=0W(s^{\prime},\cdot)=0
17:        Simulation:
18:         Perform a (light) rollout or default policy from ss^{\prime} to terminal or horizon
19:         Receive cumulative reward GG
20:        Backpropagation:
21:         Traverse back from ss^{\prime} to s0s_{0} along visited path
22:        for all visited state-action pairs (s~,a~)(\tilde{s},\tilde{a}) do
23:           N(s~)N(s~)+1N(\tilde{s})\,\leftarrow\,N(\tilde{s})+1
24:           N(s~,a~)N(s~,a~)+1N(\tilde{s},\tilde{a})\,\leftarrow\,N(\tilde{s},\tilde{a})+1
25:           W(s~,a~)W(s~,a~)+GW(\tilde{s},\tilde{a})\,\leftarrow\,W(\tilde{s},\tilde{a})+G
26:           // Update Q(s~,a~)Q(\tilde{s},\tilde{a}) with UMCTS bound:
27:           U¯(s~,a~)Q¯(s~,a~)+Ld(,¯)+L2(i)U_{\bar{\mathcal{M}}}(\tilde{s},\tilde{a})\leftarrow Q_{\bar{\mathcal{M}}}^{*}(\tilde{s},\tilde{a})+L\cdot d(\mathcal{M},\bar{\mathcal{M}})+L_{2}^{(i)}
28:           U(s~,a~)min{Rmax1γ,U¯(s~,a~),}U(\tilde{s},\tilde{a})\leftarrow\min\bigl{\{}\frac{R_{\max}}{1-\gamma}\,,\,U_{\bar{\mathcal{M}}}(\tilde{s},\tilde{a}),\,\dots\bigr{\}}
29:           Q(s~,a~)min{W(s~,a~)N(s~,a~)+ClnN(s~)N(s~,a~),U(s~,a~)}()Q(\tilde{s},\tilde{a})\leftarrow\min\!\Bigl{\{}\dfrac{W(\tilde{s},\tilde{a})}{N(\tilde{s},\tilde{a})}+C\,\sqrt{\dfrac{\ln N(\tilde{s})}{N(\tilde{s},\tilde{a})}},\;U(\tilde{s},\tilde{a})\Bigr{\}}\quad(*)
30:        end for
31:     end for
32:  end for
Algorithm 2 UMCTS with Importance Sampling
0:  Tasks {1,,M}\{\mathcal{M}_{1},\dots,\mathcal{M}_{M}\}, each partially known; Uniform distribution 𝒰(s,a)\mathcal{U}(s,a);Lipschitz constants L,L2(i)L,L_{2}^{(i)}; Discount factor γ\gamma, maximum reward RmaxR_{\max}; Exploration constant CC; Number of search iterations TT;A (default) policy π\pi used in Simulation for importance sampling;
1:  Function Distance(,¯,π\mathcal{M},\bar{\mathcal{M}},\pi)
2:      ΔX(s,a)ΔRsa+κΔPsa\displaystyle\Delta X(s,a)\;\triangleq\;\Delta R_{s}^{a}\;+\;\kappa\,\Delta P_{s}^{a}
3:  return 𝔼(s,a)π[𝒰(s,a)π(s,a)ΔX(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\pi}\Bigl{[}\frac{\mathcal{U}(s,a)}{\pi(s,a)}\cdot\Delta X(s,a)\Bigr{]}
4:  // For each task i\mathcal{M}_{i}
5:  for i=1i=1 to MM do
6:     Initialize root node s0s_{0}, set N()=0,N(,)=0,W(,)=0N(\cdot)=0,\,N(\cdot,\cdot)=0,\,W(\cdot,\cdot)=0
7:     (Optionally maintain a buffer 𝒟i\mathcal{D}_{i} for storing samples (s,a))
8:     for t=1t=1 to TT do
9:        Selection:
10:         ss0s\;\leftarrow\;s_{0}
11:        while all actions from ss are fully expanded and ss not terminal do
12:           aargmax𝑎(Q(s,a))a\;\leftarrow\;\underset{a}{\mathrm{argmax}}\;\bigl{(}Q(s,a)\bigr{)}  // UCB or UMCTS criterion
13:           schild node after action as\;\leftarrow\;\text{child node after action }a
14:        end while
15:        Expansion:
16:        if ss not terminal then
17:           Choose one unvisited action anewa_{\mathrm{new}} at ss
18:           Sample next state sPi(s,anew)s^{\prime}\sim P_{i}(\cdot\mid s,a_{\mathrm{new}}) // from environment or model
19:           Create child node ss^{\prime}, set N(s,)=0,W(s,)=0N(s^{\prime},\cdot)=0,\,W(s^{\prime},\cdot)=0
20:        end if
21:        Simulation:
22:         Initialize cumulative reward G0G\leftarrow 0
23:         ssimss_{\mathrm{sim}}\leftarrow s^{\prime}
24:        while ssims_{\mathrm{sim}} is not terminal do
25:           Pick action asima_{\mathrm{sim}} by policy π(ssim)\pi(\cdot\mid s_{\mathrm{sim}})
26:           Observe reward rsim=Ri(ssim,asim)r_{\mathrm{sim}}=R_{i}(s_{\mathrm{sim}},a_{\mathrm{sim}})
27:           Observe next state snextPi(ssim,asim)s_{\mathrm{next}}\sim P_{i}(\cdot\mid s_{\mathrm{sim}},a_{\mathrm{sim}})
28:           GG+rsimG\leftarrow G+r_{\mathrm{sim}}
29:           // Update or record increments for Rsa,Ps,saR_{s}^{a},\,P_{s,s^{\prime}}^{a}
30:            ΔRssima\Delta R_{s_{\mathrm{sim}}}^{a}, ΔPssima\Delta P_{s_{\mathrm{sim}}}^{a} \leftarrow (computed from new sample)
31:           // Optionally store (ssim,asim)(s_{\mathrm{sim}},a_{\mathrm{sim}}) in 𝒟i\mathcal{D}_{i} for importance sampling
32:           ssimsnexts_{\mathrm{sim}}\leftarrow s_{\mathrm{next}}
33:        end while
34:        Backpropagation:
35:         Traverse from ss^{\prime} back to s0s_{0} along visited path
36:        for all visited pairs (s~,a~)(\tilde{s},\tilde{a}) do
37:           N(s~)N(s~)+1N(\tilde{s})\;\leftarrow\;N(\tilde{s})+1
38:           N(s~,a~)N(s~,a~)+1N(\tilde{s},\tilde{a})\;\leftarrow\;N(\tilde{s},\tilde{a})+1
39:           W(s~,a~)W(s~,a~)+GW(\tilde{s},\tilde{a})\;\leftarrow\;W(\tilde{s},\tilde{a})+G
40:           /* Use the Lipschitz bound with distance estimation */
41:           d(i,¯)Distance(i,¯,π)d(\mathcal{M}_{i},\bar{\mathcal{M}})\;\leftarrow\;\mathrm{Distance}\bigl{(}\mathcal{M}_{i},\bar{\mathcal{M}},\pi\bigr{)}
42:           U¯(s~,a~)Q¯(s~,a~)+Ld(i,¯)+L2(i)U_{\bar{\mathcal{M}}}(\tilde{s},\tilde{a})\;\leftarrow\;Q_{\bar{\mathcal{M}}}^{*}(\tilde{s},\tilde{a})\;+\;L\cdot d(\mathcal{M}_{i},\bar{\mathcal{M}})\;+\;L_{2}^{(i)}
43:           U(s~,a~)min{Rmax1γ,U¯(s~,a~),}U(\tilde{s},\tilde{a})\;\leftarrow\;\min\Bigl{\{}\dfrac{R_{\max}}{1-\gamma},\,U_{\bar{\mathcal{M}}}(\tilde{s},\tilde{a}),\dots\Bigr{\}}
44:           /* UMCTS update rule */
45:           Q(s~,a~)min{W(s~,a~)N(s~,a~)+ClnN(s~)N(s~,a~),U(s~,a~)}()Q(\tilde{s},\tilde{a})\;\leftarrow\;\min\Bigl{\{}\dfrac{W(\tilde{s},\tilde{a})}{N(\tilde{s},\tilde{a})}+C\,\sqrt{\dfrac{\ln N(\tilde{s})}{N(\tilde{s},\tilde{a})}},\;U(\tilde{s},\tilde{a})\Bigr{\}}\quad(*)
46:        end for
47:     end for
48:  end for
Algorithm 3 UMCTS with Neural Network Environment Model
0:  MDPs {1,,M}\{\mathcal{M}_{1},\dots,\mathcal{M}_{M}\}, each with trained neural network parameters {ϕ1,,ϕM}\{\phi_{1},\dots,\phi_{M}\}; A new MDP MM (partially known), with neural network Ψϕ:𝒮×𝒜Δ(𝒮)\Psi_{\phi}:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S}); A distance function ρ(ϕ,ϕi)0\rho(\phi,\phi_{i})\geq 0 on parameter space (e.g., 2\ell_{2}-norm); Define d^para(M,Mi)=ρ(ϕ,ϕi)\hat{d}_{para}(M,M_{i})=\rho(\phi,\phi_{i}); Lipschitz constants L,L2(i)L,L_{2}^{(i)}, discount factor γ\gamma, RmaxR_{\max}, exploration constant CC, iterations TT; A default (simulation) policy π\pi for rollouts
1:  // For each task MM (with parameter ϕ\phi) run UMCTS
2:  Initialize root node s0s_{0}, counters N()=0,N(,)=0,W(,)=0N(\cdot)=0,\,N(\cdot,\cdot)=0,\,W(\cdot,\cdot)=0
3:  for t=1t=1 to TT do
4:     Selection:
5:      ss0s\leftarrow s_{0}
6:     while all actions from s are expanded and s not terminal do
7:        aargmax𝑎(Q(s,a))a\;\leftarrow\;\underset{a}{\mathrm{argmax}}\;\bigl{(}Q(s,a)\bigr{)}
8:        schild node after action as\;\leftarrow\;\text{child node after action }a
9:     end while
10:     Expansion:
11:     if ss not terminal then
12:        choose an unvisited action anewa_{\mathrm{new}}
13:        sample sΨϕ(s,anew)s^{\prime}\sim\Psi_{\phi}(\cdot\mid s,a_{\mathrm{new}})// neural net predicts next state distribution
14:        create child node s’
15:        N(s,)0,W(s,)0N(s^{\prime},\cdot)\leftarrow 0,\;W(s^{\prime},\cdot)\leftarrow 0
16:     end if
17:     Simulation:
18:      G0G\leftarrow 0
19:      ssimss_{\mathrm{sim}}\leftarrow s^{\prime}
20:     while ssims_{\mathrm{sim}} not terminal do
21:        asimsample from π(ssim)a_{\mathrm{sim}}\leftarrow\text{sample from }\pi(\cdot\mid s_{\mathrm{sim}})
22:        // observe reward (possibly from real env or approximated by a learned reward model)
23:        rsim=R(ssim,asim)r_{\mathrm{sim}}=R(s_{\mathrm{sim}},a_{\mathrm{sim}})
24:        snextΨϕ(ssim,asim)s_{\mathrm{next}}\sim\Psi_{\phi}(\cdot\mid s_{\mathrm{sim}},a_{\mathrm{sim}})
25:        GG+rsimG\;\leftarrow\;G+r_{\mathrm{sim}}
26:        /* update ϕ\phi via gradient (e.g. supervised/unsupervised RL objective) */
27:         ϕϕηϕ(ϕ;(ssim,asim,snext))\phi\;\leftarrow\;\phi-\eta\,\nabla_{\phi}\mathcal{L}\bigl{(}\phi;(s_{\mathrm{sim}},a_{\mathrm{sim}},s_{\mathrm{next}})\bigr{)}
28:        ssimsnexts_{\mathrm{sim}}\;\leftarrow\;s_{\mathrm{next}}
29:     end while
30:     Backpropagation:
31:      traverse from ss^{\prime} back to s0s_{0}
32:     for all visited state-action pairs (s~,a~)(\tilde{s},\tilde{a}) do
33:        N(s~)N(s~)+1N(\tilde{s})\;\leftarrow\;N(\tilde{s})+1
34:        N(s~,a~)N(s~,a~)+1N(\tilde{s},\tilde{a})\;\leftarrow\;N(\tilde{s},\tilde{a})+1
35:        W(s~,a~)W(s~,a~)+GW(\tilde{s},\tilde{a})\;\leftarrow\;W(\tilde{s},\tilde{a})+G
36:        // parametric distance to previously trained model ϕi\phi_{i}
37:        d^para(M,Mi)ρ(ϕ,ϕi)\hat{d}_{para}(M,M_{i})\;\triangleq\;\rho(\phi,\phi_{i})
38:        // Lipschitz-based upper bound
39:        U¯(s~,a~)Q¯(s~,a~)+Ld^para(M,¯)+L2(i)U_{\bar{\mathcal{M}}}(\tilde{s},\tilde{a})\;\leftarrow\;Q_{\bar{\mathcal{M}}}^{*}(\tilde{s},\tilde{a})\;+\;L\cdot\hat{d}_{para}(M,\bar{\mathcal{M}})\;+\;L_{2}^{(i)}
40:        U(s~,a~)min{Rmax1γ,U¯(s~,a~),}U(\tilde{s},\tilde{a})\;\leftarrow\;\min\Bigl{\{}\frac{R_{\max}}{1-\gamma},\,U_{\bar{\mathcal{M}}}(\tilde{s},\tilde{a}),\dots\Bigr{\}}
41:        // UMCTS update rule
42:        Q(s~,a~)min{W(s~,a~)N(s~,a~)+ClnN(s~)N(s~,a~),U(s~,a~)}()Q(\tilde{s},\tilde{a})\;\leftarrow\;\min\Bigl{\{}\dfrac{W(\tilde{s},\tilde{a})}{N(\tilde{s},\tilde{a})}+C\sqrt{\dfrac{\ln N(\tilde{s})}{N(\tilde{s},\tilde{a})}},\;U(\tilde{s},\tilde{a})\Bigr{\}}\quad(*)
43:     end for
44:  end for