Lipschitz Lifelong Monte Carlo Tree Search for Mastering Non-Stationary Tasks
Abstract
Monte Carlo Tree Search (MCTS) has proven highly effective in solving complex planning tasks by balancing exploration and exploitation using Upper Confidence Bound for Trees (UCT). However, existing work have not considered MCTS-based lifelong planning, where an agent faces a non-stationary series of tasks – e.g., with varying transition probabilities and rewards – that are drawn sequentially throughout the operational lifetime. This paper presents LiZero for Lipschitz lifelong planning using MCTS. We propose a novel concept of adaptive UCT (aUCT) to transfer knowledge from a source task to the exploration/exploitation of a new task, depending on both the Lipschitz continuity between tasks and the confidence of knowledge in in Monte Carlo action sampling. We analyze LiZero’s acceleration factor in terms of improved sampling efficiency and also develop efficient algorithms to compute aUCT in an online fashion by both data-driven and model-based approaches, whose sampling complexity and error bounds are also characterized. Experiment results show that LiZero significantly outperforms existing MCTS and lifelong learning baselines in terms of much faster convergence (34x) to optimal rewards. Our results highlight the potential of LiZero to advance decision-making and planning in dynamic real-world environments.
1 Introduction
Monte Carlo Tree Search (MCTS) has demonstrated state-of-the-art performance in solving many challenging planning tasks, from playing the game of go Silver et al. (2016) and chess to logistic planning Silver et al. (2017). It performs look-ahead searches based on Monte Carlo sampling of the actions to balance efficient exploration and optimized exploitation in the large search space. Recent efforts have focused on developing MCTS algorithms for real-world domains that require the elimination of certain standard assumptions. Examples include MuZero Schrittwieser et al. (2020a) that leverages the decoding of hidden states to avoid requiring the knowledge of the game dynamics; and MAzero Liu et al. (2024) that performs multi-agent search through decentralized execution. However, existing work have not considered lifelong non-stationarity of task dynamics, which may manifest itself in many open world domains, as the task environment can often vary over time or across scenarios. It requires novel MCTS algorithms that can adapt in response, accumulate, and exploit knowledge throughout the learning process.
We consider MCTS-based lifelong planning under non-stationarity. An agent faces a series of changing planning tasks – e.g., with varying transition probabilities and rewards – which are drawn sequentially throughout the operational lifetime. Transferring knowledge from prior experience to continually adapt Monte Carlo sampling of the actions and thus speed up searches in new tasks is a key question in this setting. We note that although continual and lifelong planning has been studied in reinforcement learning (RL) context, e.g., learning models of the non-stationary task environment Xie et al. (2020), identifying reusable skills Lu et al. (2020), or estimating Bayesian sampling posteriors Fu et al. (2022), such prior work are not applicable to MCTS. Monte Carlo action sampling in MCTS relies on Upper Confidence Tree (UCT) or polynomial Upper Confidence Tree (pUCT) Auger et al. (2013); Matsuzaki (2018) to balance exploration and exploitation in large search spaces. To the best of our knowledge, there has been not existing work analyzing the transfer of knowledge from past MCTS searches to new tasks, thus enabling adaptive the UCT/pUCT rules in lifelong MCTS.
This paper proposes LiZero for Lipschitz lifelong planning using MCTS. We quantify a novel concept that the amount of knowledge transferable from a source task to the UCT/pUCT rule of a new task depends on both the similarity between the tasks as well as the confidence of the knowledge. More precisely, by defining a distance metric between two MDPs, we refine the concentration argument and drive a new adaptive UCT bound (denoted as aUCT in this paper) for lifelong MCTS. The aUCT is shown to consist of two components – relating to (i) the Lipschitz continuity between the two tasks and (ii) the confidence of knowledge due to the numbers of samples in Monte Carlo action sampling. Our results enable the development a novel LiZero algorithm that makes use of prior experience to run an adaptive MCTS by simulating/traversing from the root node and selecting actions according to the aUCT rule, until reaching a leaf node. We also analyze aUCT’s acceleration factor in terms of improved sampling efficiency due to cross-task transfer. It is shown that smaller task distance and higher confidence can both lead to higher acceleration in aUCT.
To support practical deployment of LiZero in lifelong planning, we need efficient solutions to compute aUCT in an online fashion. To this end, we develop practical algorithms to estimate various terms in aUCT and especially the distance metric between two MDPs, from either available state-action samples using a data-driven approach or a parameterized distance using a model-based (deep learning) approach. We provide rigorous analysis on the sampling complexity of the data-driven approach, to ensure arbitrarily small error with high probability, by modeling a non-stationary policy update process by a filtration – i.e., an increasing sequence of -algebras. For the model-based approach, we obtain an upper bound using a parameterized distance of the neural network models. These results enable effective LiZero application to open world tasks. We evaluate LiZero on a series of learning tasks with varying transition probabilities and rewards. It is shown that LiZero significantly outperforms MCTS and lifelong RL baselines (e.g., Winands (2024); Kocsis and Szepesvári (2006); Cheng et al. ; Schrittwieser et al. (2020b); Brafman and Tennenholtz (2002); Lecarpentier et al. (2021a)) in terms of faster convergence to higher optimal rewards. Utilizing the knowledge of only a few source tasks, LiZero achieves 34x speedup with about higher early reward in the first half of the learning process.
Our key contributions are as follows. First, we study theoretically the transfer of past experience in MCTS and develop a novel aUCT rule, depending on both Lipschitz continuity between tasks and the confidence of knowledge in Monte Carlo action sampling. It is proven to provide positive acceleration in MCTS due to cross-task transfer. Second, we develop LiZero for lifelong MCTS planning, with efficient methods for online estimation of aUCT and analytical error bounds. Finally, LiZero achieves significant speed-up over MCTS and lifelong RL baselines in lifelong planning.
2 Background
Monte Carlo Tree Search (MCTS) Kocsis and Szepesvári (2006); Silver et al. (2016); Schrittwieser et al. (2020a) is a heuristic search algorithm often applied to problems modeled as MDPs to handle exploration and exploitation dynamically. MCTS builds a search tree by exploring actions from the current state, simulating outcomes, and using those results to update estimated values for the selected actions. Normally, the problems solved by MCTS can be modeled using a Markov Decision Process (MDP) Even-Dar et al. (2004), which is formally defined as a tuple: , where is the state space, and is the action space, is the reward of taking action in state and is the transition probability matrix.
In the MCTS framework, the Upper Confidence Bound for Trees (UCT) Coulom (2006) and its variant, polynomial Upper Confidence Trees (pUCT) Matsuzaki (2018); Auger et al. (2013), are among the most commonly used selection strategies for balancing exploration and exploitation during node selection. Although these bounds are theoretically grounded and have achieved great empirical success, they are based on static environment assumptions and do not consider dynamic, non-stationary environments Pourshamsaei and Nobakhti (2024); Hernandez-Leal et al. (2017); Goldberg and Matarić (2003), where state transitions and reward distributions may change over time, thus requiring the transfer of past knowledge to exploration/exploitation of new tasks. In this paper, we consider MCTS- based lifelong planning, where an agent faces a non-stationary series of tasks – e.g., with varying transition probabilities and reward – and requires the development of new adaptive UCT bounds.
Lifelong reinforcement learning Lecarpentier et al. (2021b); Xie et al. (2020); Fu et al. (2022); Lu et al. (2020); Auger et al. (2013); Zou et al. (2024); Zhang et al. (2024a) is the process of learning a series of problems from unknown MDPs (Markov Decision Processes) online. Each time an MDP is sampled, it is treated as a separate RL (Reinforcement Learning) problem, where the agent can understand the environment and adjust its own policy Da Silva et al. (2018); Hawasly and Ramamoorthy (2013); Abel et al. (2018); Zhang et al. (2024b, c). The goal is for the agent to interact with the environment using policy to achieve the maximum expected reward. We can reasonably believe that the knowledge gained in similar MDPs can be reused. We note that despite recent work on continual and lifelong RL context, e.g., learning models of the non-stationary task environment Xie et al. (2020), identifying reusable skills Lu et al. (2020), or estimating Bayesian sampling posteriors Fu et al. (2022), these prior work are not applicable to MCTS in lifelong learning settings. It requires to re-examine and derive adaptive UCT bounds, for knowledge transfer.
3 Our Proposed Solution
3.1 Deriving adaptive Upper Confidence Bound (aUCT)
To derive the proposed aUCT rule, we consider set of past known MDPs and their leaned search policies . Let and be their state and action spaces, respectively111Without loss of generality, we assume that the MDPs have the same state and action spaces. Otherwise, we can consider the extended MDPs defined on the union of their state and action spaces., be the visit count of MPD to state-action pair , to denote its sampled return, and be the learned estimate for Q-value of MDP . Our goal is to apply these knowledge toward learning a new MDP, denoted by . To this end, we derive a new Lipschitz upper confidence bound for , which utilizes and transfers the knowledge from past MDPs , thus obtaining an improved Monte Carlo action sampling strategy that limits the tree search on to a smaller subsets of sampled actions. We use to denote the visit count of the new MDP to , to denote the sampled return, and thus to denote its current Q-value estimate.
Our key idea in this paper is that an improved upper confidence bound for the new MDP can be obtained by (i) analyzing the Lipschitz continuity between the past and new MDPs with respect to the upper confidence bounds and (ii) taking into account the confidence and aleatory uncertainty of the learned Q-value estimates to determine to what extent the learned knowledge from each is pertinent. Intuitively, the more similar and are and the more samples (and thus higher confidence) we have in the learned Q-value estimates, the less exploration we would need to perform for solving through MCTS. Our analysis will lead to an improved upper confidence bound that guides the MCTS on the new MDP over a much smaller subset of action samples, thus significantly improving the search performance. We start with introducing a definition of the distance between any two given MDPs, , with reward functions and state transitions , respectively. We choose a positive scaling factor to combine the distances with respect to transition probabilities and rewards. Proofs of all theorems and corollaries are presented in the appendix.
Definition 3.1.
Give two MDPs , and a distribution for sampling the state transitions , we define the pseudometric between the MDPs as:
Here is our definition of distance between two MDPs, and . We choose to be uniform distribution for sampling the state transitions in this paper. In Section 4, we discuss practical algorithms to estimate the distance metric between two MDPs, from either available state-action samples using a data-driven approach or a parameterized distance using a model-based (deep learning) approach. The sampling complexity and error bounds are also analyzed.
Next, we prove the main result of this paper and show that the upper confidence bounds of and is Lipschitz continuous with respect to distance . We obtain a new upper confidence bound for , by transfer the knowledge from the learned Q-value estimates of MDP . Obviously, the bound also depends on the confidence of learned Q-value estimates, relating to the visit counts and .
Theorem 3.2 (Lipschitz aUCT Rule).
Consider two MDPs and with visit count and corresponding estimate Q-values , respectively. With probability at least for some positive , we have
(1) |
where is a Lipschitz constant, is the distance between MDPs, and is given by
(2) |
In the theorem above, we show that the estimate Q-values between two MDPs are bounded by two terms, i.e., a Lipschitz continuity term depending on the distance between the two environments and a confidence term depending on the number of samples used to estimate the Q-values. The Lipschitz continuity term measures how much the learned knowledge of source MDP is pertinent to the new MDP , while the confidence terms quantifies the sampling bias arising from statistical uncertainty due to limited sampling in MCTS. We note that as the number of samples goes to infinity, we have in Theorem 3.2, approaching the true Q-value of the new MDP. Our theorem effectively provides an upper confidence bound for the true Q-value of the new MDP, based on knowledge transfer from the source MDP. We also note that as both numbers goes to infinity, the confidence term becomes . Our theorem recovers the Lipschitz lifelong RL Lecarpentier et al. (2021b) as a special case of our results, with respect to the true Q-values of the two MDPs.
We apply Theorem 3.2 to MCTS-based lifelong planning with a non-stationary series of tasks, . Our goal is to obtain an improved bound on the true Q-value of the new task based on knowledge transfer. To this end, we independently apply the knowledge from each past MDP, i.e., , to the new MDP. By taking the minimum of these bounds and making , it provides a tightest upper bound on the true Q-value of the new MDP, which is defined as our aUCT bound, as it adaptively transfers knowledge from past tasks to the new tasks in MCTS-based lifelong planning. The result is summarized in the following corollary.
Corollary 3.3 (aUCT bound in lifelong planning).
Given MDPs , the new MDP’s true Q-value is bounded by with probability at least . The aUCT bound is given by
(3) |
Obtaining this corollary is straightforward from Theorem 3.2 by taking and considering the tightest bound of all knowledge transfers. In the context of MCTS-based lifelong planning, the more knowledge we have from solving past tasks, the more likely we can easily plan a new task, as the aUCT bound is taken over the minimum of all past tasks. The confidence of past knowledge, i.e., the statistical uncertainty due to sampling number , also affects the knowledge transfer to the new task.
3.2 Our Proposed LiZero Algorithm Using aUCT
We use the derived aUCT to design a highly efficient LiZero algorithm for MCTS-based lifelong planning. The LiZero algorithm transfers knowledge from past known tasks by computing in Corollary 3.3. It requires efficient estimate of the distance (as defined in Definition 3.1) between the source MDPs and the new (target) MDP. We will present practical algorithms for such distance estimate in the next section and present analysis on the sampling complexity and error bounds. We will first introduce our LiZero algorithm in this section. We note that, during MCTS, direct exploration/search in the new task also produces new knowledge and leads to improved UCT bound of . Therefore, our proposed LiZero combines both knowledge transfer through and knowledge from direct exploration/search in .
The search in our proposed LiZero algorithm is divided into three stages, repeated for a certain number of simulations. First, each simulation starts from the internal root state and finishes when the simulation reaches a leaf node. Let be the current estimate of the new MDP and be the visit count to state . For each simulated time-step, LiZero chooses an action by maximizing a combined upper confidence bound based on aUCT, i.e.,
In practice, we can also use the maximum possible return as an initial value of the search. Next, at the final time-step of the simulation, the reward and state are computed by a dynamics function. A new node, corresponding to the leaf state, is then added to the search tree. Finally, at the end of the simulation, the statistics along the trajectory are updated. Let be the accumulative (discounted) reward for state-action from the simulation. We update the statistics by:
Intuitively, at the start of task ’s MCTS, there are not sufficient samples available, and thus serves as a tighter upper confidence bound than that resulted from the Monte Carlo actions sampling in . As more samples are obtained during the search process, the standard UCT bound is expected to become tighter than . The use of both bounds will ensure both efficient knowledge transfer and task-specific search. The pseudo-code of LiZero is provided in Appendix A.2.
For the proposed LiZero algorithm, we prove that it can result in accelerated convergence in MCTS. More precisely, we analyze the sampling complexity for the learned Q-value estimate to converge to the true value , and demonstrate a strictly positive acceleration factor, compared to the standard UCT. The results are summarized in the following theorem.
Theorem 3.4.
To ensure the convergence in a finite state-action space, with probability , the number of samples required by standard UCT is
(4) |
while the proposed LiZero algorithm requires:
(5) |
where is an acceleration factor given by
(6) |
and is a state-action set where of action is lower than the optimal return of in state ; and is a normalized advantage in the range of .
The theorem shows that LiZero achieves a strictly improved acceleration with a reduced sampling complexity (by ), in terms of ensuring convergence to the optimal estimates, i.e., with probability . Since the normalized advantage is in , we have . It is then easy to see that the value of depends on the cardinality and the normalized advantage . More precisely, LiZero achieves higher acceleration when (i) our makes more actions less favorable, as implies that the sub-optimality of action in can be more easily determined due to aUCT; or (ii) helps establish tighter bounds in cases with a smaller advantage, which naturally requires more samples to distinguish the optimal actions – since increases as the normalized advantage becomes smaller for , while being larger for . These explain LiZero’s ability to achieve much higher acceleration and lower sampling complexity, resulted from significantly reduced search spaces. We will evaluate this acceleration/speedup through experiments in Section 5.
4 Estimaing aUCT in Practice
To deploy LiZero in practice, we need to estimate aUCT, and in particular, the distance between two MDPS. Sampling all transitions based on a uniform distribution , as defined in Definition 3.1, is clearly too expensive. Thus, we develop efficient algorithms to estimate the distance metric, from either available state-action samples using a data-driven approach or a parameterized distance using a model-based (deep learning) approach. In this section, we also provide rigorous analysis on the sampling complexity and error bounds of the proposed algorithms for distance estimate. The results allow us to readily implement LiZero in practical environments. We will late evaluate the performance of different distance estimaters in Section 5 and present the numerical results.
More precisely, we first propose an algorithm to estimate the distance between two MDPs, and , using trajectory samples drawn from their search policies during MCTS and then making the use of importance sampling to mitigate the bias. We will start with analyzing a stationary search policy and then extend the results to a non-stationary policy update process, by modeling it as a filtration – i.e., an increasing sequence of -algebra. Next, since many practical problems are faced with extremely large or even continuous action and state spaces (i.e., and ), we further consider a model-based approach by learning neural network approximations of the MDPs – denoted by parameter sets and , respectively – and then computing an upper bound on the distance using a parameterized distance of the neural network models. Analysis on sampling complexity and error bounds are provided as theorems in this section.
4.1 Sample-based Distance Estimate
During MCTS, transition samples are collected from the search to train a search policy . It is easy to see that we can leverage these transition samples to estimate distance between two MDPs, as long as we address the bias arising from gap between search policy and desired sampling distribution in the distance definition . It also allows us to obtain a consistent estimate of MDP distance, without depending on the search policy that is updated during training. We note that this bias can be addressed by importance sampling.
Let be the distance metric for a given state-action pair . We can rewrite the distance as . We denote as the probability (or density) of sampling according to distribution . Importance sampling implies:
(7) |
which can be readily computed from the collected transition samples, following the search policy . Therefore, for a given set of samples collected from a search policy , we can estimate the distance by the empirical mean:
(8) |
where is the importance sampling weight.
As long as the state-action pairs with cover the support of , this estimator satisfies , meaning it is unbiased. Let be the "coverage" of policy , i.e., , and be the maximum desired sampling probability. We summarize this result in the following theorem and state the sampling complexity for estimator to -converge to .
Theorem 4.1 (Sampling Complexity under Stationarity).
Assume that for any , the reward plus transition difference is bounded, i.e., , and that there exists such that . When independent samples are used to estimate , we have
(9) |
for any , if the number of samples satisfy
(10) |
Thus, we obtain a convergence guarantee in the sense of arbitrarily high probability and arbitrarily small error , for estimating using . is unbiased and ensures convergence to the true distance as the number of samples is sufficiently large.
We note that in many practical settings, the search policy would not stick to a stationary distribution. In contrast, it is continuously updated in each iteration, resulting in a non-stationary sequence of policies over time, i.e., . Thus, the transition samples ’s we obtain at each step for estimating the distance are indeed drawn from a different . We cannot assume that the samples follow a stationary distribution (nor that are i.i.d.) in importance sampling. To address this problem, we model the non-stationary process of policy updates as a filtration – i.e., an increasing sequence of -algebra. In particular, we make the following assumption: at the -th sampling step, the environment is forcibly reset to a predetermined policy or independently draws a state from an external memory. This assumption is reasonable because, in many episodic learning scenarios, the environment is inherently divided into episodes: at the beginning of each episode, the state is reset to some initial distribution (e.g., the opening state in Atari games or the initial pose in MuJoCo). This naturally results in the “reset" assumption.
In this setup, the policy at step is determined by information at step or earlier. Consequently, once is fixed, the distribution (marginal) of is also fixed. Therefore, we can establish the filtration as follows:
(11) |
where denotes the smallest -algebra generated by the random elements. Thus, we obtain:
(12) | ||||
This allows us to obtain another empirical estimator using the filtration model. We analyze the sampling complexity of and summarise the results in the following theorem.
Theorem 4.2 (Sampling Complexity under Non-Stationarity).
Under the same conditions as Theorem 4.1 when independent samples are used to estimate , we have
(13) |
for any , if the number of samples satisfy
(14) |
It implies that more samples are needed considering the non-stationarity of policy update process for distance estimate.
4.2 Model-based Distance Estimate
When the action and state spaces, and are very large or even continuous, employing the sample based method will become increasingly expensive. Therefore, we propose a model-based approach to first approximate the dynamics of MDPs and using two neural networks and then estimate based on the parameterized distance between the neural networks.
To this end, we need to establish a bound on using the distance between their neural network parameters. We use a neural network to model the MDP dynamics. Many model-based learning algorithms, such as PILCO Deisenroth and Rasmussen (2011),MBPO Janner et al. (2019),PETS Chua et al. (2018),MuZero Schrittwieser et al. (2020a), can be employed to learn the models of and . Let be the neural network parameters of MDP and be the neural network parameters of MDP . We define a distance in the parameter space:
(15) |
where is a distance or divergence measure in the parameter space, such as the -norm, -norm, or certain kernel distances. Intuitively, if and are very close, it indicates that the two neural networks are similar in fitting the dynamics of the respective MDPs. It suggests that the two MDPs should have a small distance. To provide a more rigorous characterization of this concept, we present the following theorem, which demonstrates that under proper assumptions, the distance based on neural network parameters can serve as an upper bound for the desired . Let be a constant.
Theorem 4.3.
If the neural networks modeling and satisfy the Lipschitz condition, i.e., there exists a constant such that , then we have:
(16) |
The theorem indicates that by learning neural networks to model the MDP dynamics, we can estimate the distance by estimating the distance between the neural network parameters. This parameterized distance can be computed for event continuous action and state spaces.
5 Evaluations
Our experiments evaluate LiZero on series of ten learning tasks with varying transition probabilities and rewards. We demonstrate LiZero’s ability to transfer past knowledge in MCTS-based planning, resulting in significant convergence speedup (34x) and early reward improvement (about 31% average improvement during the first half of learning process) in lifelong planning problems. All experiments are conducted on a Linux machine with AMD EPYC 7513 32-Core Processor CPU and an NVIDIA RTX A6000 GPU, implemented in python3. All source codes are made available in the supplementary material.




Name | Task1 | Task2 | Task3 | Task4 | Task5 | Task6 | Task7 | Task8 | Task9 | Task10 | Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LiZero-U | 4.830.05 | 5.980.10 | 5.990.07 | 5.940.05 | 6.080.07 | 6.050.16 | 6.010.11 | 6.030.05 | 6.040.04 | 6.030.09 | 58.98 | |
LiZero-P | 4.650.06 | 5.890.11 | 5.900.12 | 5.900.03 | 5.620.19 | 5.680.22 | 5.760.12 | 5.870.03 | 5.780.06 | 5.790.20 | 56.84 | |
LiZero-N | 4.640.08 | 5.560.07 | 5.560.07 | 5.520.05 | 5.520.08 | 5.480.06 | 5.500.09 | 5.450.06 | 5.500.06 | 5.480.05 | 54.21 | |
MCTS-R | 4.510.07 | 4.430.08 | 4.320.11 | 4.240.05 | 4.180.07 | 4.240.10 | 4.250.03 | 4.470.06 | 4.340.03 | 4.390.08 | 43.37 | |
MCTS-O | 4.520.06 | 4.870.08 | 4.570.04 | 4.160.03 | 4.780.05 | 4.910.06 | 4.040.03 | 3.700.05 | 3.020.07 | 2.960.06 | 41.53 | |
pUCT | 4.660.04 | 4.710.06 | 4.690.13 | 4.770.09 | 4.740.04 | 4.870.05 | 4.940.06 | 4.720.05 | 4.860.07 | 4.770.03 | 47.73 | |
RMax | 1.020.02 | 1.050.01 | 1.010.02 | 1.030.01 | 1.040.01 | 1.050.01 | 1.030.03 | 1.040.02 | 1.030.02 | 1.030.01 | 10.33 | |
LRMax | 1.050.01 | 1.050.02 | 1.040.02 | 1.060.03 | 1.050.01 | 1.060.02 | 1.040.01 | 1.060.03 | 1.050.01 | 1.040.01 | 10.50 |


In the evaluation, we consider some state-of-the-art baselines using MCTS and lifelong RL. In particular, we consider two versions of MCTS algorithms that leverage UCT Winands (2024); Kocsis and Szepesvári (2006); Cheng et al. : MCTS-R denotes a version that restarts the search from scratch for each new task, and MCTS-O denotes a version that is oblivious to the non-stationary task dynamics and continues to build upon the search tree from the past. We also consider state-of-the-art MCTS using pUCT, similar to MuZero and related algorithms Schrittwieser et al. (2020b). We have two lifelong RL algorithms: RMax Brafman and Tennenholtz (2002) and LRMax Lecarpentier et al. (2021a), which exploits a similar Lipschitz continuity in RL, but do not consider MCTS using upper confidence bounds. We evaluate three versions of LiZero using different methods for estimating aUCT by computing the task distances, as presented in Section 4. LiZero-U employs a direct distance estimate based on Definition 3.1; LiZero-P is the data-driven distance estimater using samples following the search policy; and LiZero-N is the neural-network based estimator using parameter distances.
The experimental environment we used is a variation of the "tight" task by Abel et al. Abel et al. (2018). It generates a non-stationary sequence of ten learning tasks. Each task consists of a grid world, with the initial state located at the center, and four possible actions: up, down, left, and right. The three cells in the top-right corner and one cell in the bottom-left corner are designated as goal cells. For each task, the reward for the goal cells is randomly chosen from the range [0.9, 1]. The remaining cells will randomly generate interference rewards within the range [0, 0.1]. Its state transition matrix selects its own slip probability (performing an action different from the chosen one) within the range [0, 0.1]. This ensures that the sequence of tasks have varying reward and transition probabilities. Each task for 1,000 epochs. These operations are repeated multiple times to narrow the confidence interval.
Figure 1 shows the convergence of different algorithms on representatives Tasks 1, 2, 6, and 10, in a non-stationary series of ten task. As tasks are drawn sequentially, LiZero-U, LiZero-P, and LiZero-N algorithms converge more rapidly than the MCTS and lifelong RL baselines. This speedup becomes evident as early as the second task (Task 2) – while similar convergences are observed in Task 1 as no prior knowledge is yet available. From Task 2 to Task 10, as more knowledge from past tasks gets transferred to the new task by LiZero, it outperforms all baselines in more significantly improved convergence speed. In Task 10 with maximum past knowledge, LiZero outperforms all baselines in convergence speed and optimal reward. MCTS-O (which is oblivious to changing task dynamics) exhibits increased deficiency, as tasks further evolve, and perform worse than MCTS-R (which restarts the search from scratch).
In Table 1, we summarize the average rewards (and their standard deviations) obtained in sequential tasks by different algorithms during their first 500 epochs (i.e., first half of the learning process). LiZero algorithms achieves about 31% early reward improvement on average. As for MCTS baselines with UCT, MCTS-R shows similar reward across different tasks, while MCTS-O demonstrates higher volatility – due to its reliance on how task dynamics evolve. pUCT achieves higher performance due to the use of improved probabilistic UCT similar to MuZero. All MCTS baselines show better results than lifelong RL algorithms (i.e., RMax and LRMax), which are known to be less sample efficient and require more epochs for exploration/exploitation. With more accurate distance estimates – i.e., from Lizero-N to LiZero-P and to LiZerio-U – we observe further improved results due to better knowledge transfer that comes with more accurate aUCT calculations.
To evaluate the speedup of LiZero, Figure 2 shows the average number of epochs needed by different algorithms to achieve 60%, 70%, and 80% of the optimal reward, respectively. We note that LiZero shows a comfortable speedup of 34x, compared with MCTS and lifelong RL baselines, while RL baselines are clearly much less sample efficient that MCTS-based planning in general. We do not go beyond 80% in this plot since some baselines are never able to achieve more than 80% of the optimal reward that LiZero obtains. The results provide numerical examples of the acceleration factor characterized in Theorem 3.4. A more accurate aUCT bound (like in LiZero-U) generally means further acceleration/speedup.
Ablation Study. Our ablation study considers the impact of distance estimator on performance. Figure 2(a) shows the distance estimators in LiZero-U, LiZero-P, and LiZero-N (each with decreasing accuracy) across the sequence of tasks, while for the purpose of ablation study, MCTS-R can be viewed as an algorithm without distance estimator. Comparing the performance of these algorithms in Table 1 and Figure 2, we see that the superior performance of LiZero is indeed resulted from the use of aUCT in MCTS – The tighter aUCT bounds we use, the higher performance we can achieve. Using no distance estimator and thus only UCT (in MCTS-R) leads to the lowest performance. Further, as tasks are drawn, the distance estimates decrease quickly, and by the third task, its is already very small, implying accurate aUCT calculation for knowledge transfer.
6 Conclusions
We study theoretically the transfer of past experience in MCTS-based lifelong planning and develop a novel aUCT rule, depending on both Lipschitz continuity between tasks and the confidence of knowledge in Monte Carlo action sampling. The proposed aUCT is proven to provide positive acceleration in MCTS due to cross-task transfer and enable the development of a new lifelong MCTS algorithm, namely LiZero. We also present efficient methods for online estimation of aUCT and provide analysis on the sampling complexity and error bounds. LiZero is implemented on a non-stationary series of learning tasks with varying transition probabilities and rewards. It outperforms MCTS and lifelong RL baselines with 34x speed-up in solving new tasks and about 31% higher early reward.
Impact Statement
This paper proposes a novel framework for applying Monte Carlo Tree Search (MCTS) in lifelong learning settings, addressing the challenges posed by non-stationary environments and dynamic game dynamics. By introducing the adaptive Upper Confidence Bound for Trees (aUCT) and leveraging insights from previous MDPs (Markov Decision Processes), our work significantly enhances the efficiency and adaptability of decision-making algorithms across evolving tasks.
The broader societal implications of this research include its potential to improve AI applications in robotics, automated systems, and other domains requiring dynamic decision-making under uncertainty. For instance, this framework could be used in autonomous systems to adaptively respond to changing environments, thereby improving safety and reliability. At the same time, it is crucial to acknowledge and mitigate potential risks, such as unintended biases or over-reliance on prior knowledge that may not fully represent novel situations.
Ethical considerations for this work focus on its use in high-stakes applications, such as healthcare, finance, or defense, where decision-making under uncertainty could have significant consequences. Developers and practitioners should implement safeguards to ensure responsible deployment, including comprehensive testing in diverse scenarios and establishing clear boundaries for its use.
By advancing the state of the art in continual learning and decision-making, this research contributes to the development of more adaptable and intelligent AI systems while highlighting the importance of ethical and responsible innovation in AI technologies.
References
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Schrittwieser et al. (2020a) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020a.
- Liu et al. (2024) Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, and Chongjie Zhang. Efficient multi-agent reinforcement learning by planning. arXiv preprint arXiv:2405.11778, 2024.
- Xie et al. (2020) Annie Xie, James Harrison, and Chelsea Finn. Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv:2006.10701, 2020.
- Lu et al. (2020) Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020.
- Fu et al. (2022) Haotian Fu, Shangqun Yu, Michael Littman, and George Konidaris. Model-based lifelong reinforcement learning with bayesian exploration. Advances in Neural Information Processing Systems, 35:32369–32382, 2022.
- Auger et al. (2013) David Auger, Adrien Couetoux, and Olivier Teytaud. Continuous upper confidence trees with polynomial exploration–consistency. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13, pages 194–209. Springer, 2013.
- Matsuzaki (2018) Kiminori Matsuzaki. Empirical analysis of puct algorithm with evaluation functions of different quality. In 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pages 142–147. IEEE, 2018.
- Winands (2024) Mark HM Winands. Monte-carlo tree search. In Encyclopedia of computer graphics and games, pages 1179–1184. Springer, 2024.
- Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
- (12) Scott Cheng, Mahmut Kandemir, and Ding-Yong Hong. Speculative monte-carlo tree search. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
- Schrittwieser et al. (2020b) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, December 2020b. ISSN 1476-4687. doi: 10.1038/s41586-020-03051-4. URL http://dx.doi.org/10.1038/s41586-020-03051-4.
- Brafman and Tennenholtz (2002) Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Lecarpentier et al. (2021a) Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L Littman. Lipschitz lifelong reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8270–8278, 2021a.
- Even-Dar et al. (2004) Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. Advances in neural information processing systems, 17, 2004.
- Coulom (2006) Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
- Pourshamsaei and Nobakhti (2024) Hossein Pourshamsaei and Amin Nobakhti. Predictive reinforcement learning in non-stationary environments using weighted mixture policy. Applied Soft Computing, 153:111305, 2024.
- Hernandez-Leal et al. (2017) Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz De Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
- Goldberg and Matarić (2003) Dani Goldberg and Maja J Matarić. Maximizing reward in a non-stationary mobile robot environment. Autonomous Agents and Multi-Agent Systems, 6:287–316, 2003.
- Lecarpentier et al. (2021b) Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L. Littman. Lipschitz lifelong reinforcement learning, 2021b. URL https://arxiv.org/abs/2001.05411.
- Zou et al. (2024) Yifei Zou, Zuyuan Zhang, Congwei Zhang, Yanwei Zheng, Dongxiao Yu, and Jiguo Yu. A distributed abstract mac layer for cooperative learning on internet of vehicles. IEEE Transactions on Intelligent Transportation Systems, 2024.
- Zhang et al. (2024a) Congwei Zhang, Yifei Zou, Zuyuan Zhang, Dongxiao Yu, Jorge Torres Gómez, Tian Lan, Falko Dressler, and Xiuzhen Cheng. Distributed age-of-information scheduling with noma via deep reinforcement learning. IEEE Transactions on Mobile Computing, 2024a.
- Da Silva et al. (2018) Felipe Leno Da Silva, Matthew E Taylor, and Anna Helena Reali Costa. Autonomously reusing knowledge in multiagent reinforcement learning. In IJCAI, pages 5487–5493, 2018.
- Hawasly and Ramamoorthy (2013) Majd Hawasly and Subramanian Ramamoorthy. Lifelong learning of structure in the space of policies. In 2013 AAAI Spring Symposium Series, 2013.
- Abel et al. (2018) David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, and Michael Littman. Policy and value transfer in lifelong reinforcement learning. In International Conference on Machine Learning, pages 20–29. PMLR, 2018.
- Zhang et al. (2024b) Zuyuan Zhang, Hanhan Zhou, Mahdi Imani, Taeyoung Lee, and Tian Lan. Collaborative ai teaming in unknown environments via active goal deduction. arXiv preprint arXiv:2403.15341, 2024b.
- Zhang et al. (2024c) Zuyuan Zhang, Mahdi Imani, and Tian Lan. Modeling other players with bayesian beliefs for games with incomplete information. arXiv preprint arXiv:2405.14122, 2024c.
- Deisenroth and Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
- Janner et al. (2019) Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
- Qiao et al. (2024) Jing Qiao, Zuyuan Zhang, Sheng Yue, Yuan Yuan, Zhipeng Cai, Xiao Zhang, Ju Ren, and Dongxiao Yu. Br-defedrl: Byzantine-robust decentralized federated reinforcement learning with fast convergence and communication efficiency. In IEEE INFOCOM 2024-IEEE Conference on Computer Communications, pages 141–150. IEEE, 2024.
- Gao et al. (2024) Mengtong Gao, Yifei Zou, Zuyuan Zhang, Xiuzhen Cheng, and Dongxiao Yu. Cooperative backdoor attack in decentralized reinforcement learning with theoretical guarantee. arXiv preprint arXiv:2405.15245, 2024.
- Riis (2024) Søren Riis. Mastering nim and impartial games with weak neural networks: An alphazero-inspired multi-frame approach. arXiv preprint arXiv:2411.06403, 2024.
- Chen et al. (2024) Yiwei Chen, Wenhao Li, XiuZhen Cheng, and Pengfei Hu. A survey of acoustic eavesdropping attacks: Principle, methods, and progress. High-Confidence Computing, page 100241, 2024.
- Zhang et al. (2025) Zuyuan Zhang, Vaneet Aggarwal, and Tian Lan. Network diffuser for placing-scheduling service function chains with inverse demonstration. arXiv preprint arXiv:2501.05673, 2025.
- Yin (2025) Chun-Wu Yin. Predefined time convergence guaranteed performance control for uncertain systems based on reinforcement learning. Engineering Applications of Artificial Intelligence, 140:109734, 2025.
Appendix A Appendix / supplemental material
A.1 Proof of Theorem 3.2
Proof.
Proof of Theorem 3.2 Since in the MCTS UCB algorithm, the estimated Q-values are obtained through multiple simulations, we need to analyze how the differences in simulation results between two MDPs affect the estimated Q-values.
However, due to the randomness involved in the simulation process of the two MDPs:
-
•
Transition randomness: Due to different transition probabilities, the two MDPs may move to different next states even when starting from the same state and action.
-
•
Action selection randomness: When using the UCB algorithm, action selection depends on the current statistical information, which in turn relies on the past simulation results.
The randomness mentioned above makes it impossible for us to compare two independent random simulation processes directly Qiao et al. (2024); Gao et al. (2024); Riis (2024); Chen et al. (2024); Zhang et al. (2025); Yin (2025).
To eliminate the impact of randomness, we need to construct a coupled simulation process for the two MDPs in the same probability space, allowing for a direct comparison between them. Then we will incorporate the additional errors caused by randomness into the analysis as error terms. For this purpose, we present the following assumptions.
Assumption A.1.
Let us temporarily assume that the actions selected in each simulation are the same for the two MDPs.
-
•
Initial action consistency: The simulation starts from the same state
-
•
Action selection consistency: The same action is chosen in each state.
Note: This is a strong assumption and may not hold in practice. We will discuss its impact later.
Thus, we can obtain the difference in cumulative rewards between the two MDPs in a single simulation as:
(17) |
Where and are the states of the two MDPs at step , and is the action selected at step .
So we can get
(18) |
where To estimate the expectation and variance of , we need to analyze how the differences in the state sequences affect the cumulative rewards.
We present several settings for the state differences.
-
•
Probability of state difference: At each time step , the probability that the states of the two MDPs differ is denoted as .
-
•
Initial state is the same: .
-
•
State difference propagation: Due to differences in transition probabilities, state differences may accumulate in subsequent time steps.
Since the probability of state differences occurring at each step is difficult to calculate precisely, we can use the total variation distance to estimate the probability of transitioning to different states. We present the definition of the total variation distance between the transition probabilities of the two MDPs and a recursive method for calculating the probability of state differences.
Definition A.2.
Under action , starting from state , the total variation distance between the transition probabilities of the two MDPs is:
(19) |
Thus, starting from the same state and action , the probability that the two MDPs transition to different next states is at most .
Thus, the probability of state differences occurring can be recursively expressed as:
(20) |
So
(21) |
Thus, at each time step , the expected difference in cumulative rewards is:
(22) | ||||
To estimate the variance of the cumulative reward difference, since the cumulative reward is bounded, its variance is also finite. We can easily obtain
(23) |
According to Hoeffding:
(24) |
Thus, with probability at least , we have:
(25) | ||||
∎
A.2 Proof of Theorem 3.4
Proof.
Proof of Theorem 3.4 First, we consider the case of a single MDP and assume that we have a "universal" upper bound .
Lemma A.3.
Since holds for all , and initially , for any update, maintains and .
The above two points illustrate Since we update using And since , during all sampling processes, if overestimates significantly, it will still be truncated by , ensuring that . When gradually approaches , it will no longer be truncated. This does not hinder the convergence of to .
Theorem A.4 (Convergence in a Single MDP).
If there are infinitely many samples for each state and its available actions (i.e., every branch in the MCTS search tree is "continuously" expanded), then the generated by the above update formula almost surely converges to .
Now we aim to demonstrate that after completing certain MDPs (tasks) , and then switching to a new MDP , the algorithm achieves faster convergence.
First, we analyze the classic scenario without upper bounds. In a finite state-action space, to achieve the desired outcome with high probability :
(26) |
The standard UCT/UCB theory typically provides a time complexity of . To prove this theorem, we just need to analyze the acceleration factor , comparing the sampling complexity of our aUCT and standard UCT.
More specifically, if we examine each specific , the analysis often resembles that of multi-armed bandits: for "suboptimal" , approximately samples are required. Where is the value gap between the action and the optimal action. Summing up the exploration costs for all state-action pairs gives a total magnitude of .
Now we introduce the case with upper bounds and analyze how to reduce the number of samples across different MDPs.
To quantitatively represent this acceleration, we divide the state-action pairs into two groups:
-
•
Upper bounds are sufficiently tight and are truncated to be lower than the optimal action from the very beginning.
(27) -
•
The upper bounds are not "tight enough," i.e.,
(28)
For :
We treat each sampling as a multi-armed bandit. Let the true mean of the optimal arm be . For a certain arm , its true mean is known to satisfy .
Even if we truncate at , the UCB algorithm’s "optimistic estimate" for this arm at step is still:
(29) |
(30) |
Let . As long as:
(31) |
From the above, it can be ensured that cannot exceed . So
(32) |
Where we obtain a sampling time complexity of .
For , these cannot be pruned by "truncation." They still require multiple samples, as in classic UCT, to determine whether they are truly optimal. For any , we still need approximately samples to distinguish that it is not as good as . Thus, the sampling complexity of our algorithm is:
(33) |
Using the fact that , we can rewrite this as
(34) |
In contrast, the sampling complexity of the standard UCT can be obtained using the same analysis, i.e.,
(35) |
Comparing the order bounds from Equation (35) and Equation (34), we can find the acceleration factor as
(36) |
which is the desired result in the theorem.
∎
A.3 Proof of Theorem 4.1
Proof.
Proof of Theorem 4.1 First, we need to establish unbiasedness and boundedness. For unbiasedness, we can derive:
(37) |
Therefore, , meaning is an unbiased estimator.
(38) |
Where . So we can get:
(39) |
So we can get where .
Based on the above analysis, we have , . According to Hoeffding’s inequality, for , we have:
(40) |
To achieve a confidence level of , it requires:
(41) |
We get if fulfilled:
(42) |
There is then a high probability error upper bound:
(43) |
∎
A.4 Proof of Theorem 4.2
Proof.
Proof of Theorem 4.2 Constructing a martingale difference, let:
(44) |
According to the martingale condition in formula 12, we know that , and satisfies . Thus, is a martingale process.
Since , and . Therefore, we have:
(45) |
According to the Azuma-Hoeffding inequality for bounded martingale differences, we have:
(46) |
Let , then is equivalent to , that is:
(47) |
So:
(48) |
Thus, as long as , we have . ∎