Permutation Invariant Learning with High-Dimensional Particle Filters

Akhilan Boopathy^1∗, Aneesh Muppidi^2∗, Peggy Yang¹, Abhiram Iyer¹, William Yue¹, Ila Fiete¹
¹ Massachusetts Institute of Technology, ² Harvard University
^∗ Equal contribution
[email protected], [email protected]

Abstract

Sequential learning in deep models often suffers from challenges such as catastrophic forgetting and loss of plasticity, largely due to the permutation dependence of gradient-based algorithms, where the order of training data impacts the learning outcome. In this work, we introduce a novel permutation-invariant learning framework based on high-dimensional particle filters. We theoretically demonstrate that particle filters are invariant to the sequential ordering of training minibatches or tasks, offering a principled solution to mitigate catastrophic forgetting and loss-of-plasticity. We develop an efficient particle filter for optimizing high-dimensional models, combining the strengths of Bayesian methods with gradient-based optimization. Through extensive experiments on continual supervised and reinforcement learning benchmarks, including SplitMNIST, SplitCIFAR100, and ProcGen, we empirically show that our method consistently improves performance, while reducing variance compared to standard baselines. Project website and code is available here.

1 Introduction

What is the optimal order for training data? This question is fundamental to understanding how the sequencing of training data impacts machine learning model performance. In sequential learning settings, such as continual learning and lifelong learning, the sequencing of training data plays a crucial role in determining model performance. When models are trained on ordered minibatches of data, poor ordering—referred to as ”poor permutations”—can result in catastrophic forgetting and loss of plasticity (Wang et al., 2024; Abel et al., 2023).

In continual learning, models process tasks in a specific sequence. Unlike conventional training, where minibatch data is randomized, continual learning often relies on a strict sequence, making models prone to overfitting on newer tasks while losing performance on older tasks. This is known as catastrophic forgetting, where new information erases prior knowledge, severely degrading performance on earlier tasks (Kim & Han, 2023; van de Ven et al., 2022).

Similarly, in lifelong reinforcement learning (LRL), agents must adapt to new tasks sequentially. The order in which these tasks are presented can lead to loss of plasticity, limiting the agent’s ability to adapt to new environments (Muppidi et al., 2024; Lyle et al., 2022; Abbas et al., 2023; Sokar et al., 2023). This poor ordering can further manifest as negative transfer or primacy bias, where learning earlier tasks biases the agent towards those tasks, impeding adaptation to new tasks (Nikishin et al., 2022; Ahn et al., 2024).

To address these challenges, we propose a shift in perspective, viewing the problem through the lens of permutation invariance. By developing learning algorithms that are invariant to the order of data presentation, we can mitigate catastrophic forgetting and loss of plasticity. Our key insight is the use of particle filters, a probabilistic tool widely used in state estimation, to achieve this goal.

Particle filters excel at dynamically estimating system states from noisy data and are grounded in Bayesian inference (Thrun et al., 2005; Doucet et al., 2001b; Jonschkowski et al., 2018; Karkus et al., 2021; Corenflos et al., 2021; Pulido & van Leeuwen, 2019; Maken et al., 2022; Boopathy et al., 2024). However, their application to modern machine learning has been limited due to scalability issues in high-dimensional settings. In contrast, gradient-based optimization techniques such as gradient descent efficiently handle high-dimensional spaces but lack the probabilistic framework offered by particle filters.

In this work, we bridge this gap by proposing a novel particle filter designed specifically for high-dimensional learning. We show that by adapting particle filters to high-dimensional learning problems, we can achieve more robust permutation-invariant learning. Our approach provides a new perspective on training in sequential settings and also addresses the core challenges of catastrophic forgetting and loss of plasticity in a principled manner.

Our contributions are threefold:

•

Theoretically, we demonstrate that particle filters enable permutation-invariant learning, where the algorithm’s output remains consistent regardless of the training data order. We further show that this property naturally mitigates catastrophic forgetting and loss of plasticity.
•

We introduce a simple, gradient-based particle filter specifically tailored for high-dimensional parameter spaces. This filter retains the essential features of traditional particle filters while being computationally efficient and well-suited for typical machine-learning optimization tasks.
•

Through empirical evaluations on continual learning and lifelong reinforcement learning benchmarks, including SplitMNIST, SplitCIFAR100, and ProcGen, we show that our proposed particle filter achieves better performance and reduced variance compared to standard baselines. Additionally, we demonstrate that integrating this particle filter with continual learning and LRL methods increases overall performance and reduces performance variance.

2 Related Work

2.1 Particle Filters

Particle filters, or sequential Monte Carlo methods, are widely used for state estimation in non-linear and non-Gaussian settings. They represent probability distributions through a set of samples (particles), providing flexibility in capturing complex dynamics (Doucet et al., 2001a). In fields such as robotics, particle filters have been applied successfully to localization and mapping problems, where they handle uncertainty and non-linearities effectively (Thrun, 2002). However, a key limitation is their scalability: as the dimensionality of the problem increases, the number of particles needed grows exponentially, making them less practical in high-dimensional spaces like those in machine learning (Bengtsson et al., 2008). Recent efforts have focused on improving particle filter scalability through adaptive resampling and dimensionality reduction techniques (Li et al., 2015), but these approaches have not fully bridged the gap for large-scale machine learning applications. Our work addresses this gap by proposing a high-dimensional particle filter that is computationally efficient and well-suited for machine learning tasks.

2.2 Bayesian Model Averaging

Bayesian model averaging (BMA) is a powerful technique for integrating uncertainty into model predictions by averaging across multiple models (Hoeting et al., 1999; Wasserman, 2000). By weighting model predictions based on their posterior probabilities, BMA can provide more robust predictions and better capture model uncertainty compared to single-model approaches (Raftery et al., 2005). In modern machine learning, BMA has been employed to enhance performance and uncertainty estimation, notably in ensemble techniques (Lakshminarayanan et al., 2017; Wortsman et al., 2022). However, BMA has not been extensively explored in the context of continual or permutation-invariant learning, where uncertainty over tasks and sequential data plays a crucial role.

In this work, we demonstrate the benefits of particle filters in high-dimensional machine learning settings. We then describe a particular particle filter that functions as a BMA technique, and demonstrate its advantages empirically.

3 Particle Filters for Learning Problems

In this section, we first theoretically demonstrate two beneficial properties of particle filters generally on learning problems, namely 1) permutation-invariance and 2) avoidance of catastrophic forgetting and loss of plasticity. We then describe a particular particle filter suitable for high-dimensional learning problems.

3.1 Setup

Consider a learning problem that provides a sequence of loss functions of model parameters, and the goal of learning is to minimize the sum of the loss functions. For instance, the loss function at each time step might correspond to the cross-entropy loss on a minibatch of points for a classification problem. We denote the model parameters at time $t$ as $x_{t}\in\mathbb{R}^{d}$ and the loss function at time $t$ as $L_{t}\in\mathbb{R}^{d}\to\mathbb{R}$ . The goal is to find an $x$ minimizing $\sum_{t=1}^{T}L_{t}(x)$ , where $T$ is the total number of updates.

How can we apply particle filters to this learning problem? To do this, we suppose that instead of learning a single model, we learn a distribution of models following a Bayesian approach. Specifically, suppose that at time $t=0$ , we initially start with a prior distribution of candidate models; each $L_{t}$ corresponds to an observation that updates the likelihood of each model. Specifically, we suppose that the likelihood of the model $x$ is set as:

P(L_{t}|x)=e^{-L_{t}(x)}

(1)

This likelihood function increases the likelihood of models that achieve lower loss values. We denote the prior distribution of models as $p_{0}$ and the posterior distribution after having observed $L_{1}$ through $L_{t}$ as $p_{t}$ . Then, $p_{T}$ is given by:

p_{T}(x)=Z_{T}p_{0}(x)\Pi_{t=1}^{T}P(L_{t}|x)=Z_{T}p_{0}(x)e^{-\sum_{t=1}^{T}L_{t}(x)}

(2)

where $Z_{T}$ is a normalization factor that ensures $p_{T}$ integrates to $1$ . Observe that this posterior places high density in regions where the summed loss is low.

Particle filters enable the computation of $p_{T}(x)$ by incrementally computing estimates of $p_{t}(x)$ . Specifically, given $p_{t}(x)$ , we may compute $p_{t+1}(x)$ as:

p_{t+1}(x)=\frac{p_{t}(x)P(L_{t+1}|x)}{\int p_{t}(x^{\prime})P(L_{t+1}|x^{\prime})dx^{\prime}}=\frac{p_{t}(x)e^{-L_{t+1}(x)}}{\int p_{t}(x^{\prime})e^{-L_{t+1}(x^{\prime})}dx^{\prime}}

(3)

This Bayesian update equation may often be intractable to compute exactly, particularly when $p_{t}$ does not have a known parametric form. Instead of tracking $p_{t}$ exactly, particle filters track an estimate $\hat{p}_{t}$ instead:

\hat{p}_{t}(x)=\sum_{i}w_{t}^{(i)}\delta(x-x^{(i)}_{t})

(4)

where $\delta$ is a delta function, $x^{(i)}_{t}$ represents the $i$ th particle at time $t$ and $w_{t}^{(i)}$ represents the weight of the particle at time $t$ . Each particle filter then has a different method of estimating the Bayesian update of Equation 3. After all updates are complete, an ensemble of particles is available, each of which is an estimate of the global minimizer of $\sum_{t=1}^{T}L_{t}(x)$ . We denote the output distribution of a particle filter initialized at $\hat{p}_{0}$ trained on a sequence of loss functions $L_{1},...L_{T}$ as $\hat{p}_{0}[L_{1},...L_{T}]$ .

Since particle filters aim to approximate Bayesian updates, we suppose that each update outputs a set of particles close to the true posterior. To formalize this, suppose that there exists a symmetric, non-negative discrepancy measure $D(p,q)$ that satisfies the triangle inequality:

D(p,q)\leq D(p,r)+D(r,q)

(5)

for all $p,q,r$ . Furthermore, suppose $D(p,p)=0$ for all $p$ .

Now, suppose that the particle filter satisfies the following two conditions:

D(\hat{p}[L],\hat{q}[L])\leq CD(\hat{p},\hat{q})

(6)

and

D(\hat{p}[L],\frac{p(\cdot)e^{-L(\cdot)}}{\int p(x)e^{-L(x)}dx})\leq CD(\hat{p},p)+\epsilon

(7)

for some constants $C$ and $\epsilon$ . $C$ can be interpreted as the estimation error propagating from the previous iteration, while $\epsilon$ can be viewed as error between each particle filter update and the true Bayesian update. This allows the discrepancy at time $T$ to be bounded as:

D(\hat{p}_{0}[L_{1},...L_{T}],\frac{p(\cdot)e^{-\sum_{t}L_{t}(\cdot)}}{\int p(x)e^{-\sum_{t}L_{t}(x)}dx})\leq C^{T}D(\hat{p}_{0},p_{0})+\epsilon\frac{C^{T}-1}{C-1}

(8)

This decomposes the discrepancy at time $T$ as a term depending on the initial discrepancy $D(\hat{p}_{0},p_{0})$ and a term depending on the incremental discrepancy $\epsilon$ .

3.2 Permutation-invariance

Next, we demonstrate that particle filters are approximately permutation-invariant: they produce an output that is nearly invariant to the ordering of loss functions $L_{t}$ . We show that $\hat{p}_{0}[L_{1},...L_{T}]$ is similar to $\hat{p}_{0}[L_{\sigma_{1}},...L_{\sigma_{T}}]$ where $\sigma_{1},\sigma_{2},...\sigma_{T}$ is a permutation of $1,2,...T$ .

Theorem 1.

Suppose $\sigma_{1},\sigma_{2},...\sigma_{T}$ is a permutation of $1,2,...T$ such that $N$ swaps of adjacent elements are required to convert $\sigma_{1},\sigma_{2},...\sigma_{T}$ to $1,2,...T$ . Denote the initialized particle filter as $\hat{p}_{0}$ . Then,

D(\hat{p}_{0}[L_{1},...L_{T}],\hat{p}_{0}[L_{\sigma_{1}},...L_{\sigma_{T}}])\leq 2N\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(9)

See Appendix A for a proof. This result demonstrates that particle filters are approximately permutation invariant, especially over small sequences. Standard learning algorithms such as gradient descent are notably not permutation-invariant: they tend to be highly dependent on the ordering of data points. Permutation-invariance enables learning algorithms with less stochastic outputs: in a perfectly permutation-invariant particle filter, the only potential sources of randomness are the initial selection of particles and the randomness in the particle filter updates themselves.

3.3 Avoiding catastrophic forgetting and loss of plasticity

Now, we demonstrate that particle filters naturally avoid catastrophic forgetting and loss of plasticity. Catastrophic forgetting can be formalized in our framework as the phenomenon where a learning algorithm trained on a sequence of losses $L_{1},...L_{T}$ performs poorly on the earlier losses it is trained on. Similarly, loss of plasticity corresponds to performing poorly on later losses. We provide an upper bound on the loss at any point in training:

Theorem 2.

Suppose that all loss functions are bounded in range $[0,M]$ . Suppose that there exists a constant $k$ such that for all loss functions $L$ and distributions $p,q$ :

\mathbb{E}_{x\sim p}[L(x)]-\mathbb{E}_{x\sim q}[L(x)]\leq kD(p,q)

(10)

Also, suppose the particle filter guarantees:

\mathbb{E}_{x\sim\hat{p}[L]}[L(x)]\leq\beta\mathbb{E}_{x\sim\hat{p}}[L(x)]

(11)

for all $L$ and $\hat{p}$ . Then,

\mathbb{E}_{x\sim\hat{p}_{0}[L_{1},L_{2},...L_{T}]}[L_{i}(x)]\leq\beta M+2kT\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(12)

See Appendix B for a proof. We make two key assumptions in this theorem: the difference in average loss under two different distributions can be bounded by $D$ and that each step in the particle filter reduces the loss on which it is trained by at least a fixed factor. We believe that the first assumption may in many cases be reasonable if the loss function is sufficiently slow-changing: small changes in the distribution over $x$ should not change the average loss value. The second assumption may also be reasonable under many settings for effective particle filters as well as other standard learning algorithms; with a fixed loss function, it corresponds to a linear convergence rate. Gradient descent, for example, satisfies this assumption on loss functions satisfying the Polyak-Łojasiewicz inequality. The resulting bound on the loss $L_{i}$ guarantees that the performance can be no worse than if there had only been a single update $L_{i}$ (yielding loss at most $\beta M$ ) plus an additional error term.

3.4 Gradient-based particle filter

Given the desirable properties of particle filters in learning problems, here, we describe a particular particle filter well suited to the high-dimensional spaces found in most machine learning settings.

Suppose that our particle filter’s particle distribution $\hat{p}_{t}(x)=\sum_{i}w_{t}^{(i)}\delta(x-x^{(i)}_{t})$ represents another distribution $\tilde{p}_{t}(x)$ constructed as:

\tilde{p}_{t}(x)=Z\sum_{i}w_{t}^{(i)}e^{-\frac{||x-x^{(i)}_{t}||^{2}}{2\sigma^{2}}}

(13)

where $\sigma$ is a variance parameter and $Z$ is a normalizing constant. Essentially, $\tilde{p}_{t}(x)$ replaces each delta function in $\hat{p}_{t}(x)$ with a isotropic Gaussian. We derive the particle filter’s update on $\hat{p}_{t}(x)$ as an approximation of the optimal Bayesian update applied to $\tilde{p}_{t}(x)$ . Observe that given a prior of $\tilde{p}_{t}(x)$ and the loss function $L_{t+1}$ , the posterior is proportional to:

e^{-L_{t+1}(x)}\sum_{i}w_{t}^{(i)}e^{-\frac{||x-x^{(i)}_{t}||^{2}}{2\sigma^{2}}}

(14)

We manipulate this expression until it can be expressed in the form of Equation 13. First, we make a linear approximation of $L_{t+1}$ centered at each particle $x^{(i)}_{t}$ :

L_{t+1}(x)\approx L_{t+1}(x^{(i)}_{t})+\nabla L_{t+1}(x^{(i)}_{t})^{T}(x-x^{(i)}_{t})

(15)

Pulling $e^{-L_{t+1}(x)}$ into the summation and applying this approximation:

\sum_{i}w_{t}^{(i)}e^{-\frac{||x-x^{(i)}_{t}||^{2}}{2\sigma^{2}}-L_{t+1}(x^{(i)}_{t})-\nabla L_{t+1}(x^{(i)}_{t})^{T}(x-x^{(i)}_{t})}

(16)

Grouping terms:

\sum_{i}w_{t}^{(i)}e^{-\frac{||x||^{2}}{2\sigma^{2}}+[\frac{x^{(i)}_{t}}{\sigma^{2}}-\nabla L_{t+1}(x^{(i)}_{t})]^{T}x}e^{-\frac{||x^{(i)}_{t}||^{2}}{2\sigma^{2}}-L_{t+1}(x^{(i)}_{t})+\nabla L_{t+1}(x^{(i)}_{t})^{T}x^{(i)}_{t}}

(17)

Completing the square in the exponent:

\sum_{i}w_{t}^{(i)}e^{-\frac{||x-x^{(i)}_{t+1}||^{2}}{2\sigma^{2}}}e^{\frac{||x^{(i)}_{t+1}||^{2}}{2\sigma^{2}}-\frac{||x^{(i)}_{t}||^{2}}{2\sigma^{2}}-L_{t+1}(x^{(i)}_{t})+\nabla L_{t+1}(x^{(i)}_{t})^{T}x^{(i)}_{t}}

(18)

where $x^{(i)}_{t+1}=x^{(i)}_{t}-\sigma^{2}\nabla L_{t+1}(x^{(i)}_{t})$ . Simplifying the constant terms:

\sum_{i}w_{t}^{(i)}e^{-\frac{||x-x^{(i)}_{t+1}||^{2}}{2\sigma^{2}}}e^{\frac{\sigma^{2}||\nabla L_{t+1}(x^{(i)}_{t})||^{2}}{2}-L_{t+1}(x^{(i)}_{t})}

(19)

Observe that under our linear approximation, $L_{t+1}(x_{t+1}^{(i)})=L_{t+1}(x^{(i)}_{t})-\sigma^{2}||\nabla L_{t+1}(x^{(i)}_{t})||^{2}$ . Thus, we may write the expression as:

\sum_{i}w_{t}^{(i)}e^{-\frac{||x-x^{(i)}_{t+1}||^{2}}{2\sigma^{2}}}e^{-\frac{L_{t+1}(x_{t+1}^{(i)})+L_{t+1}(x^{(i)}_{t})}{2}}

(20)

Finally, we define $w_{t+1}^{(i)}=w_{t}^{(i)}e^{-\frac{L_{t+1}(x_{t+1}^{(i)})+L_{t+1}(x^{(i)}_{t})}{2}}$ to arrive at our final approximation of the posterior:

\sum_{i}w_{t+1}^{(i)}e^{-\frac{||x-x^{(i)}_{t+1}||^{2}}{2\sigma^{2}}}

(21)

We represent this posterior with particles $x^{(i)}_{t+1}$ and respective weights $w^{(i)}_{t+1}$ .

We summarize the update equations of this particle filter below:

x^{(i)}_{t+1}=x^{(i)}_{t}-\sigma^{2}\nabla L_{t+1}(x^{(i)}_{t})

(22)

w_{t+1}^{(i)}=w_{t}^{(i)}e^{-\frac{L_{t+1}(x_{t+1}^{(i)})+L_{t+1}(x^{(i)}_{t})}{2}}

(23)

Algorithm 1 shows the full pseudocode of the filter. For simplicity, we do not normalize the weights of the particles at each iteration; this can be done once at the end of training. Intuitively, this filter updates the positions of the particles with gradient descent, but reweights the particles based on their performance at the old and new points, with lower-loss particles weighted higher. Like gradient descent and other gradient-based optimization procedures, this particle filter is well suited for optimization on high-dimensional spaces, while retaining the properties of a particle filters outlined in the prior sections such as permutation-invariance and avoidance of catastrophic forgetting. Figure 1 illustrates how our method converges to well-performing regions of the parameter space over time.

Input: Initial particles

\{x^{(i)}_{0}\}_{i=1}^{N}

, Loss functions

L_{1},...L_{T}

, Variance

\sigma^{2}

Output: Updated particles

\{x^{(i)}_{T},w^{(i)}_{T}\}_{i=1}^{N}

for

t=0

T-1

w^{(i)}_{0}=1

for

i=1

N

x^{(i)}_{t+1}=x^{(i)}_{t}-\sigma^{2}\nabla L_{t+1}(x^{(i)}_{t})

w_{t+1}^{(i)}=w_{t}^{(i)}e^{-\frac{L_{t+1}(x_{t+1}^{(i)})+L_{t+1}(x^{(i)}_{t})}{2}}

end for

S=0

for

i=1

N

S=S+w_{T}^{(i)}

end for

for

i=1

N

w_{T}^{(i)}=w_{T}^{(i)}/S

end for

Algorithm 1 Gradient-based particle filter

Refer to caption — Figure 1: Illustration of how our particle filter converges to well-performing regions of the parameter space over the course of training on SplitMNIST. The plot is constructed by using tSNE to map the particles into two dimensions, then representing each particle with a unimodal Gaussian of fixed variance.

Theoretical guarantees

Observe that this particle filter is built on gradient descent; thus it inherits the theoretical convergence guarantees of gradient descent. In particular, unlike gradient-free particle filters, this approach is suited for high-dimensional spaces just as gradient-based optimization methods require many fewer iterations relative to gradient-free methods in high-dimensions. What separates this particle filter from simply a model average of $N$ models independently trained with gradient descent? Unlike a simple model average, this approach retains the Bayesian estimation properties of a traditional particle filter; namely, its output is an approximation of the Bayesian posterior. This allows it to maintain the desirable properties of particle filters described earlier.

Next, we demonstrate that the particle filter indeed maintains fidelity to the Bayesian updates it is based on. Specifically, we demonstrate in a simplified setting that the given two particles with the same prior probability at initialization, the particle filer produces an output exactly matching the probability ratios of the true posterior at the final particle locations:

Theorem 3.

Suppose particles $x^{(i)}$ and $x^{(j)}$ are initialized with the same prior probability:

p_{0}(x^{(i)}_{0})=p_{0}(x^{(j)}_{0})

(24)

Furthermore, suppose that all loss functions are linear:

L_{t}(x)=g_{t}^{T}x+b_{t}

(25)

Then,

\frac{p_{T}(x^{(i)}_{T})}{p_{T}(x^{(j)}_{T})}=\frac{w^{(i)}_{T}}{w^{(j)}_{T}}

(26)

See Appendix C for a proof. This theorem guarantees that the particle filter indeed maintains weights in accordance with the true posterior distribution. Thus, it achieves the best of both worlds: it can perform optimization in high-dimensions while also approximating Bayes optimal solutions.

4 Experiments and Results

In this section, we empirically validate the permutation-invariance of our gradient-based weighted particle filter (hereafter referred to as the weighted particle filter or WPF) and demonstrate its effectiveness in mitigating catastrophic forgetting and loss of plasticity across continual and lifelong learning benchmarks.

Continual Learning Experiments:

We evaluate our weighted particle filter on continual learning benchmarks SplitMNIST (LeCun & Cortes, 2010), SplitCIFAR100 (Krizhevsky, 2009), and ProcGen (Cobbe et al., 2020). SplitMNIST is divided into 5 ”super class” splits, and SplitCIFAR100 into 20, both for class-incremental learning. For ProcGen, we use image-action trajectory datasets from the games Starpilot, Fruitbot, and Dodgeball, partitioned into 15 levels sampled using the hard distribution shift mode, similar to Mediratta et al. (2024).

We compare our weighted particle filter against established continual learning methods: Synaptic Intelligence (SI), Elastic Weight Consolidation (EWC), and Learning Without Forgetting (LWF) (Zenke et al., 2017; Kirkpatrick et al., 2016; Li & Hoiem, 2016). Since our particle filter is architecture-agnostic, we also combine it with SI, EWC, and LWF to evaluate their joint effectiveness.

In the ProcGen continual learning experiments, we use a supervised behavior cloning policy as our base model and compare it against other baseline particle filters also using supervised behavior cloning. All methods are implemented with identical architectures and learning parameters to ensure a fair comparison.

After training, we measure the average accuracy/return of both the weighted particle filter and the baseline models across all splits/levels using 10 different shuffled permutations of epoch splits. To assess permutation invariance, we calculate the task-specific variance in accuracy/return across these 10 permutations. These experiments are designed to evaluate the particle filter’s resistance to catastrophic forgetting.

Our weighted particle filter uses 100 particles, with test accuracy evaluated as a weighted average across particles. We compare this approach with three additional particle filter baselines: (1) a standard particle filter, (2) a gradient-based particle filter without weighting, and (3) traditional gradient descent (single particle). This allows us to assess the impact of particle weighting and the benefits of the Bayesian framework. The standard particle filter serves as a benchmark to evaluate performance on high-dimensional problems, and it operates by resampling particles based on their training loss performance. The gradient-based particle filter without weighting (referred to as averaging particles) is included as a baseline to determine the effectiveness of particle weighting. Full implementation details can be found in the appendix.

Table 1: Average accuracy and normalized variance across classes over 10 permutations for the weighted particle filter, standard particle filter methods, continual learning baselines, and continual learning baselines combined with the weighted particle filter on SplitMINST and SplitCIFAR100.

Method Average Accuracy % (SplitMNIST) Average Accuracy % (CIFAR100) Normalized Variance (SplitMNIST / SplitCIFAR100) Particle Methods Weighted Particle Filter 72.0 23.9 0.002 / 0.001 Averaging Particles 53.4 21.3 0.012 / 0.020 Baseline Particle Filter 50.1 19.8 0.001 / 0.006 Gradient Descent 48.7 20.1 0.032 / 0.001 Continual Learning Baselines EWC 66.3 23.2 0.186 / 0.010 EWC + PF 76.8 25.8 0.004 / 0.004 LWF 67.3 26.4 0.097 / 0.050 LWF + PF 79.2 29.0 0.012 / 0.007 SI 58.6 22.9 0.168 / 0.005 SI + PF 67.6 24.6 0.025 / 0.001

LRL Experiments:

In the lifelong reinforcement learning setting, we adopt the setup from Muppidi et al. (2024) and conduct experiments on the ProcGen games Starpilot, Fruitbot, and Dodgeball. Distribution shifts are introduced by sampling new procedurally generated levels every 2 million time steps. The agent’s performance is evaluated based on the average normalized return over the course of the lifelong experiment. Additionally, we measure the normalized variance across each level for 10 different permutations of the lifelong level sequences.

We evaluate proximal policy optimization (PPO) (Schulman et al., 2017), along with other LRL methods designed to prevent loss of plasticity—specifically, PPO combined with TRAC (Muppidi et al., 2024) and PPO combined with EWC (Kessler et al., 2022)—each tested both with and without our weighted particle filter.

4.1 Avoiding Catastrophic Forgetting and Loss of Plasticity

Performance Against Other Filters:

Tables 1 and 2 provide a summary of the performance comparison between our weighted particle filter and the baseline particle filters in both continual learning and lifelong RL experiments. Our weighted particle filter consistently achieves higher mean accuracy (averaged over classes and permutations) on SplitMNIST, SplitCIFAR100, and ProcGen Behavior Cloning datasets compared to the baseline particle filter, averaging particles, and traditional gradient descent (single particle).

Furthermore, Figure 3 demonstrates that our weighted particle filter is more resistant to loss of plasticity in LRL experiments compared to PPO using gradient descent. This underscores the advantage of incorporating particle weighting into the training process. This effect may align with the conclusions of Lyle et al. (2023); Sokar et al. (2023); Muppidi et al. (2024); Kumar et al. (2023), suggesting that when a model is not overly specialized to a specific task is better able to adapt to new tasks. In our approach, maintaining multiple particles—some tuned to domain-specific tasks and others oriented towards different sequential tasks—enables the agent to switch to well-performing particles when adapting to new environments, thereby preserving plasticity.

While theoretically, the baseline particle filter has the advantages of a Bayesian approach, because of the lack of gradient-based optimization, it fails. This lack of gradient-based optimization in high dimensions means it is essentially making random guesses, leading to performances that are close to what would be expected by chance. While gradient descent might show improved results in the latest epoch, it typically does so at the expense of previous epochs’ performance. Therefore, when the performance is averaged across all epochs, the result is diminished, approaching close to random chance levels.

Performance Compared to and Combined with Continual Learning and LRL Methods:

The results presented in Table 1 indicate that our weighted particle filter not only successfully avoids catastrophic forgetting but also outperforms all other methods in the SplitMNIST setting. In the SplitCIFAR100 dataset, our model closely competes with the top-performing continual learning model, LWF.

Interestingly, the greatest benefit is observed when we combine continual learning methods with our weighted particle filter. In all cases, the addition of the weighted particle filter increases accuracy.

In the lifelong RL experiments, our weighted particle filter consistently outperforms PPO + EWC across all games. Similar to the continual learning experiments, we observe that combining the weighted particle filter with PPO + EWC or PPO + TRAC results in an increase in average normalized return, demonstrating the effectiveness of these combined approaches.

Table 2: Comparison of Average Normalized Return and Variance of Normalized Return for Particle Methods and LRL baselines with Supervised BC and PPO on both continual and lifelong setups of Dodgeball, Starpilot, Fruitbot. Higher return and lower variance is better.

Method Average Normalized Return Variance of Normalized Return Dodgeball Starpilot Fruitbot Dodgeball Starpilot Fruitbot Particle Methods CL Supervised BC 0.28 0.35 0.33 0.16 0.11 0.13 Supervised BC + Weighted Particle Filter 0.63 0.52 0.48 0.08 0.04 0.05 Supervised BC + Averaging Particles 0.31 0.41 0.36 0.11 0.08 0.07 Supervised BC + Baseline Particle Filter 0.37 0.33 0.39 0.09 0.10 0.07 Particle Methods LRL PPO (Gradient Descent) 0.31 0.38 0.47 0.09 0.05 0.05 PPO + Weighted Particle Filter 0.40 0.55 0.63 0.04 0.03 0.03 PPO + Averaging Particles 0.34 0.40 0.44 0.09 0.03 0.04 PPO + Baseline Particle Filter 0.33 0.35 0.48 0.11 0.06 0.06 LRL Baselines PPO + TRAC 0.69 0.62 0.76 0.16 0.16 0.16 PPO + TRAC + Weighted Particle Filter 0.74 0.68 0.80 0.04 0.01 0.04 PPO + EWC 0.37 0.40 0.60 0.11 0.02 0.04 PPO + EWC + Weighted Particle Filter 0.42 0.48 0.64 0.06 0.01 0.01

Permutation Invariance

A distinctive feature of our gradient-based weighted particle filter is its permutation invariance. To validate this property, we evaluated the average normalized variances over classes or levels across 10 permutation runs for each experiment and each method in both the continual learning and lifelong reinforcement learning setups. Each run involved training on a different order of class datasets for SplitMNIST and SplitCIFAR100, or on a different order of levels for the ProcGen games.

Tables 1 and 2 show that, in all experiments, our Weighted Particle Filter exhibited lower variance compared to gradient descent. Additionally, when comparing continual learning or lifelong RL methods with and without the particle filter, we observe that the Weighted Particle Filter consistently increased performance/return and reduced variance.

Figure 2 effectively illustrates this relationship in the SplitMNIST and SplitCIFAR100 experiments. The bottom right region of each plot represents the ideal scenario of high accuracy and low task variance. It is evident from both plots that this optimal region is dominated by either the Weighted Particle Filter alone or continual learning methods combined with the Weighted Particle Filter, demonstrating the advantages of our approach.

5 Conclusion

Poor permutations of training data, such as strictly ordered minibatches, can lead to catastrophic forgetting and loss of plasticity. To overcome this challenge, we theoretically demonstrated that particle filters can be permutation-invariant, allowing them to mitigate the issues associated with poor ordering of training data. This permutation invariance offers a principled solution to avoiding catastrophic forgetting and preserving plasticity throughout learning.

Our results further highlight the effectiveness of a simple, gradient-based weighted particle filter in continual, lifelong, and permutation-invariant learning. Notably, our particle filter is domain-agnostic, significantly improving performance and reducing performance variance in both lifelong reinforcement learning and supervised continual learning settings. Moreover, our approach shows greater resistance to catastrophic forgetting and loss of plasticity. The success of our method lies in the combination of gradient-based updates, which make it suitable for high-dimensional problems, and Bayesian weight updates. Our approach paves the way for broader applications of particle filter methods in high-dimensional state spaces, particularly in modern machine learning.

References

Abbas et al. (2023) Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents, pp. 620–636. PMLR, 2023.
Abel et al. (2023) David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, and Satinder Singh. A definition of continual reinforcement learning, 2023. URL https://arxiv.org/abs/2307.11046.
Ahn et al. (2024) Hongjoon Ahn, Jinu Hyeon, Youngmin Oh, Bosun Hwang, and Taesup Moon. Catastrophic negative transfer: An overlooked problem in continual reinforcement learning, 2024. URL https://openreview.net/forum?id=o7BwUyXz1f.
Bengtsson et al. (2008) Thomas Bengtsson, Peter Bickel, and Bo Li. Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In Probability and statistics: Essays in honor of David A. Freedman, volume 2, pp. 316–335. Institute of Mathematical Statistics, 2008.
Boopathy et al. (2024) Akhilan Boopathy, Aneesh Muppidi, Peggy Yang, Abhiram Iyer, William Yue, and Ila Fiete. Resampling-free particle filters in high-dimensions. ICRA, 2024.
Cobbe et al. (2020) Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pp. 2048–2056. PMLR, 2020.
Corenflos et al. (2021) Adrien Corenflos, James Thornton, George Deligiannidis, and Arnaud Doucet. Differentiable particle filtering via entropy-regularized optimal transport. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2100–2111. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/corenflos21a.html.
Doucet et al. (2001a) Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pp. 3–14, 2001a.
Doucet et al. (2001b) Arnaud Doucet, Nando De Freitas, Neil James Gordon, et al. Sequential Monte Carlo methods in practice, volume 1. Springer, 2001b.
Hoeting et al. (1999) Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky. Bayesian model averaging: a tutorial (with comments by m. clyde, david draper and ei george, and a rejoinder by the authors. Statistical science, 14(4):382–417, 1999.
Jonschkowski et al. (2018) Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. Differentiable particle filters: End-to-end learning with algorithmic priors. Robotics: Science and Systems (RSS), 2018.
Karkus et al. (2021) Peter Karkus, Shaojun Cai, and David Hsu. Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2825, June 2021.
Kessler et al. (2022) Samuel Kessler, Jack Parker-Holder, Philip Ball, Stefan Zohren, and Stephen J. Roberts. Same state, different task: Continual reinforcement learning without interference, 2022. URL https://arxiv.org/abs/2106.02940.
Kim & Han (2023) Dongwan Kim and Bohyung Han. On the stability-plasticity dilemma of class-incremental learning, 2023. URL https://arxiv.org/abs/2304.01663.
Kirkpatrick et al. (2016) James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. CoRR, abs/1612.00796, 2016. URL http://arxiv.org/abs/1612.00796.
Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. pp. 32–33, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Kumar et al. (2023) Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity via regenerative regularization. arXiv preprint arXiv:2308.11958, 2023.
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. NeurIPS, 30, 2017.
LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
Li et al. (2015) Tiancheng Li, Miodrag Bolic, and Petar M Djuric. Resampling methods for particle filtering: classification, implementation, and strategies. IEEE Signal processing magazine, 32(3):70–86, 2015.
Li & Hoiem (2016) Zhizhong Li and Derek Hoiem. Learning without forgetting. CoRR, abs/1606.09282, 2016. URL http://arxiv.org/abs/1606.09282.
Lyle et al. (2022) Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2022.
Lyle et al. (2023) Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In International Conference on Machine Learning, pp. 23190–23211. PMLR, 2023.
Maken et al. (2022) Fahira Afzal Maken, Fabio Ramos, and Lionel Ott. Stein particle filter for nonlinear, non-gaussian state estimation. IEEE Robotics and Automation Letters, 7(2):5421–5428, 2022.
Mediratta et al. (2024) Ishita Mediratta, Qingfei You, Minqi Jiang, and Roberta Raileanu. The generalization gap in offline reinforcement learning, 2024. URL https://arxiv.org/abs/2312.05742.
Muppidi et al. (2024) Aneesh Muppidi, Zhiyu Zhang, and Heng Yang. Fast trac: A parameter-free optimizer for lifelong reinforcement learning. In Advances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.16642.
Nikishin et al. (2022) Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, pp. 16828–16847. PMLR, 2022.
Pulido & van Leeuwen (2019) Manuel Pulido and Peter Jan van Leeuwen. Sequential monte carlo with kernel embedded mappings: The mapping particle filter. Journal of Computational Physics, 396:400–415, 2019.
Raftery et al. (2005) Adrian E Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133(5):1155–1174, 2005.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sokar et al. (2023) Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. In International Conference on Machine Learning, pp. 32145–32168. PMLR, 2023.
Thrun (2002) S Thrun. Robotic mapping: A survey,”. Exploring Artificial Intelligence in the New Millenium/Morgan Kaufmann google schola, 2:237–267, 2002.
Thrun et al. (2005) Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT Press, Cambridge, Mass., 2005. ISBN 0262201623 9780262201629. URL http://www.amazon.de/gp/product/0262201623/102-8479661-9831324?v=glance&n=283155&n=507846&s=books&v=glance.
van de Ven et al. (2022) Gido van de Ven, Tinne Tuytelaars, and Andreas Tolias. Three types of incremental learning. Nature Machine Intelligence, 4:1–13, 12 2022. doi: 10.1038/s42256-022-00568-3.
Wang et al. (2024) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application, 2024. URL https://arxiv.org/abs/2302.00487.
Wasserman (2000) Larry Wasserman. Bayesian model selection and model averaging. Journal of mathematical psychology, 44(1):92–107, 2000.
Wortsman et al. (2022) Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, pp. 23965–23998. PMLR, 2022.
Zenke et al. (2017) Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3987–3995. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/zenke17a.html.

Appendix A Proof of Theorem 1

Proof.

We first show that $\hat{p}[L_{1},L_{2}]$ is similar to $\hat{p}[L_{2},L_{1}]$ . Observe that applying true Bayesian updates $L_{1}$ and $L_{2}$ to $\hat{p}$ following Equation 3 yields:

Z\hat{p}(x)e^{-L_{1}(x)-L_{2}(x)}

(27)

for some normalizing constant $Z$ , which is invariant to the ordering of the loss functions. By Equation 8, we have:

D(\hat{p}[L_{1},L_{2}],Z\hat{p}(\cdot)e^{-L_{1}(\cdot)-L_{2}(\cdot)})\leq C^{2}D(\hat{p},\hat{p})+\epsilon\frac{C^{2}-1}{C-1}=\epsilon\frac{C^{2}-1}{C-1}

(28)

∎

since $D(\hat{p},\hat{p})=0$ . Similarly, we have:

D(\hat{p}[L_{2},L_{1}],Z\hat{p}(\cdot)e^{-L_{1}(\cdot)-L_{2}(\cdot)})\leq\epsilon\frac{C^{2}-1}{C-1}

(29)

By the triangle inequality, we have:

D(\hat{p}[L_{1},L_{2}],\hat{p}[L_{2},L_{1}])\leq 2\epsilon\frac{C^{2}-1}{C-1}

(30)

Now, we bound the discrepancy between $\hat{p}[L_{1},L_{2}]$ and $\hat{p}[L_{2},L_{1}]$ when we apply $k$ additional updates $L_{3}$ through $L_{2+k}$ . By Equation 6, we have:

D(\hat{p}[L_{1},L_{2},L_{3},...L_{2+k}],\hat{p}[L_{2},L_{1},L_{3},...L_{2+k}])\leq 2\epsilon C^{k}\frac{C^{2}-1}{C-1}

(31)

We may apply this inequality to bound the discrepancy between particle filter outputs when any two adjacent losses are swapped:

D(\hat{p}[L_{1},L_{2},...L_{i-1},L_{i},L_{i+1},L_{i+2},...L_{T}],\hat{p}[L_{1},L_{2},...L_{i-1},L_{i+1},L_{i},L_{i+2},...L_{T}])\leq 2\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(32)

Thus, with $N$ swaps, using the triangle inequality, the discrepancy may be bounded as:

D(\hat{p}_{0}[L_{1},...L_{T}],\hat{p}_{0}[L_{\sigma_{1}},...L_{\sigma_{T}}])\leq 2N\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(33)

Appendix B Proof of Theorem 2

Proof.

We first bound the difference in loss between $\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i},L_{i+1},L_{i+2},...L_{T}]$ and $\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i+1},L_{i+2},...L_{T},L_{i}]$ . By Theorem 1, we have:

D(\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i},L_{i+1},L_{i+2},...L_{T}],\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i+1},L_{i+2},...L_{T},L_{i}])\leq 2T\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(34)

Applying the bound on the difference of $L$ under different distributions:

\mathbb{E}_{x\sim\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i},L_{i+1},L_{i+2},...L_{T}]}[L_{i}(x)]-\mathbb{E}_{x\sim\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i+1},L_{i+2},...L_{T},L_{i}]}[L_{i}(x)]\leq 2kT\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(35)

Now, applying the reduction in loss by training on $L_{i}$ :

\mathbb{E}_{x\sim\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i},L_{i+1},L_{i+2},...L_{T}]}[L_{i}(x)]\leq\beta\mathbb{E}_{x\sim\hat{p}_{0}[L_{1},L_{2},...L_{i-1},L_{i+1},L_{i+2},...L_{T}]}[L_{i}(x)]+2kT\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(36)

Finally, applying the absolute bound on the loss:

\mathbb{E}_{x\sim\hat{p}_{0}[L_{1},L_{2},...L_{T}]}[L_{i}(x)]\leq\beta M+2kT\epsilon C^{T-2}\frac{C^{2}-1}{C-1}

(37)

∎

Appendix C Proof of Theorem 3

Proof.

First, observe that since the losses are linear, particle $i$ at time $t$ has position:

x_{t}^{(i)}=x_{0}^{(i)}-\sigma^{2}\sum_{\tau=1}^{t}g_{\tau}

(38)

Next, observe that the weight of particle $i$ at the end of training may simply be expressed as the product of all weight updates:

w^{(i)}_{T}=e^{-\frac{1}{2}\sum_{t=1}^{T}L_{t}(x_{t}^{(i)})+L_{t}(x_{t-1}^{(i)})}

(39)

We omit the normalizing constant for notational convenience. Using the linearity of $L_{t}$ and the update equation $x_{t}^{(i)}=x_{t-1}^{(i)}-\sigma^{2}g_{t}$ :

w^{(i)}_{T}=e^{-\sum_{t=1}^{T}g_{t}^{T}(x_{t-1}^{(i)}-\frac{1}{2}\sigma^{2}g_{t})+b_{t}}

(40)

Now, expanding $x_{t-1}^{(i)}$ in terms of $g_{t}$ :

w^{(i)}_{T}=e^{-\sum_{t=1}^{T}g_{t}^{T}(x_{0}^{(i)}-\sigma^{2}\sum_{\tau=1}^{t-1}g_{\tau}-\frac{1}{2}\sigma^{2}g_{t})+b_{t}}

(41)

Rearranging terms:

w^{(i)}_{T}=e^{-\sum_{t=1}^{T}g_{t}^{T}x_{0}^{(i)}+\sigma^{2}\sum_{t=1}^{T}(\sum_{\tau=1}^{t-1}g_{t}^{T}g_{\tau}+\frac{1}{2}g_{t}^{T}g_{t})-\sum_{t=1}^{T}b_{t}}

(42)

Rewriting the double summation:

w^{(i)}_{T}=e^{-\sum_{t=1}^{T}g_{t}^{T}x_{0}^{(i)}+\frac{1}{2}\sigma^{2}\sum_{t=1}^{T}\sum_{\tau=1}^{T}g_{t}^{T}g_{\tau}-\sum_{t=1}^{T}b_{t}}

(43)

Rearranging terms again:

w^{(i)}_{T}=e^{-\sum_{t=1}^{T}g_{t}^{T}[x_{0}^{(i)}-\frac{1}{2}\sigma^{2}\sum_{\tau=1}^{T}g_{\tau}]-b_{t}}

(44)

Observe that $x_{T}^{(i)}=x_{0}^{(i)}-\frac{1}{2}\sigma^{2}\sum_{\tau=1}^{T}g_{\tau}$ ; thus, $x_{0}^{(i)}-\frac{1}{2}\sigma^{2}\sum_{\tau=1}^{T}g_{\tau}=\frac{1}{2}(x_{T}^{(i)}+x_{0}^{(i)})$ . Using this and the linearity of $L_{t}$ :

w^{(i)}_{T}=e^{-\frac{1}{2}\sum_{t=1}^{T}L_{t}(x_{0}^{(i)})+L_{t}(x_{T}^{(i)})}=e^{-\sum_{t=1}^{T}L_{t}(x_{T}^{(i)})-\sigma^{2}\frac{1}{2}\sum_{t=1}^{T}L_{t}(\sum_{\tau=1}^{T}g_{\tau})}

(45)

Next, note that $p_{T}(x_{T}^{(i)})$ is given by:

p_{T}(x_{T}^{(i)})=p_{0}(x_{T}^{(i)})e^{-\sum_{t=1}^{T}L_{t}(x_{T}^{(i)})}

(46)

where we again omit normalizing constants for convenience. Finally, applying the same equations for particle $j$

\frac{w^{(i)}_{T}}{w^{(j)}_{T}}=\frac{e^{-\sum_{t=1}^{T}L_{t}(x_{T}^{(i)})-\sigma^{2}\frac{1}{2}\sum_{t=1}^{T}L_{t}(\sum_{\tau=1}^{T}g_{\tau})}}{e^{-\sum_{t=1}^{T}L_{t}(x_{T}^{(i)})-\sigma^{2}\frac{1}{2}\sum_{t=1}^{T}L_{t}(\sum_{\tau=1}^{T}g_{\tau})}}=\frac{e^{-\sum_{t=1}^{T}L_{t}(x_{T}^{(i)})}}{e^{-\sum_{t=1}^{T}L_{t}(x_{T}^{(j)})}}=\frac{p_{T}(x_{T}^{(i)})}{p_{T}(x_{T}^{(j)})}

(47)

∎

Appendix D Experimental Setup and Details

SplitMNIST task:

In this task, our objective is to sequentially address a series of five binary classification tasks derived from the MNIST dataset. These tasks are designed to distinguish between pairs of digits, presenting a unique challenge in each case. The specific pairings are as follows:

•

Digits 0 and 1 ({0v1})
•

Digits 2 and 3 ({2v3})
•

Digits 4 and 5 ({4v5})
•

Digits 6 and 7 ({6v7})
•

Digits 8 and 9 ({8v9})

Split CIFAR100 Task:

This task involves the sequential solution of 20 different 5-class classification tasks. Each task is associated with a distinct category comprising a specific group of objects or entities. The categories, along with their corresponding class labels, are listed below:

•

Aquatic mammals: {beaver, dolphin, otter, seal, whale}
•

Fish: {aquarium fish, flatfish, ray, shark, trout}
•

Flowers: {orchid, poppy, rose, sunflower, tulip}
•

Food containers: {bottle, bowl, can, cup, plate}
•

Fruit and vegetables: {apple, mushroom, orange, pear, sweet pepper}
•

Household electrical devices: {clock, computer keyboard, lamp, telephone, television}
•

Household furniture: {bed, chair, couch, table, wardrobe}
•

Insects: {bee, beetle, butterfly, caterpillar, cockroach}
•

Large carnivores: {bear, leopard, lion, tiger, wolf}
•

Large man-made outdoor things: {bridge, castle, house, road, skyscraper}
•

Large natural outdoor scenes: {cloud, forest, mountain, plain, sea}
•

Large omnivores and herbivores: {camel, cattle, chimpanzee, elephant, kangaroo}
•

Medium-sized mammals: {fox, porcupine, possum, raccoon, skunk}
•

Non-insect invertebrates: {crab, lobster, snail, spider, worm}
•

People: {baby, boy, girl, man, woman}
•

Reptiles: {crocodile, dinosaur, lizard, snake, turtle}
•

Small mammals: {hamster, mouse, rabbit, shrew, squirrel}
•

Trees: {maple tree, oak tree, palm tree, pine tree, willow tree}
•

Vehicles 1: {bicycle, bus, motorcycle, pickup truck, train}
•

Vehicles 2: {lawn mower, rocket, streetcar, tank, tractor}

ProcGen Environment:

We use the ProcGen games Starpilot, Dodgeball, and Fruitbot, which employ procedural content generation to create new levels (corresponding to specific seeds) upon episode reset. We specifically use the hard mode to introduce distribution shifts and ensure the tasks are sufficiently challenging for both lifelong reinforcement learning and continual behavioral cloning.

Level Characteristics:

•

Observation: The observation space consists of an RGB image of shape 64x64x3, representing the state of the environment at each time step.
•

Action Space: The action space is discrete, with up to 15 possible actions depending on the game.
•

Reward: Rewards are provided in either dense or sparse formats, depending on the specific game.
•

Termination Condition: A boolean value indicates whether the episode has ended.

Offline Data Collection:

•
Levels Used:
- –
  
  Levels [0, 15): These levels are used for collecting trajectories, specifically for the lifelong RL setup. The agent in our BC experiments is trained sequentially, observing level 0 for 2 million steps, followed by level 1, and so on.
•
Expert Policies Training:
- –
  
  We follow a similar setup to Mediratta et al. (2024). A strong and well-trained PPO policy is used, which was trained for 20 million steps on 200 levels of each game. This approach ensures that the policy generalizes well and acts as a proficient expert agent for collecting trajectories.
•
Dataset Generation:
- –
  
  Expert Dataset: To generate the expert dataset, we rolled out the final checkpoint of the pretrained PPO model (i.e., the expert policy) across the 15 training levels (splits), collecting 100,000 transitions per level.

Lifelong RL Setup:

For the lifelong reinforcement learning setup, we followed the same experimental protocol as Muppidi et al. (2024). In the ProcGen experiments, individual game levels were generated using a seed value as the start_level parameter, which was incremented sequentially to create new levels. Every 2 million steps, a new level was introduced to the agent using the hard distribution mode. To assess permutation invariance, the sequence of start-level seeds was permuted 10 times, providing a diverse set of training orders for evaluation.

Model details:

Our Gradient-based particle filter uses 100 particles. Particles are initialized randomly from PyTorch’s nn.module network parameters. A small amount of noise is injected into these parameters in the beginning of training to increase exploration of the solution space. Our Gradient-descent implementation uses the same code, except we initialize the particle filter with only one particle. The averaging particle filter simply takes the average of the accuracies of all of the particles. The baseline particle filter does the following:

1.

Resamples particles from the existing pool with probabilities proportional to the exponential of the negative loss associated with each particle. .
2.

Applies perturbations to particles, enabling the exploration of the solution space.
3.

Updates the weights of the particles based on the new loss.

We have also incorporated three continual learning methods: SI, EWC, and LWF (van de Ven et al., 2022). Each of these methods has been implemented following the default, method-specific settings as prescribed in the (van de Ven et al., 2022) code implementation. These three models used a ”pure-domain” setting.

For a detailed implementation of our particle filter, please refer to our code submission.