This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\setcopyright

ifaamas \acmConference[AAMAS ’23]Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023)May 29 – June 2, 2023 London, United KingdomA. Ricci, W. Yeoh, N. Agmon, B. An (eds.) \copyrightyear2023 \acmYear2023 \acmDOI \acmPrice \acmISBN \acmSubmissionID??? \affiliation \institutionLLNL \cityLivermore, CA \countryUSA \affiliation \institutionLLNL \cityLivermore, CA \countryUSA \affiliation \institutionTexas A&M University \cityCollege Station, TX \countryUSA \affiliation \institutionLLNL \cityLivermore, CA \countryUSA \affiliation \institutionBrown University \cityProvidence, RI \countryUSA \affiliation \institutionLLNL \cityLivermore, CA \countryUSA \affiliation \institutionLLNL \cityLivermore, CA \countryUSA \affiliation \institutionLLNL \cityLivermore, CA \countryUSA

Multi-Agent Reinforcement Learning for Adaptive Mesh Refinement

Jiachen Yang [email protected] Ketan Mittal [email protected] Tarik Dzanic [email protected] Socratis Petrides [email protected] Brendan Keith [email protected] Brenden Petersen [email protected] Daniel Faissol [email protected]  and  Robert Anderson [email protected]
Abstract.

Adaptive mesh refinement (AMR) is necessary for efficient finite element simulations of complex physical phenomenon, as it allocates limited computational budget based on the need for higher or lower resolution, which varies over space and time. We present a novel formulation of AMR as a fully-cooperative Markov game, in which each element is an independent agent who makes refinement and de-refinement choices based on local information. We design a novel deep multi-agent reinforcement learning (MARL) algorithm called Value Decomposition Graph Network (VDGN), which solves the two core challenges that AMR poses for MARL: posthumous credit assignment due to agent creation and deletion, and unstructured observations due to the diversity of mesh geometries. For the first time, we show that MARL enables anticipatory refinement of regions that will encounter complex features at future times, thereby unlocking entirely new regions of the error-cost objective landscape that are inaccessible by traditional methods based on local error estimators. Comprehensive experiments show that VDGN policies significantly outperform error threshold-based policies in global error and cost metrics. We show that learned policies generalize to test problems with physical features, mesh geometries, and longer simulation times that were not seen in training. We also extend VDGN with multi-objective optimization capabilities to find the Pareto front of the tradeoff between cost and error.

Key words and phrases:
multi-agent reinforcement learning; adaptive mesh refinement; numerical analysis; graph neural network

1. Introduction

The finite element method (FEM) (Brenner and Scott, 2007) is instrumental to numerical simulation of partial differential equations (PDEs) in computational science and engineering (Reddy and Gartling, 2010; Monk et al., 2003). For multi-scale systems with large variations in local features, such as combinations of regions with large gradients that require high resolution and regions with flat solutions where coarse resolution is sufficient, an efficient trade-off between solution accuracy and computational cost requires the use of adaptive mesh refinement (AMR). The goal of AMR is to adjust the finite element mesh resolution dynamically during a simulation, by refining regions that can contribute the most to improvement in accuracy relative to computational cost.

For evolutionary (i.e., time-dependent) PDEs in particular, a long-standing challenge is to find anticipatory refinement strategies that optimize a long-term objective, such as an efficient tradeoff between final solution accuracy and cumulative degrees of freedom (DoF). Anticipatory refinement strategies would preemptively refine regions of the mesh that will contain solution features (e.g., large gradients) right before these features actually occur. This is hard for existing approaches to achieve. Traditional methods for AMR rely on estimating local refinement indicators (e.g., local error (Zienkiewicz and Zhu, 1992)) and heuristic marking strategies (e.g., greedy error-based marking) (Bangerth and Rannacher, 2013; Červený et al., 2019). Recent data-driven methods for mesh refinement apply supervised learning to learn a fast neural network estimator of the solution from a fixed dataset of pre-generated high-resolution solutions (Obiols-Sales et al., 2022; Wallwork et al., 2022). However, greedy strategies based on local information cannot produce an optimal sequence of anticipatory refinement decisions in general, as they do not have sufficient information about features that may occur at subsequent time steps, while supervised methods do not directly optimize a given long-term objective. These challenges can be addressed by formulating AMR as a sequential decision-making problem and using reinforcement learning (RL) (Sutton and Barto, 2018) to optimize a given objective directly. However, current single-agent RL approaches for AMR either do not easily support refinement of multiple elements per solver step (Yang et al., 2021); faces different definitions of the environment transition at training and test time (Foucart et al., 2022); or work by selecting a single global error threshold (Gillette et al., 2022), which poses difficulties for anticipatory refinement.

In this work, we present the first formulation of AMR as a Markov game (Owen, 1982; Littman, 1994) and propose a novel fully-cooperative deep multi-agent reinforcement learning (MARL) algorithm (Foerster, 2018; Hernandez-Leal et al., 2019; Yang, 2021) called Value Decomposition Graph Network (VDGN), shown in Figure 1, to train a team of independently and simultaneously acting agents, each of which is a decision-maker for an element, to optimize a global performance metric and find anticipatory refinement policies. Because refinement and de-refinement actions at each step of the AMR Markov game leads to the creation and deletion of agents, we face the posthumous credit assignment problem (Cohen et al., 2021): agents who contributed to a future reward are not necessarily present at the future time to experience it. We show that VDGN, by virtue of centralized training with decentralized execution (Oliehoek and Amato, 2016), addresses this key challenge. By leveraging graph networks as the inductive bias (Battaglia et al., 2018), VDGN supports meshes with varying number of elements at each time step and can be applied to meshes of arbitrary size, depth, and geometry. Moreover, graph attention layers (Veličković et al., 2018) in VDGN enable each element to receive information within a large local neighborhood, so as to anticipate incoming or outgoing solution features and learn to take preemptive actions. Experimentally, using the advection equation as a pedagogical benchmark problem, we show for the first time that MARL: (a) displays anticipatory refinement behavior; (b) generalizes to different initial conditions, initial mesh resolutions, simulation durations, and mesh geometries; (c) significantly improves error and cost metrics compared to local error threshold-based policies, and unlocks new regions of the error-cost optimization landscape. Furthermore, we augment VDGN with a multi-objective optimization method to train a single policy that discovers the Pareto front of error and cost.

Refer to caption
Figure 1. A state-action-next state transition in the Markov game for AMR. Shown in detail are the action selection (rightward) and learning (leftward) processes in the Value Decomposition Graph Network.

2. Preliminaries

2.1. Finite Element Method

The finite element method (Brenner and Scott, 2007) models the domain of a PDE with a mesh that consists of nonoverlapping geometric elements. Using the weak formulation and choosing basis functions to represent the PDE solution on these elements, one obtains a system of linear equations that can be numerically solved. The shape and sizes of elements determine solution error, while the computational cost is primarily determined by the number of elements. Improving the trade-off between error and cost is the objective of AMR. This work focuses purely on hh-refinement, in which refinement of a coarse element produces multiple new fine elements (e.g., isotropic refinement of a 2D quadrilateral element produces four new elements), whereas de-refinement of a group of fine elements, restores an original coarse element. In both cases, the original coarse element or fine elements are removed from the FEM discretization.

2.2. Notation

Each element is identified with an agent, denoted by i{1,,nt}i\in\{1,\dotsc,n_{t}\}, where ntn_{t} is variable across time step t=0,1,,Tmaxt=0,1,\dotsc,T_{\max} within each episode of length TmaxT_{\max}, and nt[Nmax]n_{t}\in[N_{\max}], where NmaxN_{\max} is constrained by depthmax\text{depth}_{\max} to be nxny4depthmaxn_{x}\cdot n_{y}\cdot 4^{\text{depth}_{\max}} in the case of quadrilateral elements. Let each element ii have its own individual observation space 𝒮i{\mathcal{S}}^{i} and action space 𝒜i\mathcal{A}^{i}. Let ss denote the global state, oio^{i} and aia^{i} denote agent ii’s individual observation and action, respectively, and a:=(a1,,ant)a\vcentcolon=(a^{1},\dotsc,a^{n_{t}}) denote the joint action by ntn_{t} agents. In the case of AMR, all agents have the same observation and action spaces. Let RR denote a single global reward for all agents and let PP denote the environment transition function, both of which are well-defined for any number of agents nt[Nmax]n_{t}\in[N_{\max}]. Let γ(0,1]\gamma\in(0,1] be the discount factor. The FEM solver time at episode step tt is denoted by τ\tau (we omit the dependency τ(t)\tau(t) for brevity), which advances by τstep\tau_{\text{step}} in increments of dτ=0.002d\tau=0.002 during each discrete step tt+1t\rightarrow t+1 up to final simulation time τf\tau_{f}.

For element ii at time tt, let utiu^{i}_{t} and u^ti\hat{u}^{i}_{t} denote the true and numerical solution, respectively, and let cti:=utiu^ti2c^{i}_{t}\vcentcolon=\lVert u^{i}_{t}-\hat{u}^{i}_{t}\rVert_{2} denote the 2\mathcal{L}_{2} norm of the error. Let ct:=i=1nt(cti)2c_{t}\vcentcolon=\sqrt{\sum_{i=1}^{n_{t}}(c^{i}_{t})^{2}} denote the global error of a mesh with ntn_{t} elements. Let depth(i)=0,1,,depthmax\text{depth}(i)=0,1,\dotsc,\text{depth}_{\max} denote the refinement depth of element ii. Let dt:=s=0tDoFsd_{t}\vcentcolon=\sum_{s=0}^{t}\text{DoF}_{s} denote the cumulative degrees of freedom (DoF) of the mesh up to step tt, which is a measure of computational cost, and let dthresd_{\text{thres}} be a threshold (i.e., constraint) on the cumulative DoF seen during training.

2.3. Value Decomposition Network

In the paradigm of centralized training with decentralized execution (CTDE) (Oliehoek and Amato, 2016), global state and reward information is used in centralized optimization of a team objective at training time, while decentralized execution allows each agent to take actions conditioned only on their own local observations, independently of other agents, both at training time and at test time. One simple yet effective way to implement this in value-based MARL is Value Decomposition Networks (VDN) (Sunehag et al., 2018). VDN learns within the class of global action-value functions Q(s,a)Q(s,a) that decompose additively:

Q(s,a):=i=1nQi(si,ai),\displaystyle Q(s,a)\vcentcolon=\sum_{i=1}^{n}Q^{i}(s^{i},a^{i})\,, (1)

where QiQ^{i} is an individual utility function representing agent ii’s contribution to the joint expected return.

This decomposition is amenable for use in Q-learning (Watkins and Dayan, 1992), as it satisfies the individual-global-max (IGM) condition:

argmax𝐚Πi=1n𝒜iQ(s,a)=[argmaxa1𝒜1Q1(s1,a1),,argmaxan𝒜nQn(sn,an)].\displaystyle\operatorname*{argmax}_{\mathbf{a}\in\Pi_{i=1}^{n}\mathcal{A}^{i}}Q(s,a)=\left[\operatorname*{argmax}_{a^{1}\in\mathcal{A}^{1}}Q^{1}(s^{1},a^{1}),\dotsc,\operatorname*{argmax}_{a^{n}\in\mathcal{A}^{n}}Q^{n}(s^{n},a^{n})\right]\,. (2)

This means the individual maxima of QiQ^{i} provide the global maximum of the joint QQ function for the Q-learning update step (Watkins and Dayan, 1992; Mnih et al., 2015), which scales linearly rather than exponentially in the number of agents. Using function approximation for QθiQ^{i}_{\theta} with parameter θ\theta, the VDN update equations using replay buffer \mathcal{B} are:

θ\displaystyle\theta θθ𝔼(st,at,rt,st+1)[(yt+1Qθ(st,at))2]\displaystyle\leftarrow\theta-\nabla_{\theta}\mathbb{E}_{(s_{t},a_{t},r_{t},s_{t+1})\sim\mathcal{B}}\left[\left(y_{t+1}-Q_{\theta}(s_{t},a_{t})\right)^{2}\right] (3)
yt+1\displaystyle y_{t+1} :=Rt+γQθ(st+1,a)|a=[argmaxaiQθi(st+1i,ai)]i=1n,\displaystyle\vcentcolon=R_{t}+\bot\gamma Q_{\theta}(s_{t+1},a)|_{a=[\operatorname*{argmax}_{a^{i}}Q^{i}_{\theta}(s_{t+1}^{i},a^{i})]_{i=1}^{n}}\,, (4)

where \bot is the stop-gradient operator.

3. Agent creation and deletion

Each agent’s refinement and de-refinement action has long-term impact on the global error (e.g., refining before arrival of a feature would reduce error upon its arrival). However, since all elements that refine or de-refine are removed immediately and their individual trajectories terminate, they do not observe future states and future rewards. This is the posthumous multi-agent credit assignment problem (Cohen et al., 2021), which we propose to address using centralized training. First, we show that an environment with variable but bounded number of agents can be written as a Markov game (Owen, 1982) (see proof in Appendix B).

Proposition 0.

Let \mathcal{M} denote a multi-agent environment where the number of agents nt=1,,Nmaxn_{t}=1,\dotsc,N_{\text{max}} can change at every time step t=0,1,,Tmaxt=0,1,\dotsc,T_{\text{max}} due to agent-creation and agent-deletion actions in each agent’s action space. At each time tt, the environment is defined by the tuple ({𝒮i}i=1nt,{𝒜i}i=1nt,R,P,γ,nt,Nmax)\left(\{{\mathcal{S}}^{i}\}_{i=1}^{n_{t}},\{\mathcal{A}^{i}\}_{i=1}^{n_{t}},R,P,\gamma,n_{t},N_{\max}\right). \mathcal{M} can be written as a Markov game with a stationary global state space and joint action space that do not depend on the number of currently existing agents.

Centralized training for posthumous multi-agent credit assignment. Our key insight for addressing the posthumous credit assignment problem stems from Proposition 1: because the environment is Markov and stationary, we can use centralized training with a global reward to train a global state-action value function Q(s,a)Q(s,a) that (a) persists across time and (b) evaluates the expected future return of any (st,at)(s_{t},a_{t}). Crucially, these two properties enable Q(s,a)Q(s,a) to sidestep the issue of posthumous credit assignment, since the value estimate of a global state will be updated by future rewards via temporal difference learning regardless of agent deletion and creation. To arrive at a truly multi-agent approach, we factorize the global action space so that each element uses its individual utility function Qi(oi,ai)Q^{i}(o^{i},a^{i}) to choose its own action from {no-op, refine, de-refine}\{\text{no-op, refine, de-refine}\}. This immediately leads to the paradigm of centralized training with decentralized execution (Oliehoek and Amato, 2016), of which VDN (eq. 1) is an example. We discuss the pros and cons of other formulations in Appendix E.

Effective space. Agent creation and deletion means that the accessible region of the global state-action space changes over time during each episode. While this is not a new phenomenon, AMR is special in that the sizes of the informative subset of the global state and the available action set depend directly on the current number of existing agents. Hence, a key observation for algorithm development is that a model-free multi-agent reinforcement learning algorithm only needs to account for the accessible state-action space i=1nt𝒮i×𝒜i\prod_{i=1}^{n_{t}}{\mathcal{S}}^{i}\times\mathcal{A}^{i} at each time step, since the expansion or contraction of that space is part of the environment dynamics that are accounted implicitly by model-free MARL methods. In practice, this means all dummy states ss_{\nexists} do not need to be input to policies, and policies do not need to output the (mandatory) no-op actions for the NmaxntN_{\max}-n_{t} nonexistent elements at time tt. This informs our concrete definition of the Markov game for AMR in Section 4.

4. AMR as a Markov game

State. The global state is defined by the collection of all individual element observations and pairwise relational features. The individual observation oio^{i} of element ii consists of:

  • log(cti)\log(c^{i}_{t}): logarithm of error at element ii at time tt.

  • 1-hot representation of element refinement depth.

Relational features eije^{ij} are defined for each pair of spatially-adjacent elements i,ji,j that form edge e=(i,j)e=(i,j) (directed from jj to ii) in the graph representation of the mesh (see Section 5), as a 1-dimensional vector concatenation of the following:

  • 𝟏[depth(i)depth(j)]\mathbf{1}[\text{depth}(i)-\text{depth}(j)]: 1-hot vector indicator of the difference in refinement depth between ii and jj.

  • uj,ΔxΔx22τstep\langle u_{j},\frac{\Delta x}{\lVert\Delta x\rVert^{2}_{2}}\rangle\cdot\tau_{\text{step}}: Dimensionless inner product of velocity uju_{j} at element jj with the displacement vector between ii and jj. Here Δx:=(xixj)\Delta x\vcentcolon=(x_{i}-x_{j}), where xix_{i} is the center of element ii.

We use the velocity uju_{j} at the sender element so that the receiver element is informed about incoming or outgoing PDE features and can act preemptively.

Action. All elements have the same action space:

𝒜:={no-op,refine,de-refine},\displaystyle\vspace{-5pt}\mathcal{A}\vcentcolon=\{\texttt{no-op},\texttt{refine},\texttt{de-refine}\},\vspace{-5pt}

where no-op means the element persists to the next decision step; refine means the element is equipartitioned into four smaller elements; de-refine means that the element opts to coalesce into a larger coarse element, subject to feasibility constraints specified by the transition function (see below).

Transition. Given the current state and agents’ joint action, which is chosen simultaneously by all agents, the transition P:𝒮×𝒜𝒮P\colon{\mathcal{S}}\times\mathcal{A}\mapsto{\mathcal{S}} is defined by these steps:

  1. (1)

    Apply de-refinement rules to each element ii whose action is de-refine: (a) if, within its group of sibling elements, a majority (or tie) of elements chose de-refine, then the whole group is de-refined; (b) if it is at the coarsest level, i.e., depth(i)=0\text{depth}(i)=0, or it belongs to a group of sibling elements in which any element chose to refine, then its choice is overridden to be no-op.

  2. (2)

    Apply refinement to each agent who chose refine.

  3. (3)

    Step the FEM simulation forward in time by τstep\tau_{\text{step}} and compute a new solution on the updated mesh. An episode terminates when τ+dτ>τf\tau+d\tau>\tau_{f} or dt>dthresd_{t}>d_{\text{thres}}. This follows standard procedure in FEM (Anderson et al., 2021) and knowledge of the transition dynamics is not used by the proposed model-free MARL approach.

Reward. We carefully design a shaped reward (Ng et al., 1999) that encourages agents to minimize the final global error. Let c0=1.0c_{0}=1.0 be a dummy initial global error. The reward at step t=1,2,t=1,2,\dotsc is

Rt={p(ττf)+log(ct1), if dt>dthresτ+dτ<τflog(ct1/ct)pmax(0,dtdthres1), if τ+dττflog(ct1/ct), else\displaystyle R_{t}=\begin{cases}p\cdot(\tau-\tau_{f})+\log(c_{t-1}),\text{ if $d_{t}>d_{\text{thres}}\wedge\tau+d\tau<\tau_{f}$}\\ \log(c_{t-1}/c_{t})-p\cdot\max(0,\frac{d_{t}}{d_{\text{thres}}}-1)\text{, if $\tau+d\tau\geq\tau_{f}$}\\ \log(c_{t-1}/c_{t}),\quad\text{ else}\end{cases} (5)

The first case applies a penalty pp when the cumulative DoF exceeds the DoF threshold before reaching the simulation final time. The second case applies a penalty based on the amount by which the DoF threshold is exceeded when the final time is reached. The last case provides the agents with a dense learning signal based on potential-based reward shaping (Ng et al., 1999), so that the episode return (i.e., cumulative reward at the end of the episode) is R=log(cTmax)R=-\log(c_{T_{\max}}) in the absence of any of the other penalties.

Objective. We consider a fully-cooperative Markov game in which all agents aim to find a shared parameterized policy πθ:𝒮×𝒜[0,1]\pi_{\theta}\colon{\mathcal{S}}\times\mathcal{A}\mapsto[0,1] to maximize the objective

maxθJ(θ):=𝔼st+1P(|a,st),aπθ(|s)[t=0TγtRt]\displaystyle\max_{\theta}J(\theta)\vcentcolon=\mathbb{E}_{s_{t+1}\sim P(\cdot|a,s_{t}),a\sim\pi_{\theta}(\cdot|s)}\left[\sum_{t=0}^{T}\gamma^{t}R_{t}\right] (6)
Remark 0.

By using time limit τf\tau_{f} and DoF threshold dthresd_{\text{thres}} in the reward at training time, while not letting agents observe absolute time and cumulative DoF, agents must learn to make optimal decisions based on local observations, which enables generalization to longer simulation duration and DoF budgets at test time.

Remark 0.

For problems without an easily computable analytical solution to compute the true error, one may use an error estimator for the element observations. The reward, which is only needed at training time and not at test time, can still be based on error with respect to a highly resolved reference simulation. Empirical results on this configuration are provided in Appendix D.

Refer to caption
Figure 2. Graph Attention Layer. A softmax over all edges connected to node ii produces attention weights a^ij\hat{a}^{ij} for edge (i,j)(i,j) (eq. 8). A weighted sum over values bijb^{ij} with weight a^ij\hat{a}^{ij} produces the updated node feature v^i\hat{v}^{i} (eq. 10).

5. Value Decomposition Graph Network

To enable anticipatory refinement in the time-dependent case, an element must observe a large enough neighborhood around itself. However, it is difficult to define a fixed local observation window for general mesh geometries, element shapes, and boundaries. Instead, we use Graph Networks (Scarselli et al., 2008; Battaglia et al., 2018) as a general inductive bias to learn representations of element interactions on arbitrary meshes.

Specifically, we construct a policy based on Graph Attention Networks (Veličković et al., 2018), which incorporates self-attention (Vaswani et al., 2017) into graph networks. At each step tt, the mesh is represented as a graph 𝒢=(Vt,Et)\mathcal{G}=(V_{t},E_{t}). Each node viv^{i} in V={vi}i=1:ntV=\{v^{i}\}_{i=1:n_{t}} corresponds to element ii and its feature is initialized to be the element observation oio^{i}. E={ek=(rk,sk)}k=1:NeE=\{e^{k}=(r^{k},s^{k})\}_{k=1:N^{e}} is a set of edges, where an edge eke^{k} exists between sender node sks^{k} and receiver node rkr^{k} if and only if they are spatially adjacent (i.e., sharing either a face or a vertex in the mesh). Its feature is initialized to be the relational feature erkske^{r^{k}s^{k}}.

5.1. Graph Attention Layer

In a graph attention layer, each node is updated by a weighted aggregation over its neighbors: weights are computed by self-attention using node features as queries and keys, then applied to values that are computed from node and edge features.

Self-attention weights a^ij\hat{a}^{ij} for each node ii are computed as follows (see Figure 2): 1) we define queries, keys, and values as linear projections of node features, via weight matrices WqW^{q}, WkW^{k}, and WvW^{v} (all d×dim(v)\in\mathbb{R}^{d\times\dim(v)}) shared for all nodes; 2) for each edge (i,j)(i,j), we compute a scalar pairwise interaction term aija^{ij} using the dot-product of queries and keys; 3) for each receiver node ii with sender node j𝒩ij\in\mathcal{N}_{i}, we define the attention weight as the jj-th component of a softmax over all neighbors k𝒩ik\in\mathcal{N}_{i}:

aij\displaystyle\vspace{-10pt}a^{ij} :=WqviWkvj\displaystyle\vcentcolon=W^{q}v^{i}\cdot W^{k}v^{j} for (i,j)E,\displaystyle\text{for }(i,j)\in E\,, (7)
a^ij\displaystyle\hat{a}^{ij} :=softmaxj({aik}k)=exp(aij)k𝒩iexp(aik)\displaystyle\vcentcolon=\texttt{softmax}_{j}(\{a^{ik}\}_{k})=\frac{\exp(a^{ij})}{\sum_{k\in\mathcal{N}_{i}}\exp(a^{ik})} for j𝒩i.\displaystyle\text{for }j\in\mathcal{N}_{i}\,.\vspace{-10pt} (8)

We use these attention weights to compute the new feature for each node ii as a linear combination over its neighbors j𝒩ij\in\mathcal{N}_{i} of projected values WvvjW^{v}v^{j}. Edge features eije^{ij}, with linear projection using Wed×dim(e)W^{e}\in\mathbb{R}^{d\times\dim(e)}, capture the relational part of the observation:

bij\displaystyle\vspace{-10pt}b^{ij} :=Wvvj+Weeij\displaystyle\vcentcolon=W^{v}v^{j}+W^{e}e^{ij} for (i,j)E,\displaystyle\text{for }(i,j)\in E\,, (9)
v^i\displaystyle\hat{v}^{i} :=j𝒩ia^ijbij\displaystyle\vcentcolon=\sum_{j\in\mathcal{N}_{i}}\hat{a}^{ij}b^{ij} for iV.\displaystyle\text{for }i\in V\,.\vspace{-10pt} (10)

Despite being a special case of the most general message-passing flavor of graph networks (Battaglia et al., 2018), graph attention networks separate the learning of a^ij\hat{a}^{ij}, the scalar importance of interaction between ii and jj relative to other neighbors, from the learning of bijb^{ij}, the vector determining how jj affects ii. This additional inductive bias reduces the functional search space and can improve learning with large receptive fields around each node, just as attention is useful for long-range interactions in sequence data (Vaswani et al., 2017).

Multi-head graph attention layer. We extend graph attention by building on the mechanism of multi-head attention (Vaswani et al., 2017), which uses HH independent linear projections (Wq,h,Wk,h,Wv,h,We,h)(W^{q,h},W^{k,h},W^{v,h},W^{e,h}) for queries, keys, values and edges (all projected to dimension d/Hd/H), and results in HH independent sets of attention weights a^ij,h\hat{a}^{ij,h}, with h=1,2,,Hh=1,2,\dotsc,H . This enables attention to different learned representation spaces and was found to stabilize learning (Veličković et al., 2018). The new node representation with multi-head attention is the concatenation of all output heads, with an output linear projection Wod×dW^{o}\in\mathbb{R}^{d\times d}. Section A.1 shows the multi-head versions of eqs. 7, 8, 9 and 10.

5.2. Value Decomposition Graph Network

We use the graph attention layer to define the Value Decomposition Graph Network (VDGN), shown in Figure 1. Firstly, an independent graph network block (Battaglia et al., 2018) linearly projects nodes and edges independently to d\mathbb{R}^{d}.

Refer to caption
Figure 3. VDGN layer

Each layer FF of VDGN involves two sub-layers: a multi-head graph attention layer followed by a fully-connected independent graph network block with ReLU nonlinearity. For each of these sub-layers, we use residual connections (He et al., 2016) and layer normalization (Ba et al., 2016), so that the transformation of input graph gg is LayerNorm(g+SubLayer(g))\text{LayerNorm}(g+\text{SubLayer}(g)). This is visualized in Figure 3. We define a VDGN stack as S:=FL(FL1(F1(g)))S\vcentcolon=F_{L}(F_{L-1}(\dotsm F_{1}(g)\dotsm)) with LL unique layers without weight sharing, then define a single forward pass by rr recurrent applications of SS: P(g)=S(S(S(g)))P(g)=S(S(\dotsm S(g)\dotsm)) with rr instances of SS. Finally, to produce |𝒜||\mathcal{A}| action-values for each node (i.e., element) ii, we apply a final graph attention layer whose output linear projection is Wouto|𝒜|×dW^{o}_{\text{out}}\in\mathbb{R}^{|\mathcal{A}|\times d}, so that each final node representation is v^i|𝒜|\hat{v}^{i}\in\mathbb{R}^{|\mathcal{A}|}, interpreted as the individual element utility Q(oi,ai)Q(o^{i},a^{i}) for all possible actions ai𝒜a^{i}\in\mathcal{A}.

VDGN is trained using eq. 1 in a standard Q-learning algorithm (Mnih et al., 2015) (see the leftward process in Figure 1). Since VDGN fundamentally is a QQ function-based method, we can extend it with a number of independent algorithmic improvements (Hessel et al., 2018) to single-agent Deep Q Network (Mnih et al., 2015) that provide complementary gains in learning speed and performance. These include double Q-learning (Van Hasselt et al., 2016), dueling networks (Wang et al., 2016), and prioritized replay (Schaul et al., 2015). Details are provided in Section A.2. We did not employ the other improvements such as noisy networks, distributional Q-learning, and multi-step returns because the AMR environment dynamics are deterministic for the PDEs we consider and the episode horizon at train time is short.

5.3. Symmetries

Methods for FEM and AMR should respect symmetries of the physics being simulated. For the simulations of interest in this work, we require a refinement policy to satisfy two properties: a) spatial equivariance: given a spatial rotation or translation of the PDE, the mesh refinement decisions should also rotate or translate in the same way; b) time invariance: the same global PDE state at different absolute times should result in the same refinement decisions. By construction of node and edge features, and the fact that graph neural networks operate on individual nodes and edges independently, we have the following result, proved in Appendix B.

Proposition 0.

VDGN is equivariant to global rotations and translations of the error and velocity field, and it is time invariant.

5.4. Multi-objective VDGN

In applications where a user’s preference between minimizing error and reducing computational cost is not known until test time, one cannot a priori combine error and cost into a single scalar reward at training time. Instead, one must take a multi-objective optimization viewpoint (Hayes et al., 2022) and treat cost and error as separate components of a vector reward function 𝐑=(Rc,Re)\mathbf{R}=(R^{c},R^{e}). The components encourage lower DoFs and lower error, respectively, and are defined by

Rtc:=dt1dtdthres;Rte:=log(ct1)log(ct).\displaystyle\vspace{-5pt}R^{c}_{t}\vcentcolon=\frac{d_{t-1}-d_{t}}{d_{\text{thres}}};\quad\quad R^{e}_{t}\vcentcolon=\log(c_{t-1})-\log(c_{t}).\vspace{-5pt} (11)

The objective is 2𝐉(θ):=𝔼st+1P(|a,st),aπθ(|s)[t=0Tγt𝐑t]\mathbb{R}^{2}\ni\mathbf{J}(\theta)\vcentcolon=\mathbb{E}_{s_{t+1}\sim P(\cdot|a,s_{t}),a\sim\pi_{\theta}(\cdot|s)}[\sum_{t=0}^{T}\gamma^{t}\mathbf{R}_{t}], which is vector-valued. We focus on the widely-applicable setting of linear preferences, whereby a user’s scalar utility based on preference vector ω\omega is ωT𝐑\omega^{T}\mathbf{R} (e.g., ω=[0.5,0.5]\omega=[0.5,0.5] implies the user cares equally about cost and error). At training time, we randomly sample ωΩ\omega\in\Omega in each episode and aim to find an optimal action-value function 𝐐(s,a,ω):=argQsupπΠωT𝔼π[t=0TγT𝐑t]\mathbf{Q}^{*}(s,a,\omega)\vcentcolon=\arg_{Q}\sup_{\pi\in\Pi}\omega^{T}\mathbb{E}_{\pi}[\sum_{t=0}^{T}\gamma^{T}\mathbf{R}_{t}], where argQ\arg_{Q} extracts the vector Eπ[]E_{\pi}[\dotsc] corresponding to the supremum. We extend VDGN with Envelop Q-learning (Yang et al., 2019), a multi-objective RL method that efficiently finds the convex envelope of the Pareto front in multi-objective MDPs; see Section A.3 for details. Once trained, 𝐐\mathbf{Q}^{*} induces the optimal policy for any preference ω\omega according to the greedy policy a=argmaxaωT𝐐(s,a,ω)a^{*}=\operatorname*{argmax}_{a}\omega^{T}\mathbf{Q}^{*}(s,a,\omega).

6. Experimental setup

We designed experiments to test the ability of VDGN to find generalizable AMR strategies that display anticipatory refinement behavior, and benchmark these policies against standard baselines on error and DoF metrics. We define the FEM environment in Section 6.1, and the implementation of our method and baselines in Section 6.2 and Appendix C. Results are analyzed in Section 7.

6.1. AMR Environment

We use MFEM (Anderson et al., 2021; mfem, [n.d.]) and PyMFEM (mfem, [n.d.]), a modular open-source library for FEM, to implement the Markov game for AMR. We ran experiments on the linear advection equation ut+νu=0\frac{\partial u}{\partial t}+\nu\nabla{\cdot}u=0 with random initial conditions (ICs) for velocity ν\nu and solution u(0)u(0), solving it using the FEM framework on a two-dimensional L2L^{2} finite element space with periodic boundary conditions. Each discrete step of the Markov game is a mesh re-grid step, with τstep\tau_{\text{step}} FEM simulation time elapsing between each consecutive step. The solution is represented using discontinuous first-order Lagrange polynomials, and the initial mesh is partitioned into nx×nyn_{x}\times n_{y} quadrilateral elements. Appendix C contains further FEM details on the mesh partition and the construction of element observations.

Linear advection is a useful benchmark for AMR despite its seeming simplicity because the challenge of anticipatory refinement can be made arbitrarily hard by increasing the τstep\tau_{\text{step}} of simulation time that elapses between two consecutive steps in the Markov game (i.e., between each mesh update step). Intuitively, an optimal refinement strategy must refine the entire connected region that covers the path of propagation of solution features with large solution gradients (i.e., high error on a coarse mesh), and maintain coarse elements everywhere else. Hence, the larger the τstep\tau_{\text{step}}, the harder it is for distant elements, which currently have low error but will experience large solution gradients later, to refine preemptively. But such long-distance preemptive refinement capability is exactly the key for future applications in which one prefers to have few re-meshing steps during a simulation due to its computational cost. Moreover, the existence of an analytic solution enables us to benchmark against error threshold-based baselines under the ideal condition of having access to perfect error estimator.

Metric. Besides analyzing performance via error cc and cumulative DoFs dd individually, we define an efficiency metric as η=1c~2+d~2[0,1]\eta=1-\sqrt{\tilde{c}^{2}+\tilde{d}^{2}}\in[0,1], where a higher value means higher efficiency. Here, c~:=ccfineccoarsecfine\tilde{c}\vcentcolon=\frac{c-c_{\text{fine}}}{c_{\text{coarse}}-c_{\text{fine}}} is a normalized solution error and d~:=ddcoarsedfinedcoarse\tilde{d}\vcentcolon=\frac{d-d_{\text{coarse}}}{d_{\text{fine}}-d_{\text{coarse}}} is normalized cumulative degrees of freedom (a measure of computational cost). Here, the subscripts “fine“ and “coarse” indicate that the quantity is computed on a constant uniformly fine and coarse mesh, respectively, that are held static (not refined/de-refined) over time. The uniformly fine and coarse meshes themselves attain efficiency η=0\eta=0. Efficiency η=1\eta=1 is unattainable in principle, since non-trivial problems require d>dcoarsed>d_{\text{coarse}}.

6.2. Implementation and Baselines

The graph attention layer of VDGN was constructed using the Graph Nets library (Battaglia et al., 2018). We used hidden dimension 6464 for all VDGN layers (except output layer of size |𝒜||\mathcal{A}|), and H=2H=2 attention heads. For depthmax=1\text{depth}_{\max}=1, with initial nx=ny=16n_{x}=n_{y}=16, we chose L=2L=2 internal layers, with R=3R=3 recurrent passes. Each Markov game step has τstep=0.25\tau_{\text{step}}=0.25, τf=0.75\tau_{f}=0.75 (hence Tmax=3T_{\max}=3). For depthmax=2\text{depth}_{\max}=2, with nx=ny=8n_{x}=n_{y}=8, we used L=R=2L=R=2. Each Markov game step has τstep=0.2\tau_{\text{step}}=0.2, τf=0.8\tau_{f}=0.8 (hence Tmax=4T_{\max}=4). For each training episode, we uniformly sampled the starting position and velocity of a 2D isotropic Gaussian wave as the initial condition. The FEM solver time discretization was dτ=0.002d\tau=0.002 throughout. See Appendix C for further architectural details and hyperparameters.

We compare with the class of local error-based Threshold policies, each member of which is defined by a tuple (θr,θd,τstep)(\theta_{r},\theta_{d},\tau_{\text{step}}) as follows: every τstep\tau_{\text{step}} of simulation time, all elements whose true error exceed θr\theta_{r} are refined, while those with true error below θd\theta_{d} are de-refined. These policies represent the ideal behavior of widely-used AMR methods based on local error estimation, in the limit of perfectly accurate error estimation.

Remark 0.

Crucially, note that the Threshold policy class does not necessarily contain the global optimal policy for all AMR problems because such policies are incapable of anticipatory refinement and cannot access the full error-cost objective landscape. Suppose an element ii with flat features and negligible error cti1c^{i}_{t}\ll 1 at time tt needs to refine before the arrival of complex PDE features at t+1t+1. If θr>cti\theta_{r}>c^{i}_{t}, then element ii is not refined preemptively and large error is incurred at t+1t+1. If θr<cti\theta_{r}<c^{i}_{t}, then many other elements jj with error ctj>θrc^{j}_{t}>\theta_{r} are also refined at tt but they may not contain complex features at t+1t+1, so DoF cost is unnecessarily increased.

7. Experimental results

Overall, we find that VDGN policies display anticipatory refinement, generalize to different initial conditions, mesh resolutions and simulation durations, thereby uncovering Pareto-efficient regions of the error-cost trade-off that were previously inaccessible by traditional error-estimator-based methods. VDGN policy runtimes are comparable to Threshold policies (see Table 4)

Refer to caption
(a) depthmax=1\text{depth}_{\max}=1
Refer to caption
(b) depthmax=2\text{depth}_{\max}=2
Figure 4. Global error versus simulation time of VDGN, compared with Thresholld policies with different τstep\tau_{\text{step}} between each mesh update step. (a) VDGN with the longest duration τstep=125dτ\tau_{\text{step}}=125d\tau has error growth comparable to Threshold with the shortest duration τstep=1dτ\tau_{\text{step}}=1d\tau. (b) VDGN significantly outperforms its Threshold counterpart with τstep=100dτ\tau_{\text{step}}=100d\tau.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5. At each RL step, VDGN refines the full path segment that will be traversed by the wave over many solver steps into the future. Here, and for all subsequent mesh visualizations, we show the process: refinements, solution after τstep\tau_{\text{step}}, refinements, and so on.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6. VDGN chooses more level-1 refinement than necessary for the wave’s location at t+1t+1, so that level-2 refinement is possible for the wave’s location at t+2t+2.

7.1. Anticipatory Refinement

As discussed in Section 6.1, each mesh update incurs a computational cost, which means AMR practitioners prefer to have a long duration of simulation time between each mesh update step. Figure 4 shows the growth of global error versus simulation time of VDGN and Threshold policies with different τstep\tau_{\text{step}} between each mesh update step. In the case of depthmax=1\text{depth}_{\max}=1, VDGN was trained and tested using the longest duration τstep=125dτ\tau_{\text{step}}=125d\tau (i.e., it has the fewest mesh updates), but it matches the error of the most expensive threshold policy that updates the mesh after each τstep=1dτ\tau_{\text{step}}=1d\tau (see Figure 4(a)). This is possible only because VDGN preemptively refines the contiguous region that will be traversed by the wave within 125dτ125d\tau (e.g., see Figure 5). In contrast, Threshold must update the mesh every 1dτ1d\tau to achieve this performance, since coarse elements that currently have negligible error due to their distance from the incoming feature do not refine before the feature’s arrival.

Moreover, in the case of depthmax=2\text{depth}_{\max}=2, the agents learned to choose level-1 refinement at t=1t=1 for a region much larger than the feature’s periphery, so that these level-1 elements can preemptively refine to level 2 at t=2t=2 before the feature passes over them. This is clearly seen in Figure 6. This enabled VDGN with τstep=100dτ\tau_{\text{step}}=100d\tau (fewest update steps) to have error growth rate close to that of Threshold with τstep=1dτ\tau_{\text{step}}=1d\tau (see Figure 4(b)).

Symmetry. Comparing Figure 12 with Figure 5, we see that VDGN policies are equivariant to rotation of initial conditions. Reflection equivariance is also visible for the opposite moving waves in Figure 10. Translation equivariance can be seen in Figure 13. Note that perfect symmetry holds only for rotation by integer multiples of π/2\pi/2 and translation by integer multiples of the width of a level-0 element. Symmetry violation from mesh discretization is unavoidable for other values.

Refer to caption
(a) nx=ny=16n_{x}=n_{y}=16, max depth 1, 500 solver steps
Refer to caption
(b) nx=ny=8n_{x}=n_{y}=8, max depth 2
Refer to caption
(c) nx=ny=16n_{x}=n_{y}=16, max depth 1, 2500 solver steps.
Refer to caption
(d) Pareto front of multi-objective VDGN.
Figure 7. Trade-off between error and computational cost. VDGN unlocks regions inaccessible by threshold policies.
Table 1. Mean and standard error of efficiency of VDGN versus Threshold policies, over 100 runs with uniform random ICs. Policies used in generalization tests were trained on isotropic Gaussian features on a square mesh with quadrilateral elements for less than 500dτ500d\tau per episode. Last row shows the efficiency ratio of VDGN to the best threshold policy.
In-distribution Generalization
θr\theta_{r} Depth 1 375375 steps Depth 2 400400 steps Triangular 12501250 steps Depth 1 25002500 steps Depth 2 25002500 steps Anisotropic 25002500 steps Ring 25002500 steps Opposite 25002500 steps Star 750750 steps
5×1035\times 10^{-3} 0.35 (0.01) 0.35 (0.01) 0.69 (0.02) 0.52 (0.02) 0.41 (0.01) 0.62 (0.02) 0.61 (0.02) 0.56 (0.002) 0.85 (0.01)
5×1045\times 10^{-4} 0.74 (0.02) 0.51 (0.01) 0.76 (0.01) 0.57 (0.02) 0.43 (0.02) 0.64 (0.02) 0.56 (0.02) 0.61 (0.01) 0.92 (0.01)
5×1055\times 10^{-5} 0.84 (0.01) 0.56 (0.01) 0.66 (0.02) 0.46 (0.02) 0.34 (0.02) 0.53 (0.02) 0.48 (0.02) 0.50 (0.01) 0.87 (0.01)
5×1065\times 10^{-6} 0.82 (0.01) 0.55 (0.01) 0.56 (0.02) 0.37 (0.02) 0.26 (0.01) 0.42 (0.02) 0.39 (0.02) 0.42 (0.01) 0.83 (0.01)
5×1075\times 10^{-7} 0.77 (0.01) 0.50 (0.01) 0.47 (0.02) 0.30 (0.02) 0.20 (0.01) 0.33 (0.02) 0.31 (0.02) 0.32 (0.01) 0.79 (0.01)
5×1085\times 10^{-8} 0.70 (0.01) 0.44 (0.01) 0.38 (0.02) 0.24 (0.02) 0.15 (0.01) 0.26 (0.02) 0.25 (0.02) 0.26 (0.01) 0.74 (0.01)
5×10155\times 10^{-15} 0.34 (0.01) 0.27 (0.01) 0.10 (0.01) 0.08 (0.01) 0.05 (0.003) 0.07 (0.01) 0.07 (0.01) 0.07 (0.01) 0.45 (0.02)
VDGN 0.92 (0.01) 0.66 (0.01) 0.80 (0.01) 0.84 (0.004) 0.73 (0.01) 0.78 (0.02) 0.82 (0.01) 0.63 (0.03) 0.93 (0.01)
VDGN/best θr\theta_{r} 1.10 1.18 1.05 1.47 1.70 1.22 1.34 1.03 1.13

7.2. Pareto Optimality

Figure 7 shows that VDGN unlocks regions of the error-cost landscape that are inaccessible to the class of Threshold policies in all of the mesh configurations that were tested. We ran a sweep over refinement threshold θr[5×103,,5×108,5×1015]\theta_{r}\in[5\times 10^{-3},\dotsc,5\times 10^{-8},5\times 10^{-15}] with de-refinement threshold θd=4×1015\theta_{d}=4\times 10^{-15}. In the case of depthmax=1,2\text{depth}_{\max}=1,2 with 500500 solver steps, and depthmax=1\text{depth}_{\max}=1 with 25002500 solver steps, Figures 7(a), 7(b) and 7(c) show that VDGN lies outside the empirical Pareto front formed by threshold-based policies, and that VDGN Pareto-dominates those policies for almost every value of θr\theta_{r}: given a desired error (cost), VDGN has much lower cost (error). The “In-distribution” group in Table 1 shows that VDGN has significantly higher efficiency than Threshold policies for all tested threshold values, for depthmax=1,2\text{depth}_{\max}=1,2.

To understand the optimality of VDGN policies, we further compared multi-objective VDGN to brute-force search for the best sequence of refinement actions in an anisotropic 1D advection problem with nx=64,ny=1n_{x}=64,n_{y}=1 and two mesh update steps. To make brute-force search tractable, we imposed the constraint that a contiguous region of nn elements are refined at each step (while all elements outside the region are de-refined). We searched for the starting locations of the region that resulted in lowest final global error. By varying nn, this procedure produces an empirical Pareto front of such brute-force policies in the error-cost landscape, which we plot in Figure 7(d). For multi-objective VDGN, we trained a single policy and evaluated it with 100100 randomly sampled preferences ω=[α,1α]\omega=[\alpha,1-\alpha] where αUnif[0,1]\alpha\sim\text{Unif}[0,1]. Figure 7(d) shows that a single multi-objective VDGN policy produces a Pareto front (o) that approaches the Pareto front formed by brute force policies (o). Moreover, we see that Threshold policies with various refinement thresholds (o) are limited to a small section of the objective landscape, whereas VDGN unlocks previously-inaccessible regions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8. Policy trained on isotropic 2D Gaussian can be applied to anisotropic 2D Gaussian.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9. Policy trained on isotropic 2D Gaussian can be applied to ring functions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10. Policy trained on one bump with one velocity can be applied to two bumps with opposite velocities.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11. Policy trained on square mesh, run on a star mesh.

7.3. Generalization

Longer time. At training time for VDGN, each episode consisted of approximately 400-500 FEM solver steps. We tested these policies on episodes with 2500 solver steps, which presents the agents with features outside of its training distribution due to accumulation of numerical error over time. Table 1 shows that: a) VDGN maintain highest top performance; b) the performance ratio between VDGN and the best Threshold policy actually increases in comparison to the case with shorter time (e.g., 1.47 in the “Depth 1 2500 steps” column, as opposed to 1.10 in the “Depth 1” column). This is because the error of threshold-based policies accumulates quickly over time due to the lack of anticipatory refinement, whereas VDGN mitigates the effect. Figure 15 shows that VDGN sustains anticipatory refinement behavior in test episodes longer than training episodes.

Out-of-distribution test problems. Even though VDGN policies were trained on square meshes with 2D isotropic Gaussian waves, we find that they generalize well to initial conditions and mesh geometries that are completely out of the training distribution. On anisotropic Gaussian waves (Figure 8), ring-shaped features (Figure 9), opposite-moving waves (Figure 10), star-shaped meshes (Figure 11), VDGN significantly outperforms Threshold policies without any additional fine-tuning or training (see the “Generalization” group in Table 1). Figure 14 shows qualitatively that a policy trained on quadrilateral elements shows rational refinement decisions when deployed on triangular elements.

Scaling. Since VDGN is defined by individual node and edge computations with parameter-sharing across nodes and edges, it is a local model that is agnostic to size and scale of the global mesh. Figures 16 and 17 show that a policy trained on nx=ny=16n_{x}=n_{y}=16 can be run with rational refinement behavior on an nx=ny=64n_{x}=n_{y}=64 mesh.

8. Related work

A growing body of work leverage machine learning and deep neural networks (Goodfellow et al., 2016) to improve the trade-off between computational cost and accuracy of numerical methods: e.g., reinforcement learning for generating a fixed (non-adaptive) mesh (Pan et al., 2022), unsupervised clustering for marking and pp-refinement (Tlales et al., 2022), and supervised learning for target resolution prediction (Obiols-Sales et al., 2022), error estimation (Wallwork et al., 2022), and mesh movement (Song et al., 2022). The following three are the closest work to ours. Yang et al. (2021) proposed a global single-agent RL approach for h-adaptive AMR. It does not naturally support refining multiple elements per mesh update step, and anticipatory refinement was not conclusively demonstrated. Gillette et al. (2022) work within the class of marking policies parameterized by an error threshold and showed that single-agent RL finds robust policies that dynamically choose the error threshold and outperform fixed-threshold policies in elliptic problems. However, threshold-based policies may not contain the optimal policy for time-dependent problems that require anticipatory refinement. Foucart et al. (2022) proposed a local single-agent RL approach whereby the agent makes a decision for one randomly-selected element at each step. At training time, the global solution is updated every time a single element action occurs; at test time, the agent faces a different environment transition since the global solution is updated only after it has acted for all elements. Our multi-agent approach enables the definition of the environment transition to be the same at training and test time.

9. Conclusion

We have formulated a Markov game for adaptive mesh refinement, shown that centralized training addresses the posthumous credit assignment problem, and proposed a novel multi-agent reinforcement learning method called Value Decomposition Graph Network (VDGN) to train AMR policies directly from simulation. VDGN displays anticipatory refinement behavior, enabling it to unlock new regions of the error-cost objective landscape that were inaccessible by previous threshold-based AMR methods. We verified that trained policies work well on out-of-distribution test problems with PDE features, mesh geometries, and simulation duration not seen in training. Our work serves as a stepping stone to apply multi-agent reinforcement learning to more complex problems in AMR.

{acks}

The authors thank Andrew Gillette for helpful discussions in the project overall. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 and the LLNL-LDRD Program with project tracking #21-SI-001. LLNL-PROC-841328.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {\{USENIX}\} Symposium on Operating Systems Design and Implementation ({\{OSDI}\} 16). 265–283.
  • Anderson et al. (2021) Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, Aaron Fisher, Tzanio Kolev, Will Pazner, Mark Stowell, Vladimir Tomov, Ido Akkerman, Johann Dahm, David Medina, and Stefano Zampini. 2021. MFEM: A modular finite element methods library. Computers & Mathematics with Applications 81 (2021), 42 – 74. Development and Application of Open-source Software for Problems with Numerical PDEs.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Bangerth and Rannacher (2013) Wolfgang Bangerth and Rolf Rannacher. 2013. Adaptive finite element methods for differential equations. Birkhäuser.
  • Battaglia et al. (2018) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).
  • Brenner and Scott (2007) Susanne Brenner and Ridgway Scott. 2007. The mathematical theory of finite element methods. Vol. 15. Springer Science & Business Media.
  • Červený et al. (2019) Jakub Červený, Veselin Dobrev, and Tzanio Kolev. 2019. Nonconforming mesh refinement for high-order finite elements. SIAM Journal on Scientific Computing 41, 4 (Jan. 2019), C367–C392. https://doi.org/10.1137/18m1193992
  • Cohen et al. (2021) Andrew Cohen, Ervin Teng, Vincent-Pierre Berges, Ruo-Ping Dong, Hunter Henry, Marwan Mattar, Alexander Zook, and Sujoy Ganguly. 2021. On the use and misuse of absorbing states in multi-agent reinforcement learning. arXiv preprint arXiv:2111.05992 (2021).
  • Foerster (2018) Jakob N Foerster. 2018. Deep multi-agent reinforcement learning. Ph.D. Dissertation. University of Oxford.
  • Foucart et al. (2022) Corbin Foucart, Aaron Charous, and Pierre FJ Lermusiaux. 2022. Deep Reinforcement Learning for Adaptive Mesh Refinement. arXiv preprint arXiv:2209.12351 (2022).
  • Gillette et al. (2022) Andrew Gillette, Brendan Keith, and Socratis Petrides. 2022. Learning robust marking policies for adaptive mesh refinement. arXiv preprint arXiv:2207.06339 (2022).
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
  • Hasselt (2010) Hado Hasselt. 2010. Double Q-learning. Advances in neural information processing systems 23 (2010).
  • Hayes et al. (2022) Conor F Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. 2022. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems 36, 1 (2022), 1–59.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hernandez-Leal et al. (2019) Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33, 6 (2019), 750–797.
  • Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence.
  • Littman (1994) Michael L Littman. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 157–163.
  • mfem ([n.d.]) mfem [n.d.]. MFEM: Modular Finite Element Methods [Software]. mfem.org. https://doi.org/10.11578/dc.20171025.1248
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
  • Monk et al. (2003) Peter Monk et al. 2003. Finite element methods for Maxwell’s equations. Oxford University Press.
  • Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, Vol. 99. 278–287.
  • Obiols-Sales et al. (2022) Octavi Obiols-Sales, Abhinav Vishnu, Nicholas Malaya, and Aparna Chandramowlishwaran. 2022. NUNet: Deep Learning for Non-Uniform Super-Resolution of Turbulent Flows. arXiv preprint arXiv:2203.14154 (2022).
  • Oliehoek and Amato (2016) Frans A Oliehoek and Christopher Amato. 2016. A concise introduction to decentralized POMDPs. Springer.
  • Owen (1982) Guillermo Owen. 1982. Game Theory: Second Edition. Academic Press, Orlando, Florida.
  • Pan et al. (2022) Jie Pan, Jingwei Huang, Gengdong Cheng, and Yong Zeng. 2022. Reinforcement learning for automatic quadrilateral mesh generation: a soft actor-critic approach. arXiv preprint arXiv:2203.11203 (2022).
  • Reddy and Gartling (2010) Junuthula Narasimha Reddy and David K Gartling. 2010. The finite element method in heat transfer and fluid dynamics. CRC press.
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61–80.
  • Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized experience replay. In International Conference on Learning Representations.
  • Song et al. (2022) Wenbin Song, Mingrui Zhang, Joseph G Wallwork, Junpeng Gao, Zheng Tian, Fanglei Sun, Matthew D Piggott, Junqing Chen, Zuoqiang Shi, Xiang Chen, et al. 2022. M2N: Mesh Movement Networks for PDE Solvers. arXiv preprint arXiv:2204.11188 (2022).
  • Sunehag et al. (2018) Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. 2018. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2085–2087.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Tlales et al. (2022) Kenza Tlales, Kheir-Eddine Otmani, Gerasimos Ntoukas, Gonzalo Rubio, and Esteban Ferrer. 2022. Machine learning adaptation for laminar and turbulent flows: applications to high order discontinuous Galerkin solvers. arXiv preprint arXiv:2209.02401 (2022).
  • Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations.
  • Wallwork et al. (2022) Joseph G Wallwork, Jingyi Lu, Mingrui Zhang, and Matthew D Piggott. 2022. E2N: Error Estimation Networks for Goal-Oriented Mesh Adaptation. arXiv preprint arXiv:2207.11233 (2022).
  • Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network architectures for deep reinforcement learning. In International conference on machine learning. PMLR, 1995–2003.
  • Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
  • Yang (2021) Jiachen Yang. 2021. Cooperation in Multi-Agent Reinforcement Learning. Ph.D. Dissertation. Georgia Institute of Technology.
  • Yang et al. (2021) Jiachen Yang, Tarik Dzanic, Brenden Petersen, Jun Kudo, Ketan Mittal, Vladimir Tomov, Jean-Sylvain Camier, Tuo Zhao, Hongyuan Zha, Tzanio Kolev, et al. 2021. Reinforcement learning for adaptive mesh refinement. arXiv preprint arXiv:2103.01342 (2021).
  • Yang et al. (2019) Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. 2019. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in Neural Information Processing Systems 32 (2019).
  • Zienkiewicz and Zhu (1992) Olgierd Cecil Zienkiewicz and Jian Zhong Zhu. 1992. The superconvergent patch recovery and a posteriori error estimates. Part 1: The recovery technique. Internat. J. Numer. Methods Engrg. 33, 7 (1992), 1331–1364.

Appendix A Additional details on methods

A.1. Multi-head attention

Multi-head graph attention layer. We extend graph attention by building on the mechanism of multi-head attention (Vaswani et al., 2017), which uses HH independent linear projections (Wq,h,Wk,h,Wv,h,We,h)(W^{q,h},W^{k,h},W^{v,h},W^{e,h}) for queries, keys, values and edges (all projected to dimension d/Hd/H), and results in HH independent sets of attention weights a^ij,h\hat{a}^{ij,h}, with h=1,2,,Hh=1,2,\dotsc,H . This enables the model to attend to different learned representation spaces and was previously found to stabilize learning (Veličković et al., 2018). The new node representation with multi-head attention is the concatenation of all output heads, with an output linear projection Wod×dW^{o}\in\mathbb{R}^{d\times d}. Hence, eqs. 7, 8, 9 and 10 are extended to be:

a^ij,h\displaystyle\hat{a}^{ij,h} :=exp(Wq,hviWk,hvj)l𝒩iexp(Wq,hviWkvl)\displaystyle\vcentcolon=\frac{\exp(W^{q,h}v^{i}\cdot W^{k,h}v^{j})}{\sum_{l\in\mathcal{N}_{i}}\exp(W^{q,h}v^{i}\cdot W^{k}v^{l})} for (i,j)E,h[H],\displaystyle\text{for }(i,j)\in E,h\in[H]\,, (12)
bij,h\displaystyle b^{ij,h} :=Wv,hvj+We,heij\displaystyle\vcentcolon=W^{v,h}v^{j}+W^{e,h}e^{ij} for (i,j)E,h[H],\displaystyle\text{for }(i,j)\in E,h\in[H]\,, (13)
v^i\displaystyle\hat{v}^{i} :=Wo(h=1Hj𝒩ia^ij,hbij,h)\displaystyle\vcentcolon=W^{o}\left(\Big{\|}_{h=1}^{H}\sum_{j\in\mathcal{N}_{i}}\hat{a}^{ij,h}b^{ij,h}\right) for iV.\displaystyle\text{for }i\in V\,. (14)

A.2. Improvements to value-based learning

Standard Deep Q Network (DQN) (Mnih et al., 2015) can be improved with double Q-learning, dueling network, and prioritized replay. Similarly, these improvements can be applied to VDN and VDGN as well.

Double Q-learning. DQN maintains a main network QθQ_{\theta} with trainable parameters θ\theta and a target network QθQ_{\theta^{\prime}} with separate trainable parameters θ\theta^{\prime}, which is used in the TD-target

yt+1:=Rt+γmaxaQθ(st+1,a).\displaystyle y_{t+1}\vcentcolon=R_{t}+\gamma\max_{a}Q_{\theta^{\prime}}(s_{t+1},a)\,. (15)

After every update of the main parameters θ\theta, the target parameters θ\theta^{\prime} are slowly updated via θλθ+(1λ)θ\theta^{\prime}\leftarrow\lambda\theta+(1-\lambda)\theta^{\prime}. In contrast, Double DQN (Van Hasselt et al., 2016; Hasselt, 2010) uses the main network to compute the greedy policy, then uses the target network to evaluate the value of the greedy policy:

yt+1:=Rt+γQθ(st+1,argmaxaQθ(st+1,a)).\displaystyle y_{t+1}:=R_{t}+\gamma Q_{\theta^{\prime}}(s_{t+1},\operatorname*{argmax}_{a}Q_{\theta}(s_{t+1},a))\,. (16)

This alleviates the overestimation issue caused by using a single network for both selecting and evaluating the greedy policy. In the context of VDN and VDGN, we compute the TD target by using the target network QθiQ^{i}_{\theta^{\prime}} to evaluate the greedy joint actions, which are selected by the main network QθiQ^{i}_{\theta}:

yt+1\displaystyle y_{t+1} :=Rt+γQθ(st+1,a)|a=[argmaxaiQθi(st+1i,ai)]\displaystyle\vcentcolon=R_{t}+\gamma Q_{\theta^{\prime}}(s_{t+1},a)|_{a=[\operatorname*{argmax}_{a^{i}}Q^{i}_{\theta}(s^{i}_{t+1},a^{i})]} (17)
Qθ(s,a)\displaystyle Q_{\theta}(s,a) :=i=1nQθi(si,ai)Qθ(s,a):=i=1nQθi(si,ai).\displaystyle\vcentcolon=\sum_{i=1}^{n}Q^{i}_{\theta}(s^{i},a^{i})\quad Q_{\theta^{\prime}}(s,a)\vcentcolon=\sum_{i=1}^{n}Q^{i}_{\theta^{\prime}}(s^{i},a^{i})\,. (18)

Dueling Network. In DQN, the neural network takes the state ss as input and has i=1,,|𝒜|i=1,\dotsc,\lvert\mathcal{A}\rvert output nodes, each of which represents the scalar Q(s,a=i)Q(s,a=i) for |𝒜|\lvert\mathcal{A}\rvert discrete actions. Dueling networks (Wang et al., 2016) use the fact that sometimes it suffices to estimate the value of a state without estimating the value of each action at the state, which implies that learning a value function Vπ(s,a)V^{\pi}(s,a) is enough. Practically, this separation of state-action values from state values is achieved by parameterizing Qϕ,α,β(s,a)Q_{\phi,\alpha,\beta}(s,a) as a sum of value function Vϕ,β(s)V_{\phi,\beta}(s) and advantage function Aϕ,α(s,a)A_{\phi,\alpha}(s,a):

Qϕ,α,β(s,a)=Vϕ,β(s)+Aϕ,α(s,a),\displaystyle Q_{\phi,\alpha,\beta}(s,a)=V_{\phi,\beta}(s)+A_{\phi,\alpha}(s,a)\,, (19)

where ϕ\phi are shared parameters while α\alpha and β\beta are separate parameters specific to AA and VV, respectively. In practice, the expression used is

Qϕ,α,β(s,a)=Vϕ,β(s)+(Aϕ,α(s,a)1|𝒜|aAϕ,α(s,a)),\displaystyle Q_{\phi,\alpha,\beta}(s,a)=V_{\phi,\beta}(s)+\left(A_{\phi,\alpha}(s,a)-\frac{1}{\lvert\mathcal{A}\rvert}\sum_{a^{\prime}}A_{\phi,\alpha}(s,a^{\prime})\right)\,, (20)

for stability and identifiability reasons. Equation 20 is used as the form for both the main network with parameters θ:=(ϕ,α,β)\theta\vcentcolon=(\phi,\alpha,\beta) and target network with parameters θ:=(ϕ,α,β)\theta^{\prime}\vcentcolon=(\phi^{\prime},\alpha^{\prime},\beta^{\prime}). Equation 20 is directly compatible with Double DQN Equation 17. It is also directly compatible with VDGN, where we use two separate final graph attention layers to output V(si)V(s^{i}) and A(si,ai)A(s^{i},a^{i}) for each node ii.

Prioritized replay. DQN stores the online experiences into a finite rolling replay buffer \mathcal{B} and periodically trains on batches of transitions that are sampled uniformly at random from \mathcal{B}. However, the frequency of experiencing a transition (which determines the count of occurrence in \mathcal{B}) should not necessarily determine the probability of sampling and training on that sample. This is because some transitions may have high occurrence frequency but low TD-error (if the Q values of the state are already well approximated), whereas other important transitions with high TD-error (due to insufficient training) may have low occurrence frequency in \mathcal{B}. Prioritized replay (Schaul et al., 2015) defines the priority pip_{i} of a transition i=(si,ai,ri,si)i=(s_{i},a_{i},r_{i},s^{\prime}_{i}) based on its TD-error δi:=yiQ(si,ai)\delta_{i}\vcentcolon=y_{i}-Q(s_{i},a_{i}), and defines the probability of sampling transition ii as

P(i)=piαkpkα\displaystyle P(i)=\frac{p^{\alpha}_{i}}{\sum_{k}p^{\alpha}_{k}} (21)

where α\alpha is a hyperparameter that anneals between uniform sampling (α=0\alpha=0) and always sampling the highest priority transition (α\alpha\rightarrow\infty). We use the rank-based variant of prioritized replay, whereby pi:=1rank(i)p_{i}\vcentcolon=\frac{1}{\text{rank}(i)} and rank(i)\text{rank}(i) is the rank of ii when buffer \mathcal{B} is sorted according to |δi|\lvert\delta_{i}\rvert. We also use importance sampling correction weights wi:=(1/N1/P(i))βw_{i}\vcentcolon=(1/N1/P(i))^{\beta}, so that the agents learn with wiδiw_{i}\delta_{i} rather than δi\delta_{i}. This is an orthogonal improvement to any replay-based off-policy reinforcement learning method, and hence is directly compatible with both Double DQN and dueling network.

A.3. Multi-objective VDGN

Working in the context of linear preferences, the goal is to find a set of policies Π\Pi^{*} corresponding to the convex coverage set of the Pareto front, i.e., Π:={πΠ:ωΩ,πΠ,ωTJπωTJπ}\Pi^{*}\vcentcolon=\{\pi\in\Pi\colon\exists\omega\in\Omega,\forall\pi^{\prime}\in\Pi,\omega^{T}J^{\pi}\geq\omega^{T}J^{\pi^{\prime}}\}.

Π:={πΠ:ωΩ,πΠ,ωTJπωTJπ}\displaystyle\Pi^{*}\vcentcolon=\{\pi\in\Pi\colon\exists\omega\in\Omega,\forall\pi^{\prime}\in\Pi,\omega^{T}J^{\pi}\geq\omega^{T}J^{\pi^{\prime}}\} (22)

To do so, we extend VDGN with Envelop Q-learning (Yang et al., 2019), a multi-objective RL method that efficiently finds the convex envelope of the Pareto front in multi-objective MDPs. Envelop Q-learning searches over the space of already-learned preferences ωΩ\omega^{\prime}\in\Omega to find the best TD target 𝐲(ω)\mathbf{y}(\omega) for a given preference ω\omega. The single-objective VDN update equations eqs. 3 and 4 become

θ\displaystyle\theta θθ𝔼st,at,st+1[𝐲t+1𝐐θ(st,at)22]\displaystyle\leftarrow\theta-\nabla_{\theta}\mathbb{E}_{s_{t},a_{t},s_{t+1}}\left[\lVert\mathbf{y}_{t+1}-\mathbf{Q}_{\theta}(s_{t},a_{t})\rVert^{2}_{2}\right] (23)
𝐲t+1(ω)\displaystyle\mathbf{y}_{t+1}(\omega) :=𝐑t+γargQmaxa,ωωT𝐐(st+1,a,ω)\displaystyle\vcentcolon=\mathbf{R}_{t}+\gamma\arg_{Q}\max_{a,\omega^{\prime}}\omega^{T}\mathbf{Q}(s_{t+1},a,\omega^{\prime}) (24)
maxa,ωωT\displaystyle\max_{a,\omega^{\prime}}\omega^{T} 𝐐(s,a,ω)=ωT𝐐(s,a)|a=[argmaxai,ωωT𝐐i(si,ai,ω)]i=1n\displaystyle\mathbf{Q}(s,a,\omega^{\prime})=\omega^{T}\mathbf{Q}(s,a)|_{a=[\underset{a^{i},\omega^{\prime}}{\operatorname*{argmax}}\omega^{T}\mathbf{Q}^{i}(s^{i},a^{i},\omega^{\prime})]_{i=1}^{n}} (25)

where argQmaxa,ωωT𝐐(st+1,a,ω)\arg_{Q}\underset{a,\omega^{\prime}}{\max}\omega^{T}\mathbf{Q}(s_{t+1},a,\omega^{\prime}) extracts the vector 𝐐(st+1,a,ω)\mathbf{Q}(s_{t+1},a^{*},\omega^{*}) such that (a,ω)(a^{*},\omega^{*}) achieves the argmax\operatorname*{argmax}. Note that the global argmax\operatorname*{argmax} is still efficiently computable via value decomposition, since

maxaωTQ(s,a)==i=1nmaxaiωTQi(s,ai)\displaystyle\max_{a}\omega^{T}Q(s,a)==\sum_{i=1}^{n}\max_{a^{i}}\omega^{T}Q^{i}(s,a^{i}) (26)
argmaxaωTQ=[argmaxaiωTQi(s,ai)]i=1n\displaystyle\Rightarrow\operatorname*{argmax}_{a}\omega^{T}Q=\left[\operatorname*{argmax}_{a^{i}}\omega^{T}Q^{i}(s,a^{i})\right]_{i=1}^{n} (27)

Appendix B Proofs

See 1

Proof.

We prove this by construction. First define a dummy individual agent state denoted ss_{\nexists} that is interpreted semantically as “nonexistence”. Define the global state space 𝒮{\mathcal{S}} to be 𝒮:=i=1Nmax(𝒮i{s}){\mathcal{S}}\vcentcolon=\prod_{i=1}^{N_{\max}}({\mathcal{S}}^{i}\cup\{s_{\nexists}\}). Hence, any arbitrary state sts_{t}, with ntn_{t} number of agents, belongs to 𝒮{\mathcal{S}}. More specifically, it belongs to the subspace (i=1nt𝒮i)×(i=nt+1Nmax{s})𝒮\left(\prod_{i=1}^{n_{t}}{\mathcal{S}}^{i}\right)\times\left(\prod_{i=n_{t}+1}^{N_{\text{max}}}\{s_{\nexists}\}\right)\subset{\mathcal{S}}.

Next, define the joint action space 𝒜\mathcal{A} to be 𝒜:=i=1Nmax𝒜i\mathcal{A}\vcentcolon=\prod_{i=1}^{N_{\max}}\mathcal{A}^{i}. At a time step during an episode when there are ntn_{t} elements, the effectively available action space is just i=1nt𝒜i\prod_{i=1}^{n_{t}}\mathcal{A}^{i}, meaning that the joint action is written as a=(a1,,ant,no-op,,no-opNmaxnt)a=(a^{1},\dotsc,a^{n_{t}},\underbrace{\text{no-op},\dotsc,\text{no-op}}_{N_{\max}-n_{t}}), where no-op is ignored by the reward and transition function. The reward for the joint state-action space is defined by

R((s1,,snt,s,,sNmaxnt),(a1,,ant,no-op,,no-opNmaxnt))\displaystyle R\bigl{(}(s^{1},\dotsc,s^{n_{t}},\underbrace{s_{\nexists},\dotsc,s_{\nexists}}_{N_{\max}-n_{t}}),(a^{1},\dotsc,a^{n_{t}},\underbrace{\text{no-op},\dotsc,\text{no-op}}_{N_{\max}-n_{t}})\bigr{)}
=R((s1,,snt),(a1,,ant)),\displaystyle=R\bigl{(}(s^{1},\dotsc,s^{n_{t}}),(a^{1},\dotsc,a^{n_{t}})\bigr{)}\,,

where we overload the use of notation RR since RR is already well-defined for variable number of agents. Similarly, the transition function PP that was already well-defined for variable number of agents induces a transition function for the new state-action spaces 𝒮{\mathcal{S}} and 𝒜\mathcal{A}. Hence the stationary Markov game is defined by the tuple (𝒮,𝒜,R,P,γ,Nmax)\left({\mathcal{S}},\mathcal{A},R,P,\gamma,N_{\max}\right). This completes the construction. ∎

See 4

Proof.

This is proved by induction on the number of graph attention layers.

Base case. In the original PDE, let X=(x1,,xn)X=(x^{1},\dotsc,x^{n}) denote the coordinates of elements i=1,,ni=1,\dotsc,n, let v=(v1,,vn)v=(v^{1},\dotsc,v^{n}) denote the input node features, and let e={eij}(i,j)Ee=\{e^{ij}\}_{(i,j)\in E} denote the input edge features. Let element ii with coordinate xix^{i} have node feature viv^{i} and edge feature set {eij}j𝒩(i)\{e^{ij}\}_{j\in\mathcal{N}(i)}. Suppose that the first layer of VDGN produces updated node features v^=(v^1,,v^n)\hat{v}=(\hat{v}^{1},\dotsc,\hat{v}^{n}). Any rotation can be represented by matrix multiplication with an orthogonal matrix Qk×kQ\in\mathbb{R}^{k\times k}, where kk is the number of spatial dimensions. Upon rotation of the PDE solution and velocity field, we check that the element at coordinate QxiQx^{i} produces the same node value as the value v^i\hat{v}^{i} that element ii would have produced without the rotation. First, note that the displacement vector for any edge ee and the velocity vector transform as d^e=Qde\hat{d}^{e}=Qd^{e} and q^=Qq\hat{q}=Qq, respectively. The node feature at QxiQx^{i} is viv^{i}, since the PDE solution has been rotated. The edge feature set at QxiQx^{i} is {eij}j𝒩(i)\{e^{ij}\}_{j\in\mathcal{N}(i)} because: a) by definition of rotation of the PDE, initial conditions are also rotated and hence the one-hot relative depth observation 𝟏[depth(i)depth(j)]\mathbf{1}[\text{depth}(i)-\text{depth}(j)] is unchanged; b) the inner product part of the edge feature is unchanged because d^e,q^=Qde,Qq=de,q\langle\hat{d}^{e},\hat{q}\rangle=\langle Qd^{e},Qq\rangle=\langle d^{e},q\rangle because orthogonality of QQ implies QTQ=1Q^{T}Q=1. Hence the input node and edge features at QxiQx^{i} in the rotated case are equal to those for element ii in the original case, so the updated node feature at QxiQx^{i} is still v^i\hat{v}^{i}. For translation, which is represented by addition with some vector bkb\in\mathbb{R}^{k}, the same reasoning applies by noting that displacements and velocities are invariance to addition, so the element at xi+bx^{i}+b in the translated case has the same node and edge observations as the element xix^{i} in the original case, so the updated node feature at xi+bx^{i}+b is still v^i\hat{v}^{i}.

Inductive step. Suppose the claim holds for layer l1l-1. This means that the node representation v^l1\hat{v}^{l-1} (used as input to layer ll) at element QxiQx^{i} in the transformed PDE is equal to the input node representation at element xix^{i} in the original PDE. The input edge representation is still {eij}j𝒩(i)\{e^{ij}\}_{j\in\mathcal{N}(i)} since edge representations are not updated, and the reasoning in the base case for edges applies. Hence, layer ll’s inputs in the transformed case are the same as its inputs in the original case, so the outputs v^l\hat{v}^{l} are also the same. This completes the proof for equivariance.

Time invariance. This follows immediately from the fact that absolute time is not used as an observable, and the fact that VDGN learns a Markov policy. ∎

Appendix C Additional implementation details

Ancillary algorithmic details. Every ttraint_{\text{train}} environment steps, we sample batch size MM transitions from the replay buffer \mathcal{B} and conduct one training step. We used the Adam optimizer with learning rate η\eta in Tensorflow (Abadi et al., 2016). Each agent has its own independent ϵ\epsilon-greedy exploration with ϵ\epsilon linearly decaying from ϵstart\epsilon_{\text{start}} to ϵend\epsilon_{\text{end}} within nϵn_{\epsilon} episodes. Along with prioritized replay, we used a lower bound of punifp_{\text{unif}} on the probability of sampling each transition from the replay buffer. Hyperparameters for all algorithmic settings are listed in Table 2. We ran a simple population-based elimination search with 128 starting samples on the final exploration rate ϵtextend\epsilon_{text{end}}, learning rate η\eta, and target update rate τ\tau on one 16×1616\times 16 mesh, then used those hyperparameters for all experiments. We used the same prioritized replay hyperparameters as the original paper. Graph attention hyperparameters LL and RR should be based on the spatial domain of influence (e.g., number of elements along a direction) to be included to inform a refinement decision.

Table 2. Algorithm hyperparameters
Name Symbol Value
Batch size MM 16
Replay buffer size ||\lvert\mathcal{B}\rvert 10000
Exploration decay episodes nϵn_{\epsilon} 150000
Exploration end ϵend\epsilon_{\text{end}} 0.01
Exploration start ϵstart\epsilon_{\text{start}} 0.5
Learning rate η\eta 5×1045\times 10^{-4}
Prioritized replay exponent α\alpha 0.5
Prioritized replay β\beta 0.4
   importance exponent
Env steps per train ttraint_{\text{train}} 4
Target update rate λ\lambda 0.0112
Uniform sampling probability punifp_{\text{unif}} 0.001
Attention layer size dd 64
Number of attention heads HH 2
Number of attention layers LL 2
Number of recurrent passes RR depth 1: 3
depth 2: 2
Table 3. Environment settings
Name Symbol Value
DoF threshold dthresd_{\text{thres}} 5620
Initial error threshold cthresc_{\text{thres}} 5×1045\times 10^{-4}
Solver time increment dτd\tau 2×1032\times 10^{-3}
Initial mesh partition (nx,ny)(n_{x},n_{y}) depth 1: (16,16)(16,16)
depth 2: (8,8)(8,8)
Training mesh dimension [sx,sy][s_{x},s_{y}] [0,2.0]2[0,2.0]^{2}
Solver time per mesh update τstep\tau_{\text{step}} depth 1: 0.25
depth 2: 0.2
Training max simulation time τf\tau_{f} depth 1: 0.75
depth 2: 0.8
Refer to caption
Refer to caption
(a) t=1t=1
Refer to caption
Refer to caption
(b) t=2t=2
Refer to caption
Refer to caption
(c) t=3t=3
Figure 12. Global refinements by VDGN are equivariant to rotation of the propagating feature. Compare to Figure 5
Refer to caption
Refer to caption
(a) t=1t=1
Refer to caption
Refer to caption
(b) t=2t=2
Refer to caption
Refer to caption
(c) t=3t=3
Figure 13. Translational equivariance.

FEM settings. Hyperparameters for all environment settings are listed in Table 3. All simulations begin by applying level-1 refinement to elements whose initial error, which is available in general due to the known initial condition, is larger than cthresc_{\text{thres}}. The training mesh length and width are [sx,sy][s_{x},s_{y}]. We specified velocity by magnitude uu\in\mathbb{R} and angle θ[0,1.0]\theta\in[0,1.0], so that velocity components are vx=ucos(2πθ)v_{x}=u\cos(2\pi\theta) and vy=usin(2πθ)v_{y}=u\sin(2\pi\theta). We trained on 2D Gaussians

f(x,y,t)=1+exp(w((x(x0+vxt))2+(y(y0+vyt))2)),\displaystyle f(x,y,t)=1+\exp\left(-w((x-(x_{0}+v_{x}t))^{2}+(y-(y_{0}+v_{y}t))^{2})\right)\,, (28)

with initial conditions sampled from uniform distributions: uU[0,1.5]u\sim U[0,1.5], θU[0,1.0]\theta\sim U[0,1.0], x0,y0U[0.5,1.5]x_{0},y_{0}\sim U[0.5,1.5], and w=100w=100. At test time, anisotropic Gaussians are specified by

f(x,y,t)\displaystyle f(x,y,t) =1+exp((wx(xdx)2+wy(ydy)2\displaystyle=1+\exp\left(-(w_{x}(x-dx)^{2}+w_{y}(y-dy)^{2}\right. (29)
+wxy(xdx)(ydy)))\displaystyle\quad\left.+w_{xy}(x-dx)(y-dy))\right) (30)
dx\displaystyle dx =x0+vxt\displaystyle=x_{0}+v_{x}t (31)
dy\displaystyle dy =y0+vyt\displaystyle=y_{0}+v_{y}t (32)

with θU[0,1.0]\theta\sim U[0,1.0], uU[0,1.5]u\sim U[0,1.5], wx,wy,wxyU[20,100]w_{x},w_{y},w_{xy}\sim U[20,100], x0,y0U[0.5,1.5]x_{0},y_{0}\sim U[0.5,1.5]. Ring functions are

f(x,y,t)\displaystyle f(x,y,t) =1+exp(w(Δr2r)2)\displaystyle=1+\exp\left(-w(\sqrt{\Delta r^{2}}-r)^{2}\right) (33)
Δr2\displaystyle\Delta r^{2} =(x(x0+vxt))2+(y(y0+vyt))2\displaystyle=(x-(x_{0}+v_{x}t))^{2}+(y-(y_{0}+v_{y}t))^{2} (34)

with rU[0.1,0.3]r\sim U[0.1,0.3], θU[0,1.0]\theta\sim U[0,1.0], uU[0,1.5]u\sim U[0,1.5], x0,y0U[0.5,1.5]x_{0},y_{0}\sim U[0.5,1.5]. Opposite traveling Gaussians have the same form as above, with uU[0,1.5]u\sim U[0,1.5], w=100w=100, the leftward moving Gaussian with θ=0.5\theta=0.5, x0U[0.5,1.5]x_{0}\sim U[0.5,1.5] and y0U[0.3,0.7]y_{0}\sim U[0.3,0.7], while the rightward moving Gaussian has θ=0.0\theta=0.0, x0U[0.5,1.5]x_{0}\sim U[0.5,1.5] and y0U[1.3,1.7]y_{0}\sim U[1.3,1.7]. The star mesh test case uses a mesh from MFEM (Anderson et al., 2021, Example 8), with θU[0.05,0.25]\theta\sim U[0.05,0.25], uU[0,6.0]u\sim U[0,6.0], w=5w=5, x0=1x_{0}=-1 and y0=4y_{0}=-4.

Refer to caption
Refer to caption
(a) t=1t=1
Refer to caption
Refer to caption
(b) t=2t=2
Refer to caption
Refer to caption
(c) t=3t=3
Refer to caption
Refer to caption
(d) t=4t=4
Figure 14. Policy trained on quadrilateral elements can be run on triangular elements.
Table 4. Runtime in seconds of VDGN is comparable to local error threshold-based policies with best θr\theta_{r} selected from Table 1. Average over 100 test episodes.
In-distribution Generalization
Depth 1 375375 steps Depth 2 400400 steps Triangular 12501250 steps Depth 1 25002500 steps Depth 2 25002500 steps Anisotropic 25002500 steps Ring 25002500 steps Opposite 25002500 steps Star 750750 steps
Threshold (action) 0.009 0.005 0.005 0.015 0.012 0.014 0.014 0.021 0.005
Threshold (total) 1.18 0.84 2.15 12.9 11.0 11.9 12.2 18.1 1.30
VDGN (action) 0.024 0.011 0.005 0.026 0.011 0.008 0.008 0.008 0.008
VDGN (total) 1.96 1.18 2.26 15.0 8.23 17.3 18.5 21.6 3.43

Appendix D Additional results

Symmetry. In Figure 5, the initial conditions are x0=y0=0.5x_{0}=y_{0}=0.5 and vx=vy=1.5v_{x}=v_{y}=1.5. We rotated the initial conditions by π\pi, so that x0=y0=1.5x_{0}=y_{0}=1.5 and vx=vy=1.5v_{x}=v_{y}=-1.5, and we can see from Figure 12 that the global refinement choices are similarly rotated by π\pi. Rotational equivariance can be also seen in Figure 10, where the refinement choices for the leftward and rightward moving waves are reflections of each other. Translational equivariance along the vertical and horizontal axes can be seen in Figure 13.

Runtime. Table 4 shows that the runtime of trained VDGN policies at test time is on the same order of magnitude as Threshold policies, for both average time per action (for all elements) and average time per episode.

Scaling. Figure 15 shows a policy trained with only three steps per episode (375 solver steps), running on 20 steps at test time (2500 solver steps). Figures 16 and 17 shows a policy trained on a 16×1616\times 16 mesh with only three steps per episode, tested on a 64×6464\times 64 mesh.

Using error estimator as observation. Table 5 shows that VDGN retains better performance than Threshold policies when the true error in agents’ observations is replaced by the estimates given by the ZZ error estimator (Zienkiewicz and Zhu, 1992).

Appendix E Other formulations and their difficulties

E.1. Single-agent Markov decision process

As mentioned in Section 8, previous work that treat AMR as a sequential decision-making problem took the single-agent viewpoint. This poses issues with producing multiple element refinements at each mesh update step.

In the global approach (Yang et al., 2021), a single agent takes the entire mesh state as input, chooses one element out of all available elements to refine (or potentially de-refine), and then the global solution is updated. To avoid computing the global solution after every refinement/de-refinement of a single element, which would be prohibitively expensive, an extension of this approach would have to use either a fixed hyperparameter or a fixed rule to determine the number of rounds of single-element selection between each solution update. Specifying this hyperparameter or rule is hard to do in advance and may erroneously exclude the true optimal policy from the policy search space.

In the local approach (Foucart et al., 2022), a single agent is “centered” on an element and makes a refine/no-op/de-refine decision for that element. At training time, because each refine/de-refine action impacts the global solution, which directly impacts the reward, the environment transition involves a global solution update every time the agent chooses an action for an element. Since this is prohibitively expensive to do for larger number of elements at test time, (Foucart et al., 2022) redefines the environment transition at test time so that the global solution is updated only after the agent chooses an action for all elements. This presents the agent with different definitions of the environment transition function at train and test time.

Table 5. Mean and standard error of efficiency of VDGN using ZZ error estimator values as observation, versus threshold-based policies, over 100 runs with uniform random ICs.
θr\theta_{r} Depth 1 375375 steps Depth 2 400 steps
5×1035\times 10^{-3} 0.916 (0.007) 0.523 (0.021)
5×1045\times 10^{-4} 0.899 (0.002) 0.529 (0.022)
5×1055\times 10^{-5} 0.857 (0.003) 0.509 (0.021)
5×1065\times 10^{-6} 0.813 (0.004) 0.474 (0.019)
5×1075\times 10^{-7} 0.767 (0.004) 0.431 (0.018)
5×1085\times 10^{-8} 0.718 (0.005) 0.390 (0.016)
5×10155\times 10^{-15} 0.473 (0.006) 0.260 (0.014)
VDGN 0.948 (0.002) 0.543 (0.022)
VDGN/best θr\theta_{r} 1.03 1.03

E.2. Fully-decentralized training and execution

In contrast to the approach of centralized training with decentralized execution used in this work, one may consider fully-decentralized training and execution. A reason for full decentralization is the fact that the governing equations of most physical systems in finite element simulations are local in nature, so one may hypothesize that an element only needs to respond to local observations and learn from local rewards. However, this poses two main challenges: 1) the class of policies accessible by fully-decentralized training with local rewards may not contain the global optimum joint policy; 2) for hh-adaptive mesh refinement with maximum depth greater than one, one must confront the posthumous credit assignment problem: a refinement or de-refinement action may have long-term impact on the global error but the agent responsible for that action immediately disappears upon taking the action and hence does not receive future rewards.

For the second challenge in particular, we can easily construct a scenario in which an agent who only receives an immediate reward and disappears upon refinement (i.e., does not persist) cannot learn anticipatory refinement. We can construct a scenario where an agent cannot make an optimal refinement action: 1) use a moving feature such that error is minimized by the sequence of actions: ati=1a^{i}_{t}=1 at time tt, and at+1j=1,jDesc(i)a^{j}_{t+1}=1,\forall j\in\text{Desc}(i) at time t+1t+1; 2) construct the feature so that action ati=1a^{i}_{t}=1 has no instantaneous benefit for error reduction, meaning that the instantaneous reward rtir^{i}_{t} for the state-action pair (oti,ati=1)(o^{i}_{t},a^{i}_{t}=1) is the same as the reward for the pair (oti,ati=0)(o^{i}_{t},a^{i}_{t}=0). Since agent ii’s trajectory terminates upon receiving reward rtir^{i}_{t}, this means agent ii cannot distinguish the value of refinement versus no-op. This problem may be overcome by letting all agents persist after refinement, receive dummy observations and local reward but take no action, so that they can learn the past action’s impact on future local reward. Another way is to let the agent disappear upon refinement/de-refinement at transition (st,at,rt)(s_{t},a_{t},r_{t}) but use the final reward rTr_{T} (e.g., upon episode termination at time t=Tt=T) as the reward for the transition (st,at,rT)(s_{t},a_{t},r_{T}).

Refer to caption
Refer to caption
(a) t=1t=1
Refer to caption
Refer to caption
(b) t=2t=2
Refer to caption
Refer to caption
(c) t=3t=3
Refer to caption
Refer to caption
(d) t=4t=4
Refer to caption
Refer to caption
(e) t=5t=5
Refer to caption
Refer to caption
(f) t=6t=6
Refer to caption
Refer to caption
(g) t=7t=7
Refer to caption
Refer to caption
(h) t=8t=8
Refer to caption
Refer to caption
(i) t=9t=9
Refer to caption
Refer to caption
(j) t=10t=10
Refer to caption
Refer to caption
(k) t=11t=11
Refer to caption
Refer to caption
(l) t=12t=12
Refer to caption
Refer to caption
(m) t=13t=13
Refer to caption
Refer to caption
(n) t=14t=14
Refer to caption
Refer to caption
(o) t=15t=15
Refer to caption
Refer to caption
(p) t=16t=16
Refer to caption
Refer to caption
(q) t=17t=17
Refer to caption
Refer to caption
(r) t=18t=18
Figure 15. VDGN generalizes to longer simulation durations, despite seeing only up to t=3t=3 during training episodes.
Refer to caption
Refer to caption
(a) t=1t=1
Refer to caption
Refer to caption
(b) t=2t=2
Figure 16. VDGN generalizes to larger meshes not seen in training. See Figure 17 for subsequent time steps.
Refer to caption
Refer to caption
(a) t=3t=3
Refer to caption
Refer to caption
(b) t=4t=4
Refer to caption
Refer to caption
(c) t=5t=5
Refer to caption
Refer to caption
(d) t=6t=6
Refer to caption
Refer to caption
(e) t=7t=7
Refer to caption
Refer to caption
(f) t=8t=8
Refer to caption
Refer to caption
(g) t=9t=9
Refer to caption
Refer to caption
(h) t=10t=10
Figure 17. Continued from Figure 16. VDGN generalizes to larger meshes not seen in training.