This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: David J. Gardner, Lynda L. LoDestro, and Carol S. Woodward 22institutetext: Lawrence Livermore National Laboratory, Livermore, CA, USA
22email: gardner48, lodestro1, [email protected]

Towards the Use of Anderson Acceleration in Coupled Transport-Gyrokinetic Turbulence Simulations

David J. Gardner\orcidID0000-0002-7993-8282    Lynda L. LoDestro\orcidID0000-0001-6503-4189    Carol S. Woodward\orcidID0000-0002-6502-8659
Abstract

Predicting the behavior of a magnetically confined fusion plasma over long time periods requires methods that can bridge the difference between transport and turbulent time scales. The nonlinear transport solver, Tango, enables simulations of very long times, in particular to steady state, by advancing each process independently with different time step sizes and couples them through a relaxed iteration scheme. We examine the use of Anderson Acceleration (AA) to reduce the total number of coupling iterations required by interfacing Tango with the AA implementation, including several extensions to AA, provided by the KINSOL nonlinear solver package in SUNDIALS. The ability to easily enable and adjust algorithmic options through KINSOL allows for rapid experimentation to evaluate different approaches with minimal effort. Additionally, we leverage the GPTune library to automate the optimization of algorithmic parameters within KINSOL. We show that AA can enable faster convergence in stiff and very stiff tests cases without noise present and in all cases, including with noisy fluxes, increases robustness and reduces sensitivity to the choice of relaxation strength.

1 Introduction

Accurately simulating turbulent transport is a critical aspect of modeling magnetically confined fusion plasmas. Turbulent fluxes have a strong nonlinear dependence on the gradients of averaged fields and are often the primary mechanism for transporting particles and energy from the plasma interior. A major challenge in simulations of very long times, in particular steady state, is the significant difference in characteristic time scales for transport (seconds) and turbulence (microseconds) in magnetic-fusion-energy burning plasmas post2004report . This challenge is further complicated by the presence of noise in the averaged transport fluxes arising from turbulent simulations.

Evolving both processes at the same time step is prohibitively expensive; however, as the dynamics on each time scale are well separated, each process can be advanced independently at its own rate and coupled through an iteration scheme to bridge the scale gap. Turbulence model evaluations are the dominant computational expense, and the speed at which the coupling iteration converges determines the number of turbulence model evaluations. In this work, we aim to reduce the total number of coupling iterations necessary and thus lower the overall simulation cost by accelerating the convergence of the iterative method. To this end, we interface the nonlinear transport solver, Tango parker2018investigation ; parker2018bringing , with the KINSOL nonlinear solver package from SUNDIALS gardner2022enabling ; hindmarsh2005sundials to examine the use of Anderson Acceleration (AA) and extensions to AA for improving the convergence rate of the coupling iteration.

The remainder of this paper is organized as follows. In Section 2, we overview turbulent transport simulations for magnetically confined fusion plasmas and the coupling method applied. We introduce the AA algorithm and detail the various extensions we consider to improve performance in Section 3. In Section 4 we overview the Tango code for coupled simulations and its interfacing with the AA implementation from KINSOL as well as the use of GPTune to optimize parameter selection. In Section 5 we discuss results from numerical simulations with Tango in both stiff and very stiff settings along with the impact of noisy fluxes on performance. Finally, in Section 6, we provide concluding remarks and directions for future investigation.

2 Coupling Transport & Turbulence

Simulating turbulence-driven transport in toroidal magnetic-fusion-energy plasmas with near-first-principles accuracy requires solving a set of gyrokinetic equations describing the distribution of different species in a 5D phase space (3 physical- plus 2 velocity-space dimensions). 5D turbulence due to fast time-scale instabilities drives transport of averaged fields (i.e., density, momentum, and energy, which are constant on magnetic flux surfaces and thus spatially 1D in the magnetic flux coordinate) across the magnetic surfaces confining the plasma. Because the transport time scales are much slower than the turbulence time scales, a formal separation-of-scales approach is possible, permitting a multiscale computational solution which proceeds by coupling separate equation sets for the fast and slow scales shestakov2003self . For this work, we utilize a simplified model for a single species to evaluate algorithms for turbulent transport without the expense of 5D simulation or the complexities of the toroidal geometry. Consider the transport equation,

pt+qx=S,\frac{\partial p}{\partial t}+\frac{\partial q}{\partial x}=S, (1)

where p(t,x)p(t,x) is the plasma profile, q(p(t,x))q(p(t,x)) is the average turbulent flux, and S(t,x)S(t,x) includes all local sources. Because the turbulence-driven transport is stiff, an implicit time-advance is required. Discretizing (1) in space with finite differences and in time with backward Euler, we obtain the implicit system,

p(n+1)+H𝒟xq(n+1)=p(n)+HS(n+1),p^{(n+1)}+H\mathcal{D}_{x}q^{(n+1)}=p^{(n)}+HS^{(n+1)}, (2)

for a time step from t(n)t^{(n)} to t(n+1)t^{(n+1)} with step size HH where superscripts denote the time at which a quantity is approximated, and 𝒟x\mathcal{D}_{x} is the difference operator for the finite-difference method.

In Tango, we solve the implicit system (2) using the LoDestro method as detailed in crotinger1997corsica and shestakov2003self . The method is given by the iteration,

pk+1H𝒟x(Dk𝒟xpk+1ckpk+1)=p(n)+HS(n+1),p_{k+1}-H\mathcal{D}_{x}(D_{k}\mathcal{D}_{x}p_{k+1}-c_{k}p_{k+1})=p^{(n)}+HS^{(n+1)}, (3)

where pk+1p_{k+1} is the new iterate for p(n+1)p^{(n+1)} and q(n+1)q^{(n+1)} is split into diffusive and convective terms, with coefficients

Dk=θkq(pk)/𝒟xpkandck=(1θk)q(pk)/pk,D_{k}=-\theta_{k}q(p_{k})/\mathcal{D}_{x}p_{k}\quad\mathrm{and}\quad c_{k}=(1-\theta_{k})q(p_{k})/p_{k}, (4)

respectively. The parameter θk\theta_{k} controls the diffusive-convective split at each mesh point. The purpose of the split is to keep DkD_{k} positive (assuring a parabolic linear equation) and within bounds. It is assumed that the transport physics is diffusive at the boundaries; while it is often diffusive throughout the confined plasma volume, there are important exceptions, which a general treatment must accommodate. With a purely diffusive (ck=0c_{k}=0) representation of qq, profile maxima can cause violent spatial swings in DkD_{k} while iterating; also, converged solutions with the flux running uphill (anti-diffusion, possible in sub-volumes with global physics) are unobtainable. While Tango currently employs only the flux-splitting technique from crotinger1997corsica and shestakov2003self to handle these complications, crotinger1997corsica and shestakov2003self provide a second technique for representing qq, which worked about equally well for the problems studied there. The important point is that however qq is represented, it is not a model of the physics (which is unknown, apart from the assumption at the boundaries); yet, assuming the iteration converges, the numerical qq supplied from the turbulence code must be recovered, as (3) and (4) have been constructed to do here.

The value of θk\theta_{k} is chosen using the default strategy in Tango,

θk={0whereD^k<Dmin12(1+DmaxD^kDmaxDmin)whereDminD^kDmaxand|𝒟xpk|<1012whereD^k>Dmaxand|𝒟xpk|<101otherwise,\theta_{k}=\begin{cases}0&\text{where}\quad\hat{D}_{k}<D_{\text{min}}\\ \frac{1}{2}\left(1+\frac{D_{\text{max}}-\hat{D}_{k}}{D_{\text{max}}-D_{\text{min}}}\right)&\text{where}\quad D_{\text{min}}\leq\hat{D}_{k}\leq D_{\text{max}}\;\text{and}\;|\mathcal{D}_{x}p_{k}|<10\\ \frac{1}{2}&\text{where}\quad\hat{D}_{k}>D_{\text{max}}\;\text{and}\;|\mathcal{D}_{x}p_{k}|<10\\ 1&\text{otherwise}\end{cases}, (5)

with D^k=q(pk)/𝒟xpk\hat{D}_{k}=-q(p_{k})/\mathcal{D}_{x}p_{k}, Dmin=105D_{\text{min}}=10^{-5}, and Dmax=1013D_{\text{max}}=10^{13}. Obtaining the new profile in (3) requires solving the linear system,

Mkpk+1=b,M_{k}p_{k+1}=b, (6)

within each iteration, where Mk=IH𝒟x(Dk𝒟xck)M_{k}=I-H\mathcal{D}_{x}(D_{k}\mathcal{D}_{x}-c_{k}) and b=p(n)+HS(n+1)b=p^{(n)}+HS^{(n+1)}. In practice, relaxation of the iteration is required to ensure convergence; i.e., pk+1=βpk+1+(1β)pkp_{k+1}=\beta p_{k+1}+(1-\beta)p_{k}. The overall method is outlined in Algorithm 1.

Algorithm 1 LoDestro Method

Input: p0p_{0}, β\beta, and tol       Output: pp^{*}

1:for k=0,1,,kmaxk=0,1,\ldots,k_{\rm{max}} do
2:     Compute the flux q(pk)q(p_{k}) and split it into DkD_{k} and ckc_{k}
3:     Construct MkM_{k} and solve Mkpk+1=bM_{k}p_{k+1}=b
4:     Relax the profile pk+1=βpk+1+(1β)pkp_{k+1}=\beta p_{k+1}+(1-\beta)p_{k}
5:     if pk+1pk<tol\|p_{k+1}-p_{k}\|<\text{tol} return pk+1p_{k+1} as pp^{*}
6:end for

As illustrated in Fig. 1, the coupled simulation alternates between the transport iteration and the turbulent evolution. The turbulent fluxes are obtained from a 5D global (entire confined plasma) gyrokinetic code running with an explicit step size, hHh\ll H. Within each transport iteration the models exchange information. The transport code provides profiles to the turbulence code which, running one long (on the fast time scale) simulation, then continues advancing with updated 1D profiles for a set time, after which it returns fluxes back to the transport code. The number of fast-scale time steps per data exchange is kept to a multiple of fluctuation periods, rather than the number to converge to a statistical steady state given the latest 1D profile; this provides a large savings in run-time per iteration, mitigating the slow rate of fixed-point convergence. Evolving the turbulence model requires far more computational resources and expense than solving the linear system (6) in the transport iteration. Minimizing the number of iterations for convergence is, then, the critical issue for overall performance.

Refer to caption
Figure 1: In Algorithm 1, the

transport code (1D) uses a large implicit time step that requires iterating to convergence. During this iteration, the turbulence model (5D for physics runs; 1D in this paper) advances with a small explicit time step. The two models periodically exchange fluxes and profiles until the iteration has converged.

References crotinger1997corsica and shestakov2003self include applications of the LoDestro method with a two-dimensional turbulence model demonstrating speed-ups in reaching turbulent steady states. The first-ever multiscale results coupling to a global 5D gyrokinetic turbulence simulation were obtained by Tango and presented in parker2018bringing . Three-field coupling studies of current experiments are presented in Di-Siena_2022 , on the ASDEX Upgrade tokamak; and in Banon-Navarro_2023 , on ion-temperature clamping in the Wendelstein 7-X stellarator. The first of these includes a section on timings and speed ups.

3 Anderson Acceleration

The LoDestro method can be viewed as a damped fixed-point iteration given by

pk+1=βG(pk)+(1β)pk,p_{k+1}=\beta G(p_{k})+(1-\beta)p_{k}, (7)

where G(p)G(p) comprises steps 2 and 3 in Algorithm 1. This iteration may converge quite slowly (β1\beta\ll 1 may be required for stability of the iterations), and, thus, we are interested in methods for accelerating convergence. One such method is Anderson Acceleration (AA) anderson1965iterative ; kelley2018numerical which computes an iteration update that approximately minimizes the linearized fixed-point residual over a subspace constructed from prior iterations. In a linear setting and under certain conditions, AA is essentially equivalent to the generalized minimal residual (GMRES) method walker2011anderson , and, with a sufficiently good initial iterate, AA performs no worse than a fixed-point iteration toth2015convergence .

The AA algorithm with fixed damping, β\beta, and depth (subspace size), mm, is given in Algorithm 2. Following walker2011anderson , the optimization problem in step 6 for the weights, γkmk\gamma_{k}\in\mathbb{R}^{m_{k}}, to compute the new iterate is presented as an unconstrained linear least-squares problem. Utilizing this formulation, rather than the constrained form of the problem, allows for an efficient solution via a QR factorization of k\mathcal{F}_{k} i.e., Rkγk=QkTfkR_{k}\gamma_{k}=Q^{T}_{k}f_{k} where QkN×mkQ_{k}\in\mathbb{R}^{N\times m_{k}} is an orthogonal matrix, Rkmk×NR_{k}\in\mathbb{R}^{m_{k}\times N} is an upper triangular matrix, and fkNf_{k}\in\mathbb{R}^{N} is the nonlinear residual, G(pk)pkG(p_{k})-p_{k}. Updating the factorization and solving the linear system as the residual history changes is typically a minimal cost compared to the function evaluation lockhart2022performance ; loffeld2016considerations and thus not a major contributor to the overall iteration run time.

Algorithm 2 Anderson Acceleration

Input: p0p_{0}, mmaxm_{\rm{max}}, kmaxk_{\rm{max}}, tol, and β\beta       Output: pp^{*}

1:Set p1=βG(p0)+(1β)p0p_{1}=\beta G(p_{0})+(1-\beta)p_{0}
2:for k=1,,kmaxk=1,\ldots,k_{\rm{max}} do
3:     Set mk=min{k,mmax}m_{k}=\min\{k,m_{\rm{max}}\}
4:     Set 𝒢k=[ΔGkmk,,ΔGk1]\mathcal{G}_{k}=[\Delta G_{k-m_{k}},\ldots,\Delta G_{k-1}] where ΔGi=G(pi+1)G(pi)\Delta G_{i}=G(p_{i+1})-G(p_{i})
5:     Set k=[Δfkmk,,Δfk1]\mathcal{F}_{k}=[\Delta f_{k-m_{k}},\ldots,\Delta f_{k-1}] where Δfi=fi+1fi\Delta f_{i}=f_{i+1}-f_{i} and fi=G(pi)pif_{i}=G(p_{i})-p_{i}
6:     Determine γk\gamma_{k} that minimizes fkkγk2\|f_{k}-\mathcal{F}_{k}\gamma_{k}\|_{2}
7:     Set pk+1=G(pk)𝒢kγk(1β)(fkQkRkγk)p_{k+1}=G(p_{k})-\mathcal{G}_{k}\gamma_{k}-(1-\beta)(f_{k}-Q_{k}R_{k}\gamma_{k})
8:     if pk+1pk<tol\|p_{k+1}-p_{k}\|<\text{tol} return pk+1p_{k+1} as pp^{*}
9:end for

In this work, we consider several extensions to AA, including delaying the start of acceleration as well as non-stationary variants with variable damping and/or depth, leading to Algorithm 3. When the initial iterate is far from the fixed-point, delaying acceleration by dd iterations can be advantageous in order to obtain a better initial iterate before enabling AA. Different approaches to adaptive damping in evans2020proof and chen2024non demonstrated significant improvements in convergence when compared to fixed damping values. The method in chen2024non computes an “optimal” damping value by solving a minimization problem at the cost of two additional function evaluations. To avoid extra evaluations, we use the heuristic approach from evans2020proof with the damping value

βk=0.9ωβΓkwithΓk=1(QTfk/fk)2,\beta_{k}=0.9-\omega_{\beta}\Gamma_{k}\quad\mathrm{with}\quad\Gamma_{k}=\sqrt{1-(\|Q^{T}f_{k}\|/\|f_{k}\|)^{2}}, (8)

where Γk\Gamma_{k} is the AA convergence gain in the kk-th iteration, QQ is readily available from the QR solve in the AA optimization step, and ωβ\omega_{\beta} is a tuning parameter. In chen2022composite the adaptive damping algorithm from chen2024non is combined with a composite version of AA as a means of adapting the depth and further improves the convergence rate. A different adaptivity approach is taken in pollock2023filtering where a filtering method is applied to remove columns from the history matrix, k\mathcal{F}_{k}, in order to control the condition number of the least-squares problem. This method also leads to improved convergence. In this work, we utilize a strategy similar to that employed in pollock2021anderson where the depth is determined by the magnitude of the current residual with

mk=min{m~k,mk1+1,mmax}withm~k=max{0,log10(ωmfk)}m_{k}=\min\{\tilde{m}_{k},m_{k-1}+1,m_{max}\}\quad\mathrm{with}\quad\tilde{m}_{k}=\max\{0,\lfloor-\log_{10}(\omega_{m}\|f_{k}\|)\rfloor\} (9)

where m0=0m_{0}=0 and ωm\omega_{m} is a turning parameter.

Algorithm 3 Anderson Acceleration with delay and variable damping and depth

Input: p0p_{0}, mmaxm_{\rm{max}}, kmaxk_{\rm{max}}, tol, dd, and β\beta       Output: pp^{*}

1:Set p1=βG(p0)+(1β)p0p_{1}=\beta G(p_{0})+(1-\beta)p_{0}
2:for k=1,,kmaxk=1,\ldots,k_{\rm{max}} do
3:     if kdk\leq d then
4:         Set pk+1=βG(pk)+(1β)pkp_{k+1}=\beta G(p_{k})+(1-\beta)p_{k}
5:     else
6:         Select mkm_{k} using (9)
7:         Set 𝒢k=[ΔGkmk,,ΔGk1]\mathcal{G}_{k}=[\Delta G_{k-m_{k}},\ldots,\Delta G_{k-1}] where ΔGi=G(pi+1)G(pi)\Delta G_{i}=G(p_{i+1})-G(p_{i})
8:         Set k=[Δfkmk,,Δfk1]\mathcal{F}_{k}=[\Delta f_{k-m_{k}},\ldots,\Delta f_{k-1}] where Δfi=fi+1fi\Delta f_{i}=f_{i+1}-f_{i} and fi=G(pi)pif_{i}=G(p_{i})-p_{i}
9:         Determine γk\gamma_{k} that minimizes fkkγk2\|f_{k}-\mathcal{F}_{k}\gamma_{k}\|_{2}
10:         Select βk\beta_{k} using (8)
11:         Set pk+1=G(pk)𝒢kγk(1βk)(fkQkRkγk)p_{k+1}=G(p_{k})-\mathcal{G}_{k}\gamma_{k}-(1-\beta_{k})(f_{k}-Q_{k}R_{k}\gamma_{k})
12:     end if
13:     if pk+1pk<tol\|p_{k+1}-p_{k}\|<\text{tol} return pk+1p_{k+1} as pp^{*}
14:end for

4 Implementation

To evaluate whether the application of AA improves the convergence rate of the LoDestro method, we utilize the Tango 1D transport solver parker2018investigation ; parker2018bringing interfaced with the KINSOL nonlinear system solver from the SUNDIALS library gardner2022enabling ; hindmarsh2005sundials . Tango is an open-source Python code that both solves the 1D transport equation (3) by way of the linear system (6) and implements the coupling Algorithm 1. The turbulent flux models available for coupling include the global gyrokinetic code GENE genecode ; goerler2011global , as well as analytic models for rapid algorithmic development. For this work we utilize one of the latter options, and note it corresponds to running the turbulence code to saturation each coupling exchange.

KINSOL provides a number of algorithms for solving nonlinear algebraic systems including an Anderson-accelerated fixed-point iteration. AA in KINSOL leverages several implementation optimizations for greater performance such as efficient QR factorization updates and solves for computing the acceleration weights fang2009two ; kresse1996efficiency ; walker2011anderson or the use of low synchronization orthogonalization methods within the QR solve to minimize the number of global reductions lockhart2022performance . For this work, we have extended KINSOL’s existing support for delay and constant damping with a fixed depth to allow for adapting the depth and/or damping. These new capabilities will be included in an upcoming SUNDIALS release. Through the interface between Tango and KINSOL, we can readily access an efficient implementation of AA and easily experiment with different variants of AA with minimal modification to existing Tango code.

The interfacing between Tango and KINSOL primarily consists of creating a class within Tango with a method to evaluate G(p)G(p) using Tango’s native functions for computing and splitting fluxes and setting up and solving the linear system. Additional class methods call KINSOL functions to configure the solver based on command line options and solve the nonlinear system. As SUNDIALS packages are written in C, we leverage SWIG beazley1996swig to generate Python interfaces that align with the C API while allowing Tango to access data stored in SUNDIALS vectors as NumPy arrays without having to copy data between KINSOL and Tango data structures.

Algorithm 3 includes a number of parameters that can have a significant impact on performance. The optimal combination of settings is unclear a priori and depends on the application and problem considered. To automate parameter selection we utilize the GPTune package liu2021gptune , a Python-based auto-tuning framework using Bayesian optimization methodologies with support for multi-task learning, multi-objective tuning, sensitivity analysis, and many other features. For this work we use the single-task learning autotuning (SLA) capability in GPTune for a single-objective function. That is, we consider one configuration of the problem (1) when optimizing the algorithm parameters to minimize the number of iterations to reach a given solution tolerance. The problem class created for interfacing with KINSOL also streamlines using GPUTune, as the objective function only needs to construct the Tango problem class given a dictionary of input options, call the solve method, and return the number of solver iterations. Additionally, the driver program defines the input, output, and parameter spaces, as well as any constraints, to create the tuning problem. Once the problem is defined, the driver creates the GPTune object and runs the SLA method which calls Tango as needed through the objective function.

5 Numerical Results

In this section we present results applying AA to the LoDestro method for a nonlinear diffusion problem from shestakov2003self that mimics the challenges observed in turbulence-driven transport simulations. The problem is given by (1) on the interval x[0,1]x\in[0,1] with the analytic flux

q=Dpx+ϵwhereD=(1ppx)r.q=-D\frac{\partial p}{\partial x}+\epsilon\quad\text{where}\quad D=\left(\frac{1}{p}\frac{\partial p}{\partial x}\right)^{r}. (10)

As briefly mentioned in previous sections, turbulent transport in tokamaks is often due to instabilities driven by gradients of the time-averaged 1D temperature and other profiles; and these instabilities often have thresholds—critical values for their onset. The resulting transport then drives the system to flatter gradients, which pushes plasma operation toward marginal stability, where the growth rates rise extremely rapidly from zero. This situation leads to very stiff transport as operation near the critical gradients is approached from above Garbet_2004 . We model this high level of stiffness by increasing the value of rr in (10). The system is stiff with r=2r=2 and we use r=10r=10 to represent very stiff tests relevant to magnetic-fusion-energy problems.

The 5D turbulence code illustrated in Fig. 1 reaches a saturated state with short spatial- and temporal-scale fluctuations. It is the electric fields involved in these fluctuations that kick particles across the magnetic surfaces, i.e., cause the transport. Even when a heat or particle flux is integrated over four dimensions, the 1D flux still has short-scale structure. We include the effects of these fluctuations on our coupling problem with the “noise” term, ϵ\epsilon, in (10). We use the same spatially correlated, temporally white Gaussian noise model studied in parker2018investigation to mimic the fluctuations observed in GENE simulations (see Figs. 4a and 5b in parker2018investigation for a comparison of the model with fluxes from GENE). Below, we first test cases without noise (ϵ=0\epsilon=0) before considering a problem with noisy fluxes.

The source term in (1) is defined as

S(x)={1for0x0.10for0.1<x1,S(x)=\begin{cases}1&\text{for}\quad 0\leq x\leq 0.1\\ 0&\text{for}\quad 0.1<x\leq 1\end{cases}, (11)

and the boundary conditions are p(0)=0p^{\prime}(0)=0 and p(1)=0.01p(1)=0.01. For the LoDestro method, the relaxation strength depends on the degree of stiffness, and the default value is β=0.6/r\beta=0.6/r. AA setups with zero depth (m=0m=0) and constant damping correspond to applying the unaccelerated LoDestro method as implemented in Tango and serves as the baseline comparison point.

5.1 A Stiff Test Case

We first consider the stiff case (r=2r=2) and the impact of different AA configurations on the convergence of the iteration.

Fixed Damping, Depth, and Delay

To begin, we consider AA with static damping, β\beta, and depth, mm, sweeping over the parameter space to better understand how each impacts performance. Fig. 2 shows a heat map for the number of iterations to reach a residual of 101110^{-11} for different β\beta and mm values with and without a delay, dd, before starting AA. From the first row (m=0m=0, no acceleration), we see the default, β=0.3\beta=0.3, is near optimal in this case, but β=0.4\beta=0.4 can reduce the number of iterations by 16%16\%. However, increasing β\beta too much leads to unphysical values or divergence (both denoted by empty cells).

Refer to caption
(a) Without Delay
Refer to caption
(b) Delay (dd) AA by 1 iteration
Figure 2: Heat map of the number of iterations to reach a residual of 101110^{-11}. Empty cells correspond to failed solves. Without a delay, m2m\geq 2 is needed to see a benefit in iteration counts from AA. With a one iteration delay, AA consistently reduces the number of iterations and decreases sensitivity to the value of β\beta.

Without delay, adding acceleration, m=1m=1, gives some robustness in the choice of β\beta, enabling convergence for larger values than possible without AA. However, the convergence rate is improved only for β\beta far from the optimal value. Increasing the depth, m>1m>1, yields more consistent gains with AA. Comparing the best results, AA takes approximately 14%14\% fewer iterations. For a non-optimal β\beta, larger reductions in iteration count are realized due to the greater robustness and reduced sensitivity to the choice of β\beta. Improved robustness in β\beta with AA has also been observed in other multiphysics coupling applications hamilton2016assessment .

As the first iterate diverges from final solution before steadily approaching the correct value at each subsequent iteration, introducing a small delay before starting AA significantly strengthens the performance in all aspects. Generally, the results continue to improve with increased depth but with diminishing returns, especially near the optimal β\beta. Delaying AA further can improve the results slightly but also reduces robustness with larger β\beta and may increase iteration counts with smaller β\beta. The size of the delay necessary will depend on the quality of the initial iterate, and the best depth value is often problem dependent.

Optimized Fixed Damping, Depth, and Delay

In the prior section we manually explored the parameter space to understand the impact of different options on algorithmic performance. In this section we investigate optimizing these selections with GPTune and examine the convergence history. We consider three sets of parameters to optimize: β\beta with m=0m=0; β\beta and mm; and β\beta, mm, and dd. Each set is optimized for three residual tolerances (10610^{-6}, 10810^{-8}, and 101110^{-11}). We set the initial number of GPTune parameter-space samples to 10 and the total number of samples to 50. The convergence histories for the best parameter values from each set are shown in Fig. 3.

[width=7.5cm]optimize_fixed_params_p2.pdf

Figure 3: Residual history for autotuned configurations. The default (faded circles) and one setup from Fig. 2(b) (faded stars) are included for reference. The autotuned results align well with the parameter sweep in Fig. 2.

In the first few iterations there is little difference between the various options with smaller β\beta values performing slightly better. At more moderate residuals, optimizing β\beta alone is one or two iterations better than the AA results. However, for values below 10710^{-7} the AA setups give the best results with the optimized parameter set edging out the best result from Fig. 2(b) by one or two iterations. The benefit of AA increases for values less than 10910^{-9} where the convergence of the unaccelerated version slows. This result indicates that AA delivers accelerated convergence to a tight tolerance. Overall the optimized parameter sets from GPTune closely align with the best setups found with the manual sweep.

Adaptive Damping and Depth

In the previous section we saw that the best parameter configurations depend on the desired residual norm. Performance also depends significantly on whether the initial iterate is in the asymptotic convergence regime of the iterative scheme. In this section we evaluate AA with adaptive damping and/or depth to find setups that perform well across residual regimes. We again consider three sets of parameters to optimize: β\beta with adaptive mm; mm and dd with adaptive β\beta, and adapting β\beta and mm together. We only optimize dd in the fixed-mm case as the strategy for adapting mm automatically finds an optimum dd by initially selecting m=0m=0. When adapting β\beta and/or mm, GPTune optimizes the factors ωβ\omega_{\beta} and ωm\omega_{m} in (8) and (9), respectively. As before, each set is optimized for the same three residual tolerances and we use the same numbers of initial and total parameter-space samples. The convergence history for the best parameter values from each set are shown in Fig. 4, and history of β\beta or mm from the adaptive cases are given in Fig. 5.

[width=7.5cm]optimize_adaptive_params_p2.pdf

Figure 4: Residual history for autotuned configurations with adaptive β\beta and/or mm. The adaptive results closely follow the autotuned fixed parameter results in Fig. 3 (faded triangles and squares included for reference).
Refer to caption
(a) β=0.37\beta=0.37 and adaptive mm
Refer to caption
(b) Adaptive β\beta, m=10m=10, and d=4d=4
Refer to caption
(c) Adaptive β\beta and mm
Refer to caption
(d) Adaptive β\beta and mm
Figure 5: Residual (left axis) and parameter history (right axis) for adaptive β\beta and/or mm configurations. Adapting mm automatically selects the AA delay, and adapting β\beta alone leads to values near the max allowed. Adapting β\beta and mm together leads to large, frequent changes in β\beta slightly degrading performance compared to other setups.

Optimizing a fixed β\beta and an adaptive mm gives similar results to optimizing fixed β\beta, mm, and dd. Fig. 5(a) shows that adaptive mm delays AA six iterations and has a max mm of three. Given the low sensitivity observed in Fig. 2(b), it is not surprising that, although the mm and dd are slightly different, the convergence history only differs by one iteration. Optimizing an adaptive β\beta with fixed mm and dd has improved convergence at larger residuals, and, after five iterations, the setup is within two iterations of the best results. Fig. 5(b) shows that the performance gain at larger residuals is due to using a small β\beta value initially before immediately growing to almost the maximum β\beta once AA starts. This behavior occurs because an optimum fixed β\beta is computed by GPTune in the delay region before AA starts, as Γk\Gamma_{k} is not available to estimate β\beta. Once AA starts, the iteration switches to adaptive values and keeps β\beta close to the max allowed.

Optimizing both β\beta and mm, is the least performant of the three configurations but, due to the low sensitivity to β\beta and mm, the results are still relatively close (within five iterations) of the other cases. Again adapting mm automatically delays AA (five iterations) before settling on a max depth. Unlike when adapting β\beta alone, there are large oscillations in the β\beta after the initial delay. This variation in β\beta appears to be the cause of lower overall performance and suggests that considering the gain history to reduce large, frequent changes in β\beta may improve results.

Overall, in this stiff test case there are modest gains in numbers of iterations with AA compared to optimized damping values at large to moderate residual values. For residuals below 10810^{-8} AA has a larger impact as convergence of the iteration without AA slows. However, AA greatly increases robustness and decreases sensitivity to the choice of β\beta with a small delay in AA increasing these effects.

5.2 A Very Stiff Test Case

We now consider a very stiff case (r=10r=10) relevant to magnetic-fusion-energy problems and examine the impact of AA on convergence with optimized fixed and adaptive parameter configurations as in the prior two sections. Fig. 6 shows the convergence history for the default iteration (β=0.06\beta=0.06), an optimized damping value (β=0.14\beta=0.14), and three of the best performing AA configurations.

[width=7.5cm]optimize_p10.pdf

Figure 6: Residual history for autotuned setups with the default iteration (β=0.06\beta=0.06), optimized damping (β=0.14\beta=0.14), and various AA configurations. Markers are placed every 5 iterations. For a wide range of residuals, AA leads to reduced numbers of iterations, especially for smaller residual values.

As in the stiff case, optimizing β\beta gives better results than the default β\beta and, in this case, is generally the best option for larger residuals. Enabling AA with β=0.01\beta=0.01 and m=3m=3 consistently outperforms optimizing β\beta alone for residuals below 10410^{-4}. Again, delaying AA generally leads to better results overall and, given the larger initial residual and stronger damping applied in this case, the best AA configurations tend to have a longer delay. For residuals below 10710^{-7}, configurations with fixed β\beta, mm, and dd or fixed β\beta and adaptive mm are the most efficient with results that differ by less than six iterations. With an adaptive mm, AA is automatically delayed by twelve iterations, the same as in the fixed-delay case, but allows more depth going up to m=7m=7. Unlike in the stiff test, the depth does not immediately ramp up after the delay but instead keeps mm constant for 2 to 7 iterations before increasing mm. The stair-step behavior of mm reflects the overall slower convergence in this case, which requires additional iterations before a residual reduction triggers an increase in mm using the heuristic (9). Adapting β\beta can also improve results but, like in the stiff case, does not generally perform as well as other setups due to large oscillations in β\beta. Overall the results with this very stiff test are improved when comparing against the stiff case, with AA producing more significant gains over a larger range of residuals.

5.3 Adding Noise in a Stiff Test Case

Finally, we revisit the stiff test (r=2r=2) when accounting for fluctuations in qq that mimic the noise observed with GENE i.e., ϵ0\epsilon\neq 0 in (10). Fig. 7(a) shows the residual history for the default setup (β=0.3\beta=0.3) and a more optimal configuration (β=0.4\beta=0.4), with and without AA, along with a heat map for other combinations of β\beta and mm.

(a) *
(b) *
Figure 7: Residual history and heat map for the number of iterations to reach a residual of 10410^{-4} both with a delay of two iterations. AA increases robustness and reduces sensitivity to β\beta but only lowers iteration counts away from the optimal β\beta.

[.49]Refer to caption [.49] Refer to caption

Unlike in the cases without noise, the performance with AA is similar to that without AA when using the best β\beta value. This lack of improvement is due to the higher, noise-dependent residual floor at which the iteration stagnates as noted in parker2018investigation for the LoDestro method and similarly observed in toth2017local when applying AA to problems with inaccurate function evaluations. While the residual floor limits the benefits in lowering iteration counts, AA still improves overall robustness and reduces sensitivity to β\beta, which leads to faster convergence for non-optimal β\beta values. This reduced sensitivity is significant as it shows that with AA, less manual tuning of β\beta is necessary for an efficient solution approach.

6 Conclusions

In this work we used the KINSOL nonlinear solver library in SUNDIALS to evaluate the potential of several variations of Anderson acceleration to improve the convergence of the LoDestro method for coupled turbulence and transport simulations of a magnetically confined fusion plasma. Combining the flexibility of KINSOL with the GPTune autotuning library enabled rapid evaluation of the parameter space to compare the impact of different options on algorithmic performance.

For all test cases, AA allows for convergence with larger relaxation parameter values and decreases sensitivity to the choice of the relaxation parameter. Comparing to the optimal damping value, AA can provide a small decrease in the number of iterations for moderate residual values with the benefit increasing for smaller target residuals leading to a reduction of 20% or more in the number of iterations. Moreover, these benefits improve as the problem difficulty increases, going from stiff to very stiff tests; and for non-optimal damping values, the improvements are larger. Utilizing adaptive depth with AA provides a means for automatically selecting the delay and maximum allowed depth with results generally matching the best setups with optimized fixed parameter values. We note that use of GPTune removes the need for time-consuming hand optimizations of the parameters. When noisy fluxes are considered, AA does not provide an improvement in iterations with optimal damping due to the higher residual floor parker2018investigation . However, AA still increases robustness and reduces sensitivity to the damping parameter, which lowers iteration counts for non-optimal damping. Improvements in the adaptivity strategies for damping and depth (or the alternative methods explored in chen2024non , evans2020proof , and pollock2023filtering ) could increase the benefits of AA at larger residual values and will be considered in later studies.

In future efforts we plan to assess the performance of AA with a modified analytic flux model that reflects the time dependence of the turbulence calculation as it is coupled to the transport code (see Fig. 1). We will also apply recently developed approaches for analyzing AA convergence de2022linear to better understand how the presence of noise and the addition of time dependence in the coupling impact convergence of the algorithm. Additionally, we will conduct studies to quantify parameter sensitivity in AA using GPUTune. Finally, the interfaces to KINSOL will allow for testing new variants of AA as they are integrated into the library, such as alternating and composite AA chen2022composite , without modifying Tango.

Acknowledgements.
The authors wish to acknowledge the initial explorations applying Anderson acceleration in Tango by Jeff Parker, Jack G. Paulson, and H. Hunter Schwartz and thank them for their help in using Tango and interfacing with KINSOL. We would also like to thank Cody J. Balos for his work in creating Python interfaces to KINSOL. Support for this work was provided in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) Program through the Frameworks, Algorithms, and Scalable Technologies for Mathematics (FASTMath) Institute, under Lawrence Livermore National Laboratory subcontract B626484 and DOE award DE-SC0021354. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Fusion Energy Sciences, Scientific Discovery through Advanced Computing (SciDAC) program. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PROC-864392.
\ethics

Competing Interests The authors have no conflicts of interest to declare that are relevant to the content of this chapter.

References

  • (1) Anderson, D.G.: Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM) 12(4), 547–560 (1965). DOI 10.1145/321296.321305
  • (2) Bañón Navarro, A., Di Siena, A., Velasco, J., Wilms, F., Merlo, G., Windisch, T., LoDestro, L., Parker, J., Jenko, F.: First-principles based plasma profile predictions for optimized stellarators. Nuclear Fusion 63(5), 054003 (2023). DOI 10.1088/1741-4326/acc3af
  • (3) Beazley, D.M., et al.: SWIG: An easy to use tool for integrating scripting languages with C and C++. In: Tcl/Tk Workshop, vol. 43, p. 74 (1996). DOI 10.5555/1267498.1267513
  • (4) Chen, K., Vuik, C.: Composite Anderson acceleration method with two window sizes and optimized damping. International Journal for Numerical Methods in Engineering 123(23), 5964–5985 (2022). DOI 10.1002/nme.7096
  • (5) Chen, K., Vuik, C.: Non-stationary Anderson acceleration with optimized damping. Journal of Computational and Applied Mathematics p. 116077 (2024). DOI 10.1016/j.cam.2024.116077
  • (6) Crotinger, J.A., LoDestro, L., Pearlstein, L.D., Tarditi, A., Casper, T.A., Hooper, E.B.: CORSICA: A comprehensive simulation of toroidal magnetic-fusion devices. final report to the LDRD program. Tech. rep., Lawrence Livermore National Laboratory (1997). DOI 10.2172/522508
  • (7) De Sterck, H., He, Y.: Linear asymptotic convergence of Anderson acceleration: fixed-point analysis. SIAM Journal on Matrix Analysis and Applications 43(4), 1755–1783 (2022). DOI 10.1137/21M1449579
  • (8) Di Siena, A., Bañón Navarro, A., Luda, T., Merlo, G., Bergmann, M., Leppin, L., Görler, T., Parker, J., LoDestro, L., Dannert, T., Germaschewski, K., Allen, B., Hittinger, J., Dorland, B., Hammett, G., Jenko, F., the ASDEX Upgrade Team, the EUROfusion MST1 Team: Global gyrokinetic simulations of asdex upgrade up to the transport timescale with gene–tango. Nuclear Fusion 62(10), 106025 (2022). DOI 10.1088/1741-4326/ac8941
  • (9) Evans, C., Pollock, S., Rebholz, L.G., Xiao, M.: A proof that Anderson acceleration improves the convergence rate in linearly converging fixed-point methods (but not in those converging quadratically). SIAM Journal on Numerical Analysis 58(1), 788–810 (2020). DOI 10.1137/19M1245384
  • (10) Fang, H.r., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numerical linear algebra with applications 16(3), 197–221 (2009). DOI 10.1002/nla.617
  • (11) Garbet, X., Mantica, P., Ryter, F., Cordey, G., Imbeaux, F., Sozzi, C., Manini, A., Asp, E., Parail, V., Wolf, R., et al.: Profile stiffness and global confinement. Plasma physics and controlled fusion 46(9), 1351 (2004). DOI 10.1088/0741-3335/46/9/002
  • (12) Gardner, D.J., Reynolds, D.R., Woodward, C.S., Balos, C.J.: Enabling new flexibility in the SUNDIALS suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software (TOMS) 48(3), 1–24 (2022). DOI 10.1145/3539801
  • (13) GENE Developers: Gyrokinetic Electromagnetic Numerical Experiment (GENE). URL http://genecode.org/. Accessed 2024-05-06
  • (14) Goerler, T., Lapillonne, X., Brunner, S., Dannert, T., Jenko, F., Merz, F., Told, D.: The global version of the gyrokinetic turbulence code GENE. Journal of Computational Physics 230(18), 7053–7071 (2011). DOI 10.1016/j.jcp.2011.05.034
  • (15) Hamilton, S., Berrill, M., Clarno, K., Pawlowski, R., Toth, A., Kelley, C., Evans, T., Philip, B.: An assessment of coupling algorithms for nuclear reactor core physics simulations. Journal of Computational Physics 311, 241–257 (2016). DOI 10.1016/j.jcp.2016.02.012
  • (16) Hindmarsh, A.C., Brown, P.N., Grant, K.E., Lee, S.L., Serban, R., Shumaker, D.E., Woodward, C.S.: SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software (TOMS) 31(3), 363–396 (2005). DOI 10.1145/1089014.1089020
  • (17) Kelley, C.T.: Numerical methods for nonlinear equations. Acta Numerica 27, 207–287 (2018). DOI 10.1017/S0962492917000113
  • (18) Kresse, G., Furthmüller, J.: Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Computational materials science 6(1), 15–50 (1996). DOI 10.1016/0927-0256(96)00008-0
  • (19) Liu, Y., Sid-Lakhdar, W.M., Marques, O., Zhu, X., Meng, C., Demmel, J.W., Li, X.S.: GPTune: Multitask learning for autotuning exascale applications. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 234–246 (2021). DOI 10.1145/3437801.3441621
  • (20) Lockhart, S., Gardner, D.J., Woodward, C.S., Thomas, S., Olson, L.N.: Performance of low synchronization orthogonalization methods in Anderson accelerated fixed point solvers. In: Proceedings of the 2022 SIAM Conference on Parallel Processing for Scientific Computing, pp. 49–59. SIAM (2022). DOI 10.1137/1.9781611977141.5
  • (21) Loffeld, J., Woodward, C.S.: Considerations on the implementation and use of anderson acceleration on distributed memory and gpu-based parallel computers. In: Advances in the Mathematical Sciences: Research from the 2015 Association for Women in Mathematics Symposium, pp. 417–436. Springer (2016). DOI 10.1007/978-3-319-34139-2˙21
  • (22) Parker, J.B., LoDestro, L.L., Campos, A.: Investigation of a multiple-timescale turbulence-transport coupling method in the presence of random fluctuations. Plasma 1(1), 126–143 (2018). DOI 10.3390/plasma1010012
  • (23) Parker, J.B., LoDestro, L.L., Told, D., Merlo, G., Ricketson, L.F., Campos, A., Jenko, F., Hittinger, J.A.: Bringing global gyrokinetic turbulence simulations to the transport timescale using a multiscale approach. Nuclear Fusion 58(5), 054004 (2018). DOI 10.1088/1741-4326/aab5c8
  • (24) Pollock, S., Rebholz, L.G.: Anderson acceleration for contractive and noncontractive operators. IMA Journal of Numerical Analysis 41(4), 2841–2872 (2021). DOI 10.1093/imanum/draa095
  • (25) Pollock, S., Rebholz, L.G.: Filtering for Anderson acceleration. SIAM Journal on Scientific Computing 45(4), A1571–A1590 (2023). DOI 10.1137/22M1536741
  • (26) Post, D.E., Batchelor, D.B., Bramley, R.B., Cary, J.R., Cohen, R.H., Colella, P., Jardin, S.C.: Report of the fusion simulation project steering committee. Journal of fusion energy 23, 1–26 (2004). DOI 10.1007/s10894-004-1868-0
  • (27) Shestakov, A., Cohen, R., Crotinger, J., LoDestro, L., Tarditi, A., Xu, X.: Self-consistent modeling of turbulence and transport. Journal of Computational Physics 185(2), 399–426 (2003). DOI 10.1016/S0021-9991(02)00063-3
  • (28) Toth, A., Ellis, J.A., Evans, T., Hamilton, S., Kelley, C.T., Pawlowski, R., Slattery, S.: Local improvement results for Anderson acceleration with inaccurate function evaluations. SIAM Journal on Scientific Computing 39(5), S47–S65 (2017). DOI 10.1137/16M1080677
  • (29) Toth, A., Kelley, C.T.: Convergence analysis for Anderson acceleration. SIAM Journal on Numerical Analysis 53(2), 805–819 (2015). DOI 10.1137/130919398
  • (30) Walker, H.F., Ni, P.: Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis 49(4), 1715–1735 (2011). DOI 10.1137/10078356X