This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parallel Energy-Minimization Prolongation for Algebraic Multigrid

Carlo Janna111M3E S.r.l., via Giambellino 7, 35129 Padova, Italy, e-mail [email protected] 222Department ICEA, University of Padova, via Marzolo 9, 35131 Padova, Italy    Andrea Franceschini333corresponding author 444Department ICEA, University of Padova, via Marzolo 9, 35131 Padova, Italy, e-mail [email protected]    Jacob B. Schroder555Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM 87131, USA, e-mail [email protected]    Luke Olson666Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA, e-mail [email protected]
Abstract

Algebraic multigrid (AMG) is one of the most widely used solution techniques for linear systems of equations arising from discretized partial differential equations. The popularity of AMG stems from its potential to solve linear systems in almost linear time, that is with an O(n)O(n) complexity, where nn is the problem size. This capability is crucial at the present, where the increasing availability of massive HPC platforms pushes for the solution of very large problems. The key for a rapidly converging AMG method is a good interplay between the smoother and the coarse-grid correction, which in turn requires the use of an effective prolongation. From a theoretical viewpoint, the prolongation must accurately represent near kernel components and, at the same time, be bounded in the energy norm. For challenging problems, however, ensuring both these requirements is not easy and is exactly the goal of this work. We propose a constrained minimization procedure aimed at reducing prolongation energy while preserving the near kernel components in the span of interpolation. The proposed algorithm is based on previous energy minimization approaches utilizing a preconditioned restricted conjugate gradients method, but has new features and a specific focus on parallel performance and implementation. It is shown that the resulting solver, when used for large real-world problems from various application fields, exhibits excellent convergence rates and scalability and outperforms at least some more traditional AMG approaches.

Keywords: Algebraic Multigrid, AMG, Preconditioning, Energy minimization, Prolongation

1 Introduction

With the increasing availability of powerful computational resources, scientific and engineering applications are becoming more demanding in terms of both memory and CPU time. For common methods used in the numerical approximation to partial differential equations (e.g., finite difference, finite volume, or finite element), the resulting approximation can easily grow to several millions or even billions of unknowns. The efficient solution to the associated sparse linear system of equations

(1) A𝐱=𝐛,A\mathbf{x}=\mathbf{b},

either as a stand-alone system or as part of a nonlinear solve process, often represents a significant computational expense in the numerical application. Thus, research on sparse linear solvers continues to be a key topic for efficient simulation at large scales. One of the most popular sparse linear solvers is algebraic multigrid (AMG) [5, 6, 27] because of its potential for O(n)O(n) computational cost in number of degrees-of-freedom nn for many problem types.

A fast converging AMG method relies on the complementary action of relaxation (e.g., with weighted Jacobi) and coarse grid correction, which is a projection step focused on eliminating the error that is not reduced by relaxation. Even in a purely algebraic setting, the main algorithmic decisions in multigrid are often based on heuristics for elliptic problems. As a result, for more complex applications, traditional methods often break down, requiring additional techniques to improve accuracy with a careful eye on overall computational complexity.

Even with advanced AMG methods, robustness remains an open problem for a variety of applications, especially in parallel. Yet, there have been several advances in recent years that have significantly improved convergence in a range of settings. Adaptive AMG [19] and adaptive smoothed aggregation [9] are among early attempts to assess the quality of the AMG setup phase during the setup process, with the ability to adaptively improve the interpolation operators. Later works focus on extending the adaptive ideas to more general settings [20], and in particular, Bootstrap AMG [3] further develops the idea of adaptive interpolation with least-squares interpolation coupled with locally relaxed vectors and multilevel eigenmodes. Other advanced approaches have a focus on specific AMG components, such as energy minimization of the interpolation operator [21, 33, 28, 26, 23], generalizing the strength of connection procedure [25, 4], or by considering the nonsymmetric nature of the problem directly [24, 22].

While AMG robustness and overall convergence has improved with combinations of the advances above, the overarching challenge of controlling cost is persistent. In this paper, we make a number of related contributions with a focus on AMG effectiveness and efficiency at large scale. Our key contributions are as follows:

  • The quality and sparsity of tentative interpolation is improved through a novel utilization of sparse QRQR and a new process for sparsity pattern expansion that targets locally full-rank matrices for improved mode interpolation constraints;

  • We accompany the energy minimization construction of interpolation with new energy and convergence monitoring, thus limiting the total cost;

  • We apply a new preconditioning technique for the energy minimization process based on Gauss-Seidel applied to the blocks;

  • We present the non-trivial and efficient parallel implementation in detail; and

  • We demonstrate improved convergence and computational complexity with several large scale experiments.

The remainder of the paper is as follows. We begin with the basics of AMG in Section 2. In Section 3, we derive the energy minimization process based on QR factorizations and introduce a method for monitoring reduction of energy in practice. Finally, we conclude with several numerical experiments in Section 5 along with a discussion on performance.

2 Introduction to Classical AMG

The effectiveness of AMG as a solver depends on the complementary relationship between relaxation and coarse-grid correction, where the error not reduced by relaxation on the fine grid (e.g., with weighted-Jacobi or Gauss-Seidel) is accurately represented on the coarse grid, where a complementary error correction is computed. For a more in-depth introduction to AMG, see the works [10, 29]. Here, we focus our description of AMG on the coarse grid and interpolation setup, which are most relevant to the rest of the paper.

Constructing the AMG coarse grid begins with a partition of the nn unknowns of AA into a C-F partition of nfn_{f} fine nodes and ncn_{c} coarse nodes: {0,,n1}=𝒞\{0,\ldots,n-1\}=\mathcal{C}\cup\mathcal{F}. From this, we assume an ordering of AA by F-points followed by C-points:

(2) A=[AffAfcAfcTAcc],A=\left[\begin{array}[]{cc}A_{ff}&A_{fc}\\ A_{fc}^{T}&A_{cc}\\ \end{array}\right],

where for example, AffA_{ff} corresponds to entries in AA between two F-points. We also assume AA is SPD so that Acf=AfcTA_{cf}=A_{fc}^{T}. In classical AMG, prolongation takes the form

(3) P=[WI],P=\left[\begin{array}[]{c}W\\ I\end{array}\right],

where WW must be sparse (for efficiency) and represents interpolation from the coarse grid to fine grid FF-points.

In constructing prolongation of the form Eq. 3, there are two widely accepted guidelines, the so-called ideal [8, 35] and optimal [34, 7] forms of prolongation. Although both of these are not feasible in practical applications, leading to very expensive and dense prolongation operators, the concepts behind their definition are valuable guides for constructing effective PP.

Ideal prolongation is constructed by starting with the above C-F partition and constructing PidP_{\text{id}} as

(4) Pid=[Aff1AfcI].P_{\text{id}}=\left[\begin{array}[]{c}-A_{ff}^{-1}A_{fc}\\ I\\ \end{array}\right].

Making PidP_{\text{id}} the goal for interpolation is motivated by Corollary 3.4 from the theoretical work [12]. Here, the main assumption is a classical AMG framework where PP is of the form in equation (3).111The other assumptions are specific choices for the map to F-points S=[I,0]S=[I,0], for the map to C-points R=[0,I]R=[0,I], and for relaxation X=AIX=\|A\|I. In this setting, the choice of W=Aff1AfcW=-A_{ff}^{-1}A_{fc} minimizes the two-grid convergence of AMG relative to the choice of PP, i.e., relaxation is fixed. Motivating our later energy minimization approach, PidP_{\text{id}} can be viewed as having zero energy rows, as APidAP_{\text{id}} is zero at all F-rows. Additionally, this classical AMG perspective likely makes the task of energy minimization easier, in that the conditioning of AFFA_{FF} is usually superior to that of AA.

With optimal interpolation, the goal of interpolation is to capture the algebraically smoothest modes in span(P)(P), i.e., the modes left behind by relaxation. More specifically following [7], let λ1λ2λn\lambda_{1}\leq\lambda_{2}\leq\dots\leq\lambda_{n} and v1,v2,,vnv_{1},v_{2},\dots,v_{n} be the ordered eigenvalues and eigenvectors of the generalized eigenvalue problem Ax=λM~xAx=\lambda\tilde{M}x, where M~\tilde{M} is the symmetrized relaxation matrix, e.g., the diagonal of AA for Jacobi. (See the work [7] for more details.) Then, the two-grid convergence of AMG is minimized if

(5) span(P)=range(v1,v2,,vnc).\mbox{span}(P)=\mbox{range}(v_{1},v_{2},...,v_{n_{c}}).

Note, that no assumptions on the structure of PP are made, as in equation (3). Motivating our later energy minimization approach, equation (5) indicates that span(P)(P) should capture low-energy modes relative to relaxation, which our Jacobi or Gauss-Seidel preconditioned energy minimization approach will explicitly target. Moreover, our energy minimization approach will incorporate constraints which explicitly force certain vectors to be in span(P)(P), where these vectors are chosen to represent the viv_{i} with smallest eigenvalues.

The idea of energy minimization AMG with constraints has been exploited for both symmetric and non-symmetric operators in several works [21, 33, 28, 26, 23], and, though requiring more computational effort than classical interpolation formulas, often provides improved preconditioners that balance the extra cost.

3 Energy minimization prolongation

The energy minimization process combines the key aspects of ideal and optimal prolongation. To define this, we first introduce VV, a basis for the near kernel of AA or the lowest energy modes of AA. Then, energy minimization seeks to satisfy two requirements:

  1. 1.

    Range: The range of prolongation must include the near kernel VV:

    (6) Vrange(P)V\subseteq\mathrm{range}(P)
  2. 2.

    Minimal: The energy of each column of PP is minimized:

    (7) P=argminP(tr(PTAP)).P=\operatorname*{argmin}_{P}\left(\mbox{tr}(P^{T}AP)\right).

To construct a PP that contains VV in the range and has minimal energy, we next introduce the key components needed by most energy minimization approaches, namely

  1. i)

    a sparsity pattern for efficient application of PP;

  2. ii)

    a constraint to enforce Eq. 6; and

  3. iii)

    an approximate block diagonal linear system for solving equation Eq. 7.

For practical use in AMG, the prolongation operator must be sparse, therefore construction begins by defining a sparse non-zero pattern for PP. Assume that a strength of connection (SoC) matrix SS is provided where nonzero entries denote a strong coupling between degrees-of-freedom [27, 25, 4]. Next, let P0P_{0} be a tentative prolongation with non-zero pattern 𝒫0\mathcal{P}_{0}, to be used as an initial guess for PP.222See Section 3.2 for our contributions regarding the construction of P0P_{0}, which is based on the adaptive algorithm [15]. P0P_{0} can be defined similarly to the tentative prolongation from smoothed aggregation AMG [31] in that P0P_{0} interpolates the basis VV, but needs further improvement, for example, with energy minimization. We next obtain an enlarged sparsity pattern 𝒫\mathcal{P} by growing 𝒫0\mathcal{P}_{0} to include all strongly connected neighbors up to distance kk. Denoting with P¯\overline{P} the unitary matrix obtained from PP by replacing its non-zeros with unitary entries, this is equivalent to

(8) P¯k=SkP¯0,\overline{P}_{k}=S^{k}\overline{P}_{0},

where 𝒫\mathcal{P} is the pattern of P¯k\overline{P}_{k} (see [26]).

For a constraint condition satisfying Eq. 6, we start by splitting the near kernel basis VV with the same C-F splitting as in Eq. 2:

(9) V=[VfVc].V=\left[\begin{array}[]{c}V_{f}\\ V_{c}\\ \end{array}\right].

Recalling the form of interpolation Eq. 3, the near kernel requirement for PP becomes

(10) WVc=Vf,W\;V_{c}=V_{f},

which is a set of nfn_{f} conditions on the rows of WW. By denoting wiT\noindent w_{i}^{T} as the ii-th row of WW (and PP) and viT\noindent v_{i}^{T} as the ii-th row VV, condition Eq. 10 is then exploited row-wise as

(11) VcTwi=vii.V_{c}^{T}\noindent w_{i}=\noindent v_{i}\qquad\forall i\in\mathcal{F}.

Using the sparsity pattern 𝒫\mathcal{P}, we rewrite Eq. 11 for only the nonzeros in each row wi\noindent w_{i}. Letting the index set 𝒥i\mathcal{J}_{i} be the column indices in the ii-th row of PP, this becomes

(12) Vc(𝒥i,:)Twi¯=vii,V_{c}(\mathcal{J}_{i},:)^{T}\overline{\noindent w_{i}}=\noindent v_{i}\qquad\forall i\in{\mathcal{F}},

where wi¯=wi(𝒥i)\overline{\noindent w_{i}}=\noindent w_{i}(\mathcal{J}_{i}) collects only the nonzeros of wi\noindent w_{i}. It is important to note that for each of the nfn_{f} fine points, the constraints Eq. 12, are independent of each other, because each entry of WW appears in only one constraint. Denote by p~\widetilde{p} the column vector collecting the nonzero entries of PP row-wise. Similarly, we denote by pp the column vector containing the nonzero entries of PP column-wise. By definition, p~\widetilde{p} and pp have the same size, equal to the number of nonzeros in PP. Next, we write the constraints in the following matrix form:

(13) B~Tp~=g~,\widetilde{B}^{T}\widetilde{p}=\widetilde{g},

where g~\widetilde{g} collects the vi\noindent v_{i} in Eq. 12 into a single vector, and where B~T\widetilde{B}^{T}, is a block diagonal matrix composed of Vc(𝒥i,:)TV_{c}(\mathcal{J}_{i},:)^{T} due to the independence of constraints.

Lastly, we describe the reduced linear system framework for approximating Eq. 7. Minimizing Eq. 7 is equivalent to minimizing the energy of the individual columns of PP on the prescribed nonzero pattern:

(14) pi=argminpi𝒫ipiTApi,p_{i}=\operatorname*{argmin}_{p_{i}\in\mathcal{P}_{i}}p_{i}^{T}Ap_{i},

where pip_{i} is the ii-th column of PP and 𝒫i\mathcal{P}_{i} is the sparsity pattern of column ii.

Next, let i\mathcal{I}_{i} be the set of nonzero row indices of the ii-th column of WW and let h¯i\overline{h}_{i} be the vector collecting the nonzero entries of the ii-th column of WW. Then the minimization in Eq. 14 defines h¯i\overline{h}_{i} with

(15) A(i,i)h¯i=A(i,i)i𝒞,A(\mathcal{I}_{i},\mathcal{I}_{i})\overline{h}_{i}=-A(\mathcal{I}_{i},i)\qquad\forall i\in\mathcal{C},

where A(i,i)A(\mathcal{I}_{i},\mathcal{I}_{i}) is a square, relatively dense, submatrix of AA corresponding to the allowed nonzero indices i\mathcal{I}_{i} and A(i,i)A(\mathcal{I}_{i},i) is a vector corresponding to the ii-th column of AA at the allowed nonzero indices. Also in this case, each column of PP satisfies Eq. 14 independently. Thus, denoting by pp the column vector collecting the nonzero entries of PP column-wise (i.e., a rearranged version of p~\widetilde{p}) and denoting by ff the vector collecting each A(i,i)-A(\mathcal{I}_{i},i), the minimization Eq. 14 is recast as

(16) Kp=f,Kp=f,

where KK is block diagonal with A(i,i)A(\mathcal{I}_{i},\mathcal{I}_{i}) on the ii-th block.

The two conditions, one on the range of PP and one on the minimality of PP, together form a constrained minimization problem, whose solution is the desired energy minimal prolongation. Casting this problem using Lagrange multipliers results in the saddle point system

(17) [KBBT0][pλ]=[fg].\left[\begin{array}[]{cc}K&B\\ B^{T}&0\\ \end{array}\right]\left[\begin{array}[]{c}p\\ \lambda\\ \end{array}\right]=\left[\begin{array}[]{c}f\\ g\\ \end{array}\right].

The elements in Eq. 17 are the same as those defined in Eqs. 13 and 16, with the exception that B~\widetilde{B}, g~\widetilde{g}, and p~\widetilde{p} are reordered following the columns of PP, and λ\lambda is the vector of Lagrange multipliers whose values are not needed for the purpose of setting up the prolongation. We emphasize that, with the entries of pp enumerated column-wise with respect to PP, KK is block diagonal. Likewise, if pp is enumerated following the rows of PP, then BB becomes block diagonal. Unfortunately, there is no sorting of pp able to make both KK and BB block diagonal at the same time. Nevertheless, it is possible to take advantage of this underlying structure in the numerical implementation, as will be shown later. Leveraging the block structure of BB is also important, because, as we will see in Section 3.1, our algorithm to minimize energy requires several applications of the orthogonal projector given as

(18) ΠB=IB(BTB)1BT.\Pi_{B}=I-B(B^{T}B)^{-1}B^{T}.

The system Eq. 17 follows closely the method from  [26]. In sections 3.23.4, we outline our proposed improvements to energy minimization.

3.1 Minimization through Krylov subspace methods

Following [26], energy minimization proceeds by starting with a tentative prolongation, P0P_{0}, that satisfies the near kernel constraints (see Eq. 10). Denoting by p0p_{0} the tentative prolongation in vector form with nonzero entries collected column-wise, where we highlight that the subscript 0 does not refer to a specific PP column as pip_{i} in Eq. (14), these constraints read

(19) BTp0=g.B^{T}p_{0}=g.

Defining the final prolongation as the tentative p0p_{0} plus a correction δp\delta p gives

(20) p=p0+δp.p=p_{0}+\delta p.

Then, the problem is recast as finding the optimal correction ΔP\Delta P^{*}:

(21) ΔP=argminΔP𝒫(tr((P0+ΔP)TA(P0+ΔP))),\Delta P^{*}=\operatorname*{argmin}_{\Delta P\in\mathcal{P}}\left(\operatorname{tr}((P_{0}+\Delta P)^{T}A(P_{0}+\Delta P))\right),

subject to the constraint BTδp=0B^{T}\delta p=0, where δp\delta p is the vector form of ΔP\Delta P with nonzero entries again collected column-wise. By recalling that ΔP\Delta P has non-zero components only in WW — i.e., ΔP=[ΔWT,0]T\Delta P=[\Delta W^{T},0]^{T} — and using the C-F partition (2), we write

(22) tr((P0+ΔP)TA(P0+ΔP))==tr(ΔWTAffΔW)+2tr(W0TAffΔW)+2tr(AfcTΔW)dependent on ΔP+tr(W0TAffW0)+2tr(W0TAfc)+tr(Acc)independent of ΔP.\begin{split}\operatorname{tr}((P_{0}+\Delta P)^{T}&A(P_{0}+\Delta P))=\\ &=\underbrace{\operatorname{tr}(\Delta W^{T}A_{ff}\Delta W)+2\operatorname{tr}(W_{0}^{T}A_{ff}\Delta W)+2\operatorname{tr}(A_{fc}^{T}\Delta W)}_{\text{dependent on $\Delta P$}}\\ &+\underbrace{\operatorname{tr}(W_{0}^{T}A_{ff}W_{0})+2\operatorname{tr}(W_{0}^{T}A_{fc})+\operatorname{tr}(A_{cc})}_{\text{independent of $\Delta P$}}.\end{split}

Using the preset non-zero pattern for ΔW\Delta W, the problem is rewritten in vector from to minimizie only the terms depending on ΔP\Delta P:

(23) argminδw(δwTKδw+2w0TKδw+2fTδw),\operatorname*{argmin}_{\delta w}(\delta w^{T}K\delta w+2w_{0}^{T}K\delta w+2f^{T}\delta w),

subject to the constraint

(24) BTδw=0,B^{T}\delta w=0,

where KK and ff are defined as in Eq. 17 and δw\delta w and w0w_{0} are the vector forms of ΔW\Delta W and W0W_{0}, respectively, with the nonzero entries collected column-wise.

This minimization can be performed using (preconditioned) conjugate gradients by ensuring that both the initial solution and the search direction satisfy the constraint [26]. To do this, return to the orthogonal projector

(25) ΠB=IB(BTB)1BT,\Pi_{B}=I-B(B^{T}B)^{-1}B^{T},

and apply conjugate gradients to the singular system

(26) ΠBKΠBδw=ΠB(f+Kw0),\Pi_{B}K\Pi_{B}\delta w=-\Pi_{B}(f+Kw_{0}),

starting from δw=0\delta w=0. Due to its block diagonal structure, it is straightforward to find a QR decomposition of B=QRB=QR, and the projection simply becomes:

(27) ΠB=IQQT.\Pi_{B}=I-QQ^{T}.

Finally, introducing KΠ=ΠBKΠBK_{\Pi}=\Pi_{B}K\Pi_{B} and f¯=f+Kw0\bar{f}=f+Kw_{0}, the Krylov subspace built by conjugate gradients is:

(28) 𝒦m=span{ΠBf¯,KΠf¯,KΠ2f¯,,KΠmf¯}.\mathcal{K}_{m}=\mbox{span}\{\Pi_{B}\bar{f},K_{\Pi}\bar{f},K_{\Pi}^{2}\bar{f},\dots,K_{\Pi}^{m}\bar{f}\}.

This is equivalent to applying the nullspace method [2] to the saddle-point system Eq. 17.

3.2 Improved tentative interpolation P0P_{0} and orthogonal projection ΠB\Pi_{B} with sparsity pattern expansion and sparse QRQR

A crucial point for energy minimization interpolation is the availability of a tentative prolongation P0P_{0} that satisfies the near kernel representability constraint Eq. 10. While this is relatively straightforward for scalar diffusion equations where VV has only one column, it is not trivial for vector-valued PDEs such as elasticity. One specific difficulty is that while forming the ii-th constraint equation Eq. 12, we must ensure that Vc(𝒥i,:)V_{c}(\mathcal{J}_{i},:) is full-rank. If it is not full rank, then no prolongation operator is able to satisfy Eq. 10 and the solution to Eq. 17 is not possible in general. We consider two possible remedies:

  1. 1.

    Add strongly connected neighbors to the pattern of the ii-th row of PP to enlarge 𝒥i\mathcal{J}_{i} until Vc(𝒥i,:)V_{c}(\mathcal{J}_{i},:) is full-rank (i.e., sparsity pattern expansion); or

  2. 2.

    Compute the least square solution of (12) as is done in [26].

A novel aspect of this work is our pursuit of sparsity pattern expansion. We find that this careful construction of the sparsity pattern, which guarantees that each constraint is exactly satisfied as Vc(𝒥i,:)V_{c}(\mathcal{J}_{i},:) is always full-rank, greatly improves performance on some problems.

To accomplish this task, we adopt a dynamic-pattern, least-squares fit (LSF) procedure that satisfies Eq. 10 or equivalently Eq. 19. For each row of WW (corresponding to a fine node ii), this is equivalent to satisfying the local dense system Eq. 12. For simplicity, we rewrite equation Eq. 12 by dropping the row subscript ii, with 𝕨=wi¯\noindent\mathbb{w}=\overline{\noindent w_{i}} and 𝕧=vi\noindent\mathbb{v}=v_{i}, yielding

(29) 𝔹𝕨=𝕧,\mathbb{B}\noindent\mathbb{w}=\noindent\mathbb{v},

where 𝔹=Vc(𝒥i,:)T\mathbb{B}=V_{c}(\mathcal{J}_{i},:)^{T} corresponds to a diagonal block of BTB^{T} in Eq. 17, when the non-zero entries of PP are enumerated row-wise.

Considering this generic FINE node represented by equation Eq. 29, if there are a sufficient number of COARSE node neighbors, then 𝔹\mathbb{B} has more columns than rows. Hence if we assume a full-rank 𝔹\mathbb{B}, then Eq. 29 is an underdetermined system and can be solved in several ways. In order to have a sparse solution 𝕨\noindent\mathbb{w}, we choose a minimal set of columns of 𝔹\mathbb{B} using the max vol algorithm [16, 14] to have the best basis. Here, we satisfy Eq. 29 exactly. We note that a related max vol approach to computing C-F splittings is used in [7].

Remark 3.1.

We adopt this form of a QRQR factorization with max vol, that is as sparse as possible, in order to improve the complexity of our algorithm and quality of P0P_{0}. While it is a relatively minor change to the algorithm’s structure, we count it as a useful novelty of our efficient implementation.

If, on the contrary, the number of neighboring COARSE nodes is not sufficient (𝕧span(𝔹)\noindent\mathbb{v}\notin\mbox{span}(\mathbb{B})), then Eq. 29 cannot be satisfied because it is overdetermined. This may occur not only when 𝔹\mathbb{B} is skinny, i.e., the number of columns is smaller than the number of rows, but more often because 𝔹\mathbb{B} is rank deficient even if it has a larger number of columns, i.e., it is wide. In elasticity problems and in particular with shell finite elements, this issue arises often in practice with standard distance one coarsening, where during the coarsening process, some FINE nodes may occasionally remain isolated. Our strategy is to gradually increase the interpolation distance for violating nodes where 𝕧span(𝔹)\noindent\mathbb{v}\notin\mbox{span}(\mathbb{B}), thus widening the interpolatory set (i.e., adding columns to 𝔹\mathbb{B}) where it is necessary. Algorithm 1 describes how to set-up p0^\widehat{p_{0}}, the vector form of initial tentative prolongation P0^\widehat{P_{0}}. (Algorithm 2, described later, will further process P0^\widehat{P_{0}} for the final tentative prolongation P0P_{0}).

Algorithm 1 Tentative Prolongation Set-Up
1:procedure PTent_SetUp(SS, VV, lmaxl_{\scriptsize\mbox{max}})
2:     input: SS – strength of connection matrix
3:              VV – near kernel modes
4:              lmaxl_{\scriptsize\mbox{max}} – maximum interpolation distance
5:     output: p0^\widehat{p_{0}} – initial tentative prolongation
6:     for all FINE nodes ii do
7:         Set l=0l=0;
8:         while l<lmaxl<l_{\scriptsize\mbox{max}} do
9:              Set l=l+1l=l+1;
10:              Form 𝒩i\mathcal{N}_{i} with the strong neighbors of ii up to distance ll;
11:              Select the best columns of Vc(𝒩i,:)TV_{c}(\mathcal{N}_{i},:)^{T} using max vol to form 𝔹\mathbb{B};
12:              Compute the least-squares solution to 𝔹𝕨=𝕧\mathbb{B}\noindent\mathbb{w}=\noindent\mathbb{v};
13:              if 𝕧𝔹𝕨2=0\|\noindent\mathbb{v}-\mathbb{B}\noindent\mathbb{w}\|_{2}=0 then
14:                  Assign 𝕨T\noindent\mathbb{w}^{T} as ii-th row of p0^\widehat{p_{0}};
15:                  break;
16:              end if
17:         end while
18:     end for
19:end procedure
Remark 3.2.

Avoiding a skinny or rank deficient local 𝔹\mathbb{B} block is also important for the construction of the orthogonal projection ΠB=IB(BTB)1BT\Pi_{B}=I-B(B^{T}B)^{-1}B^{T} that maps vectors of n\mathbb{R}^{n} to Ker(BT)\mbox{Ker}(B^{T}). Note that ΠB\Pi_{B} is used not only to correct the conjugate gradients search direction, but also to ensure the initial prolongation satisfies the near kernel constraint. In general, the size of 𝔹\mathbb{B} during energy minimization is larger than when constructing tentative prolongation because of the additional sparsity pattern expansion in (8). Thus, this “rank-deficiency” issue is ameliorated, but there are pathological cases where it has been observed in practice.

We now describe our procedure for computing local blocks of ΠB\Pi_{B}, called Π𝔹\Pi_{\mathbb{B}}, and a single row of the final tentative prolongation P0P_{0}, called 𝕨0T\noindent\mathbb{w}_{0}^{T}. The global procedure to build P0P_{0} and ΠB\Pi_{B} is then obtained by repeating this local algorithm for each FINE row. Denote by 𝕨^0T\noindent\mathbb{\widehat{w}}_{0}^{T} the starting tentative prolongation row. We use the word starting, because in general, we can receive a tentative prolongation that does not satisfy the constraint. That is, we may have:

(30) 𝔹𝕨^0𝕧.\mathbb{B}\noindent\mathbb{\widehat{w}}_{0}\neq\noindent\mathbb{v}.

Denote by nln_{l} and mlm_{l} the dimensions of the local system, so that 𝔹nl×ml\mathbb{B}\in\mathbb{R}^{n_{l}\times m_{l}}. To fulfill condition Eq. 29, we must find a correction to 𝕨^0\noindent\mathbb{\widehat{w}}_{0}, say δ\noindent\mathbb{\delta}, such that:

(31) 𝔹δ=𝕧𝔹𝕨^0=𝕣\mathbb{B}\noindent\mathbb{\delta}=\noindent\mathbb{v}-\mathbb{B}\noindent\mathbb{\widehat{w}}_{0}=\noindent\mathbb{r}

and then set:

(32) 𝕨0=𝕨^0+δ.\noindent\mathbb{w}_{0}=\noindent\mathbb{\widehat{w}}_{0}+\noindent\mathbb{\delta}.

To enforce condition Eq. 24 efficiently, we construct an orthonormal basis of range(𝔹)\mbox{range}(\mathbb{B}), say QQ, that gives rise to the desired local orthogonal projector:

(33) Π𝔹=IQQT.\Pi_{\mathbb{B}}=I-QQ^{T}.

If the prolongation pattern is large enough, the vast majority of the above local problems will be such that nlmln_{l}\geq m_{l} with 𝔹\mathbb{B} being also full-rank, i.e., rank(𝔹)=ml\mbox{rank}(\mathbb{B})=m_{l}. Thus, an economy-size QR decomposition is firstly performed on 𝔹\mathbb{B}, 𝔹=QR\mathbb{B}=QR with Qnl×mlQ\in\mathbb{R}^{n_{l}\times m_{l}} and Rml×mlR\in\mathbb{R}^{m_{l}\times m_{l}}, and QQ is used to form the local projector. Then, through the same QR decomposition, we compute δ\noindent\mathbb{\delta} as the least norm solution of the underdetermined system (31):

(34) δ=(𝔹T𝔹)1𝔹T𝕣=(RTQTQR)1RTQT𝕣=R1QT𝕣.\noindent\mathbb{\delta}=(\mathbb{B}^{T}\mathbb{B})^{-1}\mathbb{B}^{T}\noindent\mathbb{r}=(R^{T}Q^{T}QR)^{-1}R^{T}Q^{T}\noindent\mathbb{r}=R^{-1}Q^{T}\noindent\mathbb{r}.

Note that any solution to (31) would be equivalent, as the optimal choice in terms of global prolongation energy is later computed by the restricted CG algorithm. If the initial tentative prolongation arises from the LSF set-up, it should already fulfill Eq. 31 and 𝕣0\noindent\mathbb{r}\equiv 0. However, in the most difficult cases even extending the interpolatory set with large distance lmaxl_{\text{max}} is not sufficient to guarantee an exact interpolation of the near kernel for all the FINE nodes. For these FINE nodes — i.e., when nl<mln_{l}<m_{l} or 𝔹\mathbb{B} is not full-rank — we compute an SVD decomposition of 𝔹\mathbb{B}:

(35) 𝔹=UΣVT.\mathbb{B}=U\Sigma V^{T}.

From the diagonal of Σ\Sigma, we determine the rank of 𝔹\mathbb{B}, say klk_{l}, and use the first klk_{l} columns of UU to form QQ and thus Π𝔹\Pi_{\mathbb{B}}. Finally, since in this case it could be impossible to satisfy the constraint because the system is overdetermined, we use the least square solution to Eq. 31 to compute the correction for 𝕨^0\noindent\mathbb{\widehat{w}}_{0}:

(36) δ=VΣUT𝕣.\noindent\mathbb{\delta}=V\Sigma^{\dagger}U^{T}\noindent\mathbb{r}.

It is important to recognize that for these specific FINE nodes, energy minimization cannot reduce the energy, because the constraint does not leave any degree of freedom. Consequently, this situation should be avoided when selecting the COARSE variables and the prolongation pattern, because a prolongation violating the near kernel constraint will likely fail in representing certain smooth modes. The pseudocode to set-up ΠB\Pi_{B} and correct the initial tentative prolongation P0P_{0} (after Algorithm 1) is provided in Algorithm 2.

Algorithm 2 Energy Minimization Set-Up
1:procedure EMIN_SetUp(BB, gg, p0^\widehat{p_{0}})
2:     input: BB – block diagonal constraint matrix, as in equation (19)
3:              gg – constraint right-hand-side
4:              p0^\widehat{p_{0}} – initial tentative prolongation
5:     output: ΠB\Pi_{B} – projection matrix, constructed block-wise
6:                p0p_{0} – final tentative prolongation
7:     for all FINE nodes ii do
8:         Gather 𝔹\mathbb{B}, 𝕧\noindent\mathbb{v} and 𝕨^0\noindent\mathbb{\widehat{w}}_{0} for row ii, as in equation (30);
9:         Compute 𝕣=𝕧𝔹𝕨^0\noindent\mathbb{r}=\noindent\mathbb{v}-\mathbb{B}\noindent\mathbb{\widehat{w}}_{0};
10:         FAIL_QR = false;
11:         if nlmln_{l}\geq m_{l} then
12:              Compute economy-size QR of 𝔹\mathbb{B}: 𝔹=QR\mathbb{B}=QR;
13:              if rank(R)<ml\mbox{rank}(R)<m_{l} then
14:                  Set FAIL_QR = true;
15:              else
16:                  Compute δ=R1QT𝕣\noindent\mathbb{\delta}=R^{-1}Q^{T}\noindent\mathbb{r};
17:              end if
18:         end if
19:         if nl<mln_{l}<m_{l} or FAIL_QR then
20:              Compute SVD of 𝔹\mathbb{B}: 𝔹=UΣVT\mathbb{B}=U\Sigma V^{T};
21:              Determine k=rank(𝔹)k=\mbox{rank}(\mathbb{B}) using the diagonal of Σ\Sigma;
22:              Set Q=U(:,1:k)Q=U(:,1:k);
23:              Compute δ=VΣUT𝕣\noindent\mathbb{\delta}=V\Sigma^{\dagger}U^{T}\noindent\mathbb{r};
24:         end if
25:         Compute 𝕨0=𝕨^0+δ\noindent\mathbb{w}_{0}=\noindent\mathbb{\widehat{w}}_{0}+\noindent\mathbb{\delta} and insert 𝕨0T\noindent\mathbb{w}_{0}^{T} as row ii of p0p_{0};
26:         Set ii-th block of ΠB\Pi_{B} as IQQTI-QQ^{T};
27:     end for
28:end procedure

3.3 Improved stopping criterion and energy monitoring for CG-based energy minimization

Stopping criteria plays an important role in the overall cost and effectiveness of energy minimization. Here, we introduce a measure for monitoring energy and halting in the algorithm.

Since CG often converges quickly for energy minimization, it is common to fix the number of iterations in advance [21, 26]. However, in the case of a more challenging problem, several iterations may be needed, thus requiring an accurate stopping criterion. One immediate option is to use the relative residual, yet this may not be a close indicator of energy. In the following, we analyze CG for a generic Ax=bAx=b, however our observations extend to PCG as well.

In the CG algorithm, once the search direction pkp_{k} is defined at iteration kk, the scalar α\alpha is computed:

(37) αk=pkTrkpkTApk,\alpha_{k}=\frac{p_{k}^{T}r_{k}}{p_{k}^{T}Ap_{k}},

in such a way that the new approximation xk+1=xk+αkpkx_{k+1}=x_{k}+\alpha_{k}p_{k} minimizes the square of the energy norm of the error, i.e.,

(38) Ek=(xkh)TA(xkh),E_{k}=(x_{k}-h)^{T}A(x_{k}-h),

where h=A1bh=A^{-1}b is the true solution. The difference in energy ΔEk+1=Ek+1Ek\Delta E_{k+1}=E_{k+1}-E_{k} between two successive iterations kk and k+1k+1 can be computed as

(39) ΔEk+1=(xk+αkpk)TA(xk+αkpk)2bT(xk+αkpk)xkTAxk+2bTxk=2αkxkTApk+αk2pkTApk2αkbTpk=2αkrkTpk+αk2pkTApk=αk(2rkTpk+αkpkTApk)=αk(pkTrk)=(pkTrk)2pkTApk<0.\begin{split}\Delta E_{k+1}&=(x_{k}+\alpha_{k}p_{k})^{T}A(x_{k}+\alpha_{k}p_{k})-2b^{T}(x_{k}+\alpha_{k}p_{k})-x_{k}^{T}Ax_{k}+2b^{T}x_{k}\\ &=2\alpha_{k}x_{k}^{T}Ap_{k}+\alpha_{k}^{2}p_{k}^{T}Ap_{k}-2\alpha_{k}b^{T}p_{k}=-2\alpha_{k}r_{k}^{T}p_{k}+\alpha_{k}^{2}p_{k}^{T}Ap_{k}\\ &=\alpha_{k}(-2r_{k}^{T}p_{k}+\alpha_{k}p_{k}^{T}Ap_{k})=-\alpha_{k}(p_{k}^{T}r_{k})=-\frac{(p_{k}^{T}r_{k})^{2}}{p_{k}^{T}Ap_{k}}<0.\end{split}

From Eq. 37 and Eq. 39, it is possible to measure, with minimal cost, the energy decrease provided by the (k+1)(k+1)-st iteration. Indeed, by noting that αk\alpha_{k} is computed as the ratio between the two values αnum=pkTrk\alpha_{\text{num}}=p_{k}^{T}r_{k} and αden=pkTApk\alpha_{\text{den}}=p_{k}^{T}Ap_{k}, the energy decrease reads

(40) ΔEk+1=αnum2αden.\Delta E_{k+1}=\frac{\alpha_{\text{num}}^{2}}{\alpha_{\text{den}}}.

The relative value of the energy variation with respect to the initial variation (first iteration) is monitored and convergence is achieved when energy is sufficiently reduced:

(41) ΔEkΔE1τ\frac{\Delta E_{k}}{\Delta E_{1}}\leq\tau

for a small user-defined τ\tau.

3.4 Improved preconditioning for CG-based energy minimization

Before introducing PCG, we present some important properties of matrices KK and BB, that we leverage in the design of effective preconditioners for energy minimization. In particular, we will see that, thanks to the non-zero structures of KK and BB, point-wise Jacobi or Gauss-Seidel iterations prove particularly effective in preconditioning the projected block ΠBKΠB\Pi_{B}K\Pi_{B}.

Let us assume that the vector form of prolongation pp has been obtained by collecting the non-zeroes of PP row-wise, so that BB is block diagonal. Denoting by QQ the matrix collecting an orthonormal basis of range(B)\mbox{range}(B), and ZZ an orthonormal basis of ker(B)\mbox{ker}(B), by construction we have

(42) [QZ][QZ]T=[QZ]T[QZ]=I,[Q\;\;Z][Q\;\;Z]^{T}=[Q\;\;Z]^{T}[Q\;\;Z]=I,

i.e., the matrix [QZ][Q\;\;Z] is square and orthogonal. Moreover, since BB is block diagonal, both QQ and ZZ are block diagonal and can be easily computed and stored. We note that, by construction, each column of QQ and ZZ refers to a specific row of the prolongation, and, due to the block diagonal pattern chosen for BB, also each column of QQ and ZZ is non-zero only in the positions corresponding to the entries of pp collecting the non-zeroes of a specific row of PP, as is schematically shown in Fig. 1. More precisely, using the same notation as in Eq. 12, let us define i\mathcal{L}_{i} as the set of indices in pp corresponding to the non-zero entries in the ii-th row of PP so that:

(43) p(i)=P(i,𝒥i)p(\mathcal{L}_{i})=P(i,\mathcal{J}_{i})

and define 𝒥B,i\mathcal{J}_{B,i} and 𝒥Z,i\mathcal{J}_{Z,i} as the set of columns of BB and ZZ, referring to the ii-th row of PP. Then,

(44) B(k,𝒥B,i)=0Q(k,𝒥B,i)=0Z(k,𝒥Z,i)=0}ifki.\left.\begin{array}[]{l}B(k,\mathcal{J}_{B,i})=0\\ Q(k,\mathcal{J}_{B,i})=0\\ Z(k,\mathcal{J}_{Z,i})=0\end{array}\right\}\quad\mbox{if}\quad k\notin\mathcal{L}_{i}.
Refer to caption
Figure 1: Non-zero pattern of the columns of BB, QQ and ZZ corresponding to a specific row of the prolongation PP.

We are now ready to state two theorems that will be useful in explaining the choice of our preconditioners.

Theorem 3.1.

The diagonal of the projected matrix ZTKZZ^{T}KZ is equal to the projection of the diagonal of KK:

(45) diag(ZTKZ)=diag(ZTDKZ),\mathrm{diag}(Z^{T}KZ)=\mathrm{diag}(Z^{T}D_{K}Z),

where DKD_{K} is the matrix collecting the diagonal entries of KK. Moreover, ZTDKZZ^{T}D_{K}Z is a diagonal matrix.

Proof 3.2.

Let us consider the block of columns of ZZ relative to row ii, that is Z(:,JZ,i)Z(:,J_{Z,i}), remembering that it is non zero only for the row indices in i\mathcal{L}_{i}. As a consequence, the square block HiH_{i} obtained by pre- and post-multiplying KK by Z(:,JZ,i)Z(:,J_{Z,i}) is computed as:

(46) Hi(j,k)=ri(siZ(s,k)K(r,s))Z(r,j)forj,k𝒥Z,i.H_{i}(j,k)=\sum_{r\in\mathcal{L}_{i}}\left(\sum_{s\in\mathcal{L}_{i}}Z(s,k)K(r,s)\right)Z(r,j)\quad\mbox{for}\quad j,k\in\mathcal{J}_{Z,i}.

However, K(r,s)K(r,s) for r,sir,s\in\mathcal{L}_{i} represents the connection between P(i,jr)P(i,j_{r}) and P(i,js)P(i,j_{s}), which is non-zero only for jr=jsj_{r}=j_{s}, i.e., r=sr=s, because in KK there is no connection between different columns of PP. Moreover, due to Eq. 15, K(r,r)=A(i,i)K(r,r)=A(i,i) for every rir\in\mathcal{L}_{i}. As the columns of ZZ are orthonormal by construction, ZTZ=IZ^{T}Z=I, it immediately follows that the square block HiH_{i} is diagonal with all its non zero entry equal to A(i,i)A(i,i). The fact that ZTDKZZ^{T}D_{K}Z is a diagonal matrix follows from the above observation that K(i,i)=DK(i,i)=A(i,i)ImK(\mathcal{L}_{i},\mathcal{L}_{i})=D_{K}(\mathcal{L}_{i},\mathcal{L}_{i})=A(i,i)I_{m}, with ImI_{m} the identity matrix having size mm equal to the cardinality of i\mathcal{L}_{i}.

Corollary 3.3.

The product ZTDKQZ^{T}D_{K}Q, where DKD_{K} is defined as in Theorem 3.1, is equal to the null matrix:

(47) ZTDKQ=0.Z^{T}D_{K}Q=0.

Proof 3.4.

Due to the block-diagonal structure of QQ and ZZ, all the off-diagonal blocks of ZTDKQZ^{T}D_{K}Q are empty. Then, equation Eq. 47 follows from DK(i,i)=A(i,i)ImD_{K}(\mathcal{L}_{i},\mathcal{L}_{i})=A(i,i)I_{m} and Z(:,𝒥i)Q(:,𝒥i)Z(:,\mathcal{J}_{i})\perp Q(:,\mathcal{J}_{i}).

3.4.1 Preconditioned CG-based Energy Minimization

Algorithm 3 Preconditioned Conjugate Gradients for Energy Minimization
1:procedure EMIN_PCG(maxit, τ\tau, KK, ff, ΠB\Pi_{B}, MM, P0P_{0})
2:     input: maxit – maximum iterations
3:              τ\tau – energy convergence tolerance
4:              KK – system matrix (applied matrix-free with AA)
5:              ff – right-hand-side ff from equation 16
6:              ΠB\Pi_{B} – projection matrix
7:              MM – preconditioner
8:              P0P_{0} – tentative prolongation
9:     output: PP – final prolongation
10:     Extract global weight vector w0w_{0} row-wise from P0P_{0}
11:     Δw=0\Delta w=0;
12:     r=fΠBKw0r=f-\Pi_{B}Kw_{0};
13:     for k=1,,k=1,\dots,maxit do
14:         z=ΠBM1rz=\Pi_{B}M^{-1}r;
15:         γ=rTz\gamma=r^{T}z;
16:         if i=1i=1 then
17:              y=zy=z;
18:         else
19:              β=γ/γold\beta={\gamma}/{\gamma_{old}};
20:              y=z+βyy=z+\beta y;
21:         end if
22:         γold=γ\gamma_{old}=\gamma;
23:         y˘=ΠBKy\breve{y}=\Pi_{B}Ky;
24:         α=γ/(yTy˘\alpha=\gamma/(y^{T}\breve{y});
25:         ΔEk=γα\Delta E_{k}=\gamma\alpha
26:         if ΔEk<τΔE1\Delta E_{k}<\tau\Delta E_{1} return
27:         Δw=Δw+αy\Delta w=\Delta w+\alpha y;
28:         r=rαy˘r=r-\alpha\breve{y};
29:     end for
30:     w=w0+Δww=w_{0}+\Delta w;
31:     Form final prolongation P=[W;I]P=[W;I] with global weight vector ww
32:end procedure

Preconditioning CG can greatly improve convergence, but special care should be taken to maintain the search direction yy in the space of vectors satisfying the near kernel constraint. In other words, yy must satisfy ΠByy\Pi_{B}y\equiv y. In [26], a Jacobi preconditioner is adopted that satisfies this requirement, but, due to the special properties of the matrix KK, it is possible to compute a more effective preconditioner. Denoting by M1M^{-1} any approximation of K1K^{-1}, we use ΠBM1ΠB\Pi_{B}M^{-1}\Pi_{B} to precondition ΠBKΠB\Pi_{B}K\Pi_{B} in order to guarantee the constraint. The resulting PCG algorithm is outlined in Algorithm 3, where, since ΠB\Pi_{B} is a projection, we can avoid premultiplying rr by ΠB\Pi_{B} (line 14) as rr already satisfies the constraint.

In the remainder of this section we focus our attention on ZTKZZ^{T}KZ instead of ΠBKΠB\Pi_{B}K\Pi_{B}, because, as ΠB=IQQT=ZZT\Pi_{B}=I-QQ^{T}=ZZ^{T}, they have the same spectrum. Our aim is to find a good preconditioner for ZTKZZ^{T}KZ. Unfortunately, although KK is block diagonal and several effective preconditioners can be easily built for it, ZTKZZ^{T}KZ is less manageable and further approximations are needed.

By pre- and post-multiplying KK by [ZQ]T[Z\;Q]^{T} and [ZQ][Z\;Q], respectively, we can write the following 2×22\times 2 block expression:

(48) [ZQ]TK[ZQ]=[ZTKZZTKQQTKZQTKQ],[Z\;Q]^{T}K[Z\;Q]=\begin{bmatrix}Z^{T}KZ&Z^{T}KQ\\ Q^{T}KZ&Q^{T}KQ\\ \end{bmatrix}\!,

and, since we are interested in the inverse of ZTKZZ^{T}KZ, we can express it as the Schur complement of the leading block of the inverse of [ZQ]TK[ZQ][Z\;Q]^{T}K[Z\;Q] [32, Chapter 3]:

(49) ([ZQ]TK[ZQ])1=[ZQ]TK1[ZQ]=[ZTK1ZZTK1QQTK1ZQTK1Q],([Z\;Q]^{T}K[Z\;Q])^{-1}=[Z\;Q]^{T}K^{-1}[Z\;Q]=\begin{bmatrix}Z^{T}K^{-1}Z&Z^{T}K^{-1}Q\\ Q^{T}K^{-1}Z&Q^{T}K^{-1}Q\\ \end{bmatrix}\!,

from which it follows that:

(50) (ZTKZ)1=ZTK1ZZTK1Q(QTK1Q)1QTK1Z.(Z^{T}KZ)^{-1}=Z^{T}K^{-1}Z-Z^{T}K^{-1}Q(Q^{T}K^{-1}Q)^{-1}Q^{T}K^{-1}Z.

When K1K^{-1} is approximated with the inverse of the diagonal of KK, MJ=diag(K)M_{J}=\mathrm{diag}(K), because of Theorem 3.3, we have that ZTMJ1Q=0Z^{T}M_{J}^{-1}Q=0, and the expression (50) becomes:

(51) (ZTKZ)1M11=ZTMJ1Z,(Z^{T}KZ)^{-1}\simeq M_{1}^{-1}=Z^{T}M_{J}^{-1}Z,

which corresponds to a Jacobi preconditioning of ZTKZZ^{T}KZ.

We highlight that only for Jacobi can the post-multiplication by ΠB\Pi_{B} be neglected in line 14, since MJ1M_{J}^{-1} does not introduce components along range(Q)\mbox{range}(Q). This is consistent with [26], where the Jacobi preconditioner is used, and no post-multiplication by ΠB\Pi_{B} is adopted.

If a more accurate preconditioner is needed, K1K^{-1} can be approximated using a block-wise symmetric Gauss-Seidel (SGS) iteration, that is:

(52) K1MSGS1=(L+D)TD(L+D)1,K^{-1}\simeq M^{-1}_{SGS}=(L+D)^{-T}D(L+D)^{-1},

which substituted into equation (50) reads:

(53) (ZTKZ)1M21=ZTMSGS1ZZTMSGS1Q(QTMSGS1Q)1QTMSGS1Z.(Z^{T}KZ)^{-1}\simeq M_{2}^{-1}=Z^{T}M^{-1}_{SGS}Z-Z^{T}M^{-1}_{SGS}Q(Q^{T}M^{-1}_{SGS}Q)^{-1}Q^{T}M^{-1}_{SGS}Z.

Since the application of (53) is still impractical due to the presence of the term (QTMSGS1Q)1(Q^{T}M^{-1}_{SGS}Q)^{-1}, we neglect the second member of the right-hand side, based on the heuristic that ZTMSGS1QZ^{T}M^{-1}_{SGS}Q should be small, because ZTMJ1Q=0Z^{T}M_{J}^{-1}Q=0. After this simplification, we obtain the final expression of the projected SGS preconditioner:

(54) K1M21=ZTMSGS1Z=ZT(L+D)TD(L+D)1Z.K^{-1}\simeq M_{2}^{-1}=Z^{T}M^{-1}_{SGS}Z=Z^{T}(L+D)^{-T}D(L+D)^{-1}Z.

Note that while M11M_{1}^{-1} is exactly the Jacobi preconditioner of ZTKZZ^{T}KZ, M21M_{2}^{-1} is only an approximation of the exact SGS preconditioner of ZTKZZ^{T}KZ. However, we will show in the numerical results that it is able to significantly accelerate convergence.

4 Efficient parallel implementation

Energy minimization has historically been considered computationally expensive for AMG setup and real applications. Nevertheless, a cost effective implementation is still possible, but requires special care with the algorithm parallelization. In this work, we build our AMG preconditioner with Chronos [15, 13], which provides the standard numerical kernels usually required in a distributed memory AMG implementation such as the communication of ghost unknowns, coarse grid construction (e.g., computing the PMIS C-F partition [11]), sparse matrix-vector, and sparse matrix-matrix product, etc. For energy minimization, however, we developed three specific kernels that are not required in other AMG approaches, but are critical for an efficient parallel implementation here, i.e., the sparse product between KK and pp, application of the projection ΠB\Pi_{B}, and symmetric Gauss-Seidel with KK. We do not list Jacobi preconditioning, because it simply consists of a row-wise scaling of PP.

The first issue related to the product by KK is that in practical applications KK cannot be stored. In fact, if we consider a prolongation PP having rr non-zeroes per row on average, the number of non-zeroes per column will be approximately snfncrs\simeq\frac{n_{f}}{n_{c}}r and the KK matrix would be of size ncs×ncsn_{c}s\times n_{c}s with about ncstnfrtn_{c}s\,t\simeq n_{f}r\,t non-zeroes to be stored, where tt is the average number of non-zeroes per row of AA. Often, rtr\simeq t and storing KK becomes several times more expensive than storing AA. For instance in practical elasticity problems, KK can be up to 20 times larger than AA, making it unavoidable to proceed in matrix-free mode for KK. Fortunately, the special structure of KK allows us to interpret the product KpKp as a sparse matrix-matrix (SpMM) product between AA and PP, but with a fixed, prescribed pattern on PP. This property can be easily derived from the definition of KK Eq. 16 and the vector form of the prolongation pp. One advantage of prescribing a fixed pattern for PP is that the amount of data exchanged in SpMM is greatly reduced. First of all, the sparsity pattern adjacency information of PP can be communicated only once before entering PCG, then for all the successive APAP products only the entries of PP are exchanged. Moreover, all the buffers to receive and send messages through the network, can be allocated and set-up only once at the beginning and removed at the end of the minimization process. In practice, we find that these optimizations reduce the cost of SpMM by about 50%.

By contrast, the construction and application of ΠB\Pi_{B} does not require any communication. In fact, we distribute the prolongation row-wise among processes, and since BB is block diagonal, with each block corresponding to a row of PP as shown in (13), we distribute BB blocks to the process owning the respective row of PP. Following this scheme, each block 𝔹\mathbb{B} of BB is factorized independently to obtain QQ which is efficiently processed by the LAPACK routine DGEMV when applying ΠB\Pi_{B}.

Finally, we emphasize that there is no parallel bottleneck when parallelizing the application of symmetric Gauss-Seidel with KK. This is because KK is block diagonal and, at least in principle, each block can be assigned to a different process. However as in the KpKp product, the main issue is the large size of KK which prevents explicit storage. As above, we rely on the equivalence between the KpKp product and the APAP product with a prescribed sparsity pattern. The symmetric Gauss-Seidel step is then performed matrix-free, similar to the APAP product, by saving again a considerable amount of data exchange because it is not necessary to communicate adjacency information or reallocate communication buffers. As expected, the symmetric Gauss-Seidel application exhibits a computational cost comparable to that of the APAP product.

5 Numerical experiments

In this section, we investigate the improved convergence offered by energy minimization and the resulting computational performance and parallel efficiency for real-world applications. This numerical section is divided into three parts: a detailed analysis on the prolongation energy reduction and related AMG convergence for two medium-size matrices, a weak scalability test for an elasticity problem on a uniformly refined cube, and a comparison study for a set of large real world matrices representing fluid dynamics and mechanical problems.

As a reference for the proposed approach, we compare with the well-known and open source solver GAMG [1], a smoothed aggregation-based AMG code from PETSc. In both cases, Chronos and GAMG are used with preconditioned conjugate gradients (PCG). The GAMG set-up is tuned for each problem starting from its default parameters, as suggested by the user guide [30]. As smoother we have chosen the most powerfull option available in Chronos and GAMG, that is FSAI and Chebyshev accelerated Jacobi, respectively. Each time such default parameters are modified, we report the chosen values in the relevant section. Since all of the experiments are run in parallel, the system matrices are partitioned with ParMETIS [18] before the AMG preconditioner set-up to reduce communication overhead.

All numerical experiments have been run on the Marconi100 supercomputer located in the Italian consortium for supercomputing (CINECA). Marconi100 consists of 980 nodes based on the IBM Power9 architecture, each equipped with two 16-core IBM POWER9 AC922 @3.1 GHz processors. For each test, the number of nodes, NN, is selected so that each node has approximately 640,000,000 nonzero entries, and, consequently, the number of nodes is problem dependent. Each node is always fully exploited by using 32 MPI tasks, i.e., each task (core) has an average load of 20,000,000 nonzero entries. The number of cores is denoted ncrn_{cr}. Only during the smaller cases in the weak scalability analysis are nodes partially used (i.e., with less than 32 MPI tasks). Even though Chronos can exploit hybrid MPI-OpenMP parallelism, for the sake of comparison, it is convenient to use just one thread, i.e., pure MPI parallelism. Moreover, for such a high load per core, we do not find that fine-grained OpenMP parallelism is of much help.

The numerical results are presented in terms of total number of computational cores used, ncrn_{cr}, the grid and operator complexities, CgdC_{gd} and CopC_{op}, respectively, the number of iterations, nitn_{it}, and the setup, iteration, and total times, TpT_{p}, TsT_{s}, and TtT_{t} = TpT_{p} + TsT_{s}, respectively. For all the test cases, the right-hand side is a unit vector. The linear systems are solved with PCG and a zero initial guess, and convergence is achieved when the 2\ell_{2}-norm of the iterative residual drops below 8 orders of magnitude with respect to the 2\ell_{2}-norm of the right-hand side.

5.1 Analysis of the energy minimization process

We use two matrices for studying prolongation energy reduction, Cube and Pflow742 [20]. While the former is quite simple, as it is the fourth refinement level of the linear elasticity cube used in the weak scalability study, the latter arises from a 3D simulation of the pressure-temperature field in a multilayered porous medium discretized by hexahedral finite elements. The main source of ill-conditioning here is the large contrast in the material properties for different layers. The dimensions of Cube are 1,778,112 rows and 78,499,998 entries, with 44.15 entries per row on average. The size of Pflow742 is 742,793 rows and 37,138,461 entries, for an average entry-per-row ratio of 50.00.

The overall solver performance is compared against the energy reduction for the fine-level PP when using the energy mininimization Algorithm 3 with either Jacobi and Gauss-Seidel as the preconditioner. Results are analyzed in terms of computational costs and times. The main algorithmic features we want to analyze are:

  • how the prolongation energy reduction affects AMG convergence; and

  • the effectiveness of the preconditioner, i.e., Jacobi or Gauss-Seidel.

The energy minimization iteration count (Algorithm 3) is denoted by nitEn_{it}^{E}. Between brackets, we also report the relative energy reduction that is used to monitor the restricted PCG convergence of Algorithm 3, as shown in equation (41). TiT_{i} is the time spent to improve the prolongation, with either classical prolongation smoothing with weighted-Jacobi (SMOOTHED) or the energy minimization process. Note that TiT_{i} is only part of TpT_{p}.

Table 1 shows the results for Cube. First, it can be seen that the energy minimization algorithm produces prolongation operators which lead to somewhat lower complexities than for the smoothed case. Moreover, energy minimization builds more effective prolongation operators overall, since the global iteration count (nitn_{it}) is lower. As the energy minimization iteration count increases, we can observe that nitn_{it} decreases, while the setup time (TpT_{p}) increases. For this simple problem, the optimal point in terms of total time (TtT_{t}) is reached using 2 iterations of Jacobi (EMIN-J). Fig. 2a further shows that 2 iterations of Jacobi (EMIN-J) already reaches close to the achievable energy minimum. Fig. 3a compares the cost in wall-clock seconds for each energy minimization iteration using different preconditioners. For this case and implementation, Jacobi is more efficient.

Table 1: Analysis of energy minimization for the Cube problem.
Prolongation nitEn_{it}^{E} ΔEkΔE1\frac{\Delta E_{k}}{\Delta E_{1}} CgrC_{gr} CopC_{op} nitn_{it} TpT_{p} [s] TsT_{s} [s] TtT_{t} [s] TiT_{i} [s]
SMOOTHED - - 1.075 1.648 58 52.1 13.4 65.5 6.3
EMIN-J 1 10010^{0} 1.075 1.592 54 47.6 11.2 58.8 11.3
EMIN-J 2 21012\cdot 10^{-1} 1.075 1.589 32 50.9 6.7 57.6 14.3
EMIN-J 4 10210^{-2} 1.075 1.589 27 57.3 5.7 63.0 20.0
EMIN-GS 1 10010^{0} 1.075 1.587 28 55.5 6.0 61.5 19.3
EMIN-GS 2 21022\cdot 10^{-2} 1.075 1.588 26 65.6 5.5 71.1 28.8
EMIN-GS 4 11041\cdot 10^{-4} 1.075 1.589 26 84.4 5.6 90.0 47.7

Refer to caption
(a) Cube case
Refer to caption
(b) Pflow742 case
Figure 2: Energy reduction vs. energy minimization iterations with Jacobi and Gauss-Seidel.

Refer to caption
(a) Cube case
Refer to caption
(b) Pflow742 case
Figure 3: Energy reduction vs. computational cost for energy minimization preconditioned with Jacobi and Gauss-Seidel.

Similar conclusions can be drawn for the Pflow742 case, as nitn_{it} monotonically decreases as the energy associated with the prolongation operator is reduced. As reported by Table 2, the optimal total time (TtT_{t}) is obtained with 4 Jacobi iterations (EMIN-J). Fig. 2b shows how the energy of the prolongation operator decreases slower than for Cube and that Jacobi converges significantly slower than Gauss-Seidel. However as reported by Fig. 3b, the cost of Gauss-Seidel is still more than that of Jacobi, although the performance difference is smaller than for Cube.

Table 2: Analysis of energy minimization for the Pflow742 problem.
Prolongation nitEn_{it}^{E} ΔEkΔE1\frac{\Delta E_{k}}{\Delta E_{1}} CgrC_{gr} CopC_{op} nitn_{it} TpT_{p} [s] TsT_{s} [s] TtT_{t} [s] TiT_{i} [s]
SMOOTHED - - 1.061 1.465 369 23.0 29.8 52.8 2.3
EMIN-J 1 10010^{0} 1.061 1.339 377 20.4 27.3 47.7 2.9
EMIN-J 2 31013\cdot 10^{-1} 1.061 1.344 270 21.5 19.6 41.1 3.7
EMIN-J 4 41024\cdot 10^{-2} 1.062 1.352 219 23.3 15.7 39.0 5.4
EMIN-J 8 81038\cdot 10^{-3} 1.062 1.360 189 26.2 13.8 40.0 8.1
EMIN-GS 1 10010^{0} 1.061 1.346 276 23.0 19.9 42.9 5.3
EMIN-GS 2 91029\cdot 10^{-2} 1.062 1.352 221 26.2 16.0 42.2 8.2
EMIN-GS 4 61036\cdot 10^{-3} 1.062 1.363 184 31.7 14.3 46.0 13.8
EMIN-GS 8 71047\cdot 10^{-4} 1.063 1.367 183 42.9 13.8 56.7 24.8

Regarding algorithmic complexity, each energy minimization iteration with Gauss-Seidel should cost exactly twice an iteration with Jacobi, and from the above tests, Gauss-Seidel is able to reduce the energy more than twice as fast as Jacobi. Thus Gauss-Seidel should be cheaper than Jacobi. Unfortunately, this is not confirmed by our numerical experiments and this is likely due to a sub-optimal parallel implementation of the Gauss-Seidel preconditioner. Although the block diagonal structure of KK allows theoretically for a perfectly parallel implementation, the Gauss-Seidel implementation requires more communication and synchronization stages than Jacobi, likely leading to our results where the Jacobi preconditioner is faster. A more cost effective implementation of Gauss-Seidel will be the focus of future work. For now, Jacobi is chosen as our default preconditioner and will be used in all subsequent cases.

Finally, we observe that a relative energy reduction of one order of magnitude gives generally the best trade-off between set-up time and AMG convergence. Therefore, we use τ=0.1\tau=0.1 as default.

5.2 Weak scalability test

Here, we carry out a weak scalability study of energy minimization AMG for linear elasticity on a unit cube and different levels of refinement. The unit cube [0,1]3[0,1]^{3} is discretized by regular tetrahedral elements, the material is homogeneous, and all displacements are prevented on the square region [0,0.125]×[0,0.125][0,0.125]\times[0,0.125] at z=0z=0. The mesh sizes are chosen such that each subsequent refinement produces about twice the number degrees of freedom with respect to the previous mesh. The problem sizes range from 222k rows to 124M rows, with an average entries-per-row ratio of 44.47.

Two sets of AMG parameters are used with Chronos: the first set targets a constant PCG iteration count, while the second minimizes the total solution time (TtT_{t}). The first section of Table 3 provides the outcome of the first test. The iteration count increases very slowly from 23 to 33, while the problem size increases by a factor of about 292^{9}. However, this nearly optimal scaling comes with relatively large complexities. As a consequence, total times are also relatively large, especially the setup time (TpT_{p}). The time for energy minimization (TiT_{i}) scales quite well, however, with only a factor of 2 difference between the first and last refinement levels.

Next to reduce the setup and solution times, we increase the AMG strength threshold [27] to allow for lower complexities. The second section of Table 3 collects the new outputs. It can be seen that the relative trends are the same as in the previous tests, but that the timings are lower (30% on average). The reduced complexity is thus advantageous by providing faster wall-clock times, even at the expense of more iterations nitn_{it}.

Table 3: Weak scalability results for regular cube. Three sets of results are reported: i) energy minimization-based AMG via Chronos to produce an almost constant iteration count; ii) the same tuned to minimize the total time; and iii) PETSc’s GAMG using all default parameters with the exception of μ\mu, which is chosen to reduce the overall solution time.
solver nrows ncrn_{cr} NN CgrC_{gr} CopC_{op} nitn_{it} TpT_{p} [s] TsT_{s} [s] TtT_{t} [s] TiT_{i} [s]
energy minimization minimal iteration count 222k 1 1 1.077 1.540 23 28.7 2.7 31.4 0.12
447k 2 1 1.076 1.559 24 33.0 2.9 35.9 0.12
902k 4 1 1.076 1.580 26 36.1 3.4 39.5 0.13
1,778k 8 1 1.075 1.596 27 39.0 3.6 42.7 0.13
3,675k 16 1 1.075 1.610 28 43.3 4.1 47.4 0.15
7,546k 32 1 1.075 1.620 29 53.9 5.3 59.2 0.18
15,533k 64 2 1.075 1.630 30 61.2 5.9 67.2 0.20
31,081k 128 4 1.075 1.670 30 74.9 6.5 81.4 0.22
62,391k 256 8 1.075 1.642 32 72.8 6.8 79.7 0.21
124,265k 512 16 1.075 1.646 33 88.8 7.8 96.7 0.24
energy minimization best solution time 222k 1 1 1.047 1.252 31 19.5 3.1 22.6 0.10
447k 2 1 1.047 1.260 31 22.6 3.3 25.9 0.11
902k 4 1 1.047 1.269 34 24.4 3.8 28.2 0.11
1,778k 8 1 1.047 1.274 34 27.4 4.0 31.4 0.12
3,675k 16 1 1.047 1.279 37 30.6 4.7 35.3 0.13
7,546k 32 1 1.047 1.283 38 37.8 6.1 43.9 0.16
15,533k 64 2 1.046 1.286 49 44.3 8.5 52.8 0.17
31,081k 128 4 1.046 1.289 41 45.0 7.5 52.5 0.18
62,391k 256 8 1.046 1.291 43 48.4 8.6 57.0 0.20
124,265k 512 16 1.046 1.292 51 52.5 10.8 63.3 0.21
GAMG (μ=0.01\mu=0.01) best solution time 222k 1 1 N/A 1.479 58 11.19 13.72 24.92 0.24
447k 2 1 N/A 1.503 69 10.02 17.12 27.14 0.25
902k 4 1 N/A 1.526 73 11.41 19.19 30.59 0.26
1,778k 8 1 N/A 1.549 78 12.63 21.05 33.68 0.27
3,675k 16 1 N/A 1.571 82 14.83 24.86 39.68 0.30
7,546k 32 1 N/A 1.608 86 23.04 35.52 58.56 0.41
15,533k 64 2 N/A 1.618 93 27.15 39.33 66.48 0.42
31,081k 128 4 N/A 1.703 97 37.48 41.58 79.06 0.43
62,391k 256 8 N/A 1.803 98 58.98 43.37 102.35 0.44
124,265k 512 16 N/A 1.953 100 119.16 50.29 169.45 0.50

For comparison, the same set of problems is next solved with GAMG from PETSc. The default values for all parameters are used, as it turned out they were already the best. The only exception is the threshold for dropping edges in the aggregation graph (μ\mu), whose value is reported alongside the results. The third section of Table 3 shows the output for GAMG. The complexities and run-times are higher than for Chronos, especially for the setup stage. The iteration counts are usually between 2 and 3 times larger than those required by Chronos. For this test problem, reducing complexity by setting μ=0.0\mu=0.0 is not beneficial as the increase in the iteration count cancels the set-up time reduction. We also note that both the operator complexity and set-up time increase significantly with the refinement level, while energy minimization is able to better limit growth in these quantities. Comparing two codes is fraught with difficulty, but these results do indicate that energy minimization is an efficient approach in parallel for this problem.

5.3 Challenging Real World Problems

We now examine the proposed energy minimization approach for a set of challenging real-world problems arising from discretized PDEs in both fluid dynamics and mechanics. The former class consists of problems from the discretization of the Laplace operator, such as underground fluid flow, compressible or incompressible airflow around complex geometries, or porous flow. The latter class consists of mechanical applications such as subsidence analysis, hydrocarbon recovery, gas storage (geomechanics), mesoscale simulation of composite materials (mesoscale), mechanical deformation of human tissues or organs subjected to medical interventions (biomedicine), and design and analysis of mechanical elements, e.g., cutters, gears, air-coolers (mechanical). The selected problems are not only large but also characterized by severe ill-conditioning due to mesh distortions, material heterogeneity, and anisotropy. They are listed in Table 4 with details about the size, the number of nonzeros, and the application field from which they arise.

Table 4: Matrix sizes and number of non-zeroes for the real-world problems.
matrix nrows nterms avg nt/row application
guenda11m 11,452,398 512,484,300 44.75 geomechanics
agg14m 14,106,408 633,142,730 44.88 mesoscale
tripod20m 19,798,056 871,317,864 44.01 mechanical
M20 20,056,050 1,634,926,088 81.52 mechanical
wing 33,654,357 2,758,580,899 81.97 mechanical
Pflow73m 73,623,733 2,201,828,891 29.91 reservoir
c4zz134m 134,395,551 10,806,265,323 80.41 biomedicine

Table 5 reports the AMG performance on these benchmarks when using the energy minimization procedure, classical prolongation smoothing (one step of weighed-Jacobi on the tentative prolongation), and GAMG, respectively. The overall best time is highlighted in boldface. As before, please note that for GAMG only the threshold value for dropping edges in the aggregation graph (μ\mu) has been tuned. All the other parameters are used with their default value, as they were already the best.

With respect to classical prolongation smoothing, the energy minimization procedure is able to reduce the complexities, in particular CopC_{op}, the setup time TpT_{p}, and also the iteration count (with the only exception being Pflow73m). It is the reduced complexities that allow energy minimization to achieve the lower setup time. The overall gain in total time is in the range 5555%55\% for all test cases.

Energy minimization also compares favorably with GAMG. GAMG provides a faster total time than energy minimization-based AMG on two cases out of seven, agg14m and M20, and a similar total time on tripod20m. The situation is reversed on the most challenging examples, where GAMG is significantly slower and is unable to solve Pflow73m. We briefly note that, unexpectedly, the ParMETIS partitioning significantly harms GAMG effectiveness on c4zz134m. For this case, we report GAMG performance on the matrix with its native ordering and mark this test with a ‘*’. Typically, the GAMG set-up time is faster than energy minimization, but energy minimization allows for a preconditioner of higher quality, which significantly reduces the total time in the most difficult cases. Fig. 4 collects all these results, reporting the relative total times. The setup and solve phases are denoted by different shading.

Table 5: Results for the real-world cases using: i) energy minimization AMG; ii) classical prolongation smoothing; and iii) PETSc’s GAMG. Default PETSc GAMG parameters are used. Only μ\mu, i.e., the threshold for dropping edges in aggregation graph, is changed. ‘*’ means that the matrix has not been partitioned with ParMETIS. Please note that only the threshold value for dropping edges in the aggregation graph (μ\mu) has been tuned. All the other parameters are used with their default value, as it turned out they were already the best.
solver case NN μ\mu CgrC_{gr} CopC_{op} nitn_{it} TpT_{p} [s] TsT_{s} [s] TtT_{t} [s]
energy minimization guenda11m 1 N/A 1.041 1.325 987 314.0 399.0 713.0
agg14m 1 N/A 1.042 1.322 23 66.7 7.2 74.0
tripod20m 2 N/A 1.049 1.302 104 40.3 22.2 62.6
M20 2 N/A 1.055 1.304 111 98.0 40.2 138.0
wing 8 N/A 1.055 1.297 140 47.2 25.3 72.5
Pflow73m 4 N/A 1.028 1.101 1169 225.0 424.0 649.0
c4zz134m 8 N/A 1.029 1.122 154 72.7 48.8 122.0
smoothed prolongation guenda11m 1 N/A 1.041 1.378 1771 307.0 750.0 1060.0
agg14m 1 N/A 1.042 1.371 48 62.5 15.5 78.1
tripod20m 2 N/A 1.048 1.336 212 34.6 47.5 82.2
M20 2 N/A 1.054 1.733 154 167.0 76.6 244.0
wing 8 N/A 1.055 1.697 301 93.5 71.6 165.1
Pflow73m 4 N/A 1.058 1.371 841 441.3 394.5 836.0
c4zz134m 8 N/A 1.028 1.199 277 79.2 98.9 178.0
GAMG guenda11m 1 0.00 N/A 1.524 2237 22.3 1553.9 1576.1
agg14m 1 0.00 N/A 1.557 33 20.6 32.2 52.7
tripod20m 2 0.01 N/A 1.679 48 32.2 32.2 64.4
M20 2 0.01 N/A 1.203 60 36.7 62.4 99.1
wing 8 0.01 N/A 1.204 250 34.0 108.4 142.4
Pflow73m 4 N/A
c4zz134m* 8 0.01 N/A 1.233 156 110.9 250.38 361.33
Refer to caption
Figure 4: Comparison in terms of relative total time for the different preconditioners. Darker portions of the bars represent AMG setup, while the lighter segments represent the time spent in the solve phase. The relative baseline is time taken by the energy minimization phase to build prolongation.

6 Conclusions

This work provides evidence of the potential of a novel energy minimization procedure in constructing the prolongation operator in AMG. While the theoretical advantages of this idea are well known in the literature, the computational aspects and its practical feasibility in a massively parallel software were still under discussion.

With this contribution, we have highlighted how the energy minimization approach can be effectively implemented in a classical AMG setting, leading to robust and cost-effective prolongation operators when compared to other approaches. It is also shown how, especially in challenging problems, this technique can lead to considerably faster AMG convergence. The prolongation energy can be minimized with several schemes. We have adopted a restricted conjugate gradient, accelerated by suitable preconditioners. We presented and analyzed Jacobi and Gauss-Seidel preconditioners, restricting our attention to these two because of their applicability in matrix-free mode. The experiments show that Gauss-Seidel has the potential to be faster than Jacobi, however, its efficient parallel implementation is not straightforward and needs more attention.

Weak scalability has been assessed on an elasticity model problem by discretizing a cube with regular tetrahedra. The proposed algorithms have been implemented in a hybrid MPI-OpenMP linear solver and its performance has been compared to another well recognized AMG implementation (GAMG) using a set of large and difficult problems arising from real-world applications.

In the future, we plan to further reduce set-up time by investigating other preconditioning techniques as those described in [17], and to extend this approach to non-symmetric problems as well.

References

  • [1] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. Gropp, et al., Petsc users manual, (2019).
  • [2] M. Benzi, G. H. Golub, and J. Liesen, Numerical solution of saddle point problems, Acta Numer., 14 (2005), pp. 1–137.
  • [3] A. Brandt, J. Brannick, K. Kahl, and I. Livshits, Bootstrap amg, SIAM Journal on Scientific Computing, 33 (2011), pp. 612–632.
  • [4] A. Brandt, J. Brannick, K. Kahl, and I. Livshits, Algebraic distance for anisotropic diffusion problems: multilevel results, Electronic Transactions on Numerical Analysis, 44 (2015), pp. 472–496.
  • [5] A. Brandt, S. F. McCormick, and J. W. Ruge, Algebraic multigrid (AMG) for automatic multigrid solution with application to geodetic computations, tech. rep., Institute for Computational Studies, Colorado State University, 1982.
  • [6]  , Algebraic multigrid (AMG) for sparse matrix equations, in Sparsity and Its Applications, D. J. Evans, ed., Cambridge Univ. Press, Cambridge, 1984, pp. 257–284.
  • [7] J. Brannick, F. Cao, K. Kahl, R. D. Falgout, and X. Hu, Optimal interpolation and compatible relaxation in classical algebraic multigrid, SIAM Journal on Scientific Computing, 40 (2018), pp. A1473–A1493.
  • [8] J. J. Brannick and R. D. Falgout, Compatible Relaxation and Coarsening in Algebraic Multigrid, SIAM Journal on Scientific Computing, 32 (2010-01), pp. 1393 – 1416.
  • [9] M. Brezina, R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, and J. Ruge, Adaptive Smoothed Aggregation (α\alphaSA) Multigrid, SIAM Rev., 47 (2005), pp. 317–346.
  • [10] W. L. Briggs, V. E. Henson, and S. F. McCormick, A multigrid tutorial, SIAM, Philadelphia, PA, USA, 2nd ed., 2000.
  • [11] H. De Sterck, U. M. Yang, and J. J. Heys, Reducing complexity in parallel algebraic multigrid preconditioners, SIAM Journal on Matrix Analysis and Applications, 27 (2006), pp. 1019–1039.
  • [12] R. D. Falgout and P. S. Vassilevski, On generalizing the algebraic multigrid framework, SIAM J. Numer. Anal., 42 (2004), pp. 1669–1693.
  • [13] M. Frigo, G. Isotton, and C. Janna, Chronos web page. https://www.m3eweb.it/chronos, 2021.
  • [14] S. A. Goreinov, I. V. Oseledets, D. Savostyanov, E. E. Tyrtyshnikov, and N. L. Zamarashkin, How to find a good submatrix, tech. rep., Nov. 2008.
  • [15] G. Isotton, M. Frigo, N. Spiezia, and C. Janna, Chronos: a general purpose classical AMG solver for high performance computing, SIAM J. Sci. Comput., 43 (2021), pp. C335–C357.
  • [16] D. E. Knuth, Semi-optimal bases for linear dependencies, Linear and Multilinear Algebra, 17 (1985), pp. 1–4.
  • [17]  , Solving linear systems of the form (a+γuut)x=b(a+\gamma uu^{t})x=b by preconditioned iterative methods, Linear and Multilinear Algebra, (2022).
  • [18] K. Lab, ParMETIS - Parallel Graph Partitioning and Fill-reducing Matrix Ordering. http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview, 2022.
  • [19] S. MacLachlan, T. Manteuffel, and S. McCormick, Adaptive reduction-based amg, Numerical Linear Algebra with Applications, 13 (2006), pp. 599–620.
  • [20] V. A. P. Magri, A. Franceschini, and C. Janna, A novel algebraic multigrid approach based on adaptive smoothing and prolongation for ill-conditioned systems, SIAM Journal on Scientific Computing, 41 (2019), pp. A190–A219.
  • [21] J. Mandel, M. Brezina, and P. Vaněk, Energy optimization of algebraic multigrid bases, Computing, 62 (1999), pp. 205–228.
  • [22] T. A. Manteuffel, S. Münzenmaier, J. Ruge, and B. Southworth, Nonsymmetric reduction-based algebraic multigrid, SIAM Journal on Scientific Computing, 41 (2019), pp. S242–S268.
  • [23] T. A. Manteuffel, L. N. Olson, J. B. Schroder, and B. S. Southworth, A root-node-based algebraic multigrid method, SIAM J. Sci. Comput., 39 (2017), pp. S723–S756.
  • [24] T. A. Manteuffel, J. Ruge, and B. S. Southworth, Nonsymmetric algebraic multigrid based on local approximate ideal restriction (\ell air), SIAM Journal on Scientific Computing, 40 (2018), pp. A4105–A4130.
  • [25] L. N. Olson, J. Schroder, and R. S. Tuminaro, A new perspective on strength measures in algebraic multigrid, Numerical Linear Algebra with Applications, 17 (2010), pp. 713–733.
  • [26] L. N. Olson, J. B. Schroder, and R. S. Tuminaro, A general interpolation strategy for algebraic multigrid using energy minimization, SIAM J. Sci. Comput., 33 (2011), pp. 966–991.
  • [27] J. W. Ruge and K. Stüben, Algebraic multigrid (AMG), in Multigrid Methods, S. F. McCormick, ed., Frontiers Appl. Math., SIAM, Philadelphia, 1987, pp. 73–130.
  • [28] M. Sala and R. S. Tuminaro, A new Petrov-Galerkin smoothed aggregation preconditioner for nonsymmetric linear systems, SIAM J. Sci. Comput., 31 (2008), pp. 143–166.
  • [29] U. Trottenberg, C. Oosterlee, and A. Schu¨\ddot{\mbox{u}}ller, Multigrid, Academic Press, London, UK, 2001.
  • [30] L. UChicago Argonne and the PETSc Development Team, PCGAMG. https://petsc.org/main/docs/manualpages/PC/PCGAMG/index.html, 2022.
  • [31] P. Vaněk, J. Mandel, and M. Brezina, Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems, Computing, 56 (1996), pp. 179–196.
  • [32] P. S. Vassilevski, Multilevel block factorization preconditioners: Matrix-based analysis and algorithms for solving finite element equations, Springer Science & Business Media, 2008.
  • [33] W. L. Wan, T. F. Chan, and B. Smith, An energy-minimizing interpolation for robust multigrid methods, SIAM J. Sci. Comput., 21 (2000), pp. 1632–1649.
  • [34] J. Xu and L. Zikatanov, Algebraic multigrid methods, Acta Numerica, 26 (2017), pp. 591–721.
  • [35] X. Xu and C.-S. Zhang, On the ideal interpolation operator in algebraic multigrid methods, SIAM Journal on Numerical Analysis, 56 (2018), pp. 1693–1710.