Parallel Energy-Minimization Prolongation for Algebraic Multigrid

Carlo Janna¹¹1M³E S.r.l., via Giambellino 7, 35129 Padova, Italy, e-mail [email protected] ²²2Department ICEA, University of Padova, via Marzolo 9, 35131 Padova, Italy Andrea Franceschini³³3corresponding author ⁴⁴4Department ICEA, University of Padova, via Marzolo 9, 35131 Padova, Italy, e-mail [email protected] Jacob B. Schroder⁵⁵5Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM 87131, USA, e-mail [email protected] Luke Olson⁶⁶6Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA, e-mail [email protected]

Abstract

Algebraic multigrid (AMG) is one of the most widely used solution techniques for linear systems of equations arising from discretized partial differential equations. The popularity of AMG stems from its potential to solve linear systems in almost linear time, that is with an $O(n)$ complexity, where $n$ is the problem size. This capability is crucial at the present, where the increasing availability of massive HPC platforms pushes for the solution of very large problems. The key for a rapidly converging AMG method is a good interplay between the smoother and the coarse-grid correction, which in turn requires the use of an effective prolongation. From a theoretical viewpoint, the prolongation must accurately represent near kernel components and, at the same time, be bounded in the energy norm. For challenging problems, however, ensuring both these requirements is not easy and is exactly the goal of this work. We propose a constrained minimization procedure aimed at reducing prolongation energy while preserving the near kernel components in the span of interpolation. The proposed algorithm is based on previous energy minimization approaches utilizing a preconditioned restricted conjugate gradients method, but has new features and a specific focus on parallel performance and implementation. It is shown that the resulting solver, when used for large real-world problems from various application fields, exhibits excellent convergence rates and scalability and outperforms at least some more traditional AMG approaches.

Keywords: Algebraic Multigrid, AMG, Preconditioning, Energy minimization, Prolongation

1 Introduction

With the increasing availability of powerful computational resources, scientific and engineering applications are becoming more demanding in terms of both memory and CPU time. For common methods used in the numerical approximation to partial differential equations (e.g., finite difference, finite volume, or finite element), the resulting approximation can easily grow to several millions or even billions of unknowns. The efficient solution to the associated sparse linear system of equations

(1)

A\mathbf{x}=\mathbf{b},

either as a stand-alone system or as part of a nonlinear solve process, often represents a significant computational expense in the numerical application. Thus, research on sparse linear solvers continues to be a key topic for efficient simulation at large scales. One of the most popular sparse linear solvers is algebraic multigrid (AMG) [5, 6, 27] because of its potential for $O(n)$ computational cost in number of degrees-of-freedom $n$ for many problem types.

A fast converging AMG method relies on the complementary action of relaxation (e.g., with weighted Jacobi) and coarse grid correction, which is a projection step focused on eliminating the error that is not reduced by relaxation. Even in a purely algebraic setting, the main algorithmic decisions in multigrid are often based on heuristics for elliptic problems. As a result, for more complex applications, traditional methods often break down, requiring additional techniques to improve accuracy with a careful eye on overall computational complexity.

Even with advanced AMG methods, robustness remains an open problem for a variety of applications, especially in parallel. Yet, there have been several advances in recent years that have significantly improved convergence in a range of settings. Adaptive AMG [19] and adaptive smoothed aggregation [9] are among early attempts to assess the quality of the AMG setup phase during the setup process, with the ability to adaptively improve the interpolation operators. Later works focus on extending the adaptive ideas to more general settings [20], and in particular, Bootstrap AMG [3] further develops the idea of adaptive interpolation with least-squares interpolation coupled with locally relaxed vectors and multilevel eigenmodes. Other advanced approaches have a focus on specific AMG components, such as energy minimization of the interpolation operator [21, 33, 28, 26, 23], generalizing the strength of connection procedure [25, 4], or by considering the nonsymmetric nature of the problem directly [24, 22].

While AMG robustness and overall convergence has improved with combinations of the advances above, the overarching challenge of controlling cost is persistent. In this paper, we make a number of related contributions with a focus on AMG effectiveness and efficiency at large scale. Our key contributions are as follows:

•

The quality and sparsity of tentative interpolation is improved through a novel utilization of sparse $QR$ and a new process for sparsity pattern expansion that targets locally full-rank matrices for improved mode interpolation constraints;
•

We accompany the energy minimization construction of interpolation with new energy and convergence monitoring, thus limiting the total cost;
•

We apply a new preconditioning technique for the energy minimization process based on Gauss-Seidel applied to the blocks;
•

We present the non-trivial and efficient parallel implementation in detail; and
•

We demonstrate improved convergence and computational complexity with several large scale experiments.

The remainder of the paper is as follows. We begin with the basics of AMG in Section 2. In Section 3, we derive the energy minimization process based on QR factorizations and introduce a method for monitoring reduction of energy in practice. Finally, we conclude with several numerical experiments in Section 5 along with a discussion on performance.

2 Introduction to Classical AMG

The effectiveness of AMG as a solver depends on the complementary relationship between relaxation and coarse-grid correction, where the error not reduced by relaxation on the fine grid (e.g., with weighted-Jacobi or Gauss-Seidel) is accurately represented on the coarse grid, where a complementary error correction is computed. For a more in-depth introduction to AMG, see the works [10, 29]. Here, we focus our description of AMG on the coarse grid and interpolation setup, which are most relevant to the rest of the paper.

Constructing the AMG coarse grid begins with a partition of the $n$ unknowns of $A$ into a C-F partition of $n_{f}$ fine nodes and $n_{c}$ coarse nodes: $\{0,\ldots,n-1\}=\mathcal{C}\cup\mathcal{F}$ . From this, we assume an ordering of $A$ by F-points followed by C-points:

(2)

A=\left[\begin{array}[]{cc}A_{ff}&A_{fc}\\ A_{fc}^{T}&A_{cc}\\ \end{array}\right],

where for example, $A_{ff}$ corresponds to entries in $A$ between two F-points. We also assume $A$ is SPD so that $A_{cf}=A_{fc}^{T}$ . In classical AMG, prolongation takes the form

(3)

P=\left[\begin{array}[]{c}W\\ I\end{array}\right],

where $W$ must be sparse (for efficiency) and represents interpolation from the coarse grid to fine grid $F$ -points.

In constructing prolongation of the form Eq. 3, there are two widely accepted guidelines, the so-called ideal [8, 35] and optimal [34, 7] forms of prolongation. Although both of these are not feasible in practical applications, leading to very expensive and dense prolongation operators, the concepts behind their definition are valuable guides for constructing effective $P$ .

Ideal prolongation is constructed by starting with the above C-F partition and constructing $P_{\text{id}}$ as

(4)

P_{\text{id}}=\left[\begin{array}[]{c}-A_{ff}^{-1}A_{fc}\\ I\\ \end{array}\right].

Making $P_{\text{id}}$ the goal for interpolation is motivated by Corollary 3.4 from the theoretical work [12]. Here, the main assumption is a classical AMG framework where $P$ is of the form in equation (3).¹¹1The other assumptions are specific choices for the map to F-points $S=[I,0]$ , for the map to C-points $R=[0,I]$ , and for relaxation $X=\|A\|I$ . In this setting, the choice of $W=-A_{ff}^{-1}A_{fc}$ minimizes the two-grid convergence of AMG relative to the choice of $P$ , i.e., relaxation is fixed. Motivating our later energy minimization approach, $P_{\text{id}}$ can be viewed as having zero energy rows, as $AP_{\text{id}}$ is zero at all F-rows. Additionally, this classical AMG perspective likely makes the task of energy minimization easier, in that the conditioning of $A_{FF}$ is usually superior to that of $A$ .

With optimal interpolation, the goal of interpolation is to capture the algebraically smoothest modes in span $(P)$ , i.e., the modes left behind by relaxation. More specifically following [7], let $\lambda_{1}\leq\lambda_{2}\leq\dots\leq\lambda_{n}$ and $v_{1},v_{2},\dots,v_{n}$ be the ordered eigenvalues and eigenvectors of the generalized eigenvalue problem $Ax=\lambda\tilde{M}x$ , where $\tilde{M}$ is the symmetrized relaxation matrix, e.g., the diagonal of $A$ for Jacobi. (See the work [7] for more details.) Then, the two-grid convergence of AMG is minimized if

(5)

\mbox{span}(P)=\mbox{range}(v_{1},v_{2},...,v_{n_{c}}).

Note, that no assumptions on the structure of $P$ are made, as in equation (3). Motivating our later energy minimization approach, equation (5) indicates that span $(P)$ should capture low-energy modes relative to relaxation, which our Jacobi or Gauss-Seidel preconditioned energy minimization approach will explicitly target. Moreover, our energy minimization approach will incorporate constraints which explicitly force certain vectors to be in span $(P)$ , where these vectors are chosen to represent the $v_{i}$ with smallest eigenvalues.

The idea of energy minimization AMG with constraints has been exploited for both symmetric and non-symmetric operators in several works [21, 33, 28, 26, 23], and, though requiring more computational effort than classical interpolation formulas, often provides improved preconditioners that balance the extra cost.

3 Energy minimization prolongation

The energy minimization process combines the key aspects of ideal and optimal prolongation. To define this, we first introduce $V$ , a basis for the near kernel of $A$ or the lowest energy modes of $A$ . Then, energy minimization seeks to satisfy two requirements:

1.

Range: The range of prolongation must include the near kernel $V$ :

(6) $V\subseteq\mathrm{range}(P)$
2.

Minimal: The energy of each column of $P$ is minimized:

(7) $P=\operatorname*{argmin}_{P}\left(\mbox{tr}(P^{T}AP)\right).$

To construct a $P$ that contains $V$ in the range and has minimal energy, we next introduce the key components needed by most energy minimization approaches, namely

i)

a sparsity pattern for efficient application of $P$ ;
ii)

a constraint to enforce Eq. 6; and
iii)

an approximate block diagonal linear system for solving equation Eq. 7.

For practical use in AMG, the prolongation operator must be sparse, therefore construction begins by defining a sparse non-zero pattern for $P$ . Assume that a strength of connection (SoC) matrix $S$ is provided where nonzero entries denote a strong coupling between degrees-of-freedom [27, 25, 4]. Next, let $P_{0}$ be a tentative prolongation with non-zero pattern $\mathcal{P}_{0}$ , to be used as an initial guess for $P$ .²²2See Section 3.2 for our contributions regarding the construction of $P_{0}$ , which is based on the adaptive algorithm [15]. $P_{0}$ can be defined similarly to the tentative prolongation from smoothed aggregation AMG [31] in that $P_{0}$ interpolates the basis $V$ , but needs further improvement, for example, with energy minimization. We next obtain an enlarged sparsity pattern $\mathcal{P}$ by growing $\mathcal{P}_{0}$ to include all strongly connected neighbors up to distance $k$ . Denoting with $\overline{P}$ the unitary matrix obtained from $P$ by replacing its non-zeros with unitary entries, this is equivalent to

(8)

\overline{P}_{k}=S^{k}\overline{P}_{0},

where $\mathcal{P}$ is the pattern of $\overline{P}_{k}$ (see [26]).

For a constraint condition satisfying Eq. 6, we start by splitting the near kernel basis $V$ with the same C-F splitting as in Eq. 2:

(9)

V=\left[\begin{array}[]{c}V_{f}\\ V_{c}\\ \end{array}\right].

Recalling the form of interpolation Eq. 3, the near kernel requirement for $P$ becomes

(10)

W\;V_{c}=V_{f},

which is a set of $n_{f}$ conditions on the rows of $W$ . By denoting $\noindent w_{i}^{T}$ as the $i$ -th row of $W$ (and $P$ ) and $\noindent v_{i}^{T}$ as the $i$ -th row $V$ , condition Eq. 10 is then exploited row-wise as

(11)

V_{c}^{T}\noindent w_{i}=\noindent v_{i}\qquad\forall i\in\mathcal{F}.

Using the sparsity pattern $\mathcal{P}$ , we rewrite Eq. 11 for only the nonzeros in each row $\noindent w_{i}$ . Letting the index set $\mathcal{J}_{i}$ be the column indices in the $i$ -th row of $P$ , this becomes

(12)

V_{c}(\mathcal{J}_{i},:)^{T}\overline{\noindent w_{i}}=\noindent v_{i}\qquad\forall i\in{\mathcal{F}},

where $\overline{\noindent w_{i}}=\noindent w_{i}(\mathcal{J}_{i})$ collects only the nonzeros of $\noindent w_{i}$ . It is important to note that for each of the $n_{f}$ fine points, the constraints Eq. 12, are independent of each other, because each entry of $W$ appears in only one constraint. Denote by $\widetilde{p}$ the column vector collecting the nonzero entries of $P$ row-wise. Similarly, we denote by $p$ the column vector containing the nonzero entries of $P$ column-wise. By definition, $\widetilde{p}$ and $p$ have the same size, equal to the number of nonzeros in $P$ . Next, we write the constraints in the following matrix form:

(13)

\widetilde{B}^{T}\widetilde{p}=\widetilde{g},

where $\widetilde{g}$ collects the $\noindent v_{i}$ in Eq. 12 into a single vector, and where $\widetilde{B}^{T}$ , is a block diagonal matrix composed of $V_{c}(\mathcal{J}_{i},:)^{T}$ due to the independence of constraints.

Lastly, we describe the reduced linear system framework for approximating Eq. 7. Minimizing Eq. 7 is equivalent to minimizing the energy of the individual columns of $P$ on the prescribed nonzero pattern:

(14)

p_{i}=\operatorname*{argmin}_{p_{i}\in\mathcal{P}_{i}}p_{i}^{T}Ap_{i},

where $p_{i}$ is the $i$ -th column of $P$ and $\mathcal{P}_{i}$ is the sparsity pattern of column $i$ .

Next, let $\mathcal{I}_{i}$ be the set of nonzero row indices of the $i$ -th column of $W$ and let $\overline{h}_{i}$ be the vector collecting the nonzero entries of the $i$ -th column of $W$ . Then the minimization in Eq. 14 defines $\overline{h}_{i}$ with

(15)

A(\mathcal{I}_{i},\mathcal{I}_{i})\overline{h}_{i}=-A(\mathcal{I}_{i},i)\qquad\forall i\in\mathcal{C},

where $A(\mathcal{I}_{i},\mathcal{I}_{i})$ is a square, relatively dense, submatrix of $A$ corresponding to the allowed nonzero indices $\mathcal{I}_{i}$ and $A(\mathcal{I}_{i},i)$ is a vector corresponding to the $i$ -th column of $A$ at the allowed nonzero indices. Also in this case, each column of $P$ satisfies Eq. 14 independently. Thus, denoting by $p$ the column vector collecting the nonzero entries of $P$ column-wise (i.e., a rearranged version of $\widetilde{p}$ ) and denoting by $f$ the vector collecting each $-A(\mathcal{I}_{i},i)$ , the minimization Eq. 14 is recast as

(16)

Kp=f,

where $K$ is block diagonal with $A(\mathcal{I}_{i},\mathcal{I}_{i})$ on the $i$ -th block.

The two conditions, one on the range of $P$ and one on the minimality of $P$ , together form a constrained minimization problem, whose solution is the desired energy minimal prolongation. Casting this problem using Lagrange multipliers results in the saddle point system

(17)

\left[\begin{array}[]{cc}K&B\\ B^{T}&0\\ \end{array}\right]\left[\begin{array}[]{c}p\\ \lambda\\ \end{array}\right]=\left[\begin{array}[]{c}f\\ g\\ \end{array}\right].

The elements in Eq. 17 are the same as those defined in Eqs. 13 and 16, with the exception that $\widetilde{B}$ , $\widetilde{g}$ , and $\widetilde{p}$ are reordered following the columns of $P$ , and $\lambda$ is the vector of Lagrange multipliers whose values are not needed for the purpose of setting up the prolongation. We emphasize that, with the entries of $p$ enumerated column-wise with respect to $P$ , $K$ is block diagonal. Likewise, if $p$ is enumerated following the rows of $P$ , then $B$ becomes block diagonal. Unfortunately, there is no sorting of $p$ able to make both $K$ and $B$ block diagonal at the same time. Nevertheless, it is possible to take advantage of this underlying structure in the numerical implementation, as will be shown later. Leveraging the block structure of $B$ is also important, because, as we will see in Section 3.1, our algorithm to minimize energy requires several applications of the orthogonal projector given as

(18)

\Pi_{B}=I-B(B^{T}B)^{-1}B^{T}.

The system Eq. 17 follows closely the method from [26]. In sections 3.2–3.4, we outline our proposed improvements to energy minimization.

3.1 Minimization through Krylov subspace methods

Following [26], energy minimization proceeds by starting with a tentative prolongation, $P_{0}$ , that satisfies the near kernel constraints (see Eq. 10). Denoting by $p_{0}$ the tentative prolongation in vector form with nonzero entries collected column-wise, where we highlight that the subscript 0 does not refer to a specific $P$ column as $p_{i}$ in Eq. (14), these constraints read

(19)

B^{T}p_{0}=g.

Defining the final prolongation as the tentative $p_{0}$ plus a correction $\delta p$ gives

(20)

p=p_{0}+\delta p.

Then, the problem is recast as finding the optimal correction $\Delta P^{*}$ :

(21)

\Delta P^{*}=\operatorname*{argmin}_{\Delta P\in\mathcal{P}}\left(\operatorname{tr}((P_{0}+\Delta P)^{T}A(P_{0}+\Delta P))\right),

subject to the constraint $B^{T}\delta p=0$ , where $\delta p$ is the vector form of $\Delta P$ with nonzero entries again collected column-wise. By recalling that $\Delta P$ has non-zero components only in $W$ — i.e., $\Delta P=[\Delta W^{T},0]^{T}$ — and using the C-F partition (2), we write

(22)

\begin{split}\operatorname{tr}((P_{0}+\Delta P)^{T}&A(P_{0}+\Delta P))=\\ &=\underbrace{\operatorname{tr}(\Delta W^{T}A_{ff}\Delta W)+2\operatorname{tr}(W_{0}^{T}A_{ff}\Delta W)+2\operatorname{tr}(A_{fc}^{T}\Delta W)}_{\text{dependent on $\Delta P$}}\\ &+\underbrace{\operatorname{tr}(W_{0}^{T}A_{ff}W_{0})+2\operatorname{tr}(W_{0}^{T}A_{fc})+\operatorname{tr}(A_{cc})}_{\text{independent of $\Delta P$}}.\end{split}

Using the preset non-zero pattern for $\Delta W$ , the problem is rewritten in vector from to minimizie only the terms depending on $\Delta P$ :

(23)

\operatorname*{argmin}_{\delta w}(\delta w^{T}K\delta w+2w_{0}^{T}K\delta w+2f^{T}\delta w),

subject to the constraint

(24)

B^{T}\delta w=0,

where $K$ and $f$ are defined as in Eq. 17 and $\delta w$ and $w_{0}$ are the vector forms of $\Delta W$ and $W_{0}$ , respectively, with the nonzero entries collected column-wise.

This minimization can be performed using (preconditioned) conjugate gradients by ensuring that both the initial solution and the search direction satisfy the constraint [26]. To do this, return to the orthogonal projector

(25)

\Pi_{B}=I-B(B^{T}B)^{-1}B^{T},

and apply conjugate gradients to the singular system

(26)

\Pi_{B}K\Pi_{B}\delta w=-\Pi_{B}(f+Kw_{0}),

starting from $\delta w=0$ . Due to its block diagonal structure, it is straightforward to find a QR decomposition of $B=QR$ , and the projection simply becomes:

(27)

\Pi_{B}=I-QQ^{T}.

Finally, introducing $K_{\Pi}=\Pi_{B}K\Pi_{B}$ and $\bar{f}=f+Kw_{0}$ , the Krylov subspace built by conjugate gradients is:

(28)

\mathcal{K}_{m}=\mbox{span}\{\Pi_{B}\bar{f},K_{\Pi}\bar{f},K_{\Pi}^{2}\bar{f},\dots,K_{\Pi}^{m}\bar{f}\}.

This is equivalent to applying the nullspace method [2] to the saddle-point system Eq. 17.

3.2 Improved tentative interpolation $P_{0}$ and orthogonal projection $\Pi_{B}$ with sparsity pattern expansion and sparse $QR$

A crucial point for energy minimization interpolation is the availability of a tentative prolongation $P_{0}$ that satisfies the near kernel representability constraint Eq. 10. While this is relatively straightforward for scalar diffusion equations where $V$ has only one column, it is not trivial for vector-valued PDEs such as elasticity. One specific difficulty is that while forming the $i$ -th constraint equation Eq. 12, we must ensure that $V_{c}(\mathcal{J}_{i},:)$ is full-rank. If it is not full rank, then no prolongation operator is able to satisfy Eq. 10 and the solution to Eq. 17 is not possible in general. We consider two possible remedies:

1.

Add strongly connected neighbors to the pattern of the $i$ -th row of $P$ to enlarge $\mathcal{J}_{i}$ until $V_{c}(\mathcal{J}_{i},:)$ is full-rank (i.e., sparsity pattern expansion); or
2.

Compute the least square solution of (12) as is done in [26].

A novel aspect of this work is our pursuit of sparsity pattern expansion. We find that this careful construction of the sparsity pattern, which guarantees that each constraint is exactly satisfied as $V_{c}(\mathcal{J}_{i},:)$ is always full-rank, greatly improves performance on some problems.

To accomplish this task, we adopt a dynamic-pattern, least-squares fit (LSF) procedure that satisfies Eq. 10 or equivalently Eq. 19. For each row of $W$ (corresponding to a fine node $i$ ), this is equivalent to satisfying the local dense system Eq. 12. For simplicity, we rewrite equation Eq. 12 by dropping the row subscript $i$ , with $\noindent\mathbb{w}=\overline{\noindent w_{i}}$ and $\noindent\mathbb{v}=v_{i}$ , yielding

(29)

\mathbb{B}\noindent\mathbb{w}=\noindent\mathbb{v},

where $\mathbb{B}=V_{c}(\mathcal{J}_{i},:)^{T}$ corresponds to a diagonal block of $B^{T}$ in Eq. 17, when the non-zero entries of $P$ are enumerated row-wise.

Considering this generic FINE node represented by equation Eq. 29, if there are a sufficient number of COARSE node neighbors, then $\mathbb{B}$ has more columns than rows. Hence if we assume a full-rank $\mathbb{B}$ , then Eq. 29 is an underdetermined system and can be solved in several ways. In order to have a sparse solution $\noindent\mathbb{w}$ , we choose a minimal set of columns of $\mathbb{B}$ using the max vol algorithm [16, 14] to have the best basis. Here, we satisfy Eq. 29 exactly. We note that a related max vol approach to computing C-F splittings is used in [7].

Remark 3.1.

We adopt this form of a $QR$ factorization with max vol, that is as sparse as possible, in order to improve the complexity of our algorithm and quality of $P_{0}$ . While it is a relatively minor change to the algorithm’s structure, we count it as a useful novelty of our efficient implementation.

If, on the contrary, the number of neighboring COARSE nodes is not sufficient ( $\noindent\mathbb{v}\notin\mbox{span}(\mathbb{B})$ ), then Eq. 29 cannot be satisfied because it is overdetermined. This may occur not only when $\mathbb{B}$ is skinny, i.e., the number of columns is smaller than the number of rows, but more often because $\mathbb{B}$ is rank deficient even if it has a larger number of columns, i.e., it is wide. In elasticity problems and in particular with shell finite elements, this issue arises often in practice with standard distance one coarsening, where during the coarsening process, some FINE nodes may occasionally remain isolated. Our strategy is to gradually increase the interpolation distance for violating nodes where $\noindent\mathbb{v}\notin\mbox{span}(\mathbb{B})$ , thus widening the interpolatory set (i.e., adding columns to $\mathbb{B}$ ) where it is necessary. Algorithm 1 describes how to set-up $\widehat{p_{0}}$ , the vector form of initial tentative prolongation $\widehat{P_{0}}$ . (Algorithm 2, described later, will further process $\widehat{P_{0}}$ for the final tentative prolongation $P_{0}$ ).

Algorithm 1 Tentative Prolongation Set-Up

1:procedure PTent_SetUp(

S

V

l_{\scriptsize\mbox{max}}

)

2: input:

S

– strength of connection matrix

V

– near kernel modes

l_{\scriptsize\mbox{max}}

– maximum interpolation distance

5: output:

\widehat{p_{0}}

– initial tentative prolongation

6: for all FINE nodes

i

7: Set

l=0

;

8: while

l<l_{\scriptsize\mbox{max}}

9: Set

l=l+1

;

10: Form

\mathcal{N}_{i}

with the strong neighbors of

i

up to distance

l

;

11: Select the best columns of

V_{c}(\mathcal{N}_{i},:)^{T}

using max vol to form

\mathbb{B}

;

12: Compute the least-squares solution to

\mathbb{B}\noindent\mathbb{w}=\noindent\mathbb{v}

;

13: if

\|\noindent\mathbb{v}-\mathbb{B}\noindent\mathbb{w}\|_{2}=0

then

14: Assign

\noindent\mathbb{w}^{T}

i

-th row of

\widehat{p_{0}}

;

15: break;

16: end if

17: end while

18: end for

19:end procedure

Remark 3.2.

Avoiding a skinny or rank deficient local $\mathbb{B}$ block is also important for the construction of the orthogonal projection $\Pi_{B}=I-B(B^{T}B)^{-1}B^{T}$ that maps vectors of $\mathbb{R}^{n}$ to $\mbox{Ker}(B^{T})$ . Note that $\Pi_{B}$ is used not only to correct the conjugate gradients search direction, but also to ensure the initial prolongation satisfies the near kernel constraint. In general, the size of $\mathbb{B}$ during energy minimization is larger than when constructing tentative prolongation because of the additional sparsity pattern expansion in (8). Thus, this “rank-deficiency” issue is ameliorated, but there are pathological cases where it has been observed in practice.

We now describe our procedure for computing local blocks of $\Pi_{B}$ , called $\Pi_{\mathbb{B}}$ , and a single row of the final tentative prolongation $P_{0}$ , called $\noindent\mathbb{w}_{0}^{T}$ . The global procedure to build $P_{0}$ and $\Pi_{B}$ is then obtained by repeating this local algorithm for each FINE row. Denote by $\noindent\mathbb{\widehat{w}}_{0}^{T}$ the starting tentative prolongation row. We use the word starting, because in general, we can receive a tentative prolongation that does not satisfy the constraint. That is, we may have:

(30)

\mathbb{B}\noindent\mathbb{\widehat{w}}_{0}\neq\noindent\mathbb{v}.

Denote by $n_{l}$ and $m_{l}$ the dimensions of the local system, so that $\mathbb{B}\in\mathbb{R}^{n_{l}\times m_{l}}$ . To fulfill condition Eq. 29, we must find a correction to $\noindent\mathbb{\widehat{w}}_{0}$ , say $\noindent\mathbb{\delta}$ , such that:

(31)

\mathbb{B}\noindent\mathbb{\delta}=\noindent\mathbb{v}-\mathbb{B}\noindent\mathbb{\widehat{w}}_{0}=\noindent\mathbb{r}

and then set:

(32)

\noindent\mathbb{w}_{0}=\noindent\mathbb{\widehat{w}}_{0}+\noindent\mathbb{\delta}.

To enforce condition Eq. 24 efficiently, we construct an orthonormal basis of $\mbox{range}(\mathbb{B})$ , say $Q$ , that gives rise to the desired local orthogonal projector:

(33)

\Pi_{\mathbb{B}}=I-QQ^{T}.

If the prolongation pattern is large enough, the vast majority of the above local problems will be such that $n_{l}\geq m_{l}$ with $\mathbb{B}$ being also full-rank, i.e., $\mbox{rank}(\mathbb{B})=m_{l}$ . Thus, an economy-size QR decomposition is firstly performed on $\mathbb{B}$ , $\mathbb{B}=QR$ with $Q\in\mathbb{R}^{n_{l}\times m_{l}}$ and $R\in\mathbb{R}^{m_{l}\times m_{l}}$ , and $Q$ is used to form the local projector. Then, through the same QR decomposition, we compute $\noindent\mathbb{\delta}$ as the least norm solution of the underdetermined system (31):

(34)

\noindent\mathbb{\delta}=(\mathbb{B}^{T}\mathbb{B})^{-1}\mathbb{B}^{T}\noindent\mathbb{r}=(R^{T}Q^{T}QR)^{-1}R^{T}Q^{T}\noindent\mathbb{r}=R^{-1}Q^{T}\noindent\mathbb{r}.

Note that any solution to (31) would be equivalent, as the optimal choice in terms of global prolongation energy is later computed by the restricted CG algorithm. If the initial tentative prolongation arises from the LSF set-up, it should already fulfill Eq. 31 and $\noindent\mathbb{r}\equiv 0$ . However, in the most difficult cases even extending the interpolatory set with large distance $l_{\text{max}}$ is not sufficient to guarantee an exact interpolation of the near kernel for all the FINE nodes. For these FINE nodes — i.e., when $n_{l}<m_{l}$ or $\mathbb{B}$ is not full-rank — we compute an SVD decomposition of $\mathbb{B}$ :

(35)

\mathbb{B}=U\Sigma V^{T}.

From the diagonal of $\Sigma$ , we determine the rank of $\mathbb{B}$ , say $k_{l}$ , and use the first $k_{l}$ columns of $U$ to form $Q$ and thus $\Pi_{\mathbb{B}}$ . Finally, since in this case it could be impossible to satisfy the constraint because the system is overdetermined, we use the least square solution to Eq. 31 to compute the correction for $\noindent\mathbb{\widehat{w}}_{0}$ :

(36)

\noindent\mathbb{\delta}=V\Sigma^{\dagger}U^{T}\noindent\mathbb{r}.

It is important to recognize that for these specific FINE nodes, energy minimization cannot reduce the energy, because the constraint does not leave any degree of freedom. Consequently, this situation should be avoided when selecting the COARSE variables and the prolongation pattern, because a prolongation violating the near kernel constraint will likely fail in representing certain smooth modes. The pseudocode to set-up $\Pi_{B}$ and correct the initial tentative prolongation $P_{0}$ (after Algorithm 1) is provided in Algorithm 2.

Algorithm 2 Energy Minimization Set-Up

1:procedure EMIN_SetUp(

B

g

\widehat{p_{0}}

)

2: input:

B

– block diagonal constraint matrix, as in equation (19)

g

– constraint right-hand-side

\widehat{p_{0}}

– initial tentative prolongation

5: output:

\Pi_{B}

– projection matrix, constructed block-wise

p_{0}

– final tentative prolongation

7: for all FINE nodes

i

8: Gather

\mathbb{B}

\noindent\mathbb{v}

and

\noindent\mathbb{\widehat{w}}_{0}

for row

i

, as in equation (30);

9: Compute

\noindent\mathbb{r}=\noindent\mathbb{v}-\mathbb{B}\noindent\mathbb{\widehat{w}}_{0}

;

10: FAIL_QR = false;

11: if

n_{l}\geq m_{l}

then

12: Compute economy-size QR of

\mathbb{B}

\mathbb{B}=QR

;

13: if

\mbox{rank}(R)<m_{l}

then

14: Set FAIL_QR = true;

15: else

16: Compute

\noindent\mathbb{\delta}=R^{-1}Q^{T}\noindent\mathbb{r}

;

17: end if

18: end if

19: if

n_{l}<m_{l}

or FAIL_QR then

20: Compute SVD of

\mathbb{B}

\mathbb{B}=U\Sigma V^{T}

;

21: Determine

k=\mbox{rank}(\mathbb{B})

using the diagonal of

\Sigma

;

22: Set

Q=U(:,1:k)

;

23: Compute

\noindent\mathbb{\delta}=V\Sigma^{\dagger}U^{T}\noindent\mathbb{r}

;

24: end if

25: Compute

\noindent\mathbb{w}_{0}=\noindent\mathbb{\widehat{w}}_{0}+\noindent\mathbb{\delta}

and insert

\noindent\mathbb{w}_{0}^{T}

as row

i

p_{0}

;

26: Set

i

-th block of

\Pi_{B}

I-QQ^{T}

;

27: end for

28:end procedure

3.3 Improved stopping criterion and energy monitoring for CG-based energy minimization

Stopping criteria plays an important role in the overall cost and effectiveness of energy minimization. Here, we introduce a measure for monitoring energy and halting in the algorithm.

Since CG often converges quickly for energy minimization, it is common to fix the number of iterations in advance [21, 26]. However, in the case of a more challenging problem, several iterations may be needed, thus requiring an accurate stopping criterion. One immediate option is to use the relative residual, yet this may not be a close indicator of energy. In the following, we analyze CG for a generic $Ax=b$ , however our observations extend to PCG as well.

In the CG algorithm, once the search direction $p_{k}$ is defined at iteration $k$ , the scalar $\alpha$ is computed:

(37)

\alpha_{k}=\frac{p_{k}^{T}r_{k}}{p_{k}^{T}Ap_{k}},

in such a way that the new approximation $x_{k+1}=x_{k}+\alpha_{k}p_{k}$ minimizes the square of the energy norm of the error, i.e.,

(38)

E_{k}=(x_{k}-h)^{T}A(x_{k}-h),

where $h=A^{-1}b$ is the true solution. The difference in energy $\Delta E_{k+1}=E_{k+1}-E_{k}$ between two successive iterations $k$ and $k+1$ can be computed as

(39)

\begin{split}\Delta E_{k+1}&=(x_{k}+\alpha_{k}p_{k})^{T}A(x_{k}+\alpha_{k}p_{k})-2b^{T}(x_{k}+\alpha_{k}p_{k})-x_{k}^{T}Ax_{k}+2b^{T}x_{k}\\ &=2\alpha_{k}x_{k}^{T}Ap_{k}+\alpha_{k}^{2}p_{k}^{T}Ap_{k}-2\alpha_{k}b^{T}p_{k}=-2\alpha_{k}r_{k}^{T}p_{k}+\alpha_{k}^{2}p_{k}^{T}Ap_{k}\\ &=\alpha_{k}(-2r_{k}^{T}p_{k}+\alpha_{k}p_{k}^{T}Ap_{k})=-\alpha_{k}(p_{k}^{T}r_{k})=-\frac{(p_{k}^{T}r_{k})^{2}}{p_{k}^{T}Ap_{k}}<0.\end{split}

From Eq. 37 and Eq. 39, it is possible to measure, with minimal cost, the energy decrease provided by the $(k+1)$ -st iteration. Indeed, by noting that $\alpha_{k}$ is computed as the ratio between the two values $\alpha_{\text{num}}=p_{k}^{T}r_{k}$ and $\alpha_{\text{den}}=p_{k}^{T}Ap_{k}$ , the energy decrease reads

(40)

\Delta E_{k+1}=\frac{\alpha_{\text{num}}^{2}}{\alpha_{\text{den}}}.

The relative value of the energy variation with respect to the initial variation (first iteration) is monitored and convergence is achieved when energy is sufficiently reduced:

(41)

\frac{\Delta E_{k}}{\Delta E_{1}}\leq\tau

for a small user-defined $\tau$ .

3.4 Improved preconditioning for CG-based energy minimization

Before introducing PCG, we present some important properties of matrices $K$ and $B$ , that we leverage in the design of effective preconditioners for energy minimization. In particular, we will see that, thanks to the non-zero structures of $K$ and $B$ , point-wise Jacobi or Gauss-Seidel iterations prove particularly effective in preconditioning the projected block $\Pi_{B}K\Pi_{B}$ .

Let us assume that the vector form of prolongation $p$ has been obtained by collecting the non-zeroes of $P$ row-wise, so that $B$ is block diagonal. Denoting by $Q$ the matrix collecting an orthonormal basis of $\mbox{range}(B)$ , and $Z$ an orthonormal basis of $\mbox{ker}(B)$ , by construction we have

(42)

[Q\;\;Z][Q\;\;Z]^{T}=[Q\;\;Z]^{T}[Q\;\;Z]=I,

i.e., the matrix $[Q\;\;Z]$ is square and orthogonal. Moreover, since $B$ is block diagonal, both $Q$ and $Z$ are block diagonal and can be easily computed and stored. We note that, by construction, each column of $Q$ and $Z$ refers to a specific row of the prolongation, and, due to the block diagonal pattern chosen for $B$ , also each column of $Q$ and $Z$ is non-zero only in the positions corresponding to the entries of $p$ collecting the non-zeroes of a specific row of $P$ , as is schematically shown in Fig. 1. More precisely, using the same notation as in Eq. 12, let us define $\mathcal{L}_{i}$ as the set of indices in $p$ corresponding to the non-zero entries in the $i$ -th row of $P$ so that:

(43)

p(\mathcal{L}_{i})=P(i,\mathcal{J}_{i})

and define $\mathcal{J}_{B,i}$ and $\mathcal{J}_{Z,i}$ as the set of columns of $B$ and $Z$ , referring to the $i$ -th row of $P$ . Then,

(44)

\left.\begin{array}[]{l}B(k,\mathcal{J}_{B,i})=0\\ Q(k,\mathcal{J}_{B,i})=0\\ Z(k,\mathcal{J}_{Z,i})=0\end{array}\right\}\quad\mbox{if}\quad k\notin\mathcal{L}_{i}.

Refer to caption — Figure 1: Non-zero pattern of the columns of $B$ , $Q$ and $Z$ corresponding to a specific row of the prolongation $P$ .

We are now ready to state two theorems that will be useful in explaining the choice of our preconditioners.

Theorem 3.1.

The diagonal of the projected matrix $Z^{T}KZ$ is equal to the projection of the diagonal of $K$ :

(45)

\mathrm{diag}(Z^{T}KZ)=\mathrm{diag}(Z^{T}D_{K}Z),

where $D_{K}$ is the matrix collecting the diagonal entries of $K$ . Moreover, $Z^{T}D_{K}Z$ is a diagonal matrix.

Proof 3.2.

Let us consider the block of columns of $Z$ relative to row $i$ , that is $Z(:,J_{Z,i})$ , remembering that it is non zero only for the row indices in $\mathcal{L}_{i}$ . As a consequence, the square block $H_{i}$ obtained by pre- and post-multiplying $K$ by $Z(:,J_{Z,i})$ is computed as:

(46)

H_{i}(j,k)=\sum_{r\in\mathcal{L}_{i}}\left(\sum_{s\in\mathcal{L}_{i}}Z(s,k)K(r,s)\right)Z(r,j)\quad\mbox{for}\quad j,k\in\mathcal{J}_{Z,i}.

However, $K(r,s)$ for $r,s\in\mathcal{L}_{i}$ represents the connection between $P(i,j_{r})$ and $P(i,j_{s})$ , which is non-zero only for $j_{r}=j_{s}$ , i.e., $r=s$ , because in $K$ there is no connection between different columns of $P$ . Moreover, due to Eq. 15, $K(r,r)=A(i,i)$ for every $r\in\mathcal{L}_{i}$ . As the columns of $Z$ are orthonormal by construction, $Z^{T}Z=I$ , it immediately follows that the square block $H_{i}$ is diagonal with all its non zero entry equal to $A(i,i)$ . The fact that $Z^{T}D_{K}Z$ is a diagonal matrix follows from the above observation that $K(\mathcal{L}_{i},\mathcal{L}_{i})=D_{K}(\mathcal{L}_{i},\mathcal{L}_{i})=A(i,i)I_{m}$ , with $I_{m}$ the identity matrix having size $m$ equal to the cardinality of $\mathcal{L}_{i}$ .

Corollary 3.3.

The product $Z^{T}D_{K}Q$ , where $D_{K}$ is defined as in Theorem 3.1, is equal to the null matrix:

(47)

Z^{T}D_{K}Q=0.

Proof 3.4.

Due to the block-diagonal structure of $Q$ and $Z$ , all the off-diagonal blocks of $Z^{T}D_{K}Q$ are empty. Then, equation Eq. 47 follows from $D_{K}(\mathcal{L}_{i},\mathcal{L}_{i})=A(i,i)I_{m}$ and $Z(:,\mathcal{J}_{i})\perp Q(:,\mathcal{J}_{i})$ .

3.4.1 Preconditioned CG-based Energy Minimization

Algorithm 3 Preconditioned Conjugate Gradients for Energy Minimization

1:procedure EMIN_PCG(maxit,

\tau

K

f

\Pi_{B}

M

P_{0}

)

2: input: maxit – maximum iterations

\tau

– energy convergence tolerance

K

– system matrix (applied matrix-free with

A

)

f

– right-hand-side

f

from equation 16

\Pi_{B}

– projection matrix

M

– preconditioner

P_{0}

– tentative prolongation

9: output:

P

– final prolongation

10: Extract global weight vector

w_{0}

row-wise from

P_{0}

11:

\Delta w=0

;

12:

r=f-\Pi_{B}Kw_{0}

;

13: for

k=1,\dots,

maxit do

14:

z=\Pi_{B}M^{-1}r

;

15:

\gamma=r^{T}z

;

16: if

i=1

then

17:

y=z

;

18: else

19:

\beta={\gamma}/{\gamma_{old}}

;

20:

y=z+\beta y

;

21: end if

22:

\gamma_{old}=\gamma

;

23:

\breve{y}=\Pi_{B}Ky

;

24:

\alpha=\gamma/(y^{T}\breve{y}

);

25:

\Delta E_{k}=\gamma\alpha

26: if

\Delta E_{k}<\tau\Delta E_{1}

return

27:

\Delta w=\Delta w+\alpha y

;

28:

r=r-\alpha\breve{y}

;

29: end for

30:

w=w_{0}+\Delta w

;

31: Form final prolongation

P=[W;I]

with global weight vector

w

32:end procedure

Preconditioning CG can greatly improve convergence, but special care should be taken to maintain the search direction $y$ in the space of vectors satisfying the near kernel constraint. In other words, $y$ must satisfy $\Pi_{B}y\equiv y$ . In [26], a Jacobi preconditioner is adopted that satisfies this requirement, but, due to the special properties of the matrix $K$ , it is possible to compute a more effective preconditioner. Denoting by $M^{-1}$ any approximation of $K^{-1}$ , we use $\Pi_{B}M^{-1}\Pi_{B}$ to precondition $\Pi_{B}K\Pi_{B}$ in order to guarantee the constraint. The resulting PCG algorithm is outlined in Algorithm 3, where, since $\Pi_{B}$ is a projection, we can avoid premultiplying $r$ by $\Pi_{B}$ (line 14) as $r$ already satisfies the constraint.

In the remainder of this section we focus our attention on $Z^{T}KZ$ instead of $\Pi_{B}K\Pi_{B}$ , because, as $\Pi_{B}=I-QQ^{T}=ZZ^{T}$ , they have the same spectrum. Our aim is to find a good preconditioner for $Z^{T}KZ$ . Unfortunately, although $K$ is block diagonal and several effective preconditioners can be easily built for it, $Z^{T}KZ$ is less manageable and further approximations are needed.

By pre- and post-multiplying $K$ by $[Z\;Q]^{T}$ and $[Z\;Q]$ , respectively, we can write the following $2\times 2$ block expression:

(48)

[Z\;Q]^{T}K[Z\;Q]=\begin{bmatrix}Z^{T}KZ&Z^{T}KQ\\ Q^{T}KZ&Q^{T}KQ\\ \end{bmatrix}\!,

and, since we are interested in the inverse of $Z^{T}KZ$ , we can express it as the Schur complement of the leading block of the inverse of $[Z\;Q]^{T}K[Z\;Q]$ [32, Chapter 3]:

(49)

([Z\;Q]^{T}K[Z\;Q])^{-1}=[Z\;Q]^{T}K^{-1}[Z\;Q]=\begin{bmatrix}Z^{T}K^{-1}Z&Z^{T}K^{-1}Q\\ Q^{T}K^{-1}Z&Q^{T}K^{-1}Q\\ \end{bmatrix}\!,

from which it follows that:

(50)

(Z^{T}KZ)^{-1}=Z^{T}K^{-1}Z-Z^{T}K^{-1}Q(Q^{T}K^{-1}Q)^{-1}Q^{T}K^{-1}Z.

When $K^{-1}$ is approximated with the inverse of the diagonal of $K$ , $M_{J}=\mathrm{diag}(K)$ , because of Theorem 3.3, we have that $Z^{T}M_{J}^{-1}Q=0$ , and the expression (50) becomes:

(51)

(Z^{T}KZ)^{-1}\simeq M_{1}^{-1}=Z^{T}M_{J}^{-1}Z,

which corresponds to a Jacobi preconditioning of $Z^{T}KZ$ .

We highlight that only for Jacobi can the post-multiplication by $\Pi_{B}$ be neglected in line 14, since $M_{J}^{-1}$ does not introduce components along $\mbox{range}(Q)$ . This is consistent with [26], where the Jacobi preconditioner is used, and no post-multiplication by $\Pi_{B}$ is adopted.

If a more accurate preconditioner is needed, $K^{-1}$ can be approximated using a block-wise symmetric Gauss-Seidel (SGS) iteration, that is:

(52)

K^{-1}\simeq M^{-1}_{SGS}=(L+D)^{-T}D(L+D)^{-1},

which substituted into equation (50) reads:

(53)

(Z^{T}KZ)^{-1}\simeq M_{2}^{-1}=Z^{T}M^{-1}_{SGS}Z-Z^{T}M^{-1}_{SGS}Q(Q^{T}M^{-1}_{SGS}Q)^{-1}Q^{T}M^{-1}_{SGS}Z.

Since the application of (53) is still impractical due to the presence of the term $(Q^{T}M^{-1}_{SGS}Q)^{-1}$ , we neglect the second member of the right-hand side, based on the heuristic that $Z^{T}M^{-1}_{SGS}Q$ should be small, because $Z^{T}M_{J}^{-1}Q=0$ . After this simplification, we obtain the final expression of the projected SGS preconditioner:

(54)

K^{-1}\simeq M_{2}^{-1}=Z^{T}M^{-1}_{SGS}Z=Z^{T}(L+D)^{-T}D(L+D)^{-1}Z.

Note that while $M_{1}^{-1}$ is exactly the Jacobi preconditioner of $Z^{T}KZ$ , $M_{2}^{-1}$ is only an approximation of the exact SGS preconditioner of $Z^{T}KZ$ . However, we will show in the numerical results that it is able to significantly accelerate convergence.

4 Efficient parallel implementation

Energy minimization has historically been considered computationally expensive for AMG setup and real applications. Nevertheless, a cost effective implementation is still possible, but requires special care with the algorithm parallelization. In this work, we build our AMG preconditioner with Chronos [15, 13], which provides the standard numerical kernels usually required in a distributed memory AMG implementation such as the communication of ghost unknowns, coarse grid construction (e.g., computing the PMIS C-F partition [11]), sparse matrix-vector, and sparse matrix-matrix product, etc. For energy minimization, however, we developed three specific kernels that are not required in other AMG approaches, but are critical for an efficient parallel implementation here, i.e., the sparse product between $K$ and $p$ , application of the projection $\Pi_{B}$ , and symmetric Gauss-Seidel with $K$ . We do not list Jacobi preconditioning, because it simply consists of a row-wise scaling of $P$ .

The first issue related to the product by $K$ is that in practical applications $K$ cannot be stored. In fact, if we consider a prolongation $P$ having $r$ non-zeroes per row on average, the number of non-zeroes per column will be approximately $s\simeq\frac{n_{f}}{n_{c}}r$ and the $K$ matrix would be of size $n_{c}s\times n_{c}s$ with about $n_{c}s\,t\simeq n_{f}r\,t$ non-zeroes to be stored, where $t$ is the average number of non-zeroes per row of $A$ . Often, $r\simeq t$ and storing $K$ becomes several times more expensive than storing $A$ . For instance in practical elasticity problems, $K$ can be up to 20 times larger than $A$ , making it unavoidable to proceed in matrix-free mode for $K$ . Fortunately, the special structure of $K$ allows us to interpret the product $Kp$ as a sparse matrix-matrix (SpMM) product between $A$ and $P$ , but with a fixed, prescribed pattern on $P$ . This property can be easily derived from the definition of $K$ Eq. 16 and the vector form of the prolongation $p$ . One advantage of prescribing a fixed pattern for $P$ is that the amount of data exchanged in SpMM is greatly reduced. First of all, the sparsity pattern adjacency information of $P$ can be communicated only once before entering PCG, then for all the successive $AP$ products only the entries of $P$ are exchanged. Moreover, all the buffers to receive and send messages through the network, can be allocated and set-up only once at the beginning and removed at the end of the minimization process. In practice, we find that these optimizations reduce the cost of SpMM by about 50%.

By contrast, the construction and application of $\Pi_{B}$ does not require any communication. In fact, we distribute the prolongation row-wise among processes, and since $B$ is block diagonal, with each block corresponding to a row of $P$ as shown in (13), we distribute $B$ blocks to the process owning the respective row of $P$ . Following this scheme, each block $\mathbb{B}$ of $B$ is factorized independently to obtain $Q$ which is efficiently processed by the LAPACK routine DGEMV when applying $\Pi_{B}$ .

Finally, we emphasize that there is no parallel bottleneck when parallelizing the application of symmetric Gauss-Seidel with $K$ . This is because $K$ is block diagonal and, at least in principle, each block can be assigned to a different process. However as in the $Kp$ product, the main issue is the large size of $K$ which prevents explicit storage. As above, we rely on the equivalence between the $Kp$ product and the $AP$ product with a prescribed sparsity pattern. The symmetric Gauss-Seidel step is then performed matrix-free, similar to the $AP$ product, by saving again a considerable amount of data exchange because it is not necessary to communicate adjacency information or reallocate communication buffers. As expected, the symmetric Gauss-Seidel application exhibits a computational cost comparable to that of the $AP$ product.

5 Numerical experiments

In this section, we investigate the improved convergence offered by energy minimization and the resulting computational performance and parallel efficiency for real-world applications. This numerical section is divided into three parts: a detailed analysis on the prolongation energy reduction and related AMG convergence for two medium-size matrices, a weak scalability test for an elasticity problem on a uniformly refined cube, and a comparison study for a set of large real world matrices representing fluid dynamics and mechanical problems.

As a reference for the proposed approach, we compare with the well-known and open source solver GAMG [1], a smoothed aggregation-based AMG code from PETSc. In both cases, Chronos and GAMG are used with preconditioned conjugate gradients (PCG). The GAMG set-up is tuned for each problem starting from its default parameters, as suggested by the user guide [30]. As smoother we have chosen the most powerfull option available in Chronos and GAMG, that is FSAI and Chebyshev accelerated Jacobi, respectively. Each time such default parameters are modified, we report the chosen values in the relevant section. Since all of the experiments are run in parallel, the system matrices are partitioned with ParMETIS [18] before the AMG preconditioner set-up to reduce communication overhead.

All numerical experiments have been run on the Marconi100 supercomputer located in the Italian consortium for supercomputing (CINECA). Marconi100 consists of 980 nodes based on the IBM Power9 architecture, each equipped with two 16-core IBM POWER9 AC922 @3.1 GHz processors. For each test, the number of nodes, $N$ , is selected so that each node has approximately 640,000,000 nonzero entries, and, consequently, the number of nodes is problem dependent. Each node is always fully exploited by using 32 MPI tasks, i.e., each task (core) has an average load of 20,000,000 nonzero entries. The number of cores is denoted $n_{cr}$ . Only during the smaller cases in the weak scalability analysis are nodes partially used (i.e., with less than 32 MPI tasks). Even though Chronos can exploit hybrid MPI-OpenMP parallelism, for the sake of comparison, it is convenient to use just one thread, i.e., pure MPI parallelism. Moreover, for such a high load per core, we do not find that fine-grained OpenMP parallelism is of much help.

The numerical results are presented in terms of total number of computational cores used, $n_{cr}$ , the grid and operator complexities, $C_{gd}$ and $C_{op}$ , respectively, the number of iterations, $n_{it}$ , and the setup, iteration, and total times, $T_{p}$ , $T_{s}$ , and $T_{t}$ = $T_{p}$ + $T_{s}$ , respectively. For all the test cases, the right-hand side is a unit vector. The linear systems are solved with PCG and a zero initial guess, and convergence is achieved when the $\ell_{2}$ -norm of the iterative residual drops below 8 orders of magnitude with respect to the $\ell_{2}$ -norm of the right-hand side.

5.1 Analysis of the energy minimization process

We use two matrices for studying prolongation energy reduction, Cube and Pflow742 [20]. While the former is quite simple, as it is the fourth refinement level of the linear elasticity cube used in the weak scalability study, the latter arises from a 3D simulation of the pressure-temperature field in a multilayered porous medium discretized by hexahedral finite elements. The main source of ill-conditioning here is the large contrast in the material properties for different layers. The dimensions of Cube are 1,778,112 rows and 78,499,998 entries, with 44.15 entries per row on average. The size of Pflow742 is 742,793 rows and 37,138,461 entries, for an average entry-per-row ratio of 50.00.

The overall solver performance is compared against the energy reduction for the fine-level $P$ when using the energy mininimization Algorithm 3 with either Jacobi and Gauss-Seidel as the preconditioner. Results are analyzed in terms of computational costs and times. The main algorithmic features we want to analyze are:

•

how the prolongation energy reduction affects AMG convergence; and
•

the effectiveness of the preconditioner, i.e., Jacobi or Gauss-Seidel.

The energy minimization iteration count (Algorithm 3) is denoted by $n_{it}^{E}$ . Between brackets, we also report the relative energy reduction that is used to monitor the restricted PCG convergence of Algorithm 3, as shown in equation (41). $T_{i}$ is the time spent to improve the prolongation, with either classical prolongation smoothing with weighted-Jacobi (SMOOTHED) or the energy minimization process. Note that $T_{i}$ is only part of $T_{p}$ .

Table 1 shows the results for Cube. First, it can be seen that the energy minimization algorithm produces prolongation operators which lead to somewhat lower complexities than for the smoothed case. Moreover, energy minimization builds more effective prolongation operators overall, since the global iteration count ( $n_{it}$ ) is lower. As the energy minimization iteration count increases, we can observe that $n_{it}$ decreases, while the setup time ( $T_{p}$ ) increases. For this simple problem, the optimal point in terms of total time ( $T_{t}$ ) is reached using 2 iterations of Jacobi (EMIN-J). Fig. 2a further shows that 2 iterations of Jacobi (EMIN-J) already reaches close to the achievable energy minimum. Fig. 3a compares the cost in wall-clock seconds for each energy minimization iteration using different preconditioners. For this case and implementation, Jacobi is more efficient.

Table 1: Analysis of energy minimization for the Cube problem.

Prolongation	$n_{it}^{E}$	$\frac{\Delta E_{k}}{\Delta E_{1}}$	$C_{gr}$	$C_{op}$	$n_{it}$	$T_{p}$ [s]	$T_{s}$ [s]	$T_{t}$ [s]	$T_{i}$ [s]
SMOOTHED	-	-	1.075	1.648	58	52.1	13.4	65.5	6.3
EMIN-J	1	$10^{0}$	1.075	1.592	54	47.6	11.2	58.8	11.3
EMIN-J	2	$2\cdot 10^{-1}$	1.075	1.589	32	50.9	6.7	57.6	14.3
EMIN-J	4	$10^{-2}$	1.075	1.589	27	57.3	5.7	63.0	20.0
EMIN-GS	1	$10^{0}$	1.075	1.587	28	55.5	6.0	61.5	19.3
EMIN-GS	2	$2\cdot 10^{-2}$	1.075	1.588	26	65.6	5.5	71.1	28.8
EMIN-GS	4	$1\cdot 10^{-4}$	1.075	1.589	26	84.4	5.6	90.0	47.7

Similar conclusions can be drawn for the Pflow742 case, as $n_{it}$ monotonically decreases as the energy associated with the prolongation operator is reduced. As reported by Table 2, the optimal total time ( $T_{t}$ ) is obtained with 4 Jacobi iterations (EMIN-J). Fig. 2b shows how the energy of the prolongation operator decreases slower than for Cube and that Jacobi converges significantly slower than Gauss-Seidel. However as reported by Fig. 3b, the cost of Gauss-Seidel is still more than that of Jacobi, although the performance difference is smaller than for Cube.

Table 2: Analysis of energy minimization for the Pflow742 problem.

Prolongation	$n_{it}^{E}$	$\frac{\Delta E_{k}}{\Delta E_{1}}$	$C_{gr}$	$C_{op}$	$n_{it}$	$T_{p}$ [s]	$T_{s}$ [s]	$T_{t}$ [s]	$T_{i}$ [s]
SMOOTHED	-	-	1.061	1.465	369	23.0	29.8	52.8	2.3
EMIN-J	1	$10^{0}$	1.061	1.339	377	20.4	27.3	47.7	2.9
EMIN-J	2	$3\cdot 10^{-1}$	1.061	1.344	270	21.5	19.6	41.1	3.7
EMIN-J	4	$4\cdot 10^{-2}$	1.062	1.352	219	23.3	15.7	39.0	5.4
EMIN-J	8	$8\cdot 10^{-3}$	1.062	1.360	189	26.2	13.8	40.0	8.1
EMIN-GS	1	$10^{0}$	1.061	1.346	276	23.0	19.9	42.9	5.3
EMIN-GS	2	$9\cdot 10^{-2}$	1.062	1.352	221	26.2	16.0	42.2	8.2
EMIN-GS	4	$6\cdot 10^{-3}$	1.062	1.363	184	31.7	14.3	46.0	13.8
EMIN-GS	8	$7\cdot 10^{-4}$	1.063	1.367	183	42.9	13.8	56.7	24.8

Regarding algorithmic complexity, each energy minimization iteration with Gauss-Seidel should cost exactly twice an iteration with Jacobi, and from the above tests, Gauss-Seidel is able to reduce the energy more than twice as fast as Jacobi. Thus Gauss-Seidel should be cheaper than Jacobi. Unfortunately, this is not confirmed by our numerical experiments and this is likely due to a sub-optimal parallel implementation of the Gauss-Seidel preconditioner. Although the block diagonal structure of $K$ allows theoretically for a perfectly parallel implementation, the Gauss-Seidel implementation requires more communication and synchronization stages than Jacobi, likely leading to our results where the Jacobi preconditioner is faster. A more cost effective implementation of Gauss-Seidel will be the focus of future work. For now, Jacobi is chosen as our default preconditioner and will be used in all subsequent cases.

Finally, we observe that a relative energy reduction of one order of magnitude gives generally the best trade-off between set-up time and AMG convergence. Therefore, we use $\tau=0.1$ as default.

5.2 Weak scalability test

Here, we carry out a weak scalability study of energy minimization AMG for linear elasticity on a unit cube and different levels of refinement. The unit cube $[0,1]^{3}$ is discretized by regular tetrahedral elements, the material is homogeneous, and all displacements are prevented on the square region $[0,0.125]\times[0,0.125]$ at $z=0$ . The mesh sizes are chosen such that each subsequent refinement produces about twice the number degrees of freedom with respect to the previous mesh. The problem sizes range from 222k rows to 124M rows, with an average entries-per-row ratio of 44.47.

Two sets of AMG parameters are used with Chronos: the first set targets a constant PCG iteration count, while the second minimizes the total solution time ( $T_{t}$ ). The first section of Table 3 provides the outcome of the first test. The iteration count increases very slowly from 23 to 33, while the problem size increases by a factor of about $2^{9}$ . However, this nearly optimal scaling comes with relatively large complexities. As a consequence, total times are also relatively large, especially the setup time ( $T_{p}$ ). The time for energy minimization ( $T_{i}$ ) scales quite well, however, with only a factor of 2 difference between the first and last refinement levels.

Next to reduce the setup and solution times, we increase the AMG strength threshold [27] to allow for lower complexities. The second section of Table 3 collects the new outputs. It can be seen that the relative trends are the same as in the previous tests, but that the timings are lower (30% on average). The reduced complexity is thus advantageous by providing faster wall-clock times, even at the expense of more iterations $n_{it}$ .

Table 3: Weak scalability results for regular cube. Three sets of results are reported: i) energy minimization-based AMG via Chronos to produce an almost constant iteration count; ii) the same tuned to minimize the total time; and iii) PETSc’s GAMG using all default parameters with the exception of

\mu

, which is chosen to reduce the overall solution time.

solver	nrows	$n_{cr}$	$N$	$C_{gr}$	$C_{op}$	$n_{it}$	$T_{p}$ [s]	$T_{s}$ [s]	$T_{t}$ [s]	$T_{i}$ [s]
energy minimization minimal iteration count	222k	1	1	1.077	1.540	23	28.7	2.7	31.4	0.12
	447k	2	1	1.076	1.559	24	33.0	2.9	35.9	0.12
	902k	4	1	1.076	1.580	26	36.1	3.4	39.5	0.13
	1,778k	8	1	1.075	1.596	27	39.0	3.6	42.7	0.13
	3,675k	16	1	1.075	1.610	28	43.3	4.1	47.4	0.15
	7,546k	32	1	1.075	1.620	29	53.9	5.3	59.2	0.18
	15,533k	64	2	1.075	1.630	30	61.2	5.9	67.2	0.20
	31,081k	128	4	1.075	1.670	30	74.9	6.5	81.4	0.22
	62,391k	256	8	1.075	1.642	32	72.8	6.8	79.7	0.21
	124,265k	512	16	1.075	1.646	33	88.8	7.8	96.7	0.24
energy minimization best solution time	222k	1	1	1.047	1.252	31	19.5	3.1	22.6	0.10
	447k	2	1	1.047	1.260	31	22.6	3.3	25.9	0.11
	902k	4	1	1.047	1.269	34	24.4	3.8	28.2	0.11
	1,778k	8	1	1.047	1.274	34	27.4	4.0	31.4	0.12
	3,675k	16	1	1.047	1.279	37	30.6	4.7	35.3	0.13
	7,546k	32	1	1.047	1.283	38	37.8	6.1	43.9	0.16
	15,533k	64	2	1.046	1.286	49	44.3	8.5	52.8	0.17
	31,081k	128	4	1.046	1.289	41	45.0	7.5	52.5	0.18
	62,391k	256	8	1.046	1.291	43	48.4	8.6	57.0	0.20
	124,265k	512	16	1.046	1.292	51	52.5	10.8	63.3	0.21
GAMG ( $\mu=0.01$ ) best solution time	222k	1	1	N/A	1.479	58	11.19	13.72	24.92	0.24
	447k	2	1	N/A	1.503	69	10.02	17.12	27.14	0.25
	902k	4	1	N/A	1.526	73	11.41	19.19	30.59	0.26
	1,778k	8	1	N/A	1.549	78	12.63	21.05	33.68	0.27
	3,675k	16	1	N/A	1.571	82	14.83	24.86	39.68	0.30
	7,546k	32	1	N/A	1.608	86	23.04	35.52	58.56	0.41
	15,533k	64	2	N/A	1.618	93	27.15	39.33	66.48	0.42
	31,081k	128	4	N/A	1.703	97	37.48	41.58	79.06	0.43
	62,391k	256	8	N/A	1.803	98	58.98	43.37	102.35	0.44
	124,265k	512	16	N/A	1.953	100	119.16	50.29	169.45	0.50

For comparison, the same set of problems is next solved with GAMG from PETSc. The default values for all parameters are used, as it turned out they were already the best. The only exception is the threshold for dropping edges in the aggregation graph ( $\mu$ ), whose value is reported alongside the results. The third section of Table 3 shows the output for GAMG. The complexities and run-times are higher than for Chronos, especially for the setup stage. The iteration counts are usually between 2 and 3 times larger than those required by Chronos. For this test problem, reducing complexity by setting $\mu=0.0$ is not beneficial as the increase in the iteration count cancels the set-up time reduction. We also note that both the operator complexity and set-up time increase significantly with the refinement level, while energy minimization is able to better limit growth in these quantities. Comparing two codes is fraught with difficulty, but these results do indicate that energy minimization is an efficient approach in parallel for this problem.

5.3 Challenging Real World Problems

We now examine the proposed energy minimization approach for a set of challenging real-world problems arising from discretized PDEs in both fluid dynamics and mechanics. The former class consists of problems from the discretization of the Laplace operator, such as underground fluid flow, compressible or incompressible airflow around complex geometries, or porous flow. The latter class consists of mechanical applications such as subsidence analysis, hydrocarbon recovery, gas storage (geomechanics), mesoscale simulation of composite materials (mesoscale), mechanical deformation of human tissues or organs subjected to medical interventions (biomedicine), and design and analysis of mechanical elements, e.g., cutters, gears, air-coolers (mechanical). The selected problems are not only large but also characterized by severe ill-conditioning due to mesh distortions, material heterogeneity, and anisotropy. They are listed in Table 4 with details about the size, the number of nonzeros, and the application field from which they arise.

Table 4: Matrix sizes and number of non-zeroes for the real-world problems.

matrix	nrows	nterms	avg nt/row	application
guenda11m	11,452,398	512,484,300	44.75	geomechanics
agg14m	14,106,408	633,142,730	44.88	mesoscale
tripod20m	19,798,056	871,317,864	44.01	mechanical
M20	20,056,050	1,634,926,088	81.52	mechanical
wing	33,654,357	2,758,580,899	81.97	mechanical
Pflow73m	73,623,733	2,201,828,891	29.91	reservoir
c4zz134m	134,395,551	10,806,265,323	80.41	biomedicine

Table 5 reports the AMG performance on these benchmarks when using the energy minimization procedure, classical prolongation smoothing (one step of weighed-Jacobi on the tentative prolongation), and GAMG, respectively. The overall best time is highlighted in boldface. As before, please note that for GAMG only the threshold value for dropping edges in the aggregation graph ( $\mu$ ) has been tuned. All the other parameters are used with their default value, as they were already the best.

With respect to classical prolongation smoothing, the energy minimization procedure is able to reduce the complexities, in particular $C_{op}$ , the setup time $T_{p}$ , and also the iteration count (with the only exception being Pflow73m). It is the reduced complexities that allow energy minimization to achieve the lower setup time. The overall gain in total time is in the range $5$ – $55\%$ for all test cases.

Energy minimization also compares favorably with GAMG. GAMG provides a faster total time than energy minimization-based AMG on two cases out of seven, agg14m and M20, and a similar total time on tripod20m. The situation is reversed on the most challenging examples, where GAMG is significantly slower and is unable to solve Pflow73m. We briefly note that, unexpectedly, the ParMETIS partitioning significantly harms GAMG effectiveness on c4zz134m. For this case, we report GAMG performance on the matrix with its native ordering and mark this test with a ‘*’. Typically, the GAMG set-up time is faster than energy minimization, but energy minimization allows for a preconditioner of higher quality, which significantly reduces the total time in the most difficult cases. Fig. 4 collects all these results, reporting the relative total times. The setup and solve phases are denoted by different shading.

Table 5: Results for the real-world cases using: i) energy minimization AMG; ii) classical prolongation smoothing; and iii) PETSc’s GAMG. Default PETSc GAMG parameters are used. Only

\mu

, i.e., the threshold for dropping edges in aggregation graph, is changed. ‘*’ means that the matrix has not been partitioned with ParMETIS. Please note that only the threshold value for dropping edges in the aggregation graph (

\mu

) has been tuned. All the other parameters are used with their default value, as it turned out they were already the best.

solver	case	$N$	$\mu$	$C_{gr}$	$C_{op}$	$n_{it}$	$T_{p}$ [s]	$T_{s}$ [s]	$T_{t}$ [s]
energy minimization	guenda11m	1	N/A	1.041	1.325	987	314.0	399.0	713.0
	agg14m	1	N/A	1.042	1.322	23	66.7	7.2	74.0
	tripod20m	2	N/A	1.049	1.302	104	40.3	22.2	62.6
	M20	2	N/A	1.055	1.304	111	98.0	40.2	138.0
	wing	8	N/A	1.055	1.297	140	47.2	25.3	72.5
	Pflow73m	4	N/A	1.028	1.101	1169	225.0	424.0	649.0
	c4zz134m	8	N/A	1.029	1.122	154	72.7	48.8	122.0
smoothed prolongation	guenda11m	1	N/A	1.041	1.378	1771	307.0	750.0	1060.0
	agg14m	1	N/A	1.042	1.371	48	62.5	15.5	78.1
	tripod20m	2	N/A	1.048	1.336	212	34.6	47.5	82.2
	M20	2	N/A	1.054	1.733	154	167.0	76.6	244.0
	wing	8	N/A	1.055	1.697	301	93.5	71.6	165.1
	Pflow73m	4	N/A	1.058	1.371	841	441.3	394.5	836.0
	c4zz134m	8	N/A	1.028	1.199	277	79.2	98.9	178.0
GAMG	guenda11m	1	0.00	N/A	1.524	2237	22.3	1553.9	1576.1
	agg14m	1	0.00	N/A	1.557	33	20.6	32.2	52.7
	tripod20m	2	0.01	N/A	1.679	48	32.2	32.2	64.4
	M20	2	0.01	N/A	1.203	60	36.7	62.4	99.1
	wing	8	0.01	N/A	1.204	250	34.0	108.4	142.4
	Pflow73m	4	—	N/A	—	—	—	—	—
	c4zz134m*	8	0.01	N/A	1.233	156	110.9	250.38	361.33

6 Conclusions

This work provides evidence of the potential of a novel energy minimization procedure in constructing the prolongation operator in AMG. While the theoretical advantages of this idea are well known in the literature, the computational aspects and its practical feasibility in a massively parallel software were still under discussion.

With this contribution, we have highlighted how the energy minimization approach can be effectively implemented in a classical AMG setting, leading to robust and cost-effective prolongation operators when compared to other approaches. It is also shown how, especially in challenging problems, this technique can lead to considerably faster AMG convergence. The prolongation energy can be minimized with several schemes. We have adopted a restricted conjugate gradient, accelerated by suitable preconditioners. We presented and analyzed Jacobi and Gauss-Seidel preconditioners, restricting our attention to these two because of their applicability in matrix-free mode. The experiments show that Gauss-Seidel has the potential to be faster than Jacobi, however, its efficient parallel implementation is not straightforward and needs more attention.

Weak scalability has been assessed on an elasticity model problem by discretizing a cube with regular tetrahedra. The proposed algorithms have been implemented in a hybrid MPI-OpenMP linear solver and its performance has been compared to another well recognized AMG implementation (GAMG) using a set of large and difficult problems arising from real-world applications.

In the future, we plan to further reduce set-up time by investigating other preconditioning techniques as those described in [17], and to extend this approach to non-symmetric problems as well.

References

[1] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. Gropp, et al., Petsc users manual, (2019).
[2] M. Benzi, G. H. Golub, and J. Liesen, Numerical solution of saddle point problems, Acta Numer., 14 (2005), pp. 1–137.
[3] A. Brandt, J. Brannick, K. Kahl, and I. Livshits, Bootstrap amg, SIAM Journal on Scientific Computing, 33 (2011), pp. 612–632.
[4] A. Brandt, J. Brannick, K. Kahl, and I. Livshits, Algebraic distance for anisotropic diffusion problems: multilevel results, Electronic Transactions on Numerical Analysis, 44 (2015), pp. 472–496.
[5] A. Brandt, S. F. McCormick, and J. W. Ruge, Algebraic multigrid (AMG) for automatic multigrid solution with application to geodetic computations, tech. rep., Institute for Computational Studies, Colorado State University, 1982.
[6] , Algebraic multigrid (AMG) for sparse matrix equations, in Sparsity and Its Applications, D. J. Evans, ed., Cambridge Univ. Press, Cambridge, 1984, pp. 257–284.
[7] J. Brannick, F. Cao, K. Kahl, R. D. Falgout, and X. Hu, Optimal interpolation and compatible relaxation in classical algebraic multigrid, SIAM Journal on Scientific Computing, 40 (2018), pp. A1473–A1493.
[8] J. J. Brannick and R. D. Falgout, Compatible Relaxation and Coarsening in Algebraic Multigrid, SIAM Journal on Scientific Computing, 32 (2010-01), pp. 1393 – 1416.
[9] M. Brezina, R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, and J. Ruge, Adaptive Smoothed Aggregation ( $\alpha$ SA) Multigrid, SIAM Rev., 47 (2005), pp. 317–346.
[10] W. L. Briggs, V. E. Henson, and S. F. McCormick, A multigrid tutorial, SIAM, Philadelphia, PA, USA, 2nd ed., 2000.
[11] H. De Sterck, U. M. Yang, and J. J. Heys, Reducing complexity in parallel algebraic multigrid preconditioners, SIAM Journal on Matrix Analysis and Applications, 27 (2006), pp. 1019–1039.
[12] R. D. Falgout and P. S. Vassilevski, On generalizing the algebraic multigrid framework, SIAM J. Numer. Anal., 42 (2004), pp. 1669–1693.
[13] M. Frigo, G. Isotton, and C. Janna, Chronos web page. https://www.m3eweb.it/chronos, 2021.
[14] S. A. Goreinov, I. V. Oseledets, D. Savostyanov, E. E. Tyrtyshnikov, and N. L. Zamarashkin, How to find a good submatrix, tech. rep., Nov. 2008.
[15] G. Isotton, M. Frigo, N. Spiezia, and C. Janna, Chronos: a general purpose classical AMG solver for high performance computing, SIAM J. Sci. Comput., 43 (2021), pp. C335–C357.
[16] D. E. Knuth, Semi-optimal bases for linear dependencies, Linear and Multilinear Algebra, 17 (1985), pp. 1–4.
[17] , Solving linear systems of the form $(a+\gamma uu^{t})x=b$ by preconditioned iterative methods, Linear and Multilinear Algebra, (2022).
[18] K. Lab, ParMETIS - Parallel Graph Partitioning and Fill-reducing Matrix Ordering. http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview, 2022.
[19] S. MacLachlan, T. Manteuffel, and S. McCormick, Adaptive reduction-based amg, Numerical Linear Algebra with Applications, 13 (2006), pp. 599–620.
[20] V. A. P. Magri, A. Franceschini, and C. Janna, A novel algebraic multigrid approach based on adaptive smoothing and prolongation for ill-conditioned systems, SIAM Journal on Scientific Computing, 41 (2019), pp. A190–A219.
[21] J. Mandel, M. Brezina, and P. Vaněk, Energy optimization of algebraic multigrid bases, Computing, 62 (1999), pp. 205–228.
[22] T. A. Manteuffel, S. Münzenmaier, J. Ruge, and B. Southworth, Nonsymmetric reduction-based algebraic multigrid, SIAM Journal on Scientific Computing, 41 (2019), pp. S242–S268.
[23] T. A. Manteuffel, L. N. Olson, J. B. Schroder, and B. S. Southworth, A root-node-based algebraic multigrid method, SIAM J. Sci. Comput., 39 (2017), pp. S723–S756.
[24] T. A. Manteuffel, J. Ruge, and B. S. Southworth, Nonsymmetric algebraic multigrid based on local approximate ideal restriction ( $\ell$ air), SIAM Journal on Scientific Computing, 40 (2018), pp. A4105–A4130.
[25] L. N. Olson, J. Schroder, and R. S. Tuminaro, A new perspective on strength measures in algebraic multigrid, Numerical Linear Algebra with Applications, 17 (2010), pp. 713–733.
[26] L. N. Olson, J. B. Schroder, and R. S. Tuminaro, A general interpolation strategy for algebraic multigrid using energy minimization, SIAM J. Sci. Comput., 33 (2011), pp. 966–991.
[27] J. W. Ruge and K. Stüben, Algebraic multigrid (AMG), in Multigrid Methods, S. F. McCormick, ed., Frontiers Appl. Math., SIAM, Philadelphia, 1987, pp. 73–130.
[28] M. Sala and R. S. Tuminaro, A new Petrov-Galerkin smoothed aggregation preconditioner for nonsymmetric linear systems, SIAM J. Sci. Comput., 31 (2008), pp. 143–166.
[29] U. Trottenberg, C. Oosterlee, and A. Sch $\ddot{\mbox{u}}$ ller, Multigrid, Academic Press, London, UK, 2001.
[30] L. UChicago Argonne and the PETSc Development Team, PCGAMG. https://petsc.org/main/docs/manualpages/PC/PCGAMG/index.html, 2022.
[31] P. Vaněk, J. Mandel, and M. Brezina, Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems, Computing, 56 (1996), pp. 179–196.
[32] P. S. Vassilevski, Multilevel block factorization preconditioners: Matrix-based analysis and algorithms for solving finite element equations, Springer Science & Business Media, 2008.
[33] W. L. Wan, T. F. Chan, and B. Smith, An energy-minimizing interpolation for robust multigrid methods, SIAM J. Sci. Comput., 21 (2000), pp. 1632–1649.
[34] J. Xu and L. Zikatanov, Algebraic multigrid methods, Acta Numerica, 26 (2017), pp. 591–721.
[35] X. Xu and C.-S. Zhang, On the ideal interpolation operator in algebraic multigrid methods, SIAM Journal on Numerical Analysis, 56 (2018), pp. 1693–1710.

Parallel Energy-Minimization Prolongation for Algebraic Multigrid

Abstract

1 Introduction

2 Introduction to Classical AMG

3 Energy minimization prolongation

3.1 Minimization through Krylov subspace methods

3.2 Improved tentative interpolation P0P_{0} and orthogonal projection ΠB\Pi_{B} with sparsity pattern expansion and sparse Q​RQR

Remark 3.1.

Remark 3.2.

3.3 Improved stopping criterion and energy monitoring for CG-based energy minimization

3.4 Improved preconditioning for CG-based energy minimization

Theorem 3.1.

Proof 3.2.

Corollary 3.3.

Proof 3.4.

3.4.1 Preconditioned CG-based Energy Minimization

4 Efficient parallel implementation

5 Numerical experiments

5.1 Analysis of the energy minimization process

5.2 Weak scalability test

5.3 Challenging Real World Problems

6 Conclusions

References

3.2 Improved tentative interpolation $P_{0}$ and orthogonal projection $\Pi_{B}$ with sparsity pattern expansion and sparse $QR$