Parallel Energy-Minimization Prolongation for Algebraic Multigrid
Abstract
Algebraic multigrid (AMG) is one of the most widely used solution techniques for linear systems of equations arising from discretized partial differential equations. The popularity of AMG stems from its potential to solve linear systems in almost linear time, that is with an complexity, where is the problem size. This capability is crucial at the present, where the increasing availability of massive HPC platforms pushes for the solution of very large problems. The key for a rapidly converging AMG method is a good interplay between the smoother and the coarse-grid correction, which in turn requires the use of an effective prolongation. From a theoretical viewpoint, the prolongation must accurately represent near kernel components and, at the same time, be bounded in the energy norm. For challenging problems, however, ensuring both these requirements is not easy and is exactly the goal of this work. We propose a constrained minimization procedure aimed at reducing prolongation energy while preserving the near kernel components in the span of interpolation. The proposed algorithm is based on previous energy minimization approaches utilizing a preconditioned restricted conjugate gradients method, but has new features and a specific focus on parallel performance and implementation. It is shown that the resulting solver, when used for large real-world problems from various application fields, exhibits excellent convergence rates and scalability and outperforms at least some more traditional AMG approaches.
Keywords: Algebraic Multigrid, AMG, Preconditioning, Energy minimization, Prolongation
1 Introduction
With the increasing availability of powerful computational resources, scientific and engineering applications are becoming more demanding in terms of both memory and CPU time. For common methods used in the numerical approximation to partial differential equations (e.g., finite difference, finite volume, or finite element), the resulting approximation can easily grow to several millions or even billions of unknowns. The efficient solution to the associated sparse linear system of equations
(1) |
either as a stand-alone system or as part of a nonlinear solve process, often represents a significant computational expense in the numerical application. Thus, research on sparse linear solvers continues to be a key topic for efficient simulation at large scales. One of the most popular sparse linear solvers is algebraic multigrid (AMG) [5, 6, 27] because of its potential for computational cost in number of degrees-of-freedom for many problem types.
A fast converging AMG method relies on the complementary action of relaxation (e.g., with weighted Jacobi) and coarse grid correction, which is a projection step focused on eliminating the error that is not reduced by relaxation. Even in a purely algebraic setting, the main algorithmic decisions in multigrid are often based on heuristics for elliptic problems. As a result, for more complex applications, traditional methods often break down, requiring additional techniques to improve accuracy with a careful eye on overall computational complexity.
Even with advanced AMG methods, robustness remains an open problem for a variety of applications, especially in parallel. Yet, there have been several advances in recent years that have significantly improved convergence in a range of settings. Adaptive AMG [19] and adaptive smoothed aggregation [9] are among early attempts to assess the quality of the AMG setup phase during the setup process, with the ability to adaptively improve the interpolation operators. Later works focus on extending the adaptive ideas to more general settings [20], and in particular, Bootstrap AMG [3] further develops the idea of adaptive interpolation with least-squares interpolation coupled with locally relaxed vectors and multilevel eigenmodes. Other advanced approaches have a focus on specific AMG components, such as energy minimization of the interpolation operator [21, 33, 28, 26, 23], generalizing the strength of connection procedure [25, 4], or by considering the nonsymmetric nature of the problem directly [24, 22].
While AMG robustness and overall convergence has improved with combinations of the advances above, the overarching challenge of controlling cost is persistent. In this paper, we make a number of related contributions with a focus on AMG effectiveness and efficiency at large scale. Our key contributions are as follows:
-
•
The quality and sparsity of tentative interpolation is improved through a novel utilization of sparse and a new process for sparsity pattern expansion that targets locally full-rank matrices for improved mode interpolation constraints;
-
•
We accompany the energy minimization construction of interpolation with new energy and convergence monitoring, thus limiting the total cost;
-
•
We apply a new preconditioning technique for the energy minimization process based on Gauss-Seidel applied to the blocks;
-
•
We present the non-trivial and efficient parallel implementation in detail; and
-
•
We demonstrate improved convergence and computational complexity with several large scale experiments.
The remainder of the paper is as follows. We begin with the basics of AMG in Section 2. In Section 3, we derive the energy minimization process based on QR factorizations and introduce a method for monitoring reduction of energy in practice. Finally, we conclude with several numerical experiments in Section 5 along with a discussion on performance.
2 Introduction to Classical AMG
The effectiveness of AMG as a solver depends on the complementary relationship between relaxation and coarse-grid correction, where the error not reduced by relaxation on the fine grid (e.g., with weighted-Jacobi or Gauss-Seidel) is accurately represented on the coarse grid, where a complementary error correction is computed. For a more in-depth introduction to AMG, see the works [10, 29]. Here, we focus our description of AMG on the coarse grid and interpolation setup, which are most relevant to the rest of the paper.
Constructing the AMG coarse grid begins with a partition of the unknowns of into a C-F partition of fine nodes and coarse nodes: . From this, we assume an ordering of by F-points followed by C-points:
(2) |
where for example, corresponds to entries in between two F-points. We also assume is SPD so that . In classical AMG, prolongation takes the form
(3) |
where must be sparse (for efficiency) and represents interpolation from the coarse grid to fine grid -points.
In constructing prolongation of the form Eq. 3, there are two widely accepted guidelines, the so-called ideal [8, 35] and optimal [34, 7] forms of prolongation. Although both of these are not feasible in practical applications, leading to very expensive and dense prolongation operators, the concepts behind their definition are valuable guides for constructing effective .
Ideal prolongation is constructed by starting with the above C-F partition and constructing as
(4) |
Making the goal for interpolation is motivated by Corollary 3.4 from the theoretical work [12]. Here, the main assumption is a classical AMG framework where is of the form in equation (3).111The other assumptions are specific choices for the map to F-points , for the map to C-points , and for relaxation . In this setting, the choice of minimizes the two-grid convergence of AMG relative to the choice of , i.e., relaxation is fixed. Motivating our later energy minimization approach, can be viewed as having zero energy rows, as is zero at all F-rows. Additionally, this classical AMG perspective likely makes the task of energy minimization easier, in that the conditioning of is usually superior to that of .
With optimal interpolation, the goal of interpolation is to capture the algebraically smoothest modes in span, i.e., the modes left behind by relaxation. More specifically following [7], let and be the ordered eigenvalues and eigenvectors of the generalized eigenvalue problem , where is the symmetrized relaxation matrix, e.g., the diagonal of for Jacobi. (See the work [7] for more details.) Then, the two-grid convergence of AMG is minimized if
(5) |
Note, that no assumptions on the structure of are made, as in equation (3). Motivating our later energy minimization approach, equation (5) indicates that span should capture low-energy modes relative to relaxation, which our Jacobi or Gauss-Seidel preconditioned energy minimization approach will explicitly target. Moreover, our energy minimization approach will incorporate constraints which explicitly force certain vectors to be in span, where these vectors are chosen to represent the with smallest eigenvalues.
The idea of energy minimization AMG with constraints has been exploited for both symmetric and non-symmetric operators in several works [21, 33, 28, 26, 23], and, though requiring more computational effort than classical interpolation formulas, often provides improved preconditioners that balance the extra cost.
3 Energy minimization prolongation
The energy minimization process combines the key aspects of ideal and optimal prolongation. To define this, we first introduce , a basis for the near kernel of or the lowest energy modes of . Then, energy minimization seeks to satisfy two requirements:
-
1.
Range: The range of prolongation must include the near kernel :
(6) -
2.
Minimal: The energy of each column of is minimized:
(7)
To construct a that contains in the range and has minimal energy, we next introduce the key components needed by most energy minimization approaches, namely
-
i)
a sparsity pattern for efficient application of ;
-
ii)
a constraint to enforce Eq. 6; and
-
iii)
an approximate block diagonal linear system for solving equation Eq. 7.
For practical use in AMG, the prolongation operator must be sparse, therefore construction begins by defining a sparse non-zero pattern for . Assume that a strength of connection (SoC) matrix is provided where nonzero entries denote a strong coupling between degrees-of-freedom [27, 25, 4]. Next, let be a tentative prolongation with non-zero pattern , to be used as an initial guess for .222See Section 3.2 for our contributions regarding the construction of , which is based on the adaptive algorithm [15]. can be defined similarly to the tentative prolongation from smoothed aggregation AMG [31] in that interpolates the basis , but needs further improvement, for example, with energy minimization. We next obtain an enlarged sparsity pattern by growing to include all strongly connected neighbors up to distance . Denoting with the unitary matrix obtained from by replacing its non-zeros with unitary entries, this is equivalent to
(8) |
where is the pattern of (see [26]).
For a constraint condition satisfying Eq. 6, we start by splitting the near kernel basis with the same C-F splitting as in Eq. 2:
(9) |
Recalling the form of interpolation Eq. 3, the near kernel requirement for becomes
(10) |
which is a set of conditions on the rows of . By denoting as the -th row of (and ) and as the -th row , condition Eq. 10 is then exploited row-wise as
(11) |
Using the sparsity pattern , we rewrite Eq. 11 for only the nonzeros in each row . Letting the index set be the column indices in the -th row of , this becomes
(12) |
where collects only the nonzeros of . It is important to note that for each of the fine points, the constraints Eq. 12, are independent of each other, because each entry of appears in only one constraint. Denote by the column vector collecting the nonzero entries of row-wise. Similarly, we denote by the column vector containing the nonzero entries of column-wise. By definition, and have the same size, equal to the number of nonzeros in . Next, we write the constraints in the following matrix form:
(13) |
where collects the in Eq. 12 into a single vector, and where , is a block diagonal matrix composed of due to the independence of constraints.
Lastly, we describe the reduced linear system framework for approximating Eq. 7. Minimizing Eq. 7 is equivalent to minimizing the energy of the individual columns of on the prescribed nonzero pattern:
(14) |
where is the -th column of and is the sparsity pattern of column .
Next, let be the set of nonzero row indices of the -th column of and let be the vector collecting the nonzero entries of the -th column of . Then the minimization in Eq. 14 defines with
(15) |
where is a square, relatively dense, submatrix of corresponding to the allowed nonzero indices and is a vector corresponding to the -th column of at the allowed nonzero indices. Also in this case, each column of satisfies Eq. 14 independently. Thus, denoting by the column vector collecting the nonzero entries of column-wise (i.e., a rearranged version of ) and denoting by the vector collecting each , the minimization Eq. 14 is recast as
(16) |
where is block diagonal with on the -th block.
The two conditions, one on the range of and one on the minimality of , together form a constrained minimization problem, whose solution is the desired energy minimal prolongation. Casting this problem using Lagrange multipliers results in the saddle point system
(17) |
The elements in Eq. 17 are the same as those defined in Eqs. 13 and 16, with the exception that , , and are reordered following the columns of , and is the vector of Lagrange multipliers whose values are not needed for the purpose of setting up the prolongation. We emphasize that, with the entries of enumerated column-wise with respect to , is block diagonal. Likewise, if is enumerated following the rows of , then becomes block diagonal. Unfortunately, there is no sorting of able to make both and block diagonal at the same time. Nevertheless, it is possible to take advantage of this underlying structure in the numerical implementation, as will be shown later. Leveraging the block structure of is also important, because, as we will see in Section 3.1, our algorithm to minimize energy requires several applications of the orthogonal projector given as
(18) |
The system Eq. 17 follows closely the method from [26]. In sections 3.2–3.4, we outline our proposed improvements to energy minimization.
3.1 Minimization through Krylov subspace methods
Following [26], energy minimization proceeds by starting with a tentative prolongation, , that satisfies the near kernel constraints (see Eq. 10). Denoting by the tentative prolongation in vector form with nonzero entries collected column-wise, where we highlight that the subscript 0 does not refer to a specific column as in Eq. (14), these constraints read
(19) |
Defining the final prolongation as the tentative plus a correction gives
(20) |
Then, the problem is recast as finding the optimal correction :
(21) |
subject to the constraint , where is the vector form of with nonzero entries again collected column-wise. By recalling that has non-zero components only in — i.e., — and using the C-F partition (2), we write
(22) |
Using the preset non-zero pattern for , the problem is rewritten in vector from to minimizie only the terms depending on :
(23) |
subject to the constraint
(24) |
where and are defined as in Eq. 17 and and are the vector forms of and , respectively, with the nonzero entries collected column-wise.
This minimization can be performed using (preconditioned) conjugate gradients by ensuring that both the initial solution and the search direction satisfy the constraint [26]. To do this, return to the orthogonal projector
(25) |
and apply conjugate gradients to the singular system
(26) |
starting from . Due to its block diagonal structure, it is straightforward to find a QR decomposition of , and the projection simply becomes:
(27) |
3.2 Improved tentative interpolation and orthogonal projection with sparsity pattern expansion and sparse
A crucial point for energy minimization interpolation is the availability of a tentative prolongation that satisfies the near kernel representability constraint Eq. 10. While this is relatively straightforward for scalar diffusion equations where has only one column, it is not trivial for vector-valued PDEs such as elasticity. One specific difficulty is that while forming the -th constraint equation Eq. 12, we must ensure that is full-rank. If it is not full rank, then no prolongation operator is able to satisfy Eq. 10 and the solution to Eq. 17 is not possible in general. We consider two possible remedies:
-
1.
Add strongly connected neighbors to the pattern of the -th row of to enlarge until is full-rank (i.e., sparsity pattern expansion); or
- 2.
A novel aspect of this work is our pursuit of sparsity pattern expansion. We find that this careful construction of the sparsity pattern, which guarantees that each constraint is exactly satisfied as is always full-rank, greatly improves performance on some problems.
To accomplish this task, we adopt a dynamic-pattern, least-squares fit (LSF) procedure that satisfies Eq. 10 or equivalently Eq. 19. For each row of (corresponding to a fine node ), this is equivalent to satisfying the local dense system Eq. 12. For simplicity, we rewrite equation Eq. 12 by dropping the row subscript , with and , yielding
(29) |
where corresponds to a diagonal block of in Eq. 17, when the non-zero entries of are enumerated row-wise.
Considering this generic FINE node represented by equation Eq. 29, if there are a sufficient number of COARSE node neighbors, then has more columns than rows. Hence if we assume a full-rank , then Eq. 29 is an underdetermined system and can be solved in several ways. In order to have a sparse solution , we choose a minimal set of columns of using the max vol algorithm [16, 14] to have the best basis. Here, we satisfy Eq. 29 exactly. We note that a related max vol approach to computing C-F splittings is used in [7].
Remark 3.1.
We adopt this form of a factorization with max vol, that is as sparse as possible, in order to improve the complexity of our algorithm and quality of . While it is a relatively minor change to the algorithm’s structure, we count it as a useful novelty of our efficient implementation.
If, on the contrary, the number of neighboring COARSE nodes is not sufficient (), then Eq. 29 cannot be satisfied because it is overdetermined. This may occur not only when is skinny, i.e., the number of columns is smaller than the number of rows, but more often because is rank deficient even if it has a larger number of columns, i.e., it is wide. In elasticity problems and in particular with shell finite elements, this issue arises often in practice with standard distance one coarsening, where during the coarsening process, some FINE nodes may occasionally remain isolated. Our strategy is to gradually increase the interpolation distance for violating nodes where , thus widening the interpolatory set (i.e., adding columns to ) where it is necessary. Algorithm 1 describes how to set-up , the vector form of initial tentative prolongation . (Algorithm 2, described later, will further process for the final tentative prolongation ).
Remark 3.2.
Avoiding a skinny or rank deficient local block is also important for the construction of the orthogonal projection that maps vectors of to . Note that is used not only to correct the conjugate gradients search direction, but also to ensure the initial prolongation satisfies the near kernel constraint. In general, the size of during energy minimization is larger than when constructing tentative prolongation because of the additional sparsity pattern expansion in (8). Thus, this “rank-deficiency” issue is ameliorated, but there are pathological cases where it has been observed in practice.
We now describe our procedure for computing local blocks of , called , and a single row of the final tentative prolongation , called . The global procedure to build and is then obtained by repeating this local algorithm for each FINE row. Denote by the starting tentative prolongation row. We use the word starting, because in general, we can receive a tentative prolongation that does not satisfy the constraint. That is, we may have:
(30) |
Denote by and the dimensions of the local system, so that . To fulfill condition Eq. 29, we must find a correction to , say , such that:
(31) |
and then set:
(32) |
To enforce condition Eq. 24 efficiently, we construct an orthonormal basis of , say , that gives rise to the desired local orthogonal projector:
(33) |
If the prolongation pattern is large enough, the vast majority of the above local problems will be such that with being also full-rank, i.e., . Thus, an economy-size QR decomposition is firstly performed on , with and , and is used to form the local projector. Then, through the same QR decomposition, we compute as the least norm solution of the underdetermined system (31):
(34) |
Note that any solution to (31) would be equivalent, as the optimal choice in terms of global prolongation energy is later computed by the restricted CG algorithm. If the initial tentative prolongation arises from the LSF set-up, it should already fulfill Eq. 31 and . However, in the most difficult cases even extending the interpolatory set with large distance is not sufficient to guarantee an exact interpolation of the near kernel for all the FINE nodes. For these FINE nodes — i.e., when or is not full-rank — we compute an SVD decomposition of :
(35) |
From the diagonal of , we determine the rank of , say , and use the first columns of to form and thus . Finally, since in this case it could be impossible to satisfy the constraint because the system is overdetermined, we use the least square solution to Eq. 31 to compute the correction for :
(36) |
It is important to recognize that for these specific FINE nodes, energy minimization cannot reduce the energy, because the constraint does not leave any degree of freedom. Consequently, this situation should be avoided when selecting the COARSE variables and the prolongation pattern, because a prolongation violating the near kernel constraint will likely fail in representing certain smooth modes. The pseudocode to set-up and correct the initial tentative prolongation (after Algorithm 1) is provided in Algorithm 2.
3.3 Improved stopping criterion and energy monitoring for CG-based energy minimization
Stopping criteria plays an important role in the overall cost and effectiveness of energy minimization. Here, we introduce a measure for monitoring energy and halting in the algorithm.
Since CG often converges quickly for energy minimization, it is common to fix the number of iterations in advance [21, 26]. However, in the case of a more challenging problem, several iterations may be needed, thus requiring an accurate stopping criterion. One immediate option is to use the relative residual, yet this may not be a close indicator of energy. In the following, we analyze CG for a generic , however our observations extend to PCG as well.
In the CG algorithm, once the search direction is defined at iteration , the scalar is computed:
(37) |
in such a way that the new approximation minimizes the square of the energy norm of the error, i.e.,
(38) |
where is the true solution. The difference in energy between two successive iterations and can be computed as
(39) |
From Eq. 37 and Eq. 39, it is possible to measure, with minimal cost, the energy decrease provided by the -st iteration. Indeed, by noting that is computed as the ratio between the two values and , the energy decrease reads
(40) |
The relative value of the energy variation with respect to the initial variation (first iteration) is monitored and convergence is achieved when energy is sufficiently reduced:
(41) |
for a small user-defined .
3.4 Improved preconditioning for CG-based energy minimization
Before introducing PCG, we present some important properties of matrices and , that we leverage in the design of effective preconditioners for energy minimization. In particular, we will see that, thanks to the non-zero structures of and , point-wise Jacobi or Gauss-Seidel iterations prove particularly effective in preconditioning the projected block .
Let us assume that the vector form of prolongation has been obtained by collecting the non-zeroes of row-wise, so that is block diagonal. Denoting by the matrix collecting an orthonormal basis of , and an orthonormal basis of , by construction we have
(42) |
i.e., the matrix is square and orthogonal. Moreover, since is block diagonal, both and are block diagonal and can be easily computed and stored. We note that, by construction, each column of and refers to a specific row of the prolongation, and, due to the block diagonal pattern chosen for , also each column of and is non-zero only in the positions corresponding to the entries of collecting the non-zeroes of a specific row of , as is schematically shown in Fig. 1. More precisely, using the same notation as in Eq. 12, let us define as the set of indices in corresponding to the non-zero entries in the -th row of so that:
(43) |
and define and as the set of columns of and , referring to the -th row of . Then,
(44) |

We are now ready to state two theorems that will be useful in explaining the choice of our preconditioners.
Theorem 3.1.
The diagonal of the projected matrix is equal to the projection of the diagonal of :
(45) |
where is the matrix collecting the diagonal entries of . Moreover, is a diagonal matrix.
Proof 3.2.
Let us consider the block of columns of relative to row , that is , remembering that it is non zero only for the row indices in . As a consequence, the square block obtained by pre- and post-multiplying by is computed as:
(46) |
However, for represents the connection between and , which is non-zero only for , i.e., , because in there is no connection between different columns of . Moreover, due to Eq. 15, for every . As the columns of are orthonormal by construction, , it immediately follows that the square block is diagonal with all its non zero entry equal to . The fact that is a diagonal matrix follows from the above observation that , with the identity matrix having size equal to the cardinality of .
Corollary 3.3.
Proof 3.4.
Due to the block-diagonal structure of and , all the off-diagonal blocks of are empty. Then, equation Eq. 47 follows from and .
3.4.1 Preconditioned CG-based Energy Minimization
Preconditioning CG can greatly improve convergence, but special care should be taken to maintain the search direction in the space of vectors satisfying the near kernel constraint. In other words, must satisfy . In [26], a Jacobi preconditioner is adopted that satisfies this requirement, but, due to the special properties of the matrix , it is possible to compute a more effective preconditioner. Denoting by any approximation of , we use to precondition in order to guarantee the constraint. The resulting PCG algorithm is outlined in Algorithm 3, where, since is a projection, we can avoid premultiplying by (line 14) as already satisfies the constraint.
In the remainder of this section we focus our attention on instead of , because, as , they have the same spectrum. Our aim is to find a good preconditioner for . Unfortunately, although is block diagonal and several effective preconditioners can be easily built for it, is less manageable and further approximations are needed.
By pre- and post-multiplying by and , respectively, we can write the following block expression:
(48) |
and, since we are interested in the inverse of , we can express it as the Schur complement of the leading block of the inverse of [32, Chapter 3]:
(49) |
from which it follows that:
(50) |
When is approximated with the inverse of the diagonal of , , because of Theorem 3.3, we have that , and the expression (50) becomes:
(51) |
which corresponds to a Jacobi preconditioning of .
We highlight that only for Jacobi can the post-multiplication by be neglected in line 14, since does not introduce components along . This is consistent with [26], where the Jacobi preconditioner is used, and no post-multiplication by is adopted.
If a more accurate preconditioner is needed, can be approximated using a block-wise symmetric Gauss-Seidel (SGS) iteration, that is:
(52) |
which substituted into equation (50) reads:
(53) |
Since the application of (53) is still impractical due to the presence of the term , we neglect the second member of the right-hand side, based on the heuristic that should be small, because . After this simplification, we obtain the final expression of the projected SGS preconditioner:
(54) |
Note that while is exactly the Jacobi preconditioner of , is only an approximation of the exact SGS preconditioner of . However, we will show in the numerical results that it is able to significantly accelerate convergence.
4 Efficient parallel implementation
Energy minimization has historically been considered computationally expensive for AMG setup and real applications. Nevertheless, a cost effective implementation is still possible, but requires special care with the algorithm parallelization. In this work, we build our AMG preconditioner with Chronos [15, 13], which provides the standard numerical kernels usually required in a distributed memory AMG implementation such as the communication of ghost unknowns, coarse grid construction (e.g., computing the PMIS C-F partition [11]), sparse matrix-vector, and sparse matrix-matrix product, etc. For energy minimization, however, we developed three specific kernels that are not required in other AMG approaches, but are critical for an efficient parallel implementation here, i.e., the sparse product between and , application of the projection , and symmetric Gauss-Seidel with . We do not list Jacobi preconditioning, because it simply consists of a row-wise scaling of .
The first issue related to the product by is that in practical applications cannot be stored. In fact, if we consider a prolongation having non-zeroes per row on average, the number of non-zeroes per column will be approximately and the matrix would be of size with about non-zeroes to be stored, where is the average number of non-zeroes per row of . Often, and storing becomes several times more expensive than storing . For instance in practical elasticity problems, can be up to 20 times larger than , making it unavoidable to proceed in matrix-free mode for . Fortunately, the special structure of allows us to interpret the product as a sparse matrix-matrix (SpMM) product between and , but with a fixed, prescribed pattern on . This property can be easily derived from the definition of Eq. 16 and the vector form of the prolongation . One advantage of prescribing a fixed pattern for is that the amount of data exchanged in SpMM is greatly reduced. First of all, the sparsity pattern adjacency information of can be communicated only once before entering PCG, then for all the successive products only the entries of are exchanged. Moreover, all the buffers to receive and send messages through the network, can be allocated and set-up only once at the beginning and removed at the end of the minimization process. In practice, we find that these optimizations reduce the cost of SpMM by about 50%.
By contrast, the construction and application of does not require any communication. In fact, we distribute the prolongation row-wise among processes, and since is block diagonal, with each block corresponding to a row of as shown in (13), we distribute blocks to the process owning the respective row of . Following this scheme, each block of is factorized independently to obtain which is efficiently processed by the LAPACK routine DGEMV when applying .
Finally, we emphasize that there is no parallel bottleneck when parallelizing the application of symmetric Gauss-Seidel with . This is because is block diagonal and, at least in principle, each block can be assigned to a different process. However as in the product, the main issue is the large size of which prevents explicit storage. As above, we rely on the equivalence between the product and the product with a prescribed sparsity pattern. The symmetric Gauss-Seidel step is then performed matrix-free, similar to the product, by saving again a considerable amount of data exchange because it is not necessary to communicate adjacency information or reallocate communication buffers. As expected, the symmetric Gauss-Seidel application exhibits a computational cost comparable to that of the product.
5 Numerical experiments
In this section, we investigate the improved convergence offered by energy minimization and the resulting computational performance and parallel efficiency for real-world applications. This numerical section is divided into three parts: a detailed analysis on the prolongation energy reduction and related AMG convergence for two medium-size matrices, a weak scalability test for an elasticity problem on a uniformly refined cube, and a comparison study for a set of large real world matrices representing fluid dynamics and mechanical problems.
As a reference for the proposed approach, we compare with the well-known and open source solver GAMG [1], a smoothed aggregation-based AMG code from PETSc. In both cases, Chronos and GAMG are used with preconditioned conjugate gradients (PCG). The GAMG set-up is tuned for each problem starting from its default parameters, as suggested by the user guide [30]. As smoother we have chosen the most powerfull option available in Chronos and GAMG, that is FSAI and Chebyshev accelerated Jacobi, respectively. Each time such default parameters are modified, we report the chosen values in the relevant section. Since all of the experiments are run in parallel, the system matrices are partitioned with ParMETIS [18] before the AMG preconditioner set-up to reduce communication overhead.
All numerical experiments have been run on the Marconi100 supercomputer located in the Italian consortium for supercomputing (CINECA). Marconi100 consists of 980 nodes based on the IBM Power9 architecture, each equipped with two 16-core IBM POWER9 AC922 @3.1 GHz processors. For each test, the number of nodes, , is selected so that each node has approximately 640,000,000 nonzero entries, and, consequently, the number of nodes is problem dependent. Each node is always fully exploited by using 32 MPI tasks, i.e., each task (core) has an average load of 20,000,000 nonzero entries. The number of cores is denoted . Only during the smaller cases in the weak scalability analysis are nodes partially used (i.e., with less than 32 MPI tasks). Even though Chronos can exploit hybrid MPI-OpenMP parallelism, for the sake of comparison, it is convenient to use just one thread, i.e., pure MPI parallelism. Moreover, for such a high load per core, we do not find that fine-grained OpenMP parallelism is of much help.
The numerical results are presented in terms of total number of computational cores used, , the grid and operator complexities, and , respectively, the number of iterations, , and the setup, iteration, and total times, , , and = + , respectively. For all the test cases, the right-hand side is a unit vector. The linear systems are solved with PCG and a zero initial guess, and convergence is achieved when the -norm of the iterative residual drops below 8 orders of magnitude with respect to the -norm of the right-hand side.
5.1 Analysis of the energy minimization process
We use two matrices for studying prolongation energy reduction, Cube and Pflow742 [20]. While the former is quite simple, as it is the fourth refinement level of the linear elasticity cube used in the weak scalability study, the latter arises from a 3D simulation of the pressure-temperature field in a multilayered porous medium discretized by hexahedral finite elements. The main source of ill-conditioning here is the large contrast in the material properties for different layers. The dimensions of Cube are 1,778,112 rows and 78,499,998 entries, with 44.15 entries per row on average. The size of Pflow742 is 742,793 rows and 37,138,461 entries, for an average entry-per-row ratio of 50.00.
The overall solver performance is compared against the energy reduction for the fine-level when using the energy mininimization Algorithm 3 with either Jacobi and Gauss-Seidel as the preconditioner. Results are analyzed in terms of computational costs and times. The main algorithmic features we want to analyze are:
-
•
how the prolongation energy reduction affects AMG convergence; and
-
•
the effectiveness of the preconditioner, i.e., Jacobi or Gauss-Seidel.
The energy minimization iteration count (Algorithm 3) is denoted by . Between brackets, we also report the relative energy reduction that is used to monitor the restricted PCG convergence of Algorithm 3, as shown in equation (41). is the time spent to improve the prolongation, with either classical prolongation smoothing with weighted-Jacobi (SMOOTHED) or the energy minimization process. Note that is only part of .
Table 1 shows the results for Cube. First, it can be seen that the energy minimization algorithm produces prolongation operators which lead to somewhat lower complexities than for the smoothed case. Moreover, energy minimization builds more effective prolongation operators overall, since the global iteration count () is lower. As the energy minimization iteration count increases, we can observe that decreases, while the setup time () increases. For this simple problem, the optimal point in terms of total time () is reached using 2 iterations of Jacobi (EMIN-J). Fig. 2a further shows that 2 iterations of Jacobi (EMIN-J) already reaches close to the achievable energy minimum. Fig. 3a compares the cost in wall-clock seconds for each energy minimization iteration using different preconditioners. For this case and implementation, Jacobi is more efficient.
Prolongation | [s] | [s] | [s] | [s] | |||||
---|---|---|---|---|---|---|---|---|---|
SMOOTHED | - | - | 1.075 | 1.648 | 58 | 52.1 | 13.4 | 65.5 | 6.3 |
EMIN-J | 1 | 1.075 | 1.592 | 54 | 47.6 | 11.2 | 58.8 | 11.3 | |
EMIN-J | 2 | 1.075 | 1.589 | 32 | 50.9 | 6.7 | 57.6 | 14.3 | |
EMIN-J | 4 | 1.075 | 1.589 | 27 | 57.3 | 5.7 | 63.0 | 20.0 | |
EMIN-GS | 1 | 1.075 | 1.587 | 28 | 55.5 | 6.0 | 61.5 | 19.3 | |
EMIN-GS | 2 | 1.075 | 1.588 | 26 | 65.6 | 5.5 | 71.1 | 28.8 | |
EMIN-GS | 4 | 1.075 | 1.589 | 26 | 84.4 | 5.6 | 90.0 | 47.7 |




Similar conclusions can be drawn for the Pflow742 case, as monotonically decreases as the energy associated with the prolongation operator is reduced. As reported by Table 2, the optimal total time () is obtained with 4 Jacobi iterations (EMIN-J). Fig. 2b shows how the energy of the prolongation operator decreases slower than for Cube and that Jacobi converges significantly slower than Gauss-Seidel. However as reported by Fig. 3b, the cost of Gauss-Seidel is still more than that of Jacobi, although the performance difference is smaller than for Cube.
Prolongation | [s] | [s] | [s] | [s] | |||||
---|---|---|---|---|---|---|---|---|---|
SMOOTHED | - | - | 1.061 | 1.465 | 369 | 23.0 | 29.8 | 52.8 | 2.3 |
EMIN-J | 1 | 1.061 | 1.339 | 377 | 20.4 | 27.3 | 47.7 | 2.9 | |
EMIN-J | 2 | 1.061 | 1.344 | 270 | 21.5 | 19.6 | 41.1 | 3.7 | |
EMIN-J | 4 | 1.062 | 1.352 | 219 | 23.3 | 15.7 | 39.0 | 5.4 | |
EMIN-J | 8 | 1.062 | 1.360 | 189 | 26.2 | 13.8 | 40.0 | 8.1 | |
EMIN-GS | 1 | 1.061 | 1.346 | 276 | 23.0 | 19.9 | 42.9 | 5.3 | |
EMIN-GS | 2 | 1.062 | 1.352 | 221 | 26.2 | 16.0 | 42.2 | 8.2 | |
EMIN-GS | 4 | 1.062 | 1.363 | 184 | 31.7 | 14.3 | 46.0 | 13.8 | |
EMIN-GS | 8 | 1.063 | 1.367 | 183 | 42.9 | 13.8 | 56.7 | 24.8 |
Regarding algorithmic complexity, each energy minimization iteration with Gauss-Seidel should cost exactly twice an iteration with Jacobi, and from the above tests, Gauss-Seidel is able to reduce the energy more than twice as fast as Jacobi. Thus Gauss-Seidel should be cheaper than Jacobi. Unfortunately, this is not confirmed by our numerical experiments and this is likely due to a sub-optimal parallel implementation of the Gauss-Seidel preconditioner. Although the block diagonal structure of allows theoretically for a perfectly parallel implementation, the Gauss-Seidel implementation requires more communication and synchronization stages than Jacobi, likely leading to our results where the Jacobi preconditioner is faster. A more cost effective implementation of Gauss-Seidel will be the focus of future work. For now, Jacobi is chosen as our default preconditioner and will be used in all subsequent cases.
Finally, we observe that a relative energy reduction of one order of magnitude gives generally the best trade-off between set-up time and AMG convergence. Therefore, we use as default.
5.2 Weak scalability test
Here, we carry out a weak scalability study of energy minimization AMG for linear elasticity on a unit cube and different levels of refinement. The unit cube is discretized by regular tetrahedral elements, the material is homogeneous, and all displacements are prevented on the square region at . The mesh sizes are chosen such that each subsequent refinement produces about twice the number degrees of freedom with respect to the previous mesh. The problem sizes range from 222k rows to 124M rows, with an average entries-per-row ratio of 44.47.
Two sets of AMG parameters are used with Chronos: the first set targets a constant PCG iteration count, while the second minimizes the total solution time (). The first section of Table 3 provides the outcome of the first test. The iteration count increases very slowly from 23 to 33, while the problem size increases by a factor of about . However, this nearly optimal scaling comes with relatively large complexities. As a consequence, total times are also relatively large, especially the setup time (). The time for energy minimization () scales quite well, however, with only a factor of 2 difference between the first and last refinement levels.
Next to reduce the setup and solution times, we increase the AMG strength threshold [27] to allow for lower complexities. The second section of Table 3 collects the new outputs. It can be seen that the relative trends are the same as in the previous tests, but that the timings are lower (30% on average). The reduced complexity is thus advantageous by providing faster wall-clock times, even at the expense of more iterations .
solver | nrows | [s] | [s] | [s] | [s] | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
energy minimization minimal iteration count | 222k | 1 | 1 | 1.077 | 1.540 | 23 | 28.7 | 2.7 | 31.4 | 0.12 | |
447k | 2 | 1 | 1.076 | 1.559 | 24 | 33.0 | 2.9 | 35.9 | 0.12 | ||
902k | 4 | 1 | 1.076 | 1.580 | 26 | 36.1 | 3.4 | 39.5 | 0.13 | ||
1,778k | 8 | 1 | 1.075 | 1.596 | 27 | 39.0 | 3.6 | 42.7 | 0.13 | ||
3,675k | 16 | 1 | 1.075 | 1.610 | 28 | 43.3 | 4.1 | 47.4 | 0.15 | ||
7,546k | 32 | 1 | 1.075 | 1.620 | 29 | 53.9 | 5.3 | 59.2 | 0.18 | ||
15,533k | 64 | 2 | 1.075 | 1.630 | 30 | 61.2 | 5.9 | 67.2 | 0.20 | ||
31,081k | 128 | 4 | 1.075 | 1.670 | 30 | 74.9 | 6.5 | 81.4 | 0.22 | ||
62,391k | 256 | 8 | 1.075 | 1.642 | 32 | 72.8 | 6.8 | 79.7 | 0.21 | ||
124,265k | 512 | 16 | 1.075 | 1.646 | 33 | 88.8 | 7.8 | 96.7 | 0.24 | ||
energy minimization best solution time | 222k | 1 | 1 | 1.047 | 1.252 | 31 | 19.5 | 3.1 | 22.6 | 0.10 | |
447k | 2 | 1 | 1.047 | 1.260 | 31 | 22.6 | 3.3 | 25.9 | 0.11 | ||
902k | 4 | 1 | 1.047 | 1.269 | 34 | 24.4 | 3.8 | 28.2 | 0.11 | ||
1,778k | 8 | 1 | 1.047 | 1.274 | 34 | 27.4 | 4.0 | 31.4 | 0.12 | ||
3,675k | 16 | 1 | 1.047 | 1.279 | 37 | 30.6 | 4.7 | 35.3 | 0.13 | ||
7,546k | 32 | 1 | 1.047 | 1.283 | 38 | 37.8 | 6.1 | 43.9 | 0.16 | ||
15,533k | 64 | 2 | 1.046 | 1.286 | 49 | 44.3 | 8.5 | 52.8 | 0.17 | ||
31,081k | 128 | 4 | 1.046 | 1.289 | 41 | 45.0 | 7.5 | 52.5 | 0.18 | ||
62,391k | 256 | 8 | 1.046 | 1.291 | 43 | 48.4 | 8.6 | 57.0 | 0.20 | ||
124,265k | 512 | 16 | 1.046 | 1.292 | 51 | 52.5 | 10.8 | 63.3 | 0.21 | ||
GAMG () best solution time | 222k | 1 | 1 | N/A | 1.479 | 58 | 11.19 | 13.72 | 24.92 | 0.24 | |
447k | 2 | 1 | N/A | 1.503 | 69 | 10.02 | 17.12 | 27.14 | 0.25 | ||
902k | 4 | 1 | N/A | 1.526 | 73 | 11.41 | 19.19 | 30.59 | 0.26 | ||
1,778k | 8 | 1 | N/A | 1.549 | 78 | 12.63 | 21.05 | 33.68 | 0.27 | ||
3,675k | 16 | 1 | N/A | 1.571 | 82 | 14.83 | 24.86 | 39.68 | 0.30 | ||
7,546k | 32 | 1 | N/A | 1.608 | 86 | 23.04 | 35.52 | 58.56 | 0.41 | ||
15,533k | 64 | 2 | N/A | 1.618 | 93 | 27.15 | 39.33 | 66.48 | 0.42 | ||
31,081k | 128 | 4 | N/A | 1.703 | 97 | 37.48 | 41.58 | 79.06 | 0.43 | ||
62,391k | 256 | 8 | N/A | 1.803 | 98 | 58.98 | 43.37 | 102.35 | 0.44 | ||
124,265k | 512 | 16 | N/A | 1.953 | 100 | 119.16 | 50.29 | 169.45 | 0.50 |
For comparison, the same set of problems is next solved with GAMG from PETSc. The default values for all parameters are used, as it turned out they were already the best. The only exception is the threshold for dropping edges in the aggregation graph (), whose value is reported alongside the results. The third section of Table 3 shows the output for GAMG. The complexities and run-times are higher than for Chronos, especially for the setup stage. The iteration counts are usually between 2 and 3 times larger than those required by Chronos. For this test problem, reducing complexity by setting is not beneficial as the increase in the iteration count cancels the set-up time reduction. We also note that both the operator complexity and set-up time increase significantly with the refinement level, while energy minimization is able to better limit growth in these quantities. Comparing two codes is fraught with difficulty, but these results do indicate that energy minimization is an efficient approach in parallel for this problem.
5.3 Challenging Real World Problems
We now examine the proposed energy minimization approach for a set of challenging real-world problems arising from discretized PDEs in both fluid dynamics and mechanics. The former class consists of problems from the discretization of the Laplace operator, such as underground fluid flow, compressible or incompressible airflow around complex geometries, or porous flow. The latter class consists of mechanical applications such as subsidence analysis, hydrocarbon recovery, gas storage (geomechanics), mesoscale simulation of composite materials (mesoscale), mechanical deformation of human tissues or organs subjected to medical interventions (biomedicine), and design and analysis of mechanical elements, e.g., cutters, gears, air-coolers (mechanical). The selected problems are not only large but also characterized by severe ill-conditioning due to mesh distortions, material heterogeneity, and anisotropy. They are listed in Table 4 with details about the size, the number of nonzeros, and the application field from which they arise.
matrix | nrows | nterms | avg nt/row | application |
---|---|---|---|---|
guenda11m | 11,452,398 | 512,484,300 | 44.75 | geomechanics |
agg14m | 14,106,408 | 633,142,730 | 44.88 | mesoscale |
tripod20m | 19,798,056 | 871,317,864 | 44.01 | mechanical |
M20 | 20,056,050 | 1,634,926,088 | 81.52 | mechanical |
wing | 33,654,357 | 2,758,580,899 | 81.97 | mechanical |
Pflow73m | 73,623,733 | 2,201,828,891 | 29.91 | reservoir |
c4zz134m | 134,395,551 | 10,806,265,323 | 80.41 | biomedicine |
Table 5 reports the AMG performance on these benchmarks when using the energy minimization procedure, classical prolongation smoothing (one step of weighed-Jacobi on the tentative prolongation), and GAMG, respectively. The overall best time is highlighted in boldface. As before, please note that for GAMG only the threshold value for dropping edges in the aggregation graph () has been tuned. All the other parameters are used with their default value, as they were already the best.
With respect to classical prolongation smoothing, the energy minimization procedure is able to reduce the complexities, in particular , the setup time , and also the iteration count (with the only exception being Pflow73m). It is the reduced complexities that allow energy minimization to achieve the lower setup time. The overall gain in total time is in the range – for all test cases.
Energy minimization also compares favorably with GAMG. GAMG provides a faster total time than energy minimization-based AMG on two cases out of seven, agg14m and M20, and a similar total time on tripod20m. The situation is reversed on the most challenging examples, where GAMG is significantly slower and is unable to solve Pflow73m. We briefly note that, unexpectedly, the ParMETIS partitioning significantly harms GAMG effectiveness on c4zz134m. For this case, we report GAMG performance on the matrix with its native ordering and mark this test with a ‘*’. Typically, the GAMG set-up time is faster than energy minimization, but energy minimization allows for a preconditioner of higher quality, which significantly reduces the total time in the most difficult cases. Fig. 4 collects all these results, reporting the relative total times. The setup and solve phases are denoted by different shading.
solver | case | [s] | [s] | [s] | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
energy minimization | guenda11m | 1 | N/A | 1.041 | 1.325 | 987 | 314.0 | 399.0 | 713.0 | ||
agg14m | 1 | N/A | 1.042 | 1.322 | 23 | 66.7 | 7.2 | 74.0 | |||
tripod20m | 2 | N/A | 1.049 | 1.302 | 104 | 40.3 | 22.2 | 62.6 | |||
M20 | 2 | N/A | 1.055 | 1.304 | 111 | 98.0 | 40.2 | 138.0 | |||
wing | 8 | N/A | 1.055 | 1.297 | 140 | 47.2 | 25.3 | 72.5 | |||
Pflow73m | 4 | N/A | 1.028 | 1.101 | 1169 | 225.0 | 424.0 | 649.0 | |||
c4zz134m | 8 | N/A | 1.029 | 1.122 | 154 | 72.7 | 48.8 | 122.0 | |||
smoothed prolongation | guenda11m | 1 | N/A | 1.041 | 1.378 | 1771 | 307.0 | 750.0 | 1060.0 | ||
agg14m | 1 | N/A | 1.042 | 1.371 | 48 | 62.5 | 15.5 | 78.1 | |||
tripod20m | 2 | N/A | 1.048 | 1.336 | 212 | 34.6 | 47.5 | 82.2 | |||
M20 | 2 | N/A | 1.054 | 1.733 | 154 | 167.0 | 76.6 | 244.0 | |||
wing | 8 | N/A | 1.055 | 1.697 | 301 | 93.5 | 71.6 | 165.1 | |||
Pflow73m | 4 | N/A | 1.058 | 1.371 | 841 | 441.3 | 394.5 | 836.0 | |||
c4zz134m | 8 | N/A | 1.028 | 1.199 | 277 | 79.2 | 98.9 | 178.0 | |||
GAMG | guenda11m | 1 | 0.00 | N/A | 1.524 | 2237 | 22.3 | 1553.9 | 1576.1 | ||
agg14m | 1 | 0.00 | N/A | 1.557 | 33 | 20.6 | 32.2 | 52.7 | |||
tripod20m | 2 | 0.01 | N/A | 1.679 | 48 | 32.2 | 32.2 | 64.4 | |||
M20 | 2 | 0.01 | N/A | 1.203 | 60 | 36.7 | 62.4 | 99.1 | |||
wing | 8 | 0.01 | N/A | 1.204 | 250 | 34.0 | 108.4 | 142.4 | |||
Pflow73m | 4 | — | N/A | — | — | — | — | — | |||
c4zz134m* | 8 | 0.01 | N/A | 1.233 | 156 | 110.9 | 250.38 | 361.33 |

6 Conclusions
This work provides evidence of the potential of a novel energy minimization procedure in constructing the prolongation operator in AMG. While the theoretical advantages of this idea are well known in the literature, the computational aspects and its practical feasibility in a massively parallel software were still under discussion.
With this contribution, we have highlighted how the energy minimization approach can be effectively implemented in a classical AMG setting, leading to robust and cost-effective prolongation operators when compared to other approaches. It is also shown how, especially in challenging problems, this technique can lead to considerably faster AMG convergence. The prolongation energy can be minimized with several schemes. We have adopted a restricted conjugate gradient, accelerated by suitable preconditioners. We presented and analyzed Jacobi and Gauss-Seidel preconditioners, restricting our attention to these two because of their applicability in matrix-free mode. The experiments show that Gauss-Seidel has the potential to be faster than Jacobi, however, its efficient parallel implementation is not straightforward and needs more attention.
Weak scalability has been assessed on an elasticity model problem by discretizing a cube with regular tetrahedra. The proposed algorithms have been implemented in a hybrid MPI-OpenMP linear solver and its performance has been compared to another well recognized AMG implementation (GAMG) using a set of large and difficult problems arising from real-world applications.
In the future, we plan to further reduce set-up time by investigating other preconditioning techniques as those described in [17], and to extend this approach to non-symmetric problems as well.
References
- [1] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. Gropp, et al., Petsc users manual, (2019).
- [2] M. Benzi, G. H. Golub, and J. Liesen, Numerical solution of saddle point problems, Acta Numer., 14 (2005), pp. 1–137.
- [3] A. Brandt, J. Brannick, K. Kahl, and I. Livshits, Bootstrap amg, SIAM Journal on Scientific Computing, 33 (2011), pp. 612–632.
- [4] A. Brandt, J. Brannick, K. Kahl, and I. Livshits, Algebraic distance for anisotropic diffusion problems: multilevel results, Electronic Transactions on Numerical Analysis, 44 (2015), pp. 472–496.
- [5] A. Brandt, S. F. McCormick, and J. W. Ruge, Algebraic multigrid (AMG) for automatic multigrid solution with application to geodetic computations, tech. rep., Institute for Computational Studies, Colorado State University, 1982.
- [6] , Algebraic multigrid (AMG) for sparse matrix equations, in Sparsity and Its Applications, D. J. Evans, ed., Cambridge Univ. Press, Cambridge, 1984, pp. 257–284.
- [7] J. Brannick, F. Cao, K. Kahl, R. D. Falgout, and X. Hu, Optimal interpolation and compatible relaxation in classical algebraic multigrid, SIAM Journal on Scientific Computing, 40 (2018), pp. A1473–A1493.
- [8] J. J. Brannick and R. D. Falgout, Compatible Relaxation and Coarsening in Algebraic Multigrid, SIAM Journal on Scientific Computing, 32 (2010-01), pp. 1393 – 1416.
- [9] M. Brezina, R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, and J. Ruge, Adaptive Smoothed Aggregation (SA) Multigrid, SIAM Rev., 47 (2005), pp. 317–346.
- [10] W. L. Briggs, V. E. Henson, and S. F. McCormick, A multigrid tutorial, SIAM, Philadelphia, PA, USA, 2nd ed., 2000.
- [11] H. De Sterck, U. M. Yang, and J. J. Heys, Reducing complexity in parallel algebraic multigrid preconditioners, SIAM Journal on Matrix Analysis and Applications, 27 (2006), pp. 1019–1039.
- [12] R. D. Falgout and P. S. Vassilevski, On generalizing the algebraic multigrid framework, SIAM J. Numer. Anal., 42 (2004), pp. 1669–1693.
- [13] M. Frigo, G. Isotton, and C. Janna, Chronos web page. https://www.m3eweb.it/chronos, 2021.
- [14] S. A. Goreinov, I. V. Oseledets, D. Savostyanov, E. E. Tyrtyshnikov, and N. L. Zamarashkin, How to find a good submatrix, tech. rep., Nov. 2008.
- [15] G. Isotton, M. Frigo, N. Spiezia, and C. Janna, Chronos: a general purpose classical AMG solver for high performance computing, SIAM J. Sci. Comput., 43 (2021), pp. C335–C357.
- [16] D. E. Knuth, Semi-optimal bases for linear dependencies, Linear and Multilinear Algebra, 17 (1985), pp. 1–4.
- [17] , Solving linear systems of the form by preconditioned iterative methods, Linear and Multilinear Algebra, (2022).
- [18] K. Lab, ParMETIS - Parallel Graph Partitioning and Fill-reducing Matrix Ordering. http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview, 2022.
- [19] S. MacLachlan, T. Manteuffel, and S. McCormick, Adaptive reduction-based amg, Numerical Linear Algebra with Applications, 13 (2006), pp. 599–620.
- [20] V. A. P. Magri, A. Franceschini, and C. Janna, A novel algebraic multigrid approach based on adaptive smoothing and prolongation for ill-conditioned systems, SIAM Journal on Scientific Computing, 41 (2019), pp. A190–A219.
- [21] J. Mandel, M. Brezina, and P. Vaněk, Energy optimization of algebraic multigrid bases, Computing, 62 (1999), pp. 205–228.
- [22] T. A. Manteuffel, S. Münzenmaier, J. Ruge, and B. Southworth, Nonsymmetric reduction-based algebraic multigrid, SIAM Journal on Scientific Computing, 41 (2019), pp. S242–S268.
- [23] T. A. Manteuffel, L. N. Olson, J. B. Schroder, and B. S. Southworth, A root-node-based algebraic multigrid method, SIAM J. Sci. Comput., 39 (2017), pp. S723–S756.
- [24] T. A. Manteuffel, J. Ruge, and B. S. Southworth, Nonsymmetric algebraic multigrid based on local approximate ideal restriction ( air), SIAM Journal on Scientific Computing, 40 (2018), pp. A4105–A4130.
- [25] L. N. Olson, J. Schroder, and R. S. Tuminaro, A new perspective on strength measures in algebraic multigrid, Numerical Linear Algebra with Applications, 17 (2010), pp. 713–733.
- [26] L. N. Olson, J. B. Schroder, and R. S. Tuminaro, A general interpolation strategy for algebraic multigrid using energy minimization, SIAM J. Sci. Comput., 33 (2011), pp. 966–991.
- [27] J. W. Ruge and K. Stüben, Algebraic multigrid (AMG), in Multigrid Methods, S. F. McCormick, ed., Frontiers Appl. Math., SIAM, Philadelphia, 1987, pp. 73–130.
- [28] M. Sala and R. S. Tuminaro, A new Petrov-Galerkin smoothed aggregation preconditioner for nonsymmetric linear systems, SIAM J. Sci. Comput., 31 (2008), pp. 143–166.
- [29] U. Trottenberg, C. Oosterlee, and A. Schller, Multigrid, Academic Press, London, UK, 2001.
- [30] L. UChicago Argonne and the PETSc Development Team, PCGAMG. https://petsc.org/main/docs/manualpages/PC/PCGAMG/index.html, 2022.
- [31] P. Vaněk, J. Mandel, and M. Brezina, Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems, Computing, 56 (1996), pp. 179–196.
- [32] P. S. Vassilevski, Multilevel block factorization preconditioners: Matrix-based analysis and algorithms for solving finite element equations, Springer Science & Business Media, 2008.
- [33] W. L. Wan, T. F. Chan, and B. Smith, An energy-minimizing interpolation for robust multigrid methods, SIAM J. Sci. Comput., 21 (2000), pp. 1632–1649.
- [34] J. Xu and L. Zikatanov, Algebraic multigrid methods, Acta Numerica, 26 (2017), pp. 591–721.
- [35] X. Xu and C.-S. Zhang, On the ideal interpolation operator in algebraic multigrid methods, SIAM Journal on Numerical Analysis, 56 (2018), pp. 1693–1710.