GPU Accelerated Newton for Taylor Series Solutions
of Polynomial Homotopies in Multiple Double Precision^†^†thanks: Supported by the National Science Foundation under grant DMS 1854513.

Jan Verschelde University of Illinois at Chicago, Department of Mathematics, Statistics, and Computer Science, 851 S. Morgan St. (m/c 249), Chicago, IL 60607-7045 Email: janv@uic.edu, URL: http://www.math.uic.edu/

\sim

jan.

(18 June 2024)

Abstract

A polynomial homotopy is a family of polynomial systems, typically in one parameter $t$ . Our problem is to compute power series expansions of the coordinates of the solutions in the parameter $t$ , accurately, using multiple double arithmetic. One application of this problem is the location of the nearest singular solution in a polynomial homotopy, via the theorem of Fabry. Power series serve as input to construct Padé approximations.

Exploiting the massive parallelism of Graphics Processing Units (GPUs) capable of performing several trillions floating-point operations per second, the objective is to compensate for the cost overhead caused by arithmetic with power series in multiple double precision. The application of Newton’s method for this problem requires the evaluation and differentiation of polynomials, followed by solving a blocked lower triangular linear system. Experimental results are obtained on NVIDIA GPUs, in particular the RTX 2080, RTX 4080, P100, V100, and A100.

Code generated by the CAMPARY software is used to obtain results in double double, quad double, and octo double precision. The programs in this study are self contained, available in a public github repository under the GPL-v3.0 License.

Keywords and phrases. Graphics Processing Unit (GPU) multiple double arithmetic Newton’s method numerical analytic continuation Taylor series.

1 Introduction

Many problems in science and engineering require the solving of a system of polynomial equations in several variables. Writing polynomials backwards, starting at the lowest degree terms, corresponds to power series truncated at the degree of the polynomials. Homotopy methods [3] define families of polynomial systems which connect a system that must be solved to a system with known solutions. Continuation methods [4, 29] track the solution paths defined by the homotopy.

This paper is another next step to accelerate a new robust path tracking algorithm [33], applying the linearization of [9] and extending the multicore implementation of [34]. As the robust path tracker of [33] scales well to track millions of paths without error, the context of this research effort is to scale the number of equations and variables of the systems. All code used in the experiments is publicly available in the github repository of PHCpack [40], released under the GNU GPL license.

With multiple double arithmetic Taylor series developments for the solution curves defined by polynomial homotopies can be computed accurately using Newton’s method. The need for multiprecision in analytic continuation can be traced back to [16] via the direct quote “Some reflection shows that in order to get a convergent process the early vectors $A_{n}^{(k)}$ (early with respect to $k$ ) must be computed more accurately than the late ones” where the italics appear as in [16]. The early vectors refer to the earlier coefficients of the series. This quote appeared in abbreviated form in the recent paper [38].

The Taylor series coefficients are input to algorithms to construct Padé approximants [7], which are related to extrapolation methods [10], [32] and approximation algorithms [37]. An example of an application to electrical engineering is the holomorphic embedding load flow method to solve power flow problems. In the convergence study of [11], results are computed with 400 digits of precision, using [26].

In this paper, the acceleration with Graphics Processing Units (GPUs) is applied to compensate for the computational overhead caused by the multiple double arithmetic needed to compute Taylor series accurately. For this computation, Newton’s method is accelerated using kernels to differentiate polynomials at power series [42] and kernels for linear algebra operations [43]. Those kernels perform very well when given inputs of the proper size. In a combined application, the dimension of the linear algebra problem matches the number of variables of the polynomials, which has an impact on the overall performance.

1.1 Problem Statement

The two main concerns are performance and convergence. For performance, the input must be sufficiently large, but still well conditioned enough to allow for Newton’s method to converge. In examining the scalability we address the first key question: how much of the overhead can be compensated by GPU acceleration? Prior work showed that teraflop performance was achieved in the acceleration of the convolutions to evaluate and differentiate polynomials in several variables at power series [42] and in the acceleration of the blocked Householder QR [43] to solve linear systems in the least squares sense in multiple double precision. The differentiation adapts the reverse mode of algorithmic differentiation [14] with power series arithmetic. The acceleration of the blocked Householder QR [8] is explained in [22], and also addressed in [6] and [45]. Additional related work on multicore and accelerated QR can be found in [5], [2], [25], [35], and [36]. The acceleration of the back substitution algorithm applies the formulas of [15], developed further using the ideas of [31].

The second question concerns the combination of various kernels in the linearization of the power series. Of the various different types of kernels that are launched, which types require the most amount of time?

1.2 Multiprecision Arithmetic

A multiple double number is an unevaluated sum of nonoverlapping doubles. The renormalization and arithmetical operations are explained in [30].

MPLAPACK [28] supports quad double arithmetic and provides implementations of arbitrary multiprecision linear algebra operations. A recent application to matrix-matrix multiplication is in [39]. In [24], the 3M and 4M methods for complex multiplication are applied in linear algebra operations on complex matrices.

In this paper, for multiple double precision, the software libraries QDlib [17] and CAMPARY [20] are applied, customized as follows. Instead of working with an array of double double numbers, two arrays of doubles are used: the first array for the most significant doubles and the second one for the least significant doubles, as this memory layout benefits memory coalescing, especially for complex quad double and octo double numbers. The GPU version of the QDlib library [27] uses the double2 and double4 types of the CUDA SDK, which promote good memory access for double doubles and quad doubles, but are not longer suited for complex quad doubles or octo doubles.

Multiple double precision is not true multiprecision in the sense that one cannot select any number of bits for the fraction. The other drawback is the limited size of the exponents (limited to the 11 bits of the 64-bit hardware doubles), which will prohibit the computation with infinitesimal values. In the context of GPU acceleration, recent work of [19] makes an interesting comparison with double double arithmetic: “The double double arithmetic of CAMPARY performs best for the problem of matrix-vector multiplication.” Concerning quad double precision, the authors of [19] write: “the CAMPARY library is faster than our implementation; however as the precision increases the execution time of CAMPARY also increases significantly.” The advantage of multiple double arithmetic is that simple counts of the number of floating-point operations quantify the cost overhead precisely and the flops metrics for performance are directly applicable.

The goal of applying GPU acceleration to offset the cost of multiple double precision arithmetic is related to the recent trend of mixed precision [1], [18], [21].

Multiprecision is often deemed meaningless on problems with noisy input data, but the problems of interest are solution paths highly sensitive to small changes in the continuation parameter, corresponding to Taylor series with a small convergence radius. Reconditioning of the Taylor series is proposed in [44].

1.3 Numerical Condition of Taylor Series

The convergence concern is closely related to the numerical conditioning of the problem of power series solutions. Consider the following classical result, applied in [33] to detect nearby singularities.

Theorem 1.1 (the ratio theorem of Fabry [12])

If for the series $x(t)=c_{0}+c_{1}t+c_{2}t^{2}+\cdots+c_{d}t^{d}+c_{d+1}t^{d+1}+\cdots$ , we have $\displaystyle\lim_{d\rightarrow\infty}c_{d}/c_{d+1}=z$ , then

•

$z$ is a singular point of the series, and
•

it lies on the boundary of the circle of convergence of the series.

Then the radius of the disk of convergence is $|z|$ .

In a numerical interpretation of this theorem, observe that the smaller convergence radius, the larger the growth in the coefficients, because then $|c_{d+1}|>|c_{d}|$ . For example if $|z|$ is 1/2, then for sufficiently large $d$ , $|c_{d+1}|\approx 2|c_{d}|$ . Thus, for series of order 64, we could observe coefficients of magnitude $2^{64}\approx 1.8\times 10^{19}$ . Therefore, for convergence, it is best to generate examples which have a unique power series solution and use decaying coefficients as in the series developments of exponential functions.

In this paper Newton’s method is applied to compute Taylor series. An alternative is to apply Fourier series methods as done in analytic continuation, see e.g. [13]. These methods are very sensitive to a good choice of the step size for taking samples of the function to be differentiated. As explained in [13] a smaller step size benefits the lower order coefficients while a larger step size may be needed to compute the higher order coefficients accurately.

The main experimental result of this research is that on systems of 1024 equations in 1024 variables, when doubling the precision from quad double to octo double, the increase in wall clock time is much less than what can be predicted from the cost overhead factors of the multiple double arithmetical operations as the increase in the time occupied by the kernels is significant. As expected, with multiple double arithmetic, the arithmetic intensity of the computations increases and the GPU becomes fully occupied.

The next section contains a high level description of Newton’s method on power series, using linearizations. The setup of the test problems with the levels of precision is justified in the third section. As defined in the fourth section, Newton’s method is executed using a staggered progression of the order of the power series. To predict the performance of the accelerated code, in section five the arithmetic intensities of different kernels are computed. Section six contains the results of the computational experiments.

2 Linearized Series and Newton’s Method

Instead of working with vectors and matrices of power series, we work with series that have as coefficients vectors and matrices. For example, for series of order 4, we solve $A(t){\bf x}(t)={\bf b}(t)$ , with $n$ -by- $n$ matrices $A_{0}$ , $A_{1}$ , $A_{2}$ , $A_{3}$ :

$\displaystyle A(t)$	$\displaystyle=$	$\displaystyle A_{0}+A_{1}t+A_{2}t^{2}+A_{3}t^{3},$	(1)
$\displaystyle{\bf x}(t)$	$\displaystyle=$	$\displaystyle{\bf x}_{0}+{\bf x}_{1}t+{\bf x}_{2}t^{2}+{\bf x}_{3}t^{3},$	(2)
$\displaystyle{\bf b}(t)$	$\displaystyle=$	$\displaystyle{\bf b}_{0}+{\bf b}_{1}t+{\bf b}_{2}t^{2}+{\bf b}_{3}t^{3}.$	(3)

The linearization of $A(t){\bf x}(t)={\bf b}(t)$ leads to

\left[\begin{array}[]{cccc}A_{0}&&&\\ A_{1}&A_{0}&&\\ A_{2}&A_{1}&A_{0}&\\ A_{3}&A_{2}&A_{1}&A_{0}\\ \end{array}\right]\left[\begin{array}[]{c}{\bf x}_{0}\\ {\bf x}_{1}\\ {\bf x}_{2}\\ {\bf x}_{3}\end{array}\right]=\left[\begin{array}[]{c}{\bf b}_{0}\\ {\bf b}_{1}\\ {\bf b}_{2}\\ {\bf b}_{3}\end{array}\right].

(4)

Even as least squares solutions provide an accuracy close to machine precision, in the forward substitution of the solution of the lower triangular block Toeplitz system in (4) the errors in the coefficient vectors ${\bf x}_{k}$ of ${\bf x}(t)$ propagate through the updates of the right hand side vectors ${\bf b}_{k}$ . If we lose two decimal places of accuracy in each step, then for series of order four up to eight decimal places may be lost in the last coefficient vector ${\bf x}_{3}$ . A more extensive error analysis was made in [34].

Newton’s method takes as input a system of polynomials in several variables, with power series truncated to the same degree and produces a sequence of power series. As an operator, this version of Newton’s method can be considered as turning a problem in many variables where all variables are interdependent into a sequence of power series for each separate variable, thus removing the interdependencies among the variables. A high level description of Newton’s method is shown in the pseudo code below.

	Input:	${\bf f}({\bf x}(t))$ , system with power series coefficients;
		${\bf x}_{0}={\bf x}(t)$ , initial leading coefficients;
		$N$ , the maximum number of iterations;
		$\epsilon$ , the tolerance on the accuracy.
	Output: $i$ , the number of iterations;
	Output: ${\bf x}(t)$ , if $i\leq N$ , then $\\|{\bf f}({\bf x}(t))\\|<\epsilon$ .
	${\bf x}(0):={\bf x}_{0}$
	for	$i$ from 1 to $N$ do
		$\displaystyle A(t),{\bf x}(t):=\partial({\bf f}({\bf x}(t)))$
		exit when $\\|{\bf b}(t)\\|<\epsilon$
		$\Delta{\bf x}(t):=A(t)\backslash{\bf b}(t)$
		report $\\|{\bf b}(t)-A(t)\Delta{\bf x}(t)\\|$
		${\bf x}(t):={\bf x}(t)+\Delta{\bf x}(t)$
		${\bf b}(t):={\bf b}(t)-A(t){\bf x}(t)$
		exit when $\\|\Delta{\bf x}(t)\\|<\epsilon$

The two computationally intensive operations are the evaluation and differentiation $\partial$ and the linear system solving $\backslash$ . The residual computation $\|{\bf b}(t)-A(t)\Delta{\bf x}(t)\|$ is for monitoring the convergence of the method and can be omitted.

The high level description does not incorporate the staggered nature of the updates. As will be explained in section 4 below not all coefficient vectors of the series are involved in all stages.

3 Columns of Monomials

We start by considering polynomial systems of the form ${\bf x}^{E}={\bf b}(t)$ , where $E$ is an exponent matrix which contains in its rows the exponent vectors of the variables ${\bf x}$ . The right hand side ${\bf b}(t)$ is a vector of power series. For example, $n=3$ , ${\bf x}=[x_{1},x_{2},x_{3}]$ :

E=\left[\begin{array}[]{ccc}1&0&0\\ 1&1&0\\ 1&1&1\end{array}\right]\quad\left\{\begin{array}[]{lcl}x_{1}&=&b_{1}(t)\\ x_{1}x_{2}&=&b_{2}(t)\\ x_{1}x_{2}x_{3}&=&b_{3}(t)\end{array}\right.

(5)

where the solutions $x_{k}(t)$ , $k=1,2,3$ , are of the form

\exp(\alpha t)+O(t^{4})=1+\alpha t+\frac{\alpha^{2}}{2!}t^{2}+\frac{\alpha^{3}}{3!}t^{3}+O(t^{4}),

(6)

with $\alpha\in[-1,-1+\delta]\cup[1-\delta,1]$ , $\delta>0$ , or $|\alpha|=1$ for random $\alpha\in{\mathbb{C}}$ . The $\alpha$ introduces numerical variation in the coefficients of the solutions.

The choice of exponential series as solution series avoids that the series in the right hand side vector of the system become large. Considering the series expansion

\exp(t)=\sum_{k=0}^{d-1}\frac{t^{k}}{k!}+O(t^{d})

(7)

leads in Table 1 to a justification for multiprecision, based on the size of the last coefficient in the truncated series. The levels of precision proceeds in powers of two according to the expected quadratic convergence of Newton’s method.

Table 1: Recommended precision levels based on the order of the series, where eps is the smallest positive double that makes a difference when added to 1.0.

$k$	$1/k!$	choice of precision	eps
7	2.0e-004	double precision	2.2e-16
15	7.7e-013	use double doubles	4.9e-32
23	3.9e-023	use double doubles
31	1.2e-034	use quad doubles	6.1e-64
47	3.9e-060	use octo doubles	4.6e-128
63	5.0e-088	use octo doubles
95	9.7e-149	need hexa doubles	5.3e-256
127	3.3e-214	need hexa doubles

The “need hexa doubles” in Table 1 is because accelerating the least squares solving, evaluation and differentiation at power series with hexa double arithmetic is still a work in progress. The computations in this paper are therefore limited to series of order 64. While computations of larger orders are possible, the accuracy in octo double precision is no longer guaranteed.

While one column of monomials is sufficient for convergence and scalability investigations, consider the 2-column format of monomials

{\bf c}_{1}{\bf x}^{E_{1}}+{\bf c}_{2}{\bf x}^{E_{2}}={\bf b}(t),

(8)

for two $n$ -vectors ${\bf c}_{1}$ and ${\bf c}_{2}$ and two exponent matrices $E_{1}$ and $E_{2}$ . With the introduction of new variables, any polynomial system can be written in this 2-column format.

For the experiments with two columns of monomials, specific lower and upper triangular matrices of ones are used. For example, for $n=3$ :

E_{1}=\left[\begin{array}[]{ccc}1&0&0\\ 1&1&0\\ 1&1&1\end{array}\right]\quad\mbox{and}\quad E_{2}=\left[\begin{array}[]{ccc}1&1&1\\ 1&1&0\\ 1&0&0\end{array}\right].

(9)

Although the product of the degrees of the system ${\bf c}_{1}{\bf x}^{E_{1}}+{\bf c}_{2}{\bf x}^{E_{2}}={\bf b}(t)$ is now much larger than the systems defined by ${\bf x}^{E_{1}}={\bf b}(t)$ and ${\bf x}^{E_{2}}={\bf b}(t)$ , there is still only one solution, which allows for decaying coefficients in the power series and thus for coefficients of modest growth, benefiting the numerical conditioning of the problem.

That ${\bf c}_{1}{\bf x}^{E_{1}}+{\bf c}_{2}{\bf x}^{E_{2}}={\bf b}(t)$ , for lower and upper triangular matrices $E_{1}$ and $E_{2}$ as in (9), has only one solution can be seen by considering the $k$ -th and the $(n-k)$ -th equations:

	$\displaystyle k$	$\displaystyle:$	$\displaystyle c_{1,k\hphantom{-k}}{\bf x}^{{\bf e}_{1,k\hphantom{-k}}}+c_{2,k\hphantom{-k}}{\bf x}^{{\bf e}_{2,k\hphantom{-k}}}=b_{k}(t)$
	$\displaystyle n-k$	$\displaystyle:$	$\displaystyle c_{1,n-k}{\bf x}^{{\bf e}_{1,n-k}}+c_{2,n-k}{\bf x}^{{\bf e}_{2,n-k}}=b_{n-k}(t).$

As ${\bf e}_{1,k}={\bf e}_{2,n-k}$ and ${\bf e}_{2,k}={\bf e}_{1,n-k}$ , the two equations can be diagonalized into

	$\displaystyle\gamma_{1,k}{\bf x}^{{\bf e}_{1,k}}$	$\displaystyle=$	$\displaystyle\beta_{k}(t)$
	$\displaystyle\gamma_{1,n-k}{\bf x}^{{\bf e}_{1,n-k}}$	$\displaystyle=$	$\displaystyle\beta_{n-k}(t),$

so for those two particular choices of $E_{1}$ and $E_{2}$ the system is equivalent to the one-column system ${\bf x}^{E}={\bf b}(t)$ which has a unique solution series.

While ${\bf c}_{1}{\bf x}^{E_{1}}+{\bf c}_{2}{\bf x}^{E_{2}}={\bf b}(t)$ has thus the same good numerical conditioning as ${\bf x}^{E}={\bf b}(t)$ , it serves as a good test on the increased cost of evaluation and differentiation. For problems with many nearby singularities, it is recommended to work with a factor $\delta\in(0,1)$ to multiply the parameter $t$ with to dampen the growth of the coefficients in the series, according to the numerical interpretation of Theorem 1.1.

4 Staggered Computations

In computing ${\bf x}(t)={\bf x}_{0}+{\bf x}_{1}t+{\bf x}_{2}t^{2}+\cdots+{\bf x}_{d-1}t^{d-1}$ , not all $d$ coefficient vectors need to be involved. We start ${\bf x}_{0}$ with half its precision correct, otherwise Newton’s method may not converge. The first iteration consists of getting ${\bf x}_{0}$ correct to the full working precision. If Newton’s method would not converge for order zero, then there is no use of increasing the order. The $d$ in the order $O(t^{d})$ is increased gradually, for example, the update formula for the order

d:=d+1+d/2

(10)

is optimistically hoping for quadratic convergence.

Once ${\bf x}_{k}$ is correct, the corresponding ${\bf b}_{k}=0$ , as ${\bf b}_{k}$ is obtained by evaluation, and then the update $\Delta{\bf x}_{k}$ should no longer be computed because

QR\Delta{\bf x}_{k}={\bf b}_{k}={\bf 0}\quad\Rightarrow\quad\Delta{\bf x}_{k}={\bf 0}.

(11)

This gives a criterion to stop the iterations.

5 Accelerating Newton’s Method

While the blocked Householder QR is very suitable to GPU acceleration and teraflop performance is achieved already at relatively modest dimensions, it stars only at the very beginning of Newton’s method as it is no longer needed once the QR decomposition is computed. The second part of the least squares solver, the back substitution, is needed in every stage, as are the convolutions to compute the right hand sides of the linear systems.

In [23], the Compute to Global Memory Access (CGMA) ratio is defined as the number of floating-point calculations performed by a kernel for each access to the global memory. This CGMA ratio corresponds to the more general notion of arithmetic intensity of a computation [46].

5.1 Arithmetic Intensity of Convolutions

In the computation of the arithmetic intensity of convolutions, or equivalently, the number of floating-point computations per double, it suffices to consider one monomial. For example, take $n=4$ and let $f=x_{1}x_{2}x_{3}x_{4}$ be the monomial we evaluate and differentiate. Using the reverse mode of algorithmic differentiation, the computations are organized in three columns as follows:

\begin{array}[]{l|l|l}x_{1}\star x_{2}&x_{4}\star x_{3}&x_{1}x_{2}\star x_{4}\\ x_{1}x_{2}\star x_{3}&x_{4}x_{3}\star x_{2}&x_{4}x_{3}\star x_{1}\\ x_{1}x_{2}x_{3}\star x_{4}&&\end{array}

(12)

where each $\star$ indicates a new multiplication. In the three columns we count respectively $n-1=3$ , $n-2=2$ , $n-2=2$ , for a total of $(n-1)+(n-2)+(n-2)=3n-5$ multiplications, for $n$ inputs. If the inputs were doubles, then the arithmetic intensity would be

\frac{3n-5}{n}.

(13)

Each input is a power series of order $d$ . To avoid thread divergence, the coefficients of the second series in each product are padded with zeros. For example, for $d=3$ :

\begin{array}[]{ll}&(a_{0}+a_{1}t+a_{2}t^{2})(b_{0}+b_{1}t+b_{2}t^{2})\\ =&(a_{0}\star b_{0}+a_{1}\star b_{-1}+a_{2}\star b_{-2})\\ +&(a_{0}\star b_{1}+a_{1}\star b_{0\phantom{-}}+a_{2}\star b_{-1})t\\ +&(a_{0}\star b_{2}+a_{1}\star b_{1\phantom{-}}+a_{2}\star b_{0}\phantom{-})t^{2},\end{array}

(14)

where coefficients with negative indices are zero. Ignoring the additions, we count $d^{2}$ multiplications, so we now have $(3n-5)d^{2}$ multiplications for $nd$ inputs. If the coefficients of the power series were doubles, then the arithmetic intensity would be

\frac{(3n-5)d^{2}}{nd}.

(15)

The coefficients of the power series are multiple doubles. For double double, quad double, and octo double, the number of doubles in the inputs are respectively $2nd$ , $4nd$ , and $8nd$ . Working with complex coefficients doubles the size of the input. Doubling the precision doubles the size of the input, but increase the arithmetical cost significantly, as illustrated by Table 2 (with data from [41]).

Table 2: Arithmetical Cost of Multiple Double Multiplications, e.g.: multiplying two double doubles takes 5 additions of two doubles, 9 subtractions and 9 multiplications, for a total of 23 floating-point operations.

	$+$	$-$	$*$	total
double double	5	9	9	23
quad double	99	164	73	336
octo double	529	954	259	1742

Then the number of floating-point operations per double for the evaluation and differentiation of a product of $n$ power series of order $d$ are respectively for double doubles, quad doubles, and octo doubles:

\frac{23(3n-5)d^{2}}{2nd},~{}~{}\frac{336(3n-5)d^{2}}{4nd},\mbox{ and }\frac{1742(3n-5)d^{2}}{8nd},

(16)

where the corresponding multiplication factors $23/2$ , $336/4$ , and $1742/8$ evaluate respectively to 11.5, 84, and 217.75.

With each doubling of the precision, the arithmetic intensity increases by a factor of 11.5 (from double to double double), by a factor of $7.30\approx 84/11.5$ (from double double to quad double), and by a factor of $2.59\approx 217.75/84$ (from quad double to octo double). As the evaluation and differentiation is needed at every stage of Newton’s method, the high numbers of floating-point operations per double are promising indicators for the success of GPU acceleration.

5.2 Complex Vectorized Arithmetic

The analysis in the previous section concerned real numbers. On complex data, the computational intensity increases, and moreover, offers more parallelism.

Writing the real and imaginary parts of ${\bf x}$ as ${\cal R}({\bf x})$ and ${\cal I}({\bf x})$ , respectively, and applying the same notation to ${\bf y}$ , the product of ${\bf x}$ with ${\bf y}$ is then executed as

\begin{array}[]{ccc}{\cal R}({\bf x})\star{\cal R}({\bf y})&-&{\cal I}({\bf x})\star{\cal I}({\bf y})\\ {\cal R}({\bf x})\star{\cal I}({\bf y})&+&{\cal I}({\bf x})\star{\cal R}({\bf y}).\end{array}

(17)

All $\star$ ’s in (17) can be executed simultaneously first, with the $-$ and $+$ executed next. The formulas in (17) correspond to what is called the 4M method in [24].

The vectorization of the complex arithmetic introduces more parallelism. Alternatively, the computation can be arranged such that one block of threads simultaneously collaborate on the real and imaginary parts of one product. This alternative is beneficial for vector shorter than the typical warp size of 32 threads.

5.3 Accelerated Least Squares

With the QR decomposition of $A$ , solving $A{\bf x}={\bf b}$ in the least squares sense is reduced to $R{\bf x}=Q^{H}{\bf b}$ , to the multiplication of $Q^{H}$ with ${\bf b}$ , followed by a back substitution.

Assuming the leading coefficient vector ${\bf x}_{0}$ of the series has an accuracy of at least half the working precision, the QR decomposition happens only once in the first stage of Newton’s method, at a cost of $O(n^{3})$ . While every stage involves the solution of $R{\bf x}=Q^{H}{\bf b}$ , the cost of computing $Q^{H}{\bf b}$ and the back substitution is both $O(n^{2})$ . Even if the acceleration of the QR decomposition works better than the acceleration of $R{\bf x}=Q^{H}{\bf b}$ , the factor $n$ in the cost overhead of QR over $R{\bf x}=Q^{H}{\bf b}$ makes that the proportion of the QR decomposition will still dominate all $Q^{H}{\bf b}$ computations and all back substitutions, as $n$ equals 1024 and the number of stages is capped to 24.

5.4 Accelerated Updates and Residuals

The updates of the right hand side vectors involve many different matrices. For example, the updates to ${\bf b}_{3}$ happen as ${\bf b}_{3}:={\bf b}_{3}-A_{3}{\bf x}_{0}$ , ${\bf b}_{3}:={\bf b}_{3}-A_{2}{\bf x}_{1}$ , ${\bf b}_{3}:={\bf b}_{3}-A_{1}{\bf x}_{2}$ , each time with different matrices $A_{3}$ , $A_{2}$ , and $A_{1}$ which cannot remain all in the main memory of the device. Even as the cost of these computations is $O(n^{2})$ we may expect the updates to occupy a significant portion of the total execution times.

The same arguments apply to the computation of the residuals, when measuring the accuracy of the computed updates $\Delta{\bf x}$ . However, one could significantly reduce the cost by selecting only one or a few random equations instead of computing the residuals for all equations.

For comparison with the arithmetic intensities of the convolutions, consider the matrix-vector product, for an $n$ -by- $n$ matrix. Executed on doubles, $n^{2}$ multiplications are performed on $(n+1)n$ doubles. The ratio $n^{2}/(n^{2}+n)$ improves on power series of order $d$ , as convolutions with padding take $d^{2}$ multiplications, while the size of the data is multiplied by $d$ . Restricting to multiplications, the arithmetic intensity then is

\frac{n^{2}d^{2}}{(n+1)nd}\approx d.

(18)

Compared to the arithmetic intensity of convolutions, the leading terms of the numerator of (15) divided by the denominator evaluates to $3d$ . So the arithmetic intensity of convolutions is three times more than that of matrix-vector products.

6 Staging Multiple Doubles

This section details the implementation of the multiple double arithmetic.

6.1 Inlining of Arithmetical Kernels

The obvious storage of a multiple double is as an array of many doubles, stored at consecutive locations. But this storage is unfavorable for coalesced memory accesses when working with vectors of multiple doubles. Therefore, all arithmetical operations on multiple doubles are defined by functions which take as input arguments not arrays of doubles, but sequences of doubles. For example, the function to add two double doubles has prototype

     void ddf_add
      ( double a_hi, double a_lo,
        double b_hi, double b_lo,
        double *c_hi, double *c_lo );

where high doubles of the numbers $a$ , $b$ , and $c$ are respectively given as a_hi, b_hi, c_hi and their low doubles as a_lo, b_lo, c_lo, for the addition $c:=a+b$ . The GPU counterpart to the ddf_add above has the same arguments but starts with

  __device__ __forceinline__  void ddg_add

forcing the inlining of the kernels. All kernels (in the .cu files) are compiled with nvcc -O3.

Every supported precision level has dedicated functions, once defined as regular C functions (for correctness comparisons) and once defined as kernel functions to be executed on the device.

6.2 Shared Memory and Registers

A vector of double doubles is stored in main memory on the device as two vectors, one vector with the high doubles, and another vector with the low doubles. All threads in one block load the high and low doubles in vectors in memory shared by all threads in the block. As adjacent threads read into adjacent memory locations, this staging the multiple doubles benefits memory coalescing.

The inlined arithmetical kernels are applied to numbers in registers. Before executing the arithmetical operations, the data is copied from shared memory into local variables, variables that are local to each thread in a block. The additional instruction cycles allow to hide the latencies of the data transfers.

7 Computational Results

The computational experiments attempt to answer the following three questions. Classifying the types of kernels into three categories: convolutions, least squares, updates and residuals, which type of kernel occupies the largest portion of the overall execution time? For which order of the series do we reach teraflop performance? What happens to the wall clock time when the precision is doubled?

The starting point used for performance comparisons, or the baseline, is the observed performance compared against the relative performance of the GPUs. For example, if one GPU is theoretically twice as fast as another, then we may expect that computations on the faster GPU end in half the time of the other.

7.1 Graphics Processing Units

The code was developed for the “Volta” V100 NVIDIA GPU, and tested on the “Pascal” P100 and RTX 2080, 4080 NVIDIA GPUs. The “Ampere” A100 was added most recently. Table 3 lists the main characteristics of the GPUs.

Table 3: Specifications of the GPUs. All have 64 cores per multiprocessor (MP), except for the RTX 4080, which has 128 cores per MP.

GPU	CUDA	#MP	#cores	GHz	host CPU	RAM	GHz
Pascal P100	6.0	56	3584	1.33	Intel E5-2699	256 GB	2.20
Volta V100	7.0	80	5120	1.91	Intel W2123	32 GB	3.60
Ampere A100	8.0	108	6912	1.41	Intel 5318Y	256 GB	2.10
RTX 2080	7.5	46	2944	1.10	Intel i9-9880H	32 GB	2.30
RTX 4080	8.9	58	7424	2.10	Intel i9-13900HX	16 GB	2.20

The double precision peak performance of the P100 is 4.7 TFLOPS. At 7.8 TFLOPS, the V100 is 1.66 times faster than the P100. The theoretical peak performance of the A100 is 8.7 TFLOPS. To evaluate the algorithms, compare the ratios of the wall clock times on the P100 over V100 with the factor 1.66. For every kernel, the number of arithmetical operations is accumulated. The total number of double precision operations is computed using the cost overhead multipliers.

The double precision peak performance of the RTX 2080 Max Q-design is 201.5 GFLOPS, which corresponds to 761.5 GFLOPS for the more recent RTX 4080. Both the RTX 2080 and 4080 are housed in a laptop computer.

The operating system of the host of the P100 and V100 is CentOS 7. For the A100, the host runs Rocky Linux. The code on the host is compiled with g++ 4.8.5 (Red Hat), with optimization flag ’-O3’, using version 8.0 of the CUDA Software Development Kit. The RTX 2080 and RTX 4080 are housed in laptops running Windows 11. The code on the host of RTX 2080 is compiled with the 2019 Community Visual Studio compiler, using CUDA 10.2. On the host of RTX 4080, the compiler is the 2022 Community Visual Studio version, using CUDA 12.1. On Windows, the optimization flag for the compiler is ’-02’.

7.2 Proportions of Kernel Times

Table 4 lists six different types of kernels.

Table 4: Six different kernels.

name	legend
convolutions	evaluation and differentiation
qr	Householder QR
qhb	$Q^{H}{\bf b}$ computations
bs	back substitutions for $R{\bf x}=Q^{H}{\bf b}$
updates	updates ${\bf b}:={\bf b}-A{\bf x}$
residuals	residual computations $\\|{\bf b}-A{\bf x}\\|_{1}$

Visualizing the data in Table 5, Figure 1 shows the percentages of the kernels on one column of monomials defined by a triangular exponent matrix of dimension 1024 to compute series of order 64 in octo double precision, done on the V100.

Table 5: Times in seconds for each kernel on V100, on one column and two columns of monomials in octo double precision.

kernel	one column	two columns
convolution	121.386	244.535
Householder QR	24.451	24.123
kernel for $Q^{H}{\bf b}$	5.849	6.139
back substitution	17.053	17.884
updates	111.474	124.080
residuals	125.122	137.963
total kernel times	405.334	554.723
wall clock time	1129.794	1808.480

Refer to caption — Figure 1: Percentage of each type of accelerated computation for a one column monomial system in octo double precision, on V100, with legend in Table 4.

The largest portion of the time goes to the residual computations, for all equations in the system. The residuals are important to measure the convergence and must be computed in multiple double precision. One optimization could be to select at random one or a couple of equations and compute the residuals for those selected equations instead of for all equations. Figure 2 shows that for a 2-column monomial system, the time spent on convolutions dominates.

7.3 Performance of Convolutions

Figure 1 shows that the convolutions occupy a substantial part. For what orders of the series do we observe teraflop performance? Consider Table 6.

Table 6: Performance in gigaflops of convolutions on P100, V100, and A100 to evaluate and differentiate one column of 1024 monomials in octo double precision.

order	P100	V100	A100
1	8.041	28.997	39.284
2	16.191	59.820	95.770
3	23.748	90.003	147.440
5	29.277	149.894	252.502
8	62.747	240.760	411.946
12	94.035	360.816	623.255
18	140.918	540.572	938.634
27	211.261	810.645	1412.459
41	351.994	1045.032	1601.007
62	535.136	1569.347	2472.086
64	554.654	1658.382	2568.016

On one column of monomials, triangular exponent matrix of ones, $n=1024$ , performance of the evaluation and differentiation, in octo double precision, for increasing orders of the series, visualizing the data in Table 6, Figure 3 shows that teraflop performance is observed after order 40 on the V100. Teraflop is observed after order 40 on the V100, but on the P100 only half a teraflop is reached at order 64. On the A100, teraflop performance is observed at order 27.

In the implementation, in the convolution of two power series each thread is responsible for one coefficient of the result. Threads are launched in blocks of the size that match the number of coefficients. An implementation better suited for series of lower order would employ a finer granularity and have several threads collaborate to compute one coefficient of a convolution of two series.

As the V100 is theoretically 1.66 faster than the P100, multiplying 554.654 (on the last row of Table 6) by 1.66 yields 920.726, which is less than the observed 1,658.382 on the V100. Thus, for higher levels of precision (octo double), newer GPUs scale better than older ones. The observed 1.7 teraflop performance on the V100 is only at 22% of the theoretical peak performance. The performance improves on series of higher order.

7.4 Doubling the Precisions

To investigate how much of the cost overhead can be compensated by the acceleration consider the wall clock times and the elapsed times spend by all kernels when the precision is doubled.

Table 7: Wall clock times in seconds for double (D), double double (2D), quad double (4D), and octo double (8D) for one column of 1024 monomials to compute series of order 64 on P100 and V100.

	D	2D	4D	8D
P100 kernel times	10.4	44.5	204.4	1263.3
wall clock	418.5	695.3	969.9	3073.6
V100 kernel times	6.2	22.6	146.4	405.3
wall clock	277.9	475.3	834.7	1129.8

Visualizing Table 7, Figure 4 shows the 2-logarithms of the times of 24 steps with Newton’s method on one column of monomials defined by a triangular exponent matrix of ones of dimension 1024, on the V100. Doubling the precision less than doubles the wall clock time and increases the time spent by all kernels. The reason for this is that problems become more computationally intensive: the amount of data doubles but the amount of arithmetical computations on average is multiplied by factors larger than two, as explained in section 5.1, see also Table 2.

7.5 Experiments on the RTX 2080 and 4080

Table 8: Wall clock times in seconds for double (D), double double (2D), quad double (4D), and octo double (8D) for one column of 512 monomials to compute series of order 64 on RTX 2080.

	D	2D	4D	8D
kernel times	1.4	16.6	122.4	380.1
wall clock	35.7	80.0	225.0	474.8

On the problem of applying the block Householder QR on real matrices of octo double numbers, the RTX 4080 reaches a performance of 413.2 GFLOPS on dimension 1,024 (8 blocks of size 128), and 719.5 GFLOPS (very close to the theoretical peak performance) is reached on dimension 2,048 ( $8~{}\times~{}256$ ).

The last experiments concern the RTX 2080, on 16 steps with Newton’s method on one column of monomials defined by a triangular exponent matrix of ones of dimension 512. The results are summarized in Figure 5, visualizing the data in Table 8.

Table 9 compares the performance of the newer RTX 4080 versus the older RTX 2080, illustrating the scalability, on one column of 512 monomials, on series of order 64. Although the RTX 4080 is at least twice as powerful as the RTX 2080, the order 64 is unfavorable for the RTX 4080 which has 128 cores per multiprocessor. Relative to the P100 and V100, Figure 5 looks better than the corresponding Figure 4 because the RTX 2080 is already better occupied, already at a column of 512 monomials.

Table 9: Kernel times in seconds and gigaflops performance for the Householder QR, multiplication of

Q

with

{\bf b}

(Qb), back substitution (BS), evaluation & differentation (ED).

	RTX 2080		RTX 4080
stage	time	gigaflops	time	gigaflops
QR	28.5	100.5	18.3	156.5
Qb	18.6	26.3	14.9	33.0
BS	18.1	16.3	16.4	17.2
ED	27.3	110.0	15.8	190.1

8 Conclusions

Although several optimizations in the code will improve the performance, this first implementation offers a promising first step towards a scalable nonlinear solver based on results from numerical analytic continuation.

Using decaying coefficients of power series expansions, octo double precision suffices for series of order 64. Teraflop performance of the evaluation and differentiation is already attained at order 40 on the V100. The convolutions to evaluate and differentiate at power series remain a significant portion of all computational work. For two columns of monomials which can encode general polynomial systems, the computational effort to evaluate and differentiate dominates.

Doubling precisions less than doubles the wall clock times because the computations are then compute bound and thus well suited for acceleration by graphics processing units. Extending the acceleration beyond octo double precision on GPUs is a future direction.

Acknowledgements. Preliminary versions of this study were presented at the 2023 Joint Mathematical Meetings, in the AMS Special session on Polynomial Systems, Homotopy Continuation and Applications (organized by Tim Duff and Maggie Regan); and in the minisymposium on Co-Design for Heterogeneous System Architectures (organized by Navamita Ray and Edgar Leon) at the SIAM CSE 2023 meeting. The author thanks the organizers of those meetings.

References

[1] A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox, M. Gates, N. J. Higham, X. S. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, B. F. Smith, K. Swirydowicz, S. Thomas, S. Tomov, Y. M. Tsai, and U. M. Yang. A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. International Journal of High Performance Computing Applications, 35(4):344–369, 2021.
[2] E. Agullo, C. Augonnet, J. Dongarra, and M. Faverge. QR factorization on a multicore node enhanced with multiple GPU accelerators. In 2011 IEEE International Parallel and Distributed Processing Symposium, pages 932–943. IEEE, 2011.
[3] J. C. Alexander and J. A. Yorke. The homotopy continuation method: Numerically implementable topological procedures. Transactions of the American Mathematical Society, 242:271–284, 1978.
[4] E. L. Allgower and K. Georg. Introduction to Numerical Continuation Methods, volume 45 of Classics in Applied Mathematics. SIAM, 2003.
[5] M. Anderson, G. Ballard, J. Demmel, and K. Kreutzer. Communication-avoiding QR decomposition for GPUs. In 2011 IEEE International Parallel and Distributed Processing Symposium, pages 48–58. IEEE, 2011.
[6] M. Baboulin, J. Dongarra, and S. Tomov. Some issues in dense linear algebra for multicore and special purpose architectures. Technical Report UT-CS-08-200, University of Tennessee, 2008.
[7] Baker, Jr, G. A. and P. Graves-Morris. Padé Approximants. Cambridge University Press, 1996.
[8] C. Bischof and C. F. Van Loan. The WY representation for products of Householder matrices. SIAM J. Sci. Stat. Comput., 8(1):s1–s13, 1987.
[9] N. Bliss and J. Verschelde. The method of Gauss-Newton to compute power series solutions of polynomial homotopies. Linear Algebra and its Applications, 542:569–588, 2018.
[10] C. Brezinski and M. Redivo Zaglia. Extrapolation Methods, volume 2 of Studies in Computational Mathematics. North-Holland, 1991.
[11] A. Dronamraju, S. Li, Q. Li, Y. Li, D. Tylavsky, D. Shi, and Z. Wang. Implications of Stahl’s theorems to holomorphic embedding part II: numerical convergence. CSEE Journal of Power and Energy Systems, 7(4):773–784, 2021.
[12] E. Fabry. Sur les points singuliers d’une fonction donnée par son développement en série et l’impossibilité du prolongement analytique dans des cas très généraux. Annales scientifiques de l’École Normale Supérieure, 13:367–399, 1896.
[13] B. Fornberg. Numerical differentiation of analytic functions. ACM Transactions on Mathematical Software, 7(4):512–526, 1981.
[14] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, second edition, 2008.
[15] D. H. Heller. A survey of parallel algorithms in numerical linear algebra. SIAM Review, 20(4):740–777, 1978.
[16] P. Henrici. An algorithm for analytic continuation. SIAM Journal on Numerical Analysis, 3(1):67–78, 1966.
[17] Y. Hida, X. S. Li, and D. H. Bailey. Algorithms for quad-double precision floating point arithmetic. In 15th IEEE Symposium on Computer Arithmetic (Arith-15 2001), pages 155–162. IEEE Computer Society, 2001.
[18] N. J. Higham and T. Mary. Mixed precision algorithms in numerical linear algebra. Acta Numerica, pages 347–414, 2022.
[19] K. Isupov and V. Knyazkov. Multiple-precision matrix-vector multiplication on graphics processing units. Program Systems: Theory and Applications, 11(3):62–84, 2020.
[20] M. Joldes, J.-M. Muller, V. Popescu, and Tucker. W. CAMPARY: Cuda Multiple precision arithmetic library and applications. In Mathematical Software – ICMS 2016, the 5th International Conference on Mathematical Software, pages 232–240. Springer-Verlag, 2016.
[21] C. T. Kelley. Newton’s method in mixed precision. SIAM Review, 64(1):191–211, 2022.
[22] A. Kerr, D. Campbell, and M. Richards. QR decomposition on GPUs. In D. Kaeli and M. Leeser, editors, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU’09), pages 71–78. ACM, 2009.
[23] D. B. Kirk and W. W. Hwu. Programming Massively Parallel Processors. A Hands-on Approach. Morgan Kaufmann, 2010.
[24] T. Kouya. Acceleration of complex matrix multiplication using arbitrary precision floating-point arithmetic. arXiv:2307.06072v1 [math.NA] 12 Jul 2023.
[25] J. Kurzak, R. Nath, P. Du, and J. Dongarra. An implementation of the tile QR factorization for a GPU and multiple GPUs. In Applied Parallel and Scientific Computing, 10th International Conference, PARA 2010, volume 7134 of Lecture Notes in Computer Science, pages 248–257. Springer-Verlag, 2012.
[26] Advanpix LLC. Multiprecision computing toolbox for MATLAB. www.advanpix.com.
[27] M. Lu, B. He, and Q. Luo. Supporting extended precision on graphics processors. In Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), pages 19–26, 2010.
[28] N. Maho. MPLAPACK version 2.0.1. user manual. arXiv:2109.13406v2 [cs.MS] 12 Sep 2022.
[29] A. Morgan. Solving Polynomial Systems Using Continuation for Engineering and Scientific Problems, volume 57 of Classics in Applied Mathematics. SIAM, 2009.
[30] J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Jeannerod, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. Handbook of Floating-Point Arithmetic. Springer-Verlag, second edition, 2018.
[31] W. Nasri and Z. Mahjoub. Optimal parallelization of a recursive algorithm for triangular matrix inversion on MIMD computers. Parallel Computing, 27:1767–1782, 2001.
[32] A. Sidi. Practical Extrapolation Methods. Theory and Applications, volume 10 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2003.
[33] S. Telen, M. Van Barel, and J. Verschelde. A robust numerical path tracking algorithm for polynomial homotopy continuation. SIAM Journal on Scientific Computing, 42(6):3610–A3637, 2020.
[34] S. Telen, M. Van Barel, and J. Verschelde. Robust numerical tracking of one path of a polynomial homotopy on parallel shared memory computers. In Proceedings of the 22nd International Workshop on Computer Algebra in Scientific Computing (CASC 2020), volume 12291 of Lecture Notes in Computer Science, pages 563–582. Springer-Verlag, 2020.
[35] S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing, 36(5):232–240, 2010.
[36] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra solvers for multicore with GPU accelerators. In The 2010 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1–8. IEEE, 2010.
[37] L. N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2013.
[38] L. N. Trefethen. Quantifying the ill-conditioning of analytic continuation. BIT Numerical Mathematics, 60(4):901–915, 2020.
[39] T. Utsugiri and T. Kouya. Acceleration of multiple precision matrix multiplication using Ozaki scheme. arXiv:2301.09960v2 [math.NA] 25 Jan 2023.
[40] J. Verschelde. Algorithm 795: PHCpack: A general-purpose solver for polynomial systems by homotopy continuation. ACM Trans. Math. Softw., 25(2):251–276, 1999.
[41] J. Verschelde. Parallel software to offset the cost of higher precision. ACM SIGAda Ada Letters, 40(2):59–64, 2020.
[42] J. Verschelde. Accelerated polynomial evaluation and differentiation at power series in multiple double precision. In The 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 740–749. IEEE, 2021.
[43] J. Verschelde. Least squares on GPUs in multiple double precision. In The 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 828–837. IEEE, 2022.
[44] J. Verschelde and K. Viswanathan. Locating the closest singularity in a polynomial homotopy. In Proceedings of the 24th International Workshop on Computer Algebra in Scientific Computing (CASC 2022), volume 13366 of Lecture Notes in Computer Science, pages 333–352. Springer-Verlag, 2022.
[45] V. Volkov and J. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008. Article No. 31.
[46] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.

GPU Accelerated Newton for Taylor Series Solutions of Polynomial Homotopies in Multiple Double Precision††thanks: Supported by the National Science Foundation under grant DMS 1854513.