Extracting the Potential of Emerging Hardware Accelerators for Symmetric Eigenvalue Decomposition

Hansheng Wang, Lu Shi School of Computer Science and Engineering
University of Electronic Science and Technology of China wanghansheng, [email protected] , Zhekai Duan Department of Computer Science
University of Edinburgh [email protected] , Panruo Wu Department of Computer Science
University of Houston [email protected] and Liwei Guo, Shaoshuai Zhang School of Computer Science and Engineering
University of Electronic Science and Technology of China lwg, [email protected]

(Date: July 2019)

Abstract.

Benefiting from the advancement of hardware accelerators such as GPUs, deep neural networks and scientific computing applications can achieve superior performance. Recently, the computing capacity of emerging hardware accelerators has increased rapidly, while memory bandwidth has not kept pace with this growth. This disparity exacerbates the gap between computing and memory, leading to inefficiencies on conventional algorithms, as they’re likely to be converted from compute-bound to memory-bound. Symmetric eigenvalue decomposition (EVD), a critical operation in various research domains including scientific computing, deep learning training, and inference algorithms, exhibits suboptimal performance due to achieving less than 3% hardware computing utilization on the H100 GPU. In this paper, we analyze the features of emerging hardware accelerators to identify the bottlenecks inherent in conventional EVD algorithms. To improve EVD performance, we propose several algorithmic optimizations aimed at solving the memory-bound problem and providing a better utilization of the rich computing capacity and parallelism on the emerging hardware accelerators. Experimentally, our proposed method demonstrates significant speedups on tridiagonalization, which is the main workload that takes over 90% elapsed time of EVD, compared to the SOTA cuSOLVER tridiagonalization, achieving up to 10.1x, 7.5x, and 2.3x improvements on H100, A100, and RTX 4090 GPUs, respectively. And the end-to-end the performance of EVD solver is also up to 4.1x faster than cuSOVLER.

1. Introduction

Driven by the need to train deep neural networks, particularly large language models containing billions of parameters, hardware accelerators such as GPUs and NPUs have evolved rapidly. For instance, compared to Nvidia’s P100 GPU released in 2016, the H100 GPU achieves over 50x and 14x speedup in FP16 and FP64 precision, respectively. Leveraging the immense parallelism and computational capacity of these advanced hardware accelerators, various applications, including scientific computing (Tomov et al., 2011), can harness the power of GPUs to achieve performance gains that are orders of magnitude greater than those attainable with the latest generations of CPUs (Anderson et al., 1999).

However, the increase in memory bandwidth has not kept pace with the surge in computational capacity. For example, the P100 GPU offers 732 GB/s of memory bandwidth, while the H100 GPU provides 3430 GB/s. Despite the computational capacity growing by a factor of over 14x, the memory bandwidth has increased by less than 5x. This disparity presents significant challenges in designing hardware-accelerated algorithms, as some conventional algorithms may shift from being compute-bound to memory-bound (Figure 3 illustrates this issue). In this paper, we use symmetric eigenvalue decomposition (EVD), a crucial problem in scientific computing and machine learning, to thoroughly explore these challenges.

Symmetric EVD is a fundamental matrix computation in numerical linear algebra with applications spanning quantum chemistry (Rösch et al., 2005), quantum mechanics and physics (Grimes et al., 1987; Probert, 2011), and numerous machine learning and signal processing tasks (Shah et al., 2019; Zhang et al., 2020b; Gupta et al., 2018). Computing EVD typically involves two steps: 1) tridiagonalization, which converts a given symmetric matrix $A$ to a tridiagonal matrix $T$ ( $O(n^{3})$ complexity); and 2) an iterative method such as the QR algorithm (Watkins, 1982) to compute the eigenvalues ( $O(n^{2})$ complexity). However, the state-of-the-art cuSOLVER ¹¹1https://docs.nvidia.com/cuda/cusolver/ implementation of tridiagonalization (cuSolverDnSytrd) on GPUs achieves less than 3% of the peak performance on the H100 GPU (1.8 TFLOPs out of 67 TFLOPs). Even the hybrid MAGMA (Tomov et al., 2011) implementation, which employs a 2-stage tridiagonalization process (band reduction and bulge chasing) (Haidar et al., 2011), fails to utilize more than 7% of the computing capacity.

These observations prompt two critical questions: What are the reasons behind this suboptimal performance, and how to improve it. In this paper, we aim to addressing the two questions by proposing an optimized EVD solver designed for emerging hardware accelerators. Our experimental results demonstrate significant performance improvements over existing solutions, providing valuable insights into the efficient utilization of emerging hardware accelerators for matrix computations.

We consider our contributions of this paper to be:

•

We analyze the features and evaluate the performance behaviours of the emerging hardware accelerators, and we use the SOTA EVD solver as an example to reveal the bottlenecks and difficulties of designing algorithms on these hardware.
•

We identify that, on emerging hardware accelerators, the SOTA band reduction process in tridiagonalization is memory-bound. To solve this problem, we propose a new band reduction algorithm that can significantly improve the inefficiency.
•

We demonstrate the consensus, that the bulge chasing process in tridiagonalization is hard to be accelerated by hardware accelerators, is not correct. And we leverage the parallelism on hardware accelerators to obtain 7.9x speedup compared to conventional CPU-based implementations.
•

We implement and optimize the tridiagonalization which is the main workload of EVD solver, the experiments address that, compared to cuSOVLER and hybrid CPU-GPU MAGMA implementation, the proposed tridiagonalization is up to 10.1x, 6.0x speedup and the end-to-end EVD speedup is 4.1x and 3.7x, respectively.

The rest of the paper is organized as follows: Section 2 explains the basic concepts of direct and two-stage tridiagonalization, as well as the bulge chasing process. Section 3 analyzes the hardware features, algorithmic bottlenecks, and addresses the motivations. Section 4 illustrates our proposed methods. Section 5 presents the implementation details and section 6 evaluates our implementations. Section 7 introduces related work regarding EVD on modern computer architectures. Finally, Section 8 draws conclusions and discusses future work.

2. Background

To better illustrate our proposed method, we will simplify the background knowledge of the direct and 2-stage tridiagonalization process. This will provide a clearer understanding of the fundamental concepts and the improvements our method introduces.

2.1. Tridiagonalization

The tridiagonalization process is usually the pre-step to eigenvalue decomposition. The goal of tridiagonalization can be expressed as follows:

A=Q^{-1}\times T\times Q,

where $Q$ is an orthogonal matrix, and $T$ is a tridiagonal matrix. The conventional tridiagonalization (Dongarra et al., 1989) uses Householder reflection to eliminates the elements except the diagonal and the subdiagonal elements in one column. Unfortunately, on the modern computer architectures, the tridiagonalization is not efficient at all, because it contains too many BLAS2 operations (such as symmetric matrix vector multiplication (symv)), which are not efficient and takes over 90% of the total elapsed time. To solve this issue, 2-stage tridiagonalization is proposed and it’s has a better utilization of hardware resources.

2.2. 2-stage Tridiagonalization

The 2-stage tridiagonalization has two successive processes, the first is called successive band reduction (SBR) that reduces the matrix to a band form: $A=Q^{-1}\times B\times Q$ , where $Q$ is an orthogonal matrix and $B$ is a band matrix with bandwidth $b$ . The 2nd stage is bulge chasing, it converts the band form matrix $B$ from 1st stage to a tridiagonal matrix $T$ , see Figure 1 for a intuitive illustration. The 2-stage tridiagonalization is mathematically equivalent to direct tridiagonalization, but it provides more BLAS3 operations in SBR. Although the bulge chasing has many BLAS2 operations, the time complexity is much lower than SBR, so it doesn’t take too much time to be performed. Therefore, the 2-stage is proved to be more efficient on multi-core architectures (Haidar et al., 2011; Gates et al., 2018).

2.2.1. Successive Band Reduction

Refer to caption — Figure 1. The 2-stage tridiagonalization process

The SBR process iteratively performs QR factorization and trailing matrix updates using the Symmetric Rank-2k update (syr2k) routine, as illustrated in the left figure of Figure 5. The QR factorization aims to find the orthogonal transformation, eliminating the off-band entries while forming the matrices $Z$ and $Y$ required by trailing matrix update. Subsequently, the rest of the matrix is updated from both sides with the $Z$ and $Y$ matrices using syr2k: $A=A-ZY^{T}-YZ^{T}$ .

Compared to direct tridiagonalization, SBR is more efficient because it converts the column-by-column Householder transformation into QR factorization, which involves more BLAS3 operations. Although an additional bulge chasing process is required, the overall tridiagonalization performance is still improved because the time complexity of bulge chasing is only $O(nb^{2})$ .

2.2.2. Bulge Chasing

The bulge chasing process reduces a band form matrix to a tridiagonal or bidiagonal matrix (Gant, [n.d.]). In essence, the steps of bulge chasing are quite similar to tridiagonalization, as both iteratively apply Householder transformations to the trailing matrix. The key difference is that the input matrix for bulge chasing is a band matrix, allowing the computations to exploit the band structure and reduce the number of mathematical operations.

The steps of one sweep in bulge chasing are shown in Figure 2. It iteratively finds the Householder vectors and performs the Householder transformation to chase the bulge until the off-diagonal elements of one column are eliminated. In Figure 2, the orange columns denote the search for Householder vectors to eliminate elements in the current column. Once the Householder vector $v$ is formed, we use $H(v)=I-2vv^{T}/(v^{T}v)$ to update the green blocks $G$ from the left and right sides that $G=H(v)^{-1}GH(v)$ . The pink blocks $P$ can be updated from the left side only that $P=H(v)^{-1}P$ , thereby creating a bulge denoted by the blue blocks. To chase down the bulge, we find the Householder vectors of the first column of the bulge and repeat the aforementioned steps until the bulge is swept down to the last column.

The above steps demonstrate one sweep of the bulge chasing process. In fact, to fully reduce the band form matrix to a tridiagonal matrix, $n-2$ sweeps if the given matrix’s size is $n\times n$ . Obviously, the bulge chasing remains memory-bound because $b$ is typically much smaller than $n$ . Consequently, even in recent research from 2018, Gates et al.(Gates et al., 2018) assert that ”as this stage (bulge chasing) has limited parallelism, is close to memory bandwidth limited, and is already optimized for the CPU caches, it would not benefit much, if any, from an accelerator-based implementation” (Section 4.2). Therefore, all bulge chasing implementations are deployed on CPUs, as seen in LAPACK(Anderson et al., 1999), PLASMA (Dongarra et al., 2019), and MAGMA (Tomov et al., 2011), named sb2st routine.

2.3. Iterative Methods

After obtaining the tridiagonal matrix, iterative methods are used to get eigenvalues. The most popular iterative methods for computing eigenvalues and eigenvectors are the QR algorithm (Watkins, 1982) and divide and conquer (D&C) (Tisseur and Dongarra, 1999), both of which are included in numerical linear algebra libraries such as cuSOLVER. The primary difference between these methods is their computational complexity: if only eigenvalues are needed, the QR algorithm requires $O(n^{2})$ operations,whereas D&C requires $O(n^{3})$ because it also generates eigenvectors during the iterations but it has better parallelism than QR algorithm.

3. Motivation

In this section, we will show the motivations of this paper by addressing hardware features and study the bottlenecks. By understanding the features and bottlenecks, we can propose targeted optimizations to improve the efficiency of EVD solvers, especially the tridiagonalization process.

3.1. The Features of Emerging Hardware Accelerators

The emerging hardware accelerators are designed to handle the computational intensity of DNNs, especially Transformer-based DNNs (Vaswani et al., 2017), which typically involve a substantial amount of matrix multiplications (GEMMs). For instance, Nvidia’s latest GPUs feature Tensor Cores that excel in low or mixed-precision GEMMs, offering nearly 1 PFLOPs peak performance for FP16 precision GEMMs on H100 GPU. Additionally, high-precision (FP64) GEMMs can also be executed on Tensor Cores, delivering performance comparable to FP32 precision GEMMs.

While the computing capacity of hardware continues to increase rapidly, the memory bandwidth is also improving, but at a slower pace. This discrepancy leads to a widening memory-compute gap.

To better understand the difficulties in fully utilizing the performance capabilities of emerging hardware, we employ the roofline model (Williams et al., 2009) and present Figure 3 to illustrate the performance differences across various generations of Nvidia GPUs. As depicted, achieving peak performance on the latest GPUs, such as the H100 and A100, necessitates a higher data intensity. Having these features, however, conventional numerical linear algebra algorithms, like the EVD solver that utilize blocking algorithms, do not achieve the requisite data intensity to fully exploit the computational power of these advanced GPUs, which means compute-bound algorithms can be converted to memory-bound on new hardware. Thus, to understand the reasons behind the above conversion, we use EVD solver as an example and address the performance bottlenecks in the following section.

3.2. Performance Bottleneck

Extensive literature indicates that two-stage tridiagonalization outperforms direct tridiagonalization, even when forming eigenvectors (Haidar et al., 2011; Ltaief et al., 2012; Luszczek et al., 2011). Thus, we will not delve deeply into the comparative analysis of these two approaches, but rather, we will focus on the bottlenecks of the state-of-the-art 2-stage tridiagonalization algorithm.

3.2.1. Band Reduction Bottleneck

As we mentioned before, the 1st stage of 2-stage tridiagonalization is SBR, which reduces a symmetric matrix to band form. This involves designating a bandwidth $b$ , followed by QR factorization on a tall and skinny matrix of size $n\times b$ , and syr2k operations of size $n\times n\times b$ . Despite increasing the proportion of BLAS3 operations compared to direct tridiagonalization, this method suffers from a critical performance limitation: the bandwidth $b$ is strictly equal to the block size $nb$ . This necessitates a delicate balance between the first and second stages of tridiagonalization, as a larger $b$ results in slower bulge chasing, while a smaller $b$ leads to slower band reduction. Currently, MAGMA defaults to a bandwidth of 128 or 256 (although we find the $b=64$ can give the best performance on H100 and A100 GPU) to optimize performance; however, the syr2k operation cannot achieve peak hardware performance due to the tall and skinny matrix shape. Connected to the aforementioned roofline model (Figure 3), current algorithmic design of SBR cannot provide enough data intensity and result in memory-bound implementation. And to fully utilize the hardware accelerators, it’s necessary to figure out how to provide larger and more square syrk2k operations.

Quantitatively, Table 1 presents the state-of-the-art cuBLAS syr2k performance (also used by MAGMA) for various $n$ and $k$ on an H100 GPU. The data reveals that only larger values of $k$ (corresponding to bandwidth $b$ ) deliver satisfactory performance. However, to balance the performance between SBR and bulge chasing, the bandwidth $b$ is typically set to 128 or 256, as larger $b$ significantly slows down bulge chasing. Consequently, the overall performance of SBR remains suboptimal and memory-bound, even with an increased proportion of BLAS3 operations.

[Uncaptioned image] — Table 1. The performance of SYR2K on H100 GPU with different input sizes ( $n$ and $k$ ) in TFLOPs

3.2.2. Bulge Chasing Bottleneck

The bulge chasing process, which comprises numerous BLAS1/2 operations, is already memory-bound and difficult to be parallelized on old hardware accelerators. As a result, it cannot exhibit optimal performance on GPUs (Haidar et al., 2011; Gates et al., 2018). Our experimental evaluation (Figure 4) demonstrates that for a matrix of size $65536\times 65536$ , the bulge chasing process, consuming over one-fourth of the tridiagonalization elapsed time, represents a significant performance bottleneck.

3.3. Opportunities

In this section, we discuss the hardware features of the latest GPUs: the memory speed scales much slower than that of computations for evolving GPUs, which results in an enlarging gap between compute capacity and memory bandwidth. This gap necessitates higher data intensity to achieve peak performance, complicating algorithm design. Based on our analysis of the performance bottlenecks, the conventional EVD algorithms, which is efficient on previous computer architectures, fails to handle the larger memory-compute gap and rich parallelism of emerging hardware, leading to a memory-bound implementations.

As a result, the features and bottlenecks leave us some opportunities to outperform the SOTA algorithms which are memory-bound on emerging hardware accelerators. In terms of EVD solver, in cuSOVLER and MAGMA implementations, given a matrix size $50K\times 50K$ on H100 GPU, the tridiagonalization process takes over 95% (78.5 seconds out of 80.4 seconds), and 60% (43.5 seconds out of 73.5 seconds), respectively. Thus, the EVD solver’s performance will boost significantly if we can optimize the tridiagonalization quite well by utilizing the computing capacity and parallelism on new hardware. And we’ll propose a new method of tridiagonalization to increase data intensity and handle the memory-bound problem in the following section.

4. Methods

Given the aforementioned bottlenecks, it is clear that although 2-stage tridiagonalization exhibits better data locality than direct tridiagonalization, the speedup is only around 1.6x (Figure 10(a)). The primary challenge lies in designing efficient SBR and bulge chasing on emerging hardware accelerators.

4.1. Detached Band Reduction

Each iteration in the conventional band reduction algorithm involves a tall and skinny QR (TSQR) factorization and trailing matrix updates using the syr2k routine. There is extensive literature on performing fast TSQR on GPUs (Zhang et al., 2020a; Anderson et al., 2011; Ballard et al., 2014), which we can leverage directly. With the TSQR algorithm, the critical path is the trailing matrix update. As shown in Table 1, typical bandwidth selections of 128 or 256 achieve less than 30% of peak performance on the H100 GPU. Therefore, significantly improving syr2k’s performance would yield substantial gains in band reduction.

One straightforward method is to increase the bandwidth $b$ . From Table 1, only a $k$ value of 1024 or larger in the syr2k operation achieves satisfactory trailing matrix update performance. However, increasing the bandwidth indiscriminately degrades the bulge chasing performance, as seen in Figure 4 where even $b=128$ significantly increases the bulge chasing time cost.

To address this issue, we propose a new band reduction algorithm called Detached Band Reduction (DBR), and the key idea is to decouple the bandwidth $b$ from the blocksize $nb$ (Figure 5). For a matrix $A\in\mathbb{R}^{n\times n}$ , with bandwidth $b$ and block size $nb$ , the algorithm is illustrated in Algorithm 1. If $b$ equals $nb$ , DBR degrades to SBR.

Algorithm 1 Detached Band Reduction

0: A symmetric matrix

A\in\mathbb{R}^{n\times n}

, bandwidth

b

and blocksize

nb

, where

b\leq nb

A

is reduced to a band form matrix with bandwidth

b

1: for

i=1:nb:n

2: for

j=i:b:nb

[W,Y,R]\leftarrow

QR(A_{panel})

4: if

j+b<nb

then

Z\leftarrow AW-\frac{1}{2}YW^{T}AW

A\leftarrow A-ZY^{T}-YZ^{T}

%Only needed panel is udpated

7: end if

8: end for

9: Accumulate matrix

Y

and

Z

10:

A\leftarrow A-ZY^{T}-YZ^{T}

%Trailing matrix update

11: end for

12: Compute the orthogonal matrix

Q

if needed

DBR’s primary advantage over SBR is the decoupling of blocksize $nb$ from bandwidth $b$ , offering greater flexibility in selecting and adjusting these parameters. In SBR, the typical choice is $b=nb=64$ , whereas DBR allows configurations like $b=32$ and $nb=1024$ . This algorithmic optimization provides two significant performance benefits: 1) enabling much larger $k$ values in trailing matrix updates using the syr2k routine; 2) allowing smaller bandwidth $b$ to reduce the time complexity of the subsequent bulge chasing process. These advantages lead to substantial performance improvements in band reduction, even with smaller bandwidths.

4.2. Bulge Chasing Using Hardware Accelerators

It is widely accepted that the two-stage tridiagonalization is faster on modern architectures because it converts many BLAS2 operations into BLAS3 operations. However, previous research asserts that the bulge chasing process, being limited in parallelism and close to memory bandwidth limits, would not benefit significantly from an accelerator-based implementation, as it is already optimized for CPU caches (Section 4.2 in (Gates et al., 2018)). This explains why existing 2-stage tridiagonalization/bidiagonalization algorithms use CPUs for bulge chasing (Luszczek et al., 2011; Gates et al., 2018).

In this section, we refute these claims by demonstrating that the bulge chasing process, despite being memory-bound, still exhibits enough parallelism to benefit significantly from hardware accelerators. This parallelism is evident in two aspects: 1) inter-kernel parallelism (e.g., pipelining); 2) intra-kernel parallelism (e.g., Householder transformations).

The inter-kernel parallelism is enabled because different sweeps lack data dependencies, allowing for pipelining. To illustrate this pipeline, we categorize operations into three types: Householder vector generation, matrix update from both sides, and matrix update from the left side (creating a bulge), denoted by different colors in Figure 6. We observe that when the $i$ -th sweep completes three cycles, the $(i+1)$ -th sweep can safely start without any data dependency. This allows one thread block to handle one sweep while other thread blocks wait and execute sequentially.

Algorithm 2 Parallel bulge chasing on GPU

0: A band form matrix

A\in\mathbb{R}^{n\times n}

with bandwidth

nb<<n

A

is reduced to a tridiagonal matrix

1: volatile int

gCom[n]={0}

;

2: % The sweeps can be executed in parallel

3: for

i=1:1:n

4: opCol =

i

i

represents begin column index of

i

-th sweep

5: while

opCol<n

6: while (1 != i) && (

opCol+2*b>gCom[i-1]

) do

7: continue

8: end while

9: Compute Householder vectors

v_{i}

10:

H(v_{i})\leftarrow I-2\frac{v_{i}v_{i}^{T}}{v_{i}^{T}v_{i}}

11:

A_{g}\leftarrow=H(v_{i})^{-1}A_{g}H(v_{i})

% Update matrix from both sides (green block in Figure 2)

12:

A_{p}\leftarrow=H(v_{i})^{-1}A_{p}

% Update matrix from left side (pink block in Figure 2) and create a bulge (blue block in Figure 2)

13: for

j=i+1:b:n

14: Compute Householder vectors

v_{j}

15:

H(v_{j})\leftarrow I-2\frac{v_{j}v_{j}^{T}}{v_{j}^{T}v_{j}}

16:

A_{g}\leftarrow=H(v_{j})^{-1}A_{g}H(v_{j})

17:

A_{p}\leftarrow=H(v_{j})^{-1}A_{p}

18: end for

19:

opCol=opCol+b

% update process position of

i

-th sweep

20:

gCom[i]=opCol

% update sync variable

21: end while

22: end for

23: Compute the orthogonal matrix

Q

if needed

The intra-kernel parallelism is primarily leveraged in Householder vector generation and Householder transformations. By launching multiple threads, the transformation can be computed in parallel to achieve satisfactory performance. See Algorithm 2 for details.

Despite being memory-bound rather than compute-bound, bulge chasing can significantly benefit from emerging hardware accelerators, such as H100, A100, and RTX 4090 GPUs, by fully exploiting the parallelism between different sweeps and within CUDA kernels. Compared to the CPU implementation, the bulge chasing using hardware accelerators is nearly 8.0x faster than the CPU-based implementation (Figure 9).

5. Implementation and Optimization

While our proposed methods can accelerate the EVD on GPUs, effective implementation and technical optimization are crucial to avoid wasting GPU resources. In this section, we discuss our kernel-level design strategy, including optimizations for DBR, BLAS3 routines, and bulge chasing.

5.1. DBR Implementation

As shown in Algorithm 1, the first step is panel QR factorization. Extensive research exists on implementing TSQR on GPUs, so we will not delve into it here; see (Zhang et al., 2020a; Ballard et al., 2014) for high-performance TSQR implementations and the reconstruction of the WY representation.

The subsequent step is panel updates between $b$ and $nb$ (Line 6 in Algorithm 1). Similar to syr2k performance (Table 1), The GEMMs shapes and sizes also affect the performance.the shapes and sizes of GEMMs significantly affect performance. Unfortunately, the GEMMs in panel updates typically have a dimension equal to $b$ , resulting in suboptimal performance. To address this, we can use a recursive strategy to enlarge these dimensions.

For illustration, consider $b=32$ and $nb=256$ in DBR. Non-recursive panel updates generate 7 GEMMs with $k=32$ , while recursive panel updates produce 4 GEMMs with $k=32$ , 2 GEMMs with $k=64$ , and 1 GEMM with $k=128$ . This modification does not increase the total computations but converts GEMMs into more square shapes, increasing FLOPs efficiency.

5.2. Symmetric Rank-2k Update Optimization

Although algorithmic design can speed up trailing matrix updates using cuBLAS Dsyr2k routines, our experimental results show that cuBLAS routines still consume excessive time (over 90% of DBR time). Therefore, optimizing syr2k is crucial.

Testing cuBLAS Dsyr2k on an H100 GPU reveals it achieves less than 60% (36 TFLOPs) of the H100 peak FP64 precision. Additionally, a severe bug reduces Dsyr2k performance to 4 TFLOPs when $n\geq 49152$ .

To improve performance, we propose a recursive formulation that maximizes the size of internal GEMMs:

(1)

\left[\begin{array}[]{c|c}C_{11}&C_{12}\\ \hline\cr C_{21}&C_{22}\end{array}\right]=\left[\begin{array}[]{c}A_{11}\\ \hline\cr A_{21}\end{array}\right]\left[\begin{array}[]{c|c}B_{11}^{T}&B_{21}^{T}\end{array}\right]+\left[\begin{array}[]{c}B_{11}\\ \hline\cr B_{21}\end{array}\right]\left[\begin{array}[]{c|c}A_{11}^{T}&A_{21}^{T}\end{array}\right]

This results in:

SYR2K:C_{11}=A_{11}B_{11}^{T}+B_{11}A_{11}^{T}

GEMM:C_{21}=C_{12}^{T}=A_{21}B_{11}^{T}+B_{21}A_{11}^{T}

SYR2K:C_{22}=A_{21}B_{21}^{T}+B_{21}A_{21}^{T}

The original syr2k decomposes into two sub-syr2k operations and one large GEMM. Recursive decomposition improves hardware utilization by creating larger GEMMs. However, pure recursion reduces parallelism. Therefore, we adapt the recursive idea into a parallel version, ensuring sub-syr2k operations run in parallel. Figure 7 illustrates the iterative approach for computing syr2k. In the first iteration, diagonal blocks are computed in parallel (using cuBLAS batched GEMM), followed by iterative computation of off-diagonal blocks until the largest GEMM is processed. See Algorithm 3 for details.

Algorithm 3 Recursive-like Symmetric Rank-2k Update with blocksize nb

1function [C] = syr2k(A, B)2 [n,k] = size(A);3 %BatchedGEMM((m,n,k),A,B,offset)4 %1st iteration5 C1+=BatchedGEMM((nb,nb,k),A,B’,nb*(1+lda));6 C1+=BatchedGEMM((nb,nb,k),B,A’,nb*(1+lda));7 i=1;8 while(n/nb/i/2>=1)9 %2nd to n-th iteration10 Ci+=BatchedGEMM((i*nb,i*nb,k),A+i*nb,11 B’,2*i*nb(1+lda));12 Ci+=BatchedGEMM((i*nb,i*nb,k),B+i*nb,13 A’,2*i*nb(1+lda));14 i=i*2;15 end16 C=concat(C1,C2,...,Cn);17end

The speedup compared to cuBLAS Dsyr2k routine is shown in Figure 8. We compared the square SYR2K and tall and skinny SYR2K and the experiments reveal that we can outperform cuBLAS on varies sizes and shapes.

5.3. Bulge Chasing Kernel

From our previous analysis and Algorithm 2, the design strategy for bulge chasing is straightforward: each kernel handles one sweep.

Within the kernel, multiple threads perform the Householder transformations from both the left and right sides, with the number of threads determined by the bandwidth $b$ . Given that bulge chasing is memory-bound, it is crucial to hide the data movement between shared and global memory. We achieve this by maintaining two shared memory blocks: one for online computations and another for data movement. Once the computation block completes its tasks, its role switches to data movement, while the other block takes over computations. This simple pipelining allows for overlapping some data movements.

A critical issue between different sweeps (kernels) is synchronization due to data dependencies between successive sweeps. Specifically, the $i+1$ -th sweep can only commence after the third bulge elimination in the $i$ -th sweep. Nvidia’s cooperative group tool facilitates synchronization between threadblocks. However, for this to function correctly, the number of launched blocks must not exceed the device limit, or the cooperative group will return an error indicating ”too many blocks in cooperative launch cudaLaunchCooperativeKernel.”

In our implementation, to maximize parallelism, the bulge chasing process launches a large number of threadblocks, particularly for large matrices. Therefore, using a cooperative group is impractical in this scenario. To resolve the synchronization issue, we manually set up locks between adjacent sweeps to avoid conflicts. The $i$ -th sweep shares a lock flag with the $i+1$ -th sweep to indicate the current column being processed. After the $i+1$ -th sweep eliminates a bulge, it waits until there is no conflict as indicated by the lock. Similarly, the $i+2$ -th sweep monitors the lock flag shared with the $i+1$ -th sweep.

Our proposed bulge chasing implementation and optimization take full advantage of hardware accelerators, as shown by the performance comparison with MAGMA in Figure 9. The proposed implementation achieves 8.0x speedup compared to the CPU-based version.

5.4. Tuning

Like conventional SBR, tuning parameters such as $b$ and $nb$ is essential to achieve optimal performance on different hardware. The bandwidth $b$ is crucial for balancing bulge chasing and trailing matrix updates. In our DBR, $nb$ must also be carefully selected to ensure peak performance during trailing matrix updates. For instance, on an A100 GPU, setting $nb=512$ is sufficient for high-performance syr2k operations, and increasing $nb$ further would result in slower panel updates. Conversely, on an H100 GPU, syr2k operations require $nb\geq 2048$ , so we set $nb=2048$ . In general, the block size $nb=k$ , where $k$ in syr2k operation yields the best performance for large matrices.

Although we have adapted panel updates using a recursive formulation, the bandwidth $b$ does not significantly impact DBR performance, as many tall and skinny GEMMs are converted into relatively square GEMMs. However, GEMMs with one small dimension still exist and are non-negligible. Additionally, bulge chasing performance is closely related to bandwidth $b$ , with smaller $b$ being preferred. Thus, it is important to find a balance between DBR and bulge chasing through tuning. Fortunately, DBR is not highly sensitive to tuning, and even suboptimal settings can deliver satisfactory performance. See Table 2) for a performance reference.

	128	256	512	1024	2048	4096	BC
16	31.9	26.3	23.8	23.4	23.7	25.4	4.6
32	24.0	18.8	16.4	15.6	15.5	16.2	4.0
64	20.2	15.0	12.6	11.7	11.4	11.6	8.2

Table 2. The elapsed time in seconds of DBR and bulge chasing (BC, the last column) on H100 GPU with matrix size

65536\times 65536

5.5. Summary

In this section, we depict our implementations and optimizations of the tridiagonalization process, including DBR, BLAS3 operations, and bulge chasing. We’ve observed significant acceleration compared to MAGMA on the H100 GPU.

The final step of the EVD solver is computing the eigenvalues using iterative methods such as QR algorithm. As previously discussed, the iterative method is not a bottleneck on GPUs. For example, cuSOLVER’s well-optimized divide and conquer only takes about 3% of the time in the Dsyevd routine. We acknowledge that outperforming cuSOLVER’s implementation in this regard is unlikely. However, due to the tremendous speedup in tridiagonalization, even a sub-optimal divide and conquer implementation can still result in substantial acceleration of the full eigenvalue decomposition.

Another observation is that our proposed methods are also beneficial on emerging GPU architectures such as the RTX 4090 series, which have limited FP64 precision computing capacity (only $\frac{1}{64}$ of FP32 peak performance). Our methods can take advantage of new technologies that leverage INT8 Tensor Cores to perform FP64 GEMMs (Ootomo et al., 2024). Performance results will be shown in the experiments section.

6. Experimental Evaluation

We conducted experiments on a system running a 5.4.0-99-generic Linux operating system with NVIDIA H100-SMX, A100-PCIe, and RTX 4090 GPUs. The CUDA version used is 12.3, which includes a C++ compiler and the cuBLAS and cuSOLVER libraries. This section will primarily showcase the performance of the proposed tridiagonalization and the entire eigenvalue decomposition on various GPU architectures.

Evaluating our algorithmic design requires comparing performance metrics against cuSOLVER and MAGMA across various matrix sizes and GPU architectures. In previous sections, we validated our sub-modules on the H100 GPU. However, this alone is insufficient to fully demonstrate the proposed algorithm’s capabilities. Here, we extend our evaluation to include the A100 GPU, which exhibits lower ratio between computing capacity and memory bandwidth. Additionally, we demonstrate that our algorithm can leverage INT8 Tensor Cores (Ootomo et al., 2024) to produce FP64 precision results on consumer-grade GPUs such as the RTX 4090, which offers limited support for FP64 precision.

Figure 10 gives the performance comparison of tridiagonalization among cuSOLVER, MAGMA, and our proposed method on the H100, A100, and RTX 4090 GPUs. The experimental results reveal that our method outperforms cuSOLVER (the SOTA GPU-only implementation) and MAGMA (the SOTA two-stage tridiagonalization on a CPU-GPU hybrid architecture) across a range of matrix sizes and GPU architectures. This performance enhancement is attributed to our method’s higher data intensity, which leads to better utilization of hardware accelerators.

In terms of TFLOPs, our proposed method achieves up to 29% and 46% of peak FP64 performance on the H100 and A100 GPUs, respectively. In contrast, cuSOLVER only reaches 2.8% and 6.2% of peak performance on these GPUs. While MAGMA, benefiting from its two-stage tridiagonalization algorithm, is slightly faster than cuSOLVER for large matrices, it remains inefficient on hardware accelerators, utilizing only approximately 4.7% and 21.5% of peak performance. On the RTX 4090 GPU, leveraging the power of INT8 Tensor Cores, our method surpasses the FP64 performance limit (1.25 TFLOPs) imposed by the hardware, a feat unachievable by cuSOLVER and MAGMA, which cannot exploit Tensor Cores.

We also evaluate the end-to-end eigenvalue decomposition (EVD) using the divide-and-conquer method. The implementation of iterative methods involves more technical than algorithmic innovations, making it challenging to outperform cuSOLVER, as Nvidia researchers possess deep expertise in optimizing on Nvidia’s hardware accelerators. Nonetheless, our significantly faster tridiagonalization step contributes substantially to end-to-end acceleration. Despite our suboptimal divide-and-conquer implementation, we achieve approximately a 4.1x speedup on the EVD solver when only eigenvalues are required (cuSOLVER cannot handle larger matrices due to memory limitations), as detailed in Figure 11. If we have access to cuSOLVER’s implementation, we anticipate even greater speedups. MAGMA’s performance is hindered by its iterative methods, as it employs the QR algorithm on the CPU, which has limited parallelism. Thus, MAGMA only outperforms cuSOLVER for very large matrices, whereas our proposed method surpasses cuSOLVER at smaller sizes, despite our less efficient divide-and-conquer algorithm.

7. Related Work

7.1. The Development of Modern Hardware Accelerators

The boost of hardware accelerators’ developments began with Nvidia’s release of the P100 GPU in 2016, which supported half-precision computations (Turner et al., 2018). Driven by the increasing demand for training larger neural networks, the Tesla V100 GPU (Martineau et al., 2018) introduced Tensor Cores (Markidis et al., 2018), enabling exceptionally fast fused GEMMs. High precision computations have similarly benefited from these innovations. On the A100 (Choquette et al., 2021) and H100 GPUs (Choquette, 2023), FP64 GEMMs can also utilize Tensor Cores, achieving performance levels equivalent to FP32 computations. Specifically, for scientific computing requiring high precision, AMD’s recently introduced MI300x (Smith et al., 2024) offers the highest FP64 computing capacity, with a peak performance almost three times that of the H100 GPU, reaching 163.4 TFLOPs.

7.2. Eigenvalue Decomposition on Modern Computer Architectures

7.2.1. Tridiagonalization

On current computer architectures, tridiagonalization is typically regarded as a ’preconditioner’ to reduce the computational time of iterative methods. Tridiagonalization is commonly performed using Householder transformations (Golub and Van Loan, 2013). To enhance execution efficiency on modern high-performance architectures, the WY representation technique (Bischof and Van Loan, 1987; Schreiber and Van Loan, 1989) is often applied during the transformation process. To further improve data locality, two-stage tridiagonalization (Haidar et al., 2011) is frequently employed for larger matrices. This approach first reduces the matrix to a band form (SBR) and then reduces the band form to a tridiagonal matrix using bulge chasing. This method has been demonstrated to be highly efficient on multi-core architectures (Ltaief et al., 2012; Luszczek et al., 2011; Gates et al., 2018).

7.2.2. Iterative Methods

It is well known that expressing the roots of general high-order polynomials in radicals is impossible, meaning that a direct EVD solver does not exist. Consequently, eigenvalues must be computed using iterative methods such as the QR algorithm (Watkins, 1982), divide and conquer (Gu and Eisenstat, 1995), and Jacobi iterations (Golub and Van der Vorst, 2000). Among these, the divide and conquer method is particularly popular due to its superior parallelism and efficiency in computing eigenvectors. For computing eigenvalues alone, the QR algorithm is often the best choice. These methods are implemented in most linear algebra packages, including LAPACK (Anderson et al., 1999), Eigen (Guennebaud et al., 2010), MAGMA (Tomov et al., 2011), and cuSOLVER.

A flexible method is bisection (Dautov et al., 1994), which aims to find a subset of eigenvalues, such as the largest or smallest 100, or all eigenvalues within an interval $[a,b]$ . In 2004, the MRRR (Multiple Relatively Robust Representations) method (Dhillon and Parlett, 2004) was introduced. It aims to compute accurately orthogonal eigenvectors without requiring the expensive reorthogonalization process, which has a worst-case complexity of $O(n^{3})$ .

Another class of iterative methods is based on polar decomposition that links EVD and SVD. The QDWH-eig (QR-based dynamically weighted Halley eigenvalue decomposition)(Nakatsukasa and Higham, 2013) method uses QR factorization to compute the polar decomposition, followed by factorizing the derived orthonormal matrix using an iterative subspace method. In 2016, a GPU implementation(Sukkari et al., 2016) of QDWH-eig and QDWH-SVD was proposed. Another method for computing polar decomposition, called scaled Newton (Byers and Xu, 2008), involves fewer mathematical operations than QDWH. However, it relies heavily on the backward stable inversion of a matrix.

Recently, there has been growing interest in randomized linear algebra (Martinsson and Tropp, [n.d.]), particularly randomized subspace iteration for computing low-rank approximate eigenvalue or singular value decomposition. Two notable algorithms in this field are randomized subspace iteration (Gu, 2015) and randomized block Lanczos (Yuan et al., 2018). These algorithms have proven efficient in real-world applications, especially on modern high-performance architectures (Shah et al., 2019; Zhang et al., 2020b). However, these methods typically involve multiplying a randomly generated matrix, which limits their applicability to scenarios where accuracy is not critically sensitive.

8. Conclusion and Future Work

In this paper, we discuss the features of the emerging hardware accelerators, that the gap between memory bandwidth and computing capacity of hardware accelerators has widened compared to older architectures. This disparity presents challenges in designing algorithms that fully utilize computing capacity, as the conventional algorithm might become memory-bound. Consequently, the performance of EVD solvers on emerging GPU architectures has been suboptimal, achieving only 1.8 TFLOPs out of a possible 67 TFLOPs on the H100 GPU.

To harness the potential of emerging hardware accelerators, we use EVD solver as an example to show the algorithmic design of how to leverage the computing capacity and parallelism of new hardware. In short, we propose a new band reduction algorithm named detached band reduction, which provides much higher performance on syr2k, and we use a recursive-like algorithm for syr2k to further improve the performance. For the bulge chasing process in tridiagonalization, we provide a GPU-based implementation which has a better utilization of parallelism on new Experimental evaluations revealed that our proposed method achieves up to a 10.1x speedup on the H100 GPU and can benefit from INT8 Tensor Cores on consumer-level GPUs. However, the profiling in this paper is still not sufficient that we only compare the TFLOPs, providing more details such as memory footprint will be considered from the hardware aspect in the future.

However, the EVD solver is only a small portion of numerical linear algebra. Other problems, such as SVD solvers and one-sided matrix factorizations, might also benefit from the proposed ideas. Therefore, we will try expand our approach to other numerical linear algebra and matrix computation problems including LU, Cholesky and QR factorization. Using INT8 Tensor Cores to accelerate the above problems on hardware which has limited FP64 computing capacity is also appealing. Additionally, while this paper focuses on the implementations using single hardware accelerator, scaling these problems on emerging clusters is another interesting topic for future exploration. Also, we’re targeting on high precision EVD solver in this paper, exploring the possibilities of high performance EVD in low precision using Tensor Cores is another topic, because the deep learning communities might prefer half precision and stable EVD solvers.

References

(1)
Anderson et al. (1999) Edward Anderson, Zhaojun Bai, Christian Bischof, L Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, Alan McKenney, et al. 1999. LAPACK Users’ guide. SIAM.
Anderson et al. (2011) Michael Anderson, Grey Ballard, James Demmel, and Kurt Keutzer. 2011. Communication-avoiding QR decomposition for GPUs. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 48–58.
Ballard et al. (2014) Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Hong Diep Nguyen, and Edgar Solomonik. 2014. Reconstructing Householder vectors from tall-skinny QR. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1159–1170.
Bischof and Van Loan (1987) Christian Bischof and Charles Van Loan. 1987. The WY representation for products of Householder matrices. SIAM J. Sci. Statist. Comput. 8, 1 (1987), s2–s13.
Byers and Xu (2008) Ralph Byers and Hongguo Xu. 2008. A new scaling for Newton’s iteration for the polar decomposition and its backward stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822–843.
Choquette (2023) Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro (2023).
Choquette et al. (2021) Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro 41, 2 (2021), 29–35.
Dautov et al. (1994) RZ Dautov, AD Lyashko, and SI Solov’ev. 1994. The bisection method for symmetrie eigenvalue problems with a parameter entering nonlinearly. (1994).
Dhillon and Parlett (2004) Inderjit S Dhillon and Beresford N Parlett. 2004. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebra Appl. 387 (2004), 1–28.
Dongarra et al. (2019) Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim YarKhan, Maksims Abalenkovs, Negin Bagherpour, et al. 2019. PLASMA: Parallel linear algebra software for multicore using OpenMP. ACM Transactions on Mathematical Software (TOMS) 45, 2 (2019), 1–35.
Dongarra et al. (1989) Jack J Dongarra, Danny C Sorensen, and Sven J Hammarling. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 1-2 (1989), 215–227.
Gant ([n.d.]) Sebastian Gant. [n.d.]. Chasing the Bulge. ([n. d.]).
Gates et al. (2018) Mark Gates, Stanimire Tomov, and Jack Dongarra. 2018. Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs. Parallel Comput. 74 (2018), 3–18.
Golub and Van der Vorst (2000) Gene H Golub and Henk A Van der Vorst. 2000. Eigenvalue computation in the 20th century. J. Comput. Appl. Math. 123, 1-2 (2000), 35–65.
Golub and Van Loan (2013) Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.
Grimes et al. (1987) Roger Grimes, Henry Krakauer, John Lewis, Horst Simon, and Su-Hai Wei. 1987. The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD. J. Comput. Phys. 69, 2 (1987), 471–481.
Gu (2015) Ming Gu. 2015. Subspace iteration randomization and singular value problems. SIAM Journal on Scientific Computing 37, 3 (2015), A1139–A1173.
Gu and Eisenstat (1995) Ming Gu and Stanley C Eisenstat. 1995. A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem. SIAM J. Matrix Anal. Appl. 16, 1 (1995), 172–191.
Guennebaud et al. (2010) Gaël Guennebaud, Benoit Jacob, et al. 2010. Eigen. URl: http://eigen. tuxfamily. org 3, 1 (2010).
Gupta et al. (2018) Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning. PMLR, 1842–1850.
Haidar et al. (2011) Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11.
Ltaief et al. (2012) Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2012. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures. In Applications, Tools and Techniques on the Road to Exascale Computing. IOS Press, 397–404.
Luszczek et al. (2011) Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 944–955.
Markidis et al. (2018) Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 522–531.
Martineau et al. (2018) Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. 2018. Benchmarking the nvidia v100 gpu and tensor cores. In European Conference on Parallel Processing. Springer, 444–455.
Martinsson and Tropp ([n.d.]) PG Martinsson and JA Tropp. [n.d.]. Randomized numerical linear algebra: foundations & algorithms (2020). arXiv preprint arXiv:2002.01387 ([n. d.]).
Nakatsukasa and Higham (2013) Yuji Nakatsukasa and Nicholas J Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325–A1349.
Ootomo et al. (2024) Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. 2024. DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications (2024), 10943420241239588.
Probert (2011) Matt Probert. 2011. Electronic Structure: Basic Theory and Practical Methods, by Richard M. Martin: Scope: graduate level textbook. Level: theoretical materials scientists/condensed matter physicists/computational chemists.
Rösch et al. (2005) Notker Rösch, Sven Krüger, Vladimir A Nasluzov, and Alexei V Matveev. 2005. ParaGauss: The density functional program paragauss for complex systems in chemistry. In High Performance Computing in Science and Engineering, Garching 2004. Springer, 285–296.
Schreiber and Van Loan (1989) Robert Schreiber and Charles Van Loan. 1989. A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Statist. Comput. 10, 1 (1989), 53–57.
Shah et al. (2019) Ruchi Shah, Shaoshuai Zhang, Ying Lin, and Panruo Wu. 2019. xSVM: Scalable distributed kernel support vector machine training. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 155–164.
Smith et al. (2024) Alan Smith, Eric Chapman, Chintan Patel, Raja Swaminathan, John Wuu, Tyrone Huang, Wonjun Jung, Alexander Kaganov, Hugh McIntyre, and Ramon Mangaser. 2024. 11.1 AMD InstinctTM MI300 Series Modular Chiplet Package–HPC and AI Accelerator for Exa-Class Systems. In 2024 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 67. IEEE, 490–492.
Sukkari et al. (2016) Dalal Sukkari, Hatem Ltaief, and David Keyes. 2016. A high performance QDWH-SVD solver using hardware accelerators. ACM Transactions on Mathematical Software (TOMS) 43, 1 (2016), 1–25.
Tisseur and Dongarra (1999) Françoise Tisseur and Jack Dongarra. 1999. A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures. SIAM Journal on Scientific Computing 20, 6 (1999), 2223–2236.
Tomov et al. (2011) Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. 2011. MAGMA Users’ Guide. ICL, UTK (November 2009) (2011).
Turner et al. (2018) Dave Turner, Dan Andresen, Kyle Hutson, and Adam Tygart. 2018. Application performance on the newest processors and GPUs. In Proceedings of the Practice and Experience on Advanced Research Computing. 1–7.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Watkins (1982) David S Watkins. 1982. Understanding the QR algorithm. SIAM review 24, 4 (1982), 427–440.
Williams et al. (2009) Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
Yuan et al. (2018) Qiaochu Yuan, Ming Gu, and Bo Li. 2018. Superlinear convergence of randomized block lanczos algorithm. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 1404–1409.
Zhang et al. (2020a) Shaoshuai Zhang, Elaheh Baharlouei, and Panruo Wu. 2020a. High accuracy matrix computations on neural engines: A study of qr factorization and its applications. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 17–28.
Zhang et al. (2020b) Shaoshuai Zhang, Ruchi Shah, and Panruo Wu. 2020b. TensorSVM: accelerating kernel machines with tensor engine. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–11.


$k$	$n=4096$	$n=8192$	$n=16384$	$n=32768$
16	0.09	0.43	1.60	3.58
32	0.18	0.86	3.20	7.02
64	0.38	1.71	6.19	12.78
128	0.76	3.39	11.47	21.05
256	1.42	6.41	18.83	30.13
512	2.29	11.57	27.58	38.31
1024	5.77	18.91	35.23	42.86
2048	8.54	27.21	40.82	45.36
4096	13.72	34.59	43.65	45.54