MPLAPACK version 2.0.1 user manual

NAKATA Maho¹
¹RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako-City, Saitama 351-0198, JAPAN

Abstract

The MPLAPACK (formerly MPACK) is a multiple-precision version of LAPACK
(https://www.netlib.org/lapack/). MPLAPACK version 2.0.1 is based on LAPACK version 3.9.1 and translated from Fortran 90 to C++ using FABLE, a Fortran to C++ source-to-source conversion tool (https://github.com/cctbx/cctbx_project/tree/master/fable/). MPLAPACK version 2.0.1 provides the real and complex version of MPBLAS, and the real and complex versions of MPLAPACK support all LAPACK features: solvers for systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems, and related matrix factorizations except for mixed-precision routines. The MPLAPACK defines an API for numerical linear algebra, similar to LAPACK. It is easy to port legacy C/C++ numerical codes using MPLAPACK. MPLAPACK supports binary64, binary128, FP80 (extended double), MPFR, GMP, and QD libraries (double-double and quad-double). Users can choose MPFR or GMP for arbitrary accurate calculations, double-double or quad-double for fast 32 or 64-decimal calculations. We can consider the binary64 version as the C++ version of LAPACK. Moreover, it comes with an OpenMP accelerated version of MPBLAS for some routines and CUDA (A100 and V100 support) for double-double versions of Rgemm and Rsyrk. The peak performances of the OpenMP version are almost proportional to the number of cores, and the performances of the CUDA version are impressive, and approximately 400-600 GFlops. MPLAPACK is available at GitHub (https://github.com/nakatamaho/mplapack/) under the 2-clause BSD license.

1 Release note for version 2.0.1

•

Version 2.0.1 supports all the complex LAPACK functions.
•

Version 2.0.1 supports Rectangular Full Packed (RFP) Format.
•

Version 2.0.1 comes with an acceleration of double-double Rgemm and Rsyrk routines on NVIDIA Tesla V100 and A100.
•

Many small bug fixes and improvements in usability.
•

The quality assurance results and benchmark results on Intel CPU and Arm CPU are available.
•

Mixed precision version is not supported yet.
•

Version 2.0.1 was released on 2022-09-12.

2 Introduction

Numerical linear algebra aims to solve mathematical problems, such as simultaneous linear equations, eigenvalue problems, and least-squares methods, using arithmetic with finite precision on a computer [1]. We can formulate many problems as numerical linear algebra in various fields, such as natural science, social science, and engineering. Therefore, its importance is magnificent.

The numerical linear algebra package standards are BLAS [2] and LAPACK [3]. The BLAS library defines how computers should perform vector, matrix-vector, and matrix-matrix operations in FORTRAN77. Almost all other vector and matrix arithmetic libraries are compatible with it or have very similar interfaces. Furthermore, LAPACK solves linear problems such as solving linear equations, singular value problems, eigenvalue problems, and least-square fitting problems using BLAS as a building block.

The main focus in numerical linear algebra has been on the speed and the size of solving the problem at approximately 8 or 16 decimal digits (binary32 or 64) using highly optimized BLAS and LAPACK [4, 5, 6, 7, 8, 9]. The use of numbers with higher precision than binary64 is not common.

However, there are some problems in numerical linear algebra that require higher precision operations. In particular, when we solve ill-conditioned problems, large-scale simulations, and compute inversions of matrices, we usually need functions of higher precision numbers [10, 11, 12].

Solving positive semidefinite programming (SDP) is another example requiring multi-precision computation. Solving this problem using binary64 usually yields values up to eight decimal digits. The accumulation of numerical error occurs because the matrices’ condition number at the optimal solution usually becomes infinite. As a result, Cholesky factorization fails near the optimal solution; the approximate solution diverges toward the optimal solution using the primal and dual interior point method [13]. Therefore, we require multiple precision calculations if we need more than eight decimal digits for optimal solutions. For this reason, we have developed SDPA-GMP [14, 15, 16], the GNU MP version of semidefinite programming solver based on SDPA [16], which is one of the fastest SDP solvers.

We have been developing MPLAPACK as a drop-in replacement for BLAS and LAPACK in SDPA since SDPA performs Cholesky decomposition and solves symmetric eigenvalue problems via 50 BLAS and LAPACK routines [17].

The features of MPLAPACK are:

•

Provides Application Programming Interface (API) numerical algebra, similar to LAPACK and BLAS.
•

Like BLAS and LAPACK, we can implement an optimized version of MPBLAS and MPLAPACK; we provide a simple OpenMP version of some MPBLAS routines and two CUDA routines (Rgemm, Rsyrk dd version) as proof of concept.
•

Completely rewrote LAPACK and BLAS in C++ using FABLE and f2c.
•

C style programming. We do not introduce new matrix and vector classes.
•

Supports seven floating-point formats by precision independent programming; binary64 (16 decimal digits), binary128 (32 decimal digits), FP80 (extended double; 19 decimal digits), double-double (32 decimal digits), quad-double (64 decimal digits), GMP, and MPFR (arbitrary precision, and the default is 153 decimal digits).
•

MPLAPACK 2.0.1 is based on LAPACK 3.9.1.
•

Version 2.0.1 supports all real and complex versions of BLAS and LAPACK functions; simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, singular value problems, and related matrix factorization and full packed matrix form except for mixed-precision version.
•

Reliability: We extended the original test programs to handle multiple-precision numbers, and most MPLAPCK routines have passed the test.
•

Runs on Linux/Windows/Mac.
•

Released at https://github.com/nakatamaho/mplapack/ under 2-BSD clause license.

Unless otherwise noted, this paper gives examples with Ubuntu 20.04 amd64 inside Docker as the reference environment.

The rest of the sections are organized as follows: Section 3 describes supported CPUs, OSes, and compilers. Section 4 describes how to install MPLAPACK. Section 5 describes the supported floating-point format. Section 6 illustrates LAPACK and BLAS standard naming conventions and available MPBLAS and MPLAPACK routines; section 7 describes how to use MPBLAS and MPLAPACK. We use the Docker environment to try these examples to avoid over-complicating the notation with different environments. Section 8 describes how we tested MPBLAS and MPLAPACK. Section 9 describes how we rewrote Fortran90 codes to C++ and extended them to multiple precision versions. Next, section 10 describes benchmarking MPBLAS routines. Next, section 11 describes the history, and section 12 describes related works. Finally, section 13 describes future plans for MPLAPACK.

3 Supported CPUs, OSes, and compilers

Only 64-bit CPUs are supported. The following OSes are supported:

•

CentOS 7 (amd64, aarch64)
•

CentOS 8 (amd64, aarch64)
•

Ubuntu 22.04 (amd64, aarch64)
•

Ubuntu 20.04 (amd64, aarch64)
•

Ubuntu 18.04 (amd64)
•

Windows 10 (amd64)
•

MacOS (Intel)

We support the following compilers:

•

GCC (GNU Compiler Collection) 9 and later
•

Intel One API. (you need -fp-model precise to compile dd and qd)

Note that we use GCC by MacPorts on macOS, which is NOT the default compiler. However, we need to use GCC (Apple Clang) to compile GMP [18].

We support the following GPUs to accelerate matrix-matrix multiplication. See section 6 for CUDA-enabled routines.

•

NVIDIA A100
•

NVIDIA V100

We use the mingw64 and Wine64 environments on Linux to build and test the Windows version. On different configurations, MPLAPACK may build and work without problems. We welcome reports or patches from the community.

4 Installation

4.1 Using Docker (recommended for Linux environment), a simple demo

The easiest way to install MPLAPACK is to build inside Docker [19] and use it inside Docker. The following command will build inside docker on Ubuntu amd64 (also known as x86_64) or aarch64 (also known as arm64), and we use Docker ubuntu 20.04 amd64 environment for the reference in this paper and showing examples.

$ git clone https://github.com/nakatamaho/mplapack/
$ cd mplapack
$ /usr/bin/time docker build -t mplapack:ubuntu2004 \
   -f Dockerfile_ubuntu20.04 . 2>&1 | tee log.ubuntu2004

It will take a while, depending on CPU cores and OSes. For example, a Docker build took 35 minutes on a Ryzen 3970X (3.7GHz, 32 cores) Ubuntu 20.04 machine, a Docker build took 1 hour 50 minutes on Xeon E5-2623 v3 (3.0GHz, 2 CPUs 8 cores) Ubuntu 20.04 machine, a Docker build took 14 hours 40 minutes on Raspberry Pi 4 (Cortex A72 1.5GHz, four cores) Ubuntu 20.04 machine, Furthermore, the build of MPLAPACK on the Mac mini (2018, Core i5-8500B, 3.0GHz and six cores) took about 3.5 hours.

To run a simple demo (matrix-matrix multiplication in binary128) can be done as follows:

$ docker run -it mplapack:ubuntu20.04 /bin/bash
docker@2cb33bf4c36f:~$ ls
MPLAPACK  mplapack-2.0.1  mplapack-2.0.1.tar.xz
docker@2cb33bf4c36f:~ cd MPLAPACK/share/examples/mpblas/
docker@2cb33bf4c36f:~/MPLAPACK/share/examples/mpblas$ make -f Makefile.linux
c++ -c -O2 -fopenmp -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
-I/home/docker/MPLAPACK/include/qd Rgemm_mpfr.cpp
...
docker@2cb33bf4c36f:~/MPLAPACK/share/examples/mpblas$ ./Rgemm__Float128
# Rgemm demo...
a =[ [ +1.00000000000000000000000000000000000e+00, +8.00000000000000000000000000000000000e+00,
+3.00000000000000000000000000000000000e+00]; [ +2.00000000000000000000000
...
ans =[ [ +2.10000000000000000000000000000000000e+01, -1.92000000000000000000000000000000000e+02,
+2.28000000000000000000000000000000000e+02]; [ -6.40000000000000000000000000000000000e+01,
-1.46000000000000000000000000000000000e+02, +2.66000000000000000000000000000000000e+02];
[ +2.10000000000000000000000000000000000e+02, +3.61000000000000000000000000000000000e+02,
-3.80000000000000000000000000000000000e+01] ]
#please check by Matlab or Octave following and ans above
alpha * a * b + beta * c

If it fails, it is a bug. Please report a problem via GitHub issue.

In Table 1, we list corresponding Dockerfiles for CPUs, GPUs, OSes, and compilers. When the build is finished, all the files will be installed under /home/docker/MPLAPACK_CUDA for the CUDA version, /home/docker/MPLAPACK_MINGW for Windows version, /home/docker/MPLAPACK_INTELONEAPI for Intel One API version, and others are under /home/docker/MPLAPACK. We can build corresponding environments (other OSes, GPUs, CPUs), and you can choose the Dockerfile appropriately listed on 1.

Table 1: The name of the Dockerfile and the corresponding OS and CPU

Docker filename	CPU	OS	Compiler
Dockerfile_CentOS7	amd64	CentOS 7	GCC
Dockerfile_CentOS7_AArch64	aarch64 only	CentOS 7	GCC
Dockerfile_CentOS8	all CPUs	CentOS 8	GCC
Dockerfile_ubuntu18.04	amd64 only	Ubuntu 18.04	GCC
Dockerfile_ubuntu20.04	all CPUs	Ubuntu 20.04	GCC
Dockerfile_ubuntu20.04_inteloneapi	amd64 only	Ubuntu 20.04	Intel oneAPI
Dockerfile_ubuntu20.04_mingw64	amd64 only	Ubuntu 20.04	GCC
Dockerfile_ubuntu20.04_cuda	NVIDIA A100, V100	Ubuntu 20.04	GCC
Dockerfile_ubuntu22.04	all cpus	Ubuntu 22.04	GCC
Dockerfile_debian_bullseye	all cpus	Debian(bullseye)	GCC

4.2 Compiling from the source

We list prerequisites for compiling from the source in Table 2; users can satisfy these prerequisites using MacPorts on macOS. Homebrew may be used as an alternative.

$ sudo port install gcc10 coreutils git ccache
$ wget https://github.com/nakatamaho/mplapack/releases/download/v2.0.1/mplapack-2.0.1.tar.xz
$ tar xvfz mplapack-2.0.1.tar.xz
$ cd mplapack-2.0.1
$ CXX="g++-mp-10" ; export CXX
$ CC="gcc-mp-10" ; export CC
$ FC="gfortran-mp-10"; export FC
$ ./configure --prefix=/usr/local --enable-gmp=yes --enable-mpfr=yes \
--enable-_Float128=yes --enable-qd=yes --enable-dd=yes --enable-double=yes \
--enable-_Float64x=yes --enable-test=yes
...
$ make -j6
...
$ sudo make install

Table 2: Prerequisites for compiling from source

Package	Version
GCC	9 or later
gmake	4.3 or later
git	2.33.0 or later
autotools	2.71
automake	1.16
GNU sed	4.1 or later

5 Supported Floating point formats

We support binary64, FP80 (extended double), binary128, double-double, quad-double (QD library), GMP, and MPFR floating point arithmetics in MPLAPACK, and floating point types in C++ are summarized in table [mplapackformats].

Binary64 [4] is the so-called double precision: the number of significant digits in decimal is about 16, and CPUs perform arithmetic processing in hardware. As a result, the CPU can perform binary64 operations very fast.

FP80 [20] is the so-called extended double precision: the number of significant digits in decimal is about 19, and Intel and AMD CPUs perform arithmetic processing in hardware. ISO defined C real floating types [21] as _Float64x. Thus We always use _Float64x as a type for FP80 numbers in MPLAPACK. However, no SIMD support, and due to its architecture, FP80 operations are more than ten times slower than binary64 operations. Besides, FP80 was removed since IEEE 754-2008 [4].

Binary128, sometimes called quadruple precision, has been defined since IEEE754-2008, [4] and this format has approximately 33 significant decimal digits. Usually, binary128 arithmetic is done by software. Therefore, operations are prolonged. We know that only IBM z processors have been the only commercial platform supporting quadruple precision [22]. Some processors like aarch64, Sparc, RISCV64, and MIPS64 define instructions for quadruple-precision arithmetic. These processors emulate binary128 instructions by software. ISO defined C real floating types [21] as _Float128 but not yet a standard of C++. GCC has already provided a type for binary128 as __float128 since GCC 4.6. Intel one API also provides binary128 as _Quad. We always use _Float128 as a type for binary128 numbers in MPLAPACK. This type may be the same as long double or __float128 depends on the environment, we typedef appropriate type to _Float128.

The double-double casts two binary64 numbers as one number and has approximately 32 decimal significant digits, and the quad-double casts four binary64 numbers as one number and has approximately 64 decimal significant digits. We use the QD library [23] to support double-double and quad-double precision. The double-double precision and quad-double precision use Kunth and Dekker’s algorithm [24, 25], which can evaluate the addition and multiplication of two binary64 numbers rigorously. Then, we can define the addition and multiplication of double-double numbers. The pros of using these formats are that they are speedy. Since all arithmetic can be done by binary64 and accelerated by hardware, calculation speed is approximately ten times faster than software implemented binary128. The cons of using these formats are programs written for expecting that the IEEE754 feature might not work with these precisions. Historically, IBM XL compilers for PowerPCs and GCC targeted to PowerPCs and PowerMacs, “long double” has been equivalent to the double-double [26]. However, other environments do not always support “long double.” Therefore, we use the QD library for MPLAPACK, and we guess that is why Hida et al. developed the library.

GMP [27] is a C library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. We can perform calculations to any accuracy, as long as the machine’s resources allow. GMP comes with C++ binding, and we use this mpf_class as an arbitrary floating-point number. In addition, we support complex numbers by preparing ${\tt mpc\_class}$ .

MPFR [28] is a C library for multiple-precision floating-point computations with correct rounding based on GMP. Unlike GMP, MPFR does not come with C++ binding, and we use mpreal [29] as an arbitrary floating-point number. We support complex numbers using MPC, a library for multiple-precision complex arithmetic with correct rounding [30] via ${\tt mpcomplex}$ class (more precisely, the final LGPL version with our customization). Both libraries provide almost the same functionalities, but MPFR and MPC are smooth extensions to IEEE 754 and further support trigonometric functions necessary for cosine-sine decomposition, elementary and special functions. Therefore, we will drop GMP support in the future.

long double is no longer supported in MPLAPACK since version 1.0 because the situation regarding long double is very different and confusing by CPUs and OSes. E.g., on Intel CPUs, long double is equivalent to _Float64x on Linux; however, on Windows, long double is equivalent to double. Similar confusion happens in the AArch64 environment. The official ARM ABI defines long double to be binary128 [31]. Nevertheless, Apple ABI overwrites long double to double on the OS side. On IBM PowerPC or Power Macs, long double have been double-double, as described above. When long double transit to __float128 on PowerPC in the future, double-double will also be supported as __ibm128 [26] as the GNU extension.

Table 3: The floating-point formats used in MPLAPACK, their type names, and accuracy in decimal digits.

Library or format	type name	accuracy in decimal digits
GMP	mpf_class, mpc_class	154 (default) and arbitrary
MPFR	mpreal, mpcomplex	154 (default) and arbitrary
double-double	dd_real, dd_complex	32
quad-double	qd_real, qd_complex	64
binary64	double, std::complex<double>	16
extended double	_Float64x, std::complex<_Float64x>	19
binary128	_Float128, std::complex<_Float128>	33
integer	mplapackint	(32bit on Win, 64bit on Linux and macOS)

6 LAPACK and BLAS Routine Naming Conventions and available routines

BLAS and LAPACK prefix routine names with “s” and “d” for real single and double precision real numbers and “c” and “z” for single and double precision complex numbers. FORTRAN77 and Fortran90 are not case-sensitive in function and subroutine names, while C++ is case-sensitive in function names.

The prefix for real and complex routine names in MPLAPACK is an uppercase “R” for real numbers and “C” for complex numbers. All other letters in the routines are lowercase. Also, while we do not distinguish function names by floating-point class (e.g. mpf_class, _Float128) 3, we use the same function name for different floating-point classes to take advantage of function overloading (e.g., no matter which floating-point class is adopted, program calls Raxpy appropriately). The floating-point class is added after the routine name for routines that take no arguments or only integers. Otherwise, we add the prefix “M” or insert “M” after “i.”

For example,

•

daxpy, zaxpy $\rightarrow$ Raxpy, Caxpy
•

dgemm, zgemm $\rightarrow$ Rgemm, Cgemm
•

dsterf, dsyev $\rightarrow$ Rsterf, Rsyev
•

dzabs1, dzasum $\rightarrow$ RCabs1, RCasum
•

lsame $\rightarrow$ Mlsame_mpfr, Mlsame_gmp, Mlsame__Float128 $\cdots$ etc.
•

dlamch $\rightarrow$ Rlamch_mpfr, Rlamch_gmp, Rlamch__Float128 $\cdots$ etc.
•

ilaenv $\rightarrow$ iMlaenv_mpfr, iMlaenv_gmp, iMlaenv__Float128 $\cdots$ etc.

In table 4, we show all supported MPBLAS routines. The prototype definitions of these routines can be found in the following headers.

•

/home/docker/MPLAPACK/include/mplapack/mpblas__Float128.h
•

/home/docker/MPLAPACK/include/mplapack/mpblas__Float64x.h
•

/home/docker/MPLAPACK/include/mplapack/mpblas_dd.h
•

/home/docker/MPLAPACK/include/mplapack/mpblas_double.h
•

/home/docker/MPLAPACK/include/mplapack/mpblas_gmp.h
•

/home/docker/MPLAPACK/include/mplapack/mpblas_mpfr.h
•

/home/docker/MPLAPACK/include/mplapack/mpblas_qd.h

A simple OpenMP version of MPBLAS is available. In table 5, we show all OpenMP accelerated MPBLAS routines. In table 6, we show all CUDA accelerated MPBLAS (double-double) routines.

In table 7, we show all supported MPLAPACK real driver routines. In table 9, we show all supported MPLAPACK real computational routines. In table 8, we show all supported MPLAPACK complex driver routines. In table 10, we show all supported MPLAPACK complex computational routines.

We do not list them here, but there are also many MPLPACK auxiliary routines. We usually use MPLAPACK driver routines and MPBLAS routines directly. However, the driver routines also implicitly use computational routines and auxiliary routines.

The prototype definitions of the MLAPACK routines can be found in the following headers.

•

/home/docker/MPLAPACK/include/mplapack/mplapack__Float128.h
•

/home/docker/MPLAPACK/include/mplapack/mplapack__Float64x.h
•

/home/docker/MPLAPACK/include/mplapack/mplapack_dd.h
•

/home/docker/MPLAPACK/include/mplapack/mplapack_double.h
•

/home/docker/MPLAPACK/include/mplapack/mplapack_gmp.h
•

/home/docker/MPLAPACK/include/mplapack/mplapack_mpfr.h
•

/home/docker/MPLAPACK/include/mplapack/mplapack_qd.h

Table 4: Available MPBLAS routines

Crotg	Cscal	Rrotg	Rrot	Rrotm	CRrot	Cswap	Rswap	CRscal	Rscal
Ccopy	Rcopy	Caxpy	Raxpy	Rdot	Cdotc	Cdotu	RCnrm2	Rnrm2	Rasum
iCasum	iRamax	RCabs1	Mlsame	Mxerbla
Cgemv	Rgemv	Cgbmv	Rgbmv	Chemv	Chbmv	Chpmv	Rsymv	Rsbmv	Ctrmv
Cgemv	Rgemv	Cgbmv	Rgemv	Chemv	Chbmv	Chpmv	Rsymv	Rsbmv	Rspmv
Ctrmv	Rtrmv	Ctbmv	Ctpmv	Rtpmv	Ctrsv	Rtrsv	Ctbsv	Rtbsv	Ctpsv
Rger	Cgeru	Cgerc	Cher	Chpr	Cher2	Chpr2	Rsyr	Rspr	Rsyr2
Rspr2
Cgemm	Rgemm	Csymm	Rsymm	Chemm	Csyrk	Rsyrk	Cherk	Csyr2k	Rsyr2k
Cher2k	Ctrmm	Rtrmm	Ctrsm	Rtrsm

Table 5: Available OpenMP version of MPBLAS routines

Raxpy

Rcopy

Rdot

Rgemm

Table 6: Available CUDA version of MPBLAS (double-double) routines

Rgemm

Rsyrk

Table 7: Available MPLAPACK Real Driver routines (2.0.1)

Rgesv	Rgesvx	Rgbsv	Rgbsvx	Rgtsv	Rgtsvx	Rposv	Rposvx	Rppsv	Rppsvx
Rpbsv	Rpvsvx	Rptsv	Rptsvx	Rsysv	Rsysvx	Rpspv	Rspsvx
Rgels	Rgelsy	Rgelss	Rgelsd
Rgglse	Rggglm
Rsyev	Rsyevd	Rsyevx	Rsyevr	Rspev	Rspevd	Rspevx	Rsbev	Rsbevd	Rsbevx
Rstev	Rstevd	Rstevx	Rstevr	Rgees	Rgeesx	Rgeev	Rgeevx	Rgesvd	Rgesdd
Rsygv	Rsygvd	Rsygvx	Rspgv	Rspgvd	Rspgvx	Rsbgv	Rsbgv	Rsbgvx	Rgges
Rggesx	Rggev	Rggevx	Rggsvd

Table 8: Available MPLAPACK Complex Driver routines (2.0.1)

Cgesv	Cgesvx	Cgbsv	Cgbsvx	Cgtsv	Cgtsvx	Cposv	Cposvx	Cppsv	Cppsvx
Cpbsv	Cpbsvx	Cptsv	Cptsvx	Chesv	Chesvx	Csysv	Csysvx	Chpsv	Chpsvx
Cspsv	Cspsvx
Cgels	Cgelsy	Cgelss	Cgelsd
Cgglse	Cggglm
Cheev	Cheevd	Cheevx	Cheevr	Chpev	Chpevd	Chpevx	Chbev	Chbevd	Chbevx
Cgees	Cgeesx	Cgeev	Cgeevx	Cgesvd	Cgesdd
Chegv	Chegvd	Chegvx	Chpgv	Chpgvd	Chpgvx	Chbgv	Chbgvd	Chbgvx	Cgges
Cggesx	Cggev	Cggevx	Cggsvd

Table 9: Available MPLAPACK Computational Real routines (2.0.1)

Rgetrf	Rgetrs	Rgecon	Rgerfs	Rgetri	Rgeequ	Rgbtrf	Rgbtrs	Rgbcon	Rgbrfs
Rgbequ	Rgttrf	Rgttrs	Rgtcon	Rgtrfs	Rpotrf	Rpotrs	Rpocon	Rporfs	Rpotri
Rpoequ	Rpptrf	Rpptrs	Rppcon	Rpprfs	Rpptri	Rppequ	Rpbtrf	Rpbtrs	Rpbcon
Rpbrfs	Rpbequ	Rpttrf	Rpttrs	Rptcon	Rptrfs	Rsytrf	Rsytrs	Rsycon	Rsyrfs
Rsytri	Rsptrf	Rsptrs	Rspcon	Rsprfs	Rsptri	Rtrtrs	Rtrcon	Rtrrfs	Rtrtri
Rtptrs	Rtpcon	Rtprfs	Rtptri	Rtbtrs	Rtbcon	Rtbrfs
Rgeqp3	Rgeqrf	Rorgqr	Rormqr	Rgelqf	Rorglq	Rormlq	Rgeqlf	Rorgql	Rormql
Rgerqf	Rorgrq	Rormrq	Rtzrzf	Rormrz
Rsytrd	Rsptrd	Rsbtrd	Rorgtr	Rormtr	Ropgtr	Ropmtr	Rsteqr	Rsterf	Rstedc
Rstegr	Rstebz	Rstein	Rpteqr
Rgehrd	Rgebal	Rgebak	Rorghr	Rormhr	Rhseqr	Rhsein	Rtrevc	Rtrexc	Rtrsyl
Rtrsna	Rtrsen
Rgebrd	Rgbbrd	Rorgbr	Rormbr	Rbdsqr	Rbdsdc
Rsygst	Rspgst	Rpbstf	Rsbgst
Rgghrd	Rggbal	Rggbak	Rhgeqz	Rtgevc	Rtgexc	Rtgsyl	Rtgsna	Rtgsen
Rggsvp	Rtgsja

Table 10: Available MPLAPACK Computational Complex routines (2.0.1)

Cgetrf	Cgetrs	Cgecon	Cgerfs	Cgetri	Cgeequ	Cgbtrf	Cgbtrs	Cgbcon	Cgbrfs
Cgbequ	Cgttrf	Cgttrs	Cgtcon	Cgtrfs	Cpotrf	Cpotrs	Cpocon	Cporfs	Cpotri
Cpoequ	Cpptrf	Cpptrs	Cppcon	Cpprfs	Cpptri	Cppequ	Cpbtrf	Cpbtrs	Cpbcon
Cpbrfs	Cpbequ	Cpttrf	Cpttrs	Cptcon	Cptrfs	Chetrf	Chetrs	Checon	Cherfs
Chetri	Csytrf	Csytrs	Csycon	Csyrfs	Csytri	Chptrf	Chptrs	Chpcon	Chprfs
Chptri	Csptrf	Csptrs	Cspcon	Csprfs	Csptri	Ctrtrs	Ctrcon	Ctrrfs	Ctrtri
Ctptrs	Ctpcon	Ctprfs	Ctptri	Ctbtrs	Ctbcon	Ctbrfs
Cgeqp3	Cgeqrf	Cungqr	Cunmqr	Cgelqf	Cunglq	Cunmlq	Cgeqlf	Cungql	Cunmql
Cgerqf	Cungrq	Cunmrq	Ctzrzf	Cunmrz
Chetrd	Chptrd	Chbtrd	Cungtr	Cunmtr	Cupgtr	Cupmtr	Csteqr	Cstedc	Cstegr
Cstein	Cpteqr
Cgehrd	Cgebal	Cgebak	Cunghr	Cunmhr	Chseqr	Chsein	Ctrevc	Ctrexc	Ctrsyl
Ctrsna	Ctrsen
Cgebrd	Cgbbrd	Cungbr	Cunmbr	Cbdsqr
Chegst	Chpgst	Cpbstf	Chbgst
Cgghrd	Cggbal	Cggbak	Chgeqz	Ctgevc	Ctgexc	Ctgsyl	Ctgsna	Ctgsen
Cggsvp	Ctgsja

7 How to use MPBLAS and MPLPACK

7.1 Multiprecision Types

We use seven kinds of floating-point formats. Multiple precision types are listed in the table 3. For GMP, we use built-in mpf_class for real type. For complex type, we developed mpc_class. For MPFR, we use a modified version of mpfrc++ (the final LGPL version) to treat like double or float type. For complex type, we developed mpcomplex.h using MPFR and MPC for complex type. For double-double and quad-double, we use dd_real and qd_real, respectively. For complex type, we developed dd_complex and qd_complex, respectively.

For extended double, we use _Float64x for all Intel and AMD environments. For binary128, we use _Float128 for all OSes and CPUs. For complex types for double, _Float64x and _Float128, we use standard complex implementation of C++. Using __float80, __float128, or long double is strongly discouraged as they are not tested and not portable.

Also, we define mplapackint as a 64-bit signed integer to MPLAPACK that can access all the memory spaces regardless of which data type models the environment employs (LLP64 or LP64). On Windows, mplapackint is still a 32-bit signed integer.

7.2 General pitfall substituting floating point numbers to multiple precision numbers

Suppose when we substitute $1.2$ to multiple numbers alpha

    alpha = 1.2;

Such code is problematic. First, GCC translates “1.2” into double precision, then substituting alpha. The compilation of this code cause exact numbers to be rounded to double precision. Thus, when we print this number up 64 decimal digits using default precision (512 bits = 153 decimal digits) of GMP in MPBLAS, alpha becomes

    -1.1999999999999999555910790149937383830547332763671875000000000000e+00

. This may be an undesired result. To avoid this behavior, we should code as follows:

    alpha = "1.2";

Alternatively, use the constructer explicitly as follows:

    alpha = mpf_class("1.2");

Then, the output becomes as desired.

    -1.2000000000000000000000000000000000000000000000000000000000000000e+00

For GMP, to substitute an imaginary number $1.2+1.2i$ to beta, we should code as follows:

    beta = mpc_class(mpf_class("1.2"), mpf_class("1.2");

For real numbers, we verified that all precision have string type constructor except for _Float128 and _Float64x as following:

    alpha = "1.2";

For complex numbers, we have to explicitly use real constructors except for _Float128 and _Float64x as follows:

    beta = dd_complex(dd_real("1.2"), dd_real("1.2");

7.3 How to use MPBLAS

This subsection describes the basic usage of MPBLAS, OpenMP accelerated MPBLAS and CUDA accelerated MPBLAS.

7.3.1 How to use reference MPBLAS

The API of MPBLAS is very similar to the original BLAS and CBLAS. However, unlike CBLAS, we always use a one-dimensional array as a column-major type matrix in MPBLAS.

Here, we show how to use MPBLAS by two examples; Rgemm and Cgemm. Of course, other routines can be used similarly. However, first, we show how to use Rgemm, which corresponds to DGEMM of BLAS.

Following is the prototype definition of the multiple-precision version of matrix-matrix multiplication (Rgemm) of the MPFR version.

void Rgemm(const char *transa, const char *transb, mplapackint const m,
mplapackint const n, mplapackint const k, mpreal const alpha, mpreal *a,
mplapackint const lda, mpreal *b, mplapackint const ldb, mpreal const beta,
mpreal*c, mplapackint const ldc);

Moreover, the following is the definition part of the original DGEMM:

SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)
*     .. Scalar Arguments ..
      DOUBLE PRECISION ALPHA,BETA
      INTEGER K,LDA,LDB,LDC,M,N
      CHARACTER TRANSA,TRANSB
*     ..
*     .. Array Arguments ..
      DOUBLE PRECISION A(LDA,*),B(LDB,*),C(LDC,*)

There is a clear correspondence between variables in the C++ prototype of Rgemm and variables in the header of DGEMM.

•

CHARACTER $\rightarrow$ const char *
•

INTEGER $\rightarrow$ mplapackint
•

DOUBLE PRECISION A(LDA, *) $\rightarrow$ mpreal *a

We can see such correspondences for other MPBLAS and MPLAPACK routines.

Then, let us see how we multiply matrices using the MPFR version of Rgemm,

\alpha AB+\beta C\rightarrow C,

where $A$ , $B$ and $C$ are matrices, $\alpha$ and $\beta$ are scalars. Let us choose $A$ , $B$ , and $C$

A=\left(\begin{array}[]{rrr}1&8&3\\ 0&10&8\\ 9&-5&-1\end{array}\right),B=\left(\begin{array}[]{rrr}9&8&3\\ 3&-11&0\\ -8&6&1\end{array}\right),C=\left(\begin{array}[]{rrr}3&3&0\\ 8&4&8\\ 6&1&-2\end{array}\right),

and $\alpha=3$ , and $\beta=-2$ .

The answer is:

\alpha AB+\beta C=\left(\begin{array}[]{rrr}21&-192&18\\ -118&-194&8\\ 210&361&82\end{array}\right).

The list is the following.

⬇

1#include <mpblas_mpfr.h>

3//Matlab/Octave format

4void printmat(int N, int M, mpreal * A, int LDA)

6 mpreal mtmp;

7 printf("[ ");

8 for (int i = 0; i < N; i++) {

9 printf("[ ");

10 for (int j = 0; j < M; j++) {

11 mtmp = A[i + j * LDA];

12 mpfr_printf("%5.2Re", mpfr_ptr(mtmp));

13 if (j < M - 1)

14 printf(", ");

15 }

16 if (i < N - 1)

17 printf("]; ");

18 else

19 printf("] ");

20 }

21 printf("]");

22}

24int main()

25{

26 mplapackint n = 3;

27//initialization of MPFR

28 int default_prec = 256;

29 mpfr_set_default_prec(default_prec);

31 mpreal *A = new mpreal[n * n];

32 mpreal *B = new mpreal[n * n];

33 mpreal *C = new mpreal[n * n];

34 mpreal alpha, beta;

36//setting A matrix

37 A[0 + 0 * n] = 1; A[0 + 1 * n] = 8; A[0 + 2 * n] = 3;

38 A[1 + 0 * n] = 0; A[1 + 1 * n] = 10; A[1 + 2 * n] = 8;

39 A[2 + 0 * n] = 9; A[2 + 1 * n] = -5; A[2 + 2 * n] = -1;

41 B[0 + 0 * n] = 9; B[0 + 1 * n] = 8; B[0 + 2 * n] = 3;

42 B[1 + 0 * n] = 3; B[1 + 1 * n] = -11; B[1 + 2 * n] = 0;

43 B[2 + 0 * n] = -8; B[2 + 1 * n] = 6; B[2 + 2 * n] = 1;

45 C[0 + 0 * n] = 3; C[0 + 1 * n] = 3; C[0 + 2 * n] = 0;

46 C[1 + 0 * n] = 8; C[1 + 1 * n] = 4; C[1 + 2 * n] = 8;

47 C[2 + 0 * n] = 6; C[2 + 1 * n] = 1; C[2 + 2 * n] = -2;

49 printf("# Rgemm demo...\n");

51 printf("A ="); printmat(n, n, A, n); printf("\n");

52 printf("B ="); printmat(n, n, B, n); printf("\n");

53 printf("C ="); printmat(n, n, C, n); printf("\n");

54 alpha = 3.0;

55 beta = -2.0;

56 Rgemm("n", "n", n, n, n, alpha, A, n, B, n, beta, C, n);

58 mpfr_printf("alpha = %5.3Re\n", mpfr_ptr(alpha));

59 mpfr_printf("beta = %5.3Re\n", mpfr_ptr(beta));

60 printf("ans ="); printmat(n, n, C, n); printf("\n");

61 printf("#please check by Matlab or Octave following and ans above\n");

62 printf("alpha * A * B + beta * C =\n");

63 delete[]C;

64 delete[]B;

65 delete[]A;

66}

First, we must include mpblas_mpfr.h to use MPFR. printmat function prints matrix in Octave/Matlab format (lines 4 to 22). We set 256 bit in fraction for mpfrc++ in lines 28 to 29; MPFR Real accuracy setted to 77 decimal digits ( $77.06=\log_{10}2^{256}$ ). We allocate the matrix as a one-dimensional array (lines 31 to 33). Then we set the matrix $A$ , $B$ , and $C$ . We always input the matrix by row-major format to an array (lines 36 to 47). Matrix-matrix multiplication Rgemm is called in line 58. One can input the list and save as Rgemm_mpfr.cpp or you can find /home/docker/mplapack/examples/mpblas/ directory. Then, one can compile on one own following in the Docker environment as follows:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm_mpfr.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmpblas_mpfr -lgmp -lmpfr -lmpc

If the compilation is done successfully, one can run as follows:

$ ./a.out
# Rgemm demo...
A =[ [ 1.00e+00, 8.00e+00, 3.00e+00]; [ 0.00e+00, 1.00e+01, 8.00e+00]; [ 9.00e+00, -5.00e+00, -1.00e+00] ]
B =[ [ 9.00e+00, 8.00e+00, 3.00e+00]; [ 3.00e+00, -1.10e+01, 0.00e+00]; [ -8.00e+00, 6.00e+00, 1.00e+00] ]
C =[ [ 3.00e+00, 3.00e+00, 0.00e+00]; [ 8.00e+00, 4.00e+00, 8.00e+00]; [ 6.00e+00, 1.00e+00, -2.00e+00] ]
alpha = 3.000e+00
beta  = -2.000e+00
ans =[ [ 2.10e+01, -1.92e+02, 1.80e+01]; [ -1.18e+02, -1.94e+02, 8.00e+00]; [ 2.10e+02, 3.61e+02, 8.20e+01] ]
#please check by Matlab or Octave following and ans above
alpha * A * B + beta * C

One can check the result by comparing the result of the octave.

$ ./a.out | octave
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
A =

    1    8    3
    0   10    8
    9   -5   -1

B =

    9    8    3
    3  -11    0
   -8    6    1

C =

   3   3   0
   8   4   8
   6   1  -2

alpha =  3
beta = -2
ans =

    21  -192    18
  -118  -194     8
   210   361    82

ans =

    21  -192    18
  -118  -194     8
   210   361    82

In this case, we see the result below two “ans =” are the same. The Rgemm result is correct up to 16 decimal digits.

Let us see how we can multiply matrices using _Float128 (binary128). The list is the following.

⬇

1#include <mpblas__Float128.h>

2#include <stdio.h>

3#define BUFLEN 1024

5void printnum(_Float128 rtmp)

7 int width = 42;

8 char buf[BUFLEN];

9#if defined ___MPLAPACK_WANT_LIBQUADMATH___

10 int n = quadmath_snprintf (buf, sizeof buf, "%+-#*.35Qe", width, rtmp);

11#elif defined ___MPLAPACK_LONGDOUBLE_IS_BINARY128___

12 snprintf (buf, sizeof buf, "%.35Le", rtmp);

13#else

14 strfromf128(buf, sizeof(buf), "%.35e", rtmp);

15#endif

16 printf ("%s", buf);

17 return;

18}

20//Matlab/Octave format

21void printmat(int N, int M, _Float128 * A, int LDA)

22{

23 _Float128 mtmp;

25 printf("[ ");

26 for (int i = 0; i < N; i++) {

27 printf("[ ");

28 for (int j = 0; j < M; j++) {

29 mtmp = A[i + j * LDA];

30 printnum(mtmp);

31 if (j < M - 1)

32 printf(", ");

33 }

34 if (i < N - 1)

35 printf("]; ");

36 else

37 printf("] ");

38 }

39 printf("]");

40}

42int main()

43{

44 mplapackint n = 3;

46 _Float128 *A = new _Float128[n * n];

47 _Float128 *B = new _Float128[n * n];

48 _Float128 *C = new _Float128[n * n];

49 _Float128 alpha, beta;

51//setting A matrix

52 A[0 + 0 * n] = 1; A[0 + 1 * n] = 8; A[0 + 2 * n] = 3;

53 A[1 + 0 * n] = 2.5; A[1 + 1 * n] = 10; A[1 + 2 * n] = 8;

54 A[2 + 0 * n] = 9; A[2 + 1 * n] = -5; A[2 + 2 * n] = -1;

56 B[0 + 0 * n] = 9; B[0 + 1 * n] = 8; B[0 + 2 * n] = 3;

57 B[1 + 0 * n] = 3; B[1 + 1 * n] = -11; B[1 + 2 * n] = 4.8;

58 B[2 + 0 * n] = -8; B[2 + 1 * n] = 6; B[2 + 2 * n] = 1;

60 C[0 + 0 * n] = 3; C[0 + 1 * n] = 3; C[0 + 2 * n] = 1.2;

61 C[1 + 0 * n] = 8; C[1 + 1 * n] = 4; C[1 + 2 * n] = 8;

62 C[2 + 0 * n] = 6; C[2 + 1 * n] = 1; C[2 + 2 * n] = -2;

64 printf("# Rgemm demo...\n");

66 printf("A ="); printmat(n, n, A, n); printf("\n");

67 printf("B ="); printmat(n, n, B, n); printf("\n");

68 printf("C ="); printmat(n, n, C, n); printf("\n");

69 alpha = 3.0;

70 beta = -2.0;

71 Rgemm("n", "n", n, n, n, alpha, A, n, B, n, beta, C, n);

73 printf("alpha = "); printnum(alpha); printf("\n");

74 printf("beta = "); printnum(beta); printf("\n");

75 printf("ans ="); printmat(n, n, C, n); printf("\n");

76 printf("#please check by Matlab or Octave following and ans above\n");

77 printf("alpha * A * B + beta * C \n");

78 delete[]C;

79 delete[]B;

80 delete[]A;

81}

We do not show the output since the output is almost the same as the MPFR version.

The _Float128 version of the program list is almost similar to the MPFR version. However, printnum part is a bit complicated. Since there are at least three kinds of binary128 support depending on CPUs and OSes: (i) GCC only supports __float128 using libquadmath (Windows, macOS), (ii) GCC supports _Float128 directly and there is libc support as well (Linux amd64), (iii) long double is already binary128, and no special support is necessary (AArch64). For (i), we internally define ___MPLAPACK_WANT_LIBQUADMATH___, for (ii) we define ___MPLAPACK__FLOAT128_ONLY___, and for (iii), we define ___MPLAPACK_LONGDOUBLE_IS_BINARY128___. Users can use these internal definitions to write a portable program.

As a summary, we show how we compile and run binary64, FP80, binary128, double-double, quad-double, GMP and MPFR versions of Rgemm demo programs in /home/docker/mplapack/examples/mpblas as follows:

•

binary64 (double) version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm_double.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas_double
$ ./a.out

•

FP80 (extended double) version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm__Float64x.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas__Float64x
$ ./a.out

•

binary128 version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm__Float128.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas__Float128
$ ./a.out

or (on macOS, mingw64 and CentOS7 amd64)

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm__Float128.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas__Float128 -lquadmath
$ ./a.out

•

double-double version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm_dd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas_dd -lqd
$ ./a.out

•

quad-double version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm_qd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas_qd -lqd
$ ./a.out

•

GMP version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm_gmp.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas_gmp -lgmpxx -lgmp
$ ./a.out

•

MPFR version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgemm_mpfr.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmpblas_mpfr -lmpfr -lmpc -lgmp
$ ./a.out

Next, we show how to use Cgemm the complex version gemm. This is not just an example of how to use Cgemm, but an example of how to input floating point numbers.

Following is the prototype definition of the multiple-precision version of matrix-matrix multiplication (Cgemm) of the GMP version.

void Cgemm(const char *transa, const char *transb, mplapackint const m,
mplapackint const n, mplapackint const k, mpc_class const alpha, mpc_class *a,
mplapackint const lda, mpc_class *b,  mplapackint const ldb, mpc_class const beta,
mpc_class *c, mplapackint const ldc);

The following is the definition part of the original ZGEMM:

*       SUBROUTINE ZGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)
*
*       .. Scalar Arguments ..
*       COMPLEX*16 ALPHA,BETA
*       INTEGER K,LDA,LDB,LDC,M,N
*       CHARACTER TRANSA,TRANSB
*       ..
*       .. Array Arguments ..
*       COMPLEX*16 A(LDA,*),B(LDB,*),C(LDC,*)

There is a clear correspondence between variables in the C++ prototype of Cgemm and variables in the header of ZGEMM like Rgemm’s case.

•

CHARACTER $\rightarrow$ const char *
•

INTEGER $\rightarrow$ mplapackint
•

COMPLEX*16 A(LDA, *) $\rightarrow$ mpc_class *a

Let us see how we can multiply matrices using GMP.

\alpha AB+\beta C\rightarrow C,

where

A=\left(\begin{array}[]{rrr}1-i&8+2.2i&-10i\\ 2&10&8.1+2.2i\\ -9+3i&-5+3i&-1.0\end{array}\right),B=\left(\begin{array}[]{rrr}9&8-0.01i&3.1.001i\\ 3-8i&-11+0.1i&8+0.0000i\\ -8+i&6.0&1,1+1.1i\end{array}\right),C=\left(\begin{array}[]{rrr}3+i&3+9.99i&-9-11i\\ 8-i&4+4.44i&8+9i\\ 6&-1&-2+1i\end{array}\right),

and $\alpha=3-1.2i$ , and $\beta=-2-2i$ .

The answer is:

\alpha AB+\beta C=\left(\begin{array}[]{rrr}194.120-39.920i&-324.402-191.934i&235.524-39.798i\\ -182.400-259.700i&-118.304+80.140i&295.157-107.686i\\ -114.000+289.800i&-79.102+1.694i&-179.720+156.296i\end{array}\right).

(1)

Following is the program list for multiplying complex matrices using GMP.

⬇

1//public domain

2#include <mpblas_gmp.h>

3#include <iostream>

4#include <cstring>

5#include <algorithm>

7#define GMP_FORMAT "%+68.64Fe"

8#define GMP_SHORT_FORMAT "%+20.16Fe"

10inline void printnum(mpf_class rtmp) { gmp_printf(GMP_FORMAT, rtmp.get_mpf_t()); }

11inline void printnum_short(mpf_class rtmp) { gmp_printf(GMP_SHORT_FORMAT, rtmp.get_mpf_t()); }

12inline void printnum(mpc_class ctmp) { gmp_printf(GMP_FORMAT GMP_FORMAT "i", ctmp.real().get_mpf_t(), ctmp.imag().get_mpf_t()); }

14//Matlab/Octave format

15template <class X> void printvec(X *a, int len) {

16 X tmp;

17 printf("[ ");

18 for (int i = 0; i < len; i++) {

19 tmp = a[i];

20 printnum(tmp);

21 if (i < len - 1)

22 printf(", ");

23 }

24 printf("]");

25}

27template <class X> void printmat(int n, int m, X *a, int lda)

28{

29 X mtmp;

31 printf("[ ");

32 for (int i = 0; i < n; i++) {

33 printf("[ ");

34 for (int j = 0; j < m; j++) {

35 mtmp = a[i + j * lda];

36 printnum(mtmp);

37 if (j < m - 1)

38 printf(", ");

39 }

40 if (i < n - 1)

41 printf("]; ");

42 else

43 printf("] ");

44 }

45 printf("]");

46}

47int main()

48{

49 mplapackint n = 3;

51 mpc_class *a = new mpc_class[n * n];

52 mpc_class *b = new mpc_class[n * n];

53 mpc_class *c = new mpc_class[n * n];

54 mpc_class alpha, beta;

56//setting A matrix

57 a[0 + 0 * n] = mpc_class(1.0,-1.0); a[0 + 1 * n] = mpc_class(8.0, 2.2); a[0 + 2 * n] = mpc_class(0.0, -10.0);

58 a[1 + 0 * n] = mpc_class(2.0, 0.0); a[1 + 1 * n] = mpc_class(10.0,0.0); a[1 + 2 * n] = mpc_class(8.1, 2.2);

59 a[2 + 0 * n] = mpc_class(-9.0,3.0); a[2 + 1 * n] = mpc_class(-5.0,3.0); a[2 + 2 * n] = mpc_class(-1.0, 0.0);

61 b[0 + 0 * n] = mpc_class(9.0, 0.0); b[0 + 1 * n] = mpc_class(8.0, -0.01); b[0 + 2 * n] = mpc_class(3.0, 1.001);

62 b[1 + 0 * n] = mpc_class(3.0, -8.0); b[1 + 1 * n] = mpc_class(-11.0, 0.1); b[1 + 2 * n] = mpc_class(8.0, 0.00001);

63 b[2 + 0 * n] = mpc_class(-8.0, 1.0); b[2 + 1 * n] = mpc_class(6.0, 0.0); b[2 + 2 * n] = mpc_class(1.1, 1.0);

65 c[0 + 0 * n] = mpc_class(3.0, 1.0); c[0 + 1 * n] = mpc_class(-3.0, 9.99); c[0 + 2 * n] = mpc_class(-9.0, -11.0);

66 c[1 + 0 * n] = mpc_class(8.0, -1.0); c[1 + 1 * n] = mpc_class(4.0, 4.44); c[1 + 2 * n] = mpc_class(8.0, 9.0);

67 c[2 + 0 * n] = mpc_class(6.0, 0.0); c[2 + 1 * n] = mpc_class(-1.0, 0.0); c[2 + 2 * n] = mpc_class(-2.0, 1.0);

69 printf("# Cgemm demo...\n");

71 printf("a ="); printmat(n, n, a, n); printf("\n");

72 printf("b ="); printmat(n, n, b, n); printf("\n");

73 printf("c ="); printmat(n, n, c, n); printf("\n");

74 alpha = mpc_class(3.0,-1.2);

75 beta = mpc_class(-2.0, -2.0);

76 Cgemm("n", "n", n, n, n, alpha, a, n, b, n, beta, c, n);

78 printf("alpha = "); printnum(alpha); printf("\n");

79 printf("beta = "); printnum(beta); printf("\n");

80 printf("ans ="); printmat(n, n, c, n); printf("\n");

81 printf("#please check by Matlab or Octave following and ans above\n");

82 printf("alpha * a * b + beta * c \n");

83 delete[]c;

84 delete[]b;

85 delete[]a;

86}

First, we must include mpblas_gmp.h to use GMP. printmat function prints matrix in Octave/Matlab format (lines 7 to 46). We use the default GMP complex precision 153 decimal digits ( $154.13=\log_{10}2^{512}$ ). We allocate the matrix as a one-dimensional array (lines 51 to 54). We use mpc_class for complex values. Then we set the matrix $A$ , $B$ , and $C$ . We always input the matrix by row-major format to an array (lines 36 to 47). Matrix-matrix multiplication Cgemm is called in line 58. One can input the list and save as Cgemm_gmp.cpp or you can find /home/docker/mplapack/examples/mpblas/ directory. Then, one can compile on one own following in the Docker environment as follows:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cgemm_gmp.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmpblas_gmp -lgmp

Then, you can run as follows:

$ ./a.out
# Cgemm demo...
a =[ [ +1.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00+2.2000000000000001776356839400250464677810668945312500000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+01i];
[ +2.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.0000000000000000000000000000000000000000000000000000000000000000e+01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+8.0999999999999996447286321199499070644378662109375000000000000000e+00+2.2000000000000001776356839400250464677810668945312500000000000000e+00i];
[ -9.0000000000000000000000000000000000000000000000000000000000000000e+00+3.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-5.0000000000000000000000000000000000000000000000000000000000000000e+00+3.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-1.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
b =[ [ +9.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000208166817117216851329430937767028808593750000000e-02i,
+3.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0009999999999998898658759571844711899757385253906250000000000000e+00i];
[ +3.0000000000000000000000000000000000000000000000000000000000000000e+00-8.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-1.1000000000000000000000000000000000000000000000000000000000000000e+01+1.0000000000000000555111512312578270211815834045410156250000000000e-01i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000818030539140313095458623138256371021270751953125e-05i];
[ -8.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+6.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.1000000000000000888178419700125232338905334472656250000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
c =[ [ +3.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-3.0000000000000000000000000000000000000000000000000000000000000000e+00+9.9900000000000002131628207280300557613372802734375000000000000000e+00i,
-9.0000000000000000000000000000000000000000000000000000000000000000e+00-1.1000000000000000000000000000000000000000000000000000000000000000e+01i];
[ +8.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+4.0000000000000000000000000000000000000000000000000000000000000000e+00+4.4400000000000003907985046680551022291183471679687500000000000000e+00i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00+9.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ +6.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-1.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-2.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
alpha = +3.0000000000000000000000000000000000000000000000000000000000000000e+00-1.1999999999999999555910790149937383830547332763671875000000000000e+00i
beta = -2.0000000000000000000000000000000000000000000000000000000000000000e+00-2.0000000000000000000000000000000000000000000000000000000000000000e+00i
ans =[ [ +1.9412000000000000429878355134860610085447466221730208764814304152e+02-3.9919999999999997415400798672635510784913059003625879561714777377e+01i,
-3.2440199999999999789655757975737052220306876582624715712208640779e+02-1.9193400000000000968240765342187614821595591023640492250541356027e+02i,
+2.3552422999999999997259326805967153749369165591598754371335859085e+02-3.9797933599999995135557075099860818051923306637592336735756250096e+01i];
[ -1.8239999999999999016342400182111297296053289861415149007714347172e+02-2.5970000000000000937028232783632108284631549553475262574071520758e+02i,
-1.1830400000000000489791540658757190133827036314312018406665329523e+02+8.0140000000000003122918590392487108980814301859549537116009666066e+01i,
+2.9515651999999999967519852739003531813778936267996503629002521457e+02-1.0768569999999999661823612607559887657445872205333970782644834681e+02i];
[ -1.1400000000000000333066907387546962127089500427246093750000000000e+02+2.8979999999999999715782905695959925651550292968750000000000000000e+02i,
-7.9101999999999999661257077399056923910599691859337230298536113865e+01+1.6939999999999989081927997958132429788216147581137981441043024206e+00i,
-1.7971994999999999910681342224060695849998775453588962690525666214e+02+1.5629648599999999952610174635322244325683964957536896708915619612e+02i] ]
#please check by Matlab or Octave following and ans above
alpha * a * b + beta * c

This output is tough to see, and pass to octave got more readable results.

$ ./a.out
a =

    1.00000 -  1.00000i    8.00000 +  2.20000i    0.00000 - 10.00000i
    2.00000 +  0.00000i   10.00000 +  0.00000i    8.10000 +  2.20000i
   -9.00000 +  3.00000i   -5.00000 +  3.00000i   -1.00000 +  0.00000i

b =

    9.0000 +  0.0000i    8.0000 -  0.0100i    3.0000 +  1.0010i
    3.0000 -  8.0000i  -11.0000 +  0.1000i    8.0000 +  0.0000i
   -8.0000 +  1.0000i    6.0000 +  0.0000i    1.1000 +  1.0000i

c =

    3.0000 +  1.0000i   -3.0000 +  9.9900i   -9.0000 - 11.0000i
    8.0000 -  1.0000i    4.0000 +  4.4400i    8.0000 +  9.0000i
    6.0000 +  0.0000i   -1.0000 +  0.0000i   -2.0000 +  1.0000i

alpha =  3.0000 - 1.2000i
beta = -2 - 2i
ans =

   194.120 -  39.920i  -324.402 - 191.934i   235.524 -  39.798i
  -182.400 - 259.700i  -118.304 +  80.140i   295.157 - 107.686i
  -114.000 + 289.800i   -79.102 +   1.694i  -179.720 + 156.296i

ans =

   194.120 -  39.920i  -324.402 - 191.934i   235.524 -  39.798i
  -182.400 - 259.700i  -118.304 +  80.140i   295.157 - 107.686i
  -114.000 + 289.800i   -79.102 +   1.694i  -179.720 + 156.296i

One can see that the input and output look wrong when we see the raw output. Nevertheless, the output is correct up to 153 decimal digits.

Let us explain the reason. We input many values, but we look at $\alpha=3-1.2i$ . MPBLAS GMP output is the following:

alpha = +3.0000000000000000000000000000000000000000000000000000000000000000e+00-1.1999999999999999555910790149937383830547332763671875000000000000e+00i

The imaginary part $-1.2i$ is correct up to 16 decimal digits. Again, this is correct behavior. First, compilers read alpha = mpc_class(3.0,-1.2);, and converts the string “ $-1.2i$ ” to double precision number. We cannot represent “ $-1.2$ ” exactly, since representing $1.2$ in binary numbers results in an infinite number of circular decimals, i.e., $1.2_{(10)}=1.00110011001100110011..._{(2)}$ . Then alpha is initialized by mpc_class. Thus, input and output look wrong. We can use a workaround for this behavior by enclosing numbers in mpf_class("") to indicate exact numbers up to specified precision as follows:

    alpha = mpc_class(mpf_class("3.0"), mpf_class("-1.2"));

Then, the alpha becomes as follows:

alpha = +3.0000000000000000000000000000000000000000000000000000000000000000e+00-1.2000000000000000000000000000000000000000000000000000000000000000e+00i

Even though “ $-1.20000...0e+01$ ” looks exact, there is still a tiny difference between mathematically rigorous “ $1.2i$ ”.

Finally, we show a program and the result fixing such trancation problem as follows:

⬇

1//public domain

2//public domain

3#include <mpblas_gmp.h>

4#include <iostream>

5#include <cstring>

6#include <algorithm>

8#define GMP_FORMAT "%+68.64Fe"

9#define GMP_SHORT_FORMAT "%+20.16Fe"

11inline void printnum(mpf_class rtmp) { gmp_printf(GMP_FORMAT, rtmp.get_mpf_t()); }

12inline void printnum_short(mpf_class rtmp) { gmp_printf(GMP_SHORT_FORMAT, rtmp.get_mpf_t()); }

13inline void printnum(mpc_class ctmp) { gmp_printf(GMP_FORMAT GMP_FORMAT "i", ctmp.real().get_mpf_t(), ctmp.imag().get_mpf_t()); }

15//Matlab/Octave format

16template <class X> void printvec(X *a, int len) {

17 X tmp;

18 printf("[ ");

19 for (int i = 0; i < len; i++) {

20 tmp = a[i];

21 printnum(tmp);

22 if (i < len - 1)

23 printf(", ");

24 }

25 printf("]");

26}

28template <class X> void printmat(int n, int m, X *a, int lda)

29{

30 X mtmp;

32 printf("[ ");

33 for (int i = 0; i < n; i++) {

34 printf("[ ");

35 for (int j = 0; j < m; j++) {

36 mtmp = a[i + j * lda];

37 printnum(mtmp);

38 if (j < m - 1)

39 printf(", ");

40 }

41 if (i < n - 1)

42 printf("]; ");

43 else

44 printf("] ");

45 }

46 printf("]");

47}

48int main()

49{

50 mplapackint n = 3;

52 mpc_class *a = new mpc_class[n * n];

53 mpc_class *b = new mpc_class[n * n];

54 mpc_class *c = new mpc_class[n * n];

55 mpc_class alpha, beta;

57//setting A matrix

58 a[0 + 0 * n] = mpc_class(mpf_class("1.0"), mpf_class("-1.0")); a[0 + 1 * n] = mpc_class(mpf_class("8.0"), mpf_class("2.2"));

59 a[0 + 2 * n] = mpc_class(mpf_class("0.0"), mpf_class("-10.0"));

60 a[1 + 0 * n] = mpc_class(mpf_class("2.0"), mpf_class("0.0")); a[1 + 1 * n] = mpc_class(mpf_class("10.0"), mpf_class("0.0"));

61 a[1 + 2 * n] = mpc_class(mpf_class("8.1"), mpf_class("2.2"));

62 a[2 + 0 * n] = mpc_class(mpf_class("-9.0"), mpf_class("3.0")); a[2 + 1 * n] = mpc_class(mpf_class("-5.0"), mpf_class("3.0"));

63 a[2 + 2 * n] = mpc_class(mpf_class("-1.0"), mpf_class("0.0"));

65 b[0 + 0 * n] = mpc_class(mpf_class("9.0"), mpf_class("0.0")); b[0 + 1 * n] = mpc_class(mpf_class("8.0"), mpf_class("-0.01"));

66 b[0 + 2 * n] = mpc_class(mpf_class("3.0"), mpf_class("1.001"));

67 b[1 + 0 * n] = mpc_class(mpf_class("3.0"), mpf_class("-8.0")); b[1 + 1 * n] = mpc_class(mpf_class("-11.0"), mpf_class("0.1"));

68 b[1 + 2 * n] = mpc_class(mpf_class("8.0"), mpf_class("0.00001"));

69 b[2 + 0 * n] = mpc_class(mpf_class("-8.0"), mpf_class("1.0")); b[2 + 1 * n] = mpc_class(mpf_class("6.0"), mpf_class("0.0"));

70 b[2 + 2 * n] = mpc_class(mpf_class("1.1"), mpf_class("1.0"));

72 c[0 + 0 * n] = mpc_class(mpf_class("3.0"), mpf_class("1.0")); c[0 + 1 * n] = mpc_class(mpf_class("-3.0"), mpf_class("9.99"));

73 c[0 + 2 * n] = mpc_class(mpf_class("-9.0"), mpf_class("-11.0"));

74 c[1 + 0 * n] = mpc_class(mpf_class("8.0"), mpf_class("-1.0")); c[1 + 1 * n] = mpc_class(mpf_class("4.0"), mpf_class("4.44"));

75 c[1 + 2 * n] = mpc_class(mpf_class("8.0"), mpf_class("9.0"));

76 c[2 + 0 * n] = mpc_class(mpf_class("6.0"), mpf_class("0.0")); c[2 + 1 * n] = mpc_class(mpf_class("-1.0"), mpf_class("0.0"));

77 c[2 + 2 * n] = mpc_class(mpf_class("-2.0"), mpf_class("1.0"));

79 printf("# Cgemm demo...\n");

81 printf("a ="); printmat(n, n, a, n); printf("\n");

82 printf("b ="); printmat(n, n, b, n); printf("\n");

83 printf("c ="); printmat(n, n, c, n); printf("\n");

84 alpha = mpc_class(mpf_class("3.0"), mpf_class("-1.2"));

85 beta = mpc_class(mpf_class("-2.0"), mpf_class("-2.0"));

86 Cgemm("n", "n", n, n, n, alpha, a, n, b, n, beta, c, n);

88 printf("alpha = "); printnum(alpha); printf("\n");

89 printf("beta = "); printnum(beta); printf("\n");

90 printf("ans ="); printmat(n, n, c, n); printf("\n");

91 printf("#please check by Matlab or Octave following and ans above\n");

92 printf("alpha * a * b + beta * c \n");

93 delete[]c;

94 delete[]b;

95 delete[]a;

96}

$ ./a.out
# Cgemm demo...
a =[ [ +1.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00+2.2000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+01i];
[ +2.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.0000000000000000000000000000000000000000000000000000000000000000e+01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+8.1000000000000000000000000000000000000000000000000000000000000000e+00+2.2000000000000000000000000000000000000000000000000000000000000000e+00i];
[ -9.0000000000000000000000000000000000000000000000000000000000000000e+00+3.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-5.0000000000000000000000000000000000000000000000000000000000000000e+00+3.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-1.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
b =[ [ +9.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e-02i,
+3.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0010000000000000000000000000000000000000000000000000000000000000e+00i];
[ +3.0000000000000000000000000000000000000000000000000000000000000000e+00-8.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-1.1000000000000000000000000000000000000000000000000000000000000000e+01+1.0000000000000000000000000000000000000000000000000000000000000000e-01i,
+8.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e-05i]; [
-8.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+6.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.1000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
c =[ [ +3.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-3.0000000000000000000000000000000000000000000000000000000000000000e+00+9.9900000000000000000000000000000000000000000000000000000000000000e+00i,
-9.0000000000000000000000000000000000000000000000000000000000000000e+00-1.1000000000000000000000000000000000000000000000000000000000000000e+01i];
[ +8.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+4.0000000000000000000000000000000000000000000000000000000000000000e+00+4.4400000000000000000000000000000000000000000000000000000000000000e+00i
+8.0000000000000000000000000000000000000000000000000000000000000000e+00+9.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ +6.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-1.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i
-2.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
alpha = +3.0000000000000000000000000000000000000000000000000000000000000000e+00-1.2000000000000000000000000000000000000000000000000000000000000000e+00i
beta = -2.0000000000000000000000000000000000000000000000000000000000000000e+00-2.0000000000000000000000000000000000000000000000000000000000000000e+00i
beta = -2.0000000000000000000000000000000000000000000000000000000000000000e+00-2.0000000000000000000000000000000000000000000000000000000000000000e+00i
ans =[ [ +1.9412000000000000000000000000000000000000000000000000000000000000e+02-3.9920000000000000000000000000000000000000000000000000000000000000e+01i,
-3.2440200000000000000000000000000000000000000000000000000000000000e+02-1.9193400000000000000000000000000000000000000000000000000000000000e+02i,
+2.3552423000000000000000000000000000000000000000000000000000000000e+02-3.9797933600000000000000000000000000000000000000000000000000000000e+01i];
[ -1.8240000000000000000000000000000000000000000000000000000000000000e+02-2.5970000000000000000000000000000000000000000000000000000000000000e+02i,
-1.1830400000000000000000000000000000000000000000000000000000000000e+02+8.0140000000000000000000000000000000000000000000000000000000000000e+01i,
+2.9515652000000000000000000000000000000000000000000000000000000000e+02-1.0768570000000000000000000000000000000000000000000000000000000000e+02i];
[ -1.1400000000000000000000000000000000000000000000000000000000000000e+02+2.8980000000000000000000000000000000000000000000000000000000000000e+02i,
-7.9102000000000000000000000000000000000000000000000000000000000000e+01+1.6940000000000000000000000000000000000000000000000000000000000000e+00i,
-1.7971995000000000000000000000000000000000000000000000000000000000e+02+1.5629648600000000000000000000000000000000000000000000000000000000e+02i] ]
#please check by Matlab or Octave following and ans above
alpha * a * b + beta * c

Unfortunately, such a workaround does not always exist. For example, _Float128 and _Float64x do not allow such a string to a floating point number conversion. Therefore, we should write a program to handle high-precision inputs in such a case.

In any case, the calculations are correct up to specified precisions.

7.3.2 OpenMP accelerated MPBLAS

We have provided reference implementation, and a simple OpenMP accelerated version for MPBLAS, shown in Table 5. Even though the number of optimized routines is small; however, acceleration of Rgemm is significant as it impacts the performance. To change linking against the optimized version, we must add “_opt” for MPBLAS library as follows:

•

-lmpblas_mpfr $\rightarrow$ -lmpblas_mpfr_opt
•

-lmpblas_gmp $\rightarrow$ -lmpblas_gmp_opt
•

-lmpblas_double $\rightarrow$ -lmpblas_double_opt
•

-lmpblas__Float128 $\rightarrow$ -lmpblas__Float128_opt
•

-lmpblas_dd $\rightarrow$ -lmpblas_dd_opt
•

-lmpblas_qd $\rightarrow$ -lmpblas_qd_opt
•

-lmpblas__Float128 $\rightarrow$ -lmpblas__Float128_opt
•

-lmpblas__Float64x $\rightarrow$ -lmpblas__Float64x_opt.

Besides, we must add the “-fopenmp” flag when compiling and linking.

We can compile the demo program for the MPFR version and link it against the optimized version of MPBLAS.

$ g++ -fopenmp -O2 -I/home/docker/MPLAPACK/include \
-I/home/docker/MPLAPACK/include/mplapack \
Rgemm_mpfr.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmpblas_mpfr_opt -lmpfr -lmpc -lgmp
$ ./a.out
...

Performance comparison is presented in Section 10.

7.3.3 CUDA accelerated MPBLAS

These routines are usable on Ampere and Volta generations of NVIDIA Tesla GPUs. We provide Rgemm and Rsyrk in double-double routines in CUDA [32]. First, we build the CUDA version on Docker as follows:

$ docker build -f Dockerfile_ubuntu20.04_cuda -t mplapack:ubuntu2004_cuda .

In this Dockerfile, we install MPLAPACK at /home/docker/MPLAPACK_CUDA. To run a shell in the Docker enabling GPUs by:

$ docker run --gpus all -it mplapack:ubuntu2004_cuda /bin/bash

Finally, to link against the CUDA version, we compile the program as follows:

$ g++ -O2 -I/home/docker/MPLAPACK_CUDA/include \
-I/home/docker/MPLAPACK_CUDA/include/mplapack Rgemm_dd.cpp \
-Wl,--rpath=/home/docker/MPLAPACK_CUDA/lib -L/home/docker/MPLAPACK_CUDA/lib \
-lmpblas_dd_cuda -lmpblas_dd -lqd \
-L/usr/local/cuda/lib64/ -lcudart
$ ./a.out
...

As described in [32], using GPUs significantly improves the preformance of the double-double version of Rgemm and Rsyrk. We provide Rsyrk for the first time. Detailed benchmark will be given in Section 10, and for large matrices, we obtained 450-600GFlops using A100 or V100.

7.4 How to use MPLAPACK

In this subsection, we describe the basic usage of MPLAPACK with examples. The API of MPLAPACK is very similar to the original LAPACK. We always use a one-dimensional array as a column-major type matrix in MPLAPACK. Unfortunately, we do not have an interface like LAPACKE at the moment, and we cannot specify raw- or column-major orders in MPLAPACK.

7.4.1 Eigenvalues and eigenvectors of a real symmetric matrix (Rsyev)

Let us show how we diagonalize a real symmetric matrix $A$

\left[\begin{array}[]{llll}5&4&1&1\\ 4&5&1&1\\ 1&1&4&2\\ 1&1&2&4\end{array}\right]

using GMP. Eigenvalues are $\lambda_{1}=10,\lambda_{2}=5,\lambda_{3}=2,\lambda_{4}=1$ and eigenvectors are

x_{1}=\left[\begin{array}[]{l}2\\ 2\\ 1\\ 1\end{array}\right],\quad x_{2}=\left[\begin{array}[]{c}-1\\ -1\\ 2\\ 2\end{array}\right],\quad x_{3}=\left[\begin{array}[]{c}0\\ 0\\ -1\\ 1\end{array}\right],\quad x_{4}=\left[\begin{array}[]{c}-1\\ 1\\ 0\\ 0\end{array}\right]

and $Ax_{i}=\lambda x_{i}$ for $1\leq i\leq 4$ [33].

Following is the prototype definition of GMP’s diagonalization of a symmetric matrix (Rsyev).

void Rsyev(const char *jobz, const char *uplo, mplapackint const n,
mpf_class *a, mplapackint const lda, mpf_class *w, mpf_class *work,
mplapackint const lwork, mplapackint &info);

And the following is the definition of the original dsyev

       SUBROUTINE dsyev( JOBZ, UPLO, N, A, LDA, W, WORK, LWORK, INFO )
 *
 *  -- LAPACK driver routine --
 *  -- LAPACK is a software package provided by Univ. of Tennessee,    --
 *  -- Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd..--
 *
 *     .. Scalar Arguments ..
       CHARACTER          JOBZ, UPLO
       INTEGER            INFO, LDA, LWORK, N
 *     ..
 *     .. Array Arguments ..
       DOUBLE PRECISION   A( LDA, * ), W( * ), WORK( * )
 *     ..
 *

There are clear correspondences between variables in the C++ prototype of Rsyev and the header of DSYEV.

•

CHARACTER $\rightarrow$ const char *
•

INTEGER $\rightarrow$ mplapackint
•

DOUBLE PRECISION A(LDA, *) $\rightarrow$ mpf_class *a
•

DOUBLE PRECISION W(*) $\rightarrow$ mpf_class *w
•

DOUBLE PRECISION WORK(*) $\rightarrow$ mpf_class *work

The list is the following.

⬇

1//public domain

2#include <mpblas_gmp.h>

3#include <mplapack_gmp.h>

5#define GMP_FORMAT "%+68.64Fe"

6#define GMP_SHORT_FORMAT "%+20.16Fe"

8inline void printnum(mpf_class rtmp) { gmp_printf(GMP_SHORT_FORMAT, rtmp.get_mpf_t()); }

9inline void printnum_short(mpf_class rtmp) { gmp_printf(GMP_SHORT_FORMAT, rtmp.get_mpf_t()); }

11//Matlab/Octave format

12void printvec(mpf_class *a, int len) {

13 mpf_class tmp;

14 printf("[ ");

15 for (int i = 0; i < len; i++) {

16 tmp = a[i];

17 printnum(tmp);

18 if (i < len - 1)

19 printf(", ");

20 }

21 printf("]");

22}

24void printmat(int n, int m, mpf_class * a, int lda)

25{

26 mpf_class mtmp;

28 printf("[ ");

29 for (int i = 0; i < n; i++) {

30 printf("[ ");

31 for (int j = 0; j < m; j++) {

32 mtmp = a[i + j * lda];

33 printnum(mtmp);

34 if (j < m - 1)

35 printf(", ");

36 }

37 if (i < n - 1)

38 printf("]; ");

39 else

40 printf("] ");

41 }

42 printf("]");

43}

44int main()

45{

46 mplapackint n = 4;

47 mplapackint lwork, info;

49 mpf_class *A = new mpf_class[n * n];

50 mpf_class *w = new mpf_class[n];

52//setting A matrix

53 A[0 + 0 * n] = 5; A[0 + 1 * n] = 4; A[0 + 2 * n] = 1; A[0 + 3 * n] = 1;

54 A[1 + 0 * n] = 4; A[1 + 1 * n] = 5; A[1 + 2 * n] = 1; A[1 + 3 * n] = 1;

55 A[2 + 0 * n] = 1; A[2 + 1 * n] = 1; A[2 + 2 * n] = 4; A[2 + 3 * n] = 2;

56 A[3 + 0 * n] = 1; A[3 + 1 * n] = 1; A[3 + 2 * n] = 2; A[3 + 3 * n] = 4;

58 printf("A ="); printmat(n, n, A, n); printf("\n");

59//work space query

60 lwork = -1;

61 mpf_class *work = new mpf_class[1];

63 Rsyev("V", "U", n, A, n, w, work, lwork, info);

64 lwork = (int) cast2double (work[0]);

65 delete[]work;

66 work = new mpf_class[std::max((mplapackint) 1, lwork)];

67//inverse matrix

68 Rsyev("V", "U", n, A, n, w, work, lwork, info);

69//print out some results.

70 printf("#eigenvalues \n");

71 printf("w ="); printmat(n, 1, w, 1); printf("\n");

73 printf("#eigenvecs \n");

74 printf("U ="); printmat(n, n, A, n); printf("\n");

75 printf("#you can check eigenvalues using octave/Matlab by:\n");

76 printf("eig(A)\n");

77 printf("#you can check eigenvectors using octave/Matlab by:\n");

78 printf("U’*A*U\n");

80 delete[]work;

81 delete[]w;

82 delete[]A;

83}

One can input the list and save it as Rsyev_test_gmp.cpp, or you can find
/home/docker/mplapack/examples/mplapack/03_SymmetricEigenproblems directory. Then, one can compile on one own following in the Docker environment as follows:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test_gmp.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_gmp -lmpblas_gmp -lgmpxx -lgmp

Finally, you can run as follows:

$ ./a.out
A =[ [ +5.0000000000000000e+00, +4.0000000000000000e+00, +1.0000000000000000e+00, +1.0000000000000000e+00];
[ +4.0000000000000000e+00, +5.0000000000000000e+00, +1.0000000000000000e+00, +1.0000000000000000e+00];]
#eigenvalues
w =[ [ +1.0000000000000000e+00]; [ +2.0000000000000000e+00]; [ +5.0000000000000000e+00]; [ +1.0000000000000000e+01] ]
#eigenvecs
U =[ [ +7.0710678118654752e-01, -1.3882760712710340e-155, -3.1622776601683793e-01, +6.3245553203367587e-01];
[ -7.0710678118654752e-01, -1.0249769098010974e-155, -3.1622776601683793e-01, +6.3245553203367587e-01]
#you can check eigenvalues using octave/Matlab by:
eig(A)
#you can check eigenvectors using octave/Matlab by:
U’*A*U

One can check the result by comparing the result of the octave.

$./a.out | octave
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
A =

   5   4   1   1
   4   5   1   1
   1   1   4   2
   1   1   2   4

w =

    1
    2
    5
   10

U =

   0.70711  -0.00000  -0.31623   0.63246
  -0.70711  -0.00000  -0.31623   0.63246
   0.00000  -0.70711   0.63246   0.31623
   0.00000   0.70711   0.63246   0.31623

ans =

    1.00000
    2.00000
    5.00000
   10.00000

ans =

    1.0000e+00  -2.8639e-155    2.6715e-17   -5.3429e-17
  -2.6070e-155    2.0000e+00   -4.1633e-18   -2.0817e-18
   -6.7450e-17    1.7912e-16    5.0000e+00    8.5557e-17
   -3.5824e-16   -1.3490e-16    1.1229e-16    1.0000e+01

As a summary, we show how we compile and run binary64, FP80, binary128, double-double, quad-double, GMP, and MPFR versions of Rsyev demo programs in
/home/docker/mplapack/examples/mplapack/03_SymmetricEigenproblems as follows:

•

binary64 (double) version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test_double.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmplapack_double -lmpblas_double
$ ./a.out

•

FP80 (extended double) version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test__Float64x.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmplapack__Float64x -lmpblas__Float64x
$ ./a.out

•

binary128 version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test__Float128.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib \
-lmplapack__Float128 -lmpblas__Float128
$ ./a.out

or (on macOS, mingw64 and CentOS7 amd64)

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test__Float128.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib \
-lmplapack__Float128 -lmpblas__Float128 -lquadmath
$ ./a.out

•

double-double version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test_dd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_dd -lmpblas_dd -lqd
$ ./a.out

•

quad-double version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test_qd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_qd -lmpblas_qd -lqd
$ ./a.out

•

GMP version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test_gmp.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_gmp -lmpblas_gmp -lgmpxx -lgmp
$ ./a.out

•

MPFR version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rsyev_test_mpfr.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_mpfr -lmpblas_mpfr -lmpfr -lmpc -lgmp
$ ./a.out

7.4.2 Eigenvalues and eigenvectors of a complex Hermitian matrix (Cheev)

Let us show how we diagonalize a complex hermitian matrix $A$

\left[\begin{array}[]{lll}2&-i&0\\ i&2&0\\ 0&0&3\\ \end{array}\right]

using GMP. Eigenvalues are $\lambda_{1}=1,\lambda_{2}=3,\lambda_{3}=3$ and eigenvectors are

x_{1}=\left[\begin{array}[]{c}-1/\sqrt{2}i\\ -1/\sqrt{2}\\ 0\end{array}\right],\quad x_{2}=\left[\begin{array}[]{c}-1/\sqrt{2}i\\ 1/\sqrt{2}i\\ 0\end{array}\right],\quad x_{3}=\left[\begin{array}[]{c}0\\ 0\\ 1\end{array}\right]

and $Ax_{i}=\lambda x_{i}$ for $1\leq i\leq 3$ .

Following is the prototype definition of GMP’s diagonalization of a symmetric matrix (Cheev).

void Cheev(const char *jobz, const char *uplo, mplapackint const n,
mpc_class *a, mplapackint const lda, mpf_class *w, mpc_class *work,
mplapackint const lwork, mpf_class *rwork, mplapackint &info);

And the following is the definition of the original dsyev

*       SUBROUTINE ZHEEV( JOBZ, UPLO, N, A, LDA, W, WORK, LWORK, RWORK,
*                         INFO )
*
*       .. Scalar Arguments ..
*       CHARACTER          JOBZ, UPLO
*       INTEGER            INFO, LDA, LWORK, N
*       ..
*       .. Array Arguments ..
*       DOUBLE PRECISION   RWORK( * ), W( * )
*       COMPLEX*16         A( LDA, * ), WORK( * )

There are clear correspondences between variables in the C++ prototype of Cheev and the header of ZHEEV.

•

CHARACTER $\rightarrow$ const char *
•

INTEGER $\rightarrow$ mplapackint
•

COMPLEX*16 A(LDA, *) $\rightarrow$ mpc_class *a
•

COMPLEX*16 WORK(*) $\rightarrow$ mpc_class *rwork
•

DOUBLE PRECISION W(*) $\rightarrow$ mpf_class *w
•

DOUBLE PRECISION RWORK(*) $\rightarrow$ mpf_class *rwork

The list is the following.

⬇

1//public domain

2#include <mpblas_gmp.h>

3#include <mplapack_gmp.h>

4#include <iostream>

5#include <cstring>

6#include <algorithm>

8#define GMP_FORMAT "%+68.64Fe"

9#define GMP_SHORT_FORMAT "%+20.16Fe"

11inline void printnum(mpf_class rtmp) { gmp_printf(GMP_FORMAT, rtmp.get_mpf_t()); }

12inline void printnum_short(mpf_class rtmp) { gmp_printf(GMP_SHORT_FORMAT, rtmp.get_mpf_t()); }

13inline void printnum(mpc_class ctmp) { gmp_printf(GMP_FORMAT GMP_FORMAT "i", ctmp.real().get_mpf_t(), ctmp.imag().get_mpf_t()); }

15//Matlab/Octave format

16template <class X> void printvec(X *a, int len) {

17 X tmp;

18 printf("[ ");

19 for (int i = 0; i < len; i++) {

20 tmp = a[i];

21 printnum(tmp);

22 if (i < len - 1)

23 printf(", ");

24 }

25 printf("]");

26}

28template <class X> void printmat(int n, int m, X *a, int lda)

29{

30 X mtmp;

32 printf("[ ");

33 for (int i = 0; i < n; i++) {

34 printf("[ ");

35 for (int j = 0; j < m; j++) {

36 mtmp = a[i + j * lda];

37 printnum(mtmp);

38 if (j < m - 1)

39 printf(", ");

40 }

41 if (i < n - 1)

42 printf("]; ");

43 else

44 printf("] ");

45 }

46 printf("]");

47}

48int main()

49{

50 mplapackint n = 3;

51 mplapackint lwork, info;

53 mpc_class *A = new mpc_class[n * n];

54 mpf_class *w = new mpf_class[n];

55 mpf_class *rwork = new mpf_class[3 * n - 1];

57//setting A matrix

58 A[0 + 0 * n] = 2.0; A[0 + 1 * n] = mpc_class(0.0, -1.0); A[0 + 2 * n] = 0.0;

59 A[1 + 0 * n] = mpc_class(0.0, 1.0); A[1 + 1 * n] = 2.0; A[1 + 2 * n] = 0.0;

60 A[2 + 0 * n] = 0.0; A[2 + 1 * n] = 0.0; A[2 + 2 * n] = 3.0;

62 printf("A ="); printmat(n, n, A, n); printf("\n");

63//work space query

64 lwork = -1;

65 mpc_class *work = new mpc_class[1];

67 Cheev("V", "U", n, A, n, w, work, lwork, rwork, info);

68 lwork = (int) cast2double (work[0].real());

69 delete[]work;

70 work = new mpc_class[std::max((mplapackint) 1, lwork)];

71//inverse matrix

72 Cheev("V", "U", n, A, n, w, work, lwork, rwork, info);

73//print out some results.

74 printf("#eigenvalues \n");

75 printf("w ="); printmat(n, 1, w, 1); printf("\n");

77 printf("#eigenvecs \n");

78 printf("U ="); printmat(n, n, A, n); printf("\n");

79 printf("#you can check eigenvalues using octave/Matlab by:\n");

80 printf("eig(A)\n");

81 printf("#you can check eigenvectors using octave/Matlab by:\n");

82 printf("U’*A*U\n");

84 delete[]work;

85 delete[]w;

86 delete[]A;

87}

One can input the list and save it as Cheev_test_gmp.cpp, or you can find
/home/docker/mplapack/examples/mplapack/03_SymmetricEigenproblems directory. Then, one can compile on one own following in the Docker environment as follows:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test_gmp.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_gmp -lmpblas_gmp -lgmpxx -lgmp

Finally, you can run as follows:

A =[ [ +2.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00-1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ +0.0000000000000000000000000000000000000000000000000000000000000000e+00+1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+2.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ +0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+3.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
#eigenvalues
w =[ [ +1.0000000000000000000000000000000000000000000000000000000000000000e+00];
[ +3.0000000000000000000000000000000000000000000000000000000000000000e+00];
[ +3.0000000000000000000000000000000000000000000000000000000000000000e+00] ]
#eigenvecs
U =[ [ +0.0000000000000000000000000000000000000000000000000000000000000000e+00-7.0710678118654752440084436210484903928483593768847403658833986900e-01i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00-7.0710678118654752440084436210484903928483593768847403658833986900e-01i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ -7.0710678118654752440084436210484903928483593768847403658833986900e-01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+7.0710678118654752440084436210484903928483593768847403658833986900e-01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ +0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.0000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i] ]
#you can check eigenvalues using octave/Matlab by:
eig(A)
#you can check eigenvectors using octave/Matlab by:
U’*A*U

One can check the result by comparing the result of the octave.

$ ./a.out | octave
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
A =

   2 + 0i   0 - 1i   0 + 0i
   0 + 1i   2 + 0i   0 + 0i
   0 + 0i   0 + 0i   3 + 0i

w =

   1
   3
   3

U =

   0.00000 - 0.70711i   0.00000 - 0.70711i   0.00000 + 0.00000i
  -0.70711 + 0.00000i   0.70711 + 0.00000i   0.00000 + 0.00000i
   0.00000 + 0.00000i   0.00000 + 0.00000i   1.00000 + 0.00000i

ans =

   1
   3
   3

ans =

   1.00000   0.00000   0.00000
   0.00000   3.00000   0.00000
   0.00000   0.00000   3.00000

As a summary, we show how we compile and run binary64, FP80, binary128, double-double, quad-double, GMP, and MPFR versions of Cheev demo programs in
/home/docker/mplapack/examples/mplapack/03_SymmetricEigenproblems as follows:

•

binary64 (double) version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test_double.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmplapack_double -lmpblas_double
$ ./a.out

•

FP80 (extended double) version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test__Float64x.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmplapack__Float64x -lmpblas__Float64x
$ ./a.out

•

binary128 version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test__Float128.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib \
-lmplapack__Float128 -lmpblas__Float128
$ ./a.out

or (on macOS, mingw64 and CentOS7 amd64)

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test__Float128.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib \
-lmplapack__Float128 -lmpblas__Float128 -lquadmath
$ ./a.out

•

double-double version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test_dd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_dd -lmpblas_dd -lqd
$ ./a.out

•

quad-double version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test_qd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_qd -lmpblas_qd -lqd
$ ./a.out

•

GMP version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test_gmp.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_gmp -lmpblas_gmp -lgmpxx -lgmp
$ ./a.out

•

MPFR version

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cheev_test_mpfr.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_mpfr -lmpblas_mpfr -lmpfr -lmpc -lgmp
$ ./a.out

7.4.3 Eigenvalue problem of a real non-symmetric matrix (Rgees)

The following example shows how to solve a real non-symmetric eigenvalue problem. For example, let $A$ be four times four matrices:

A=\left[\begin{array}[]{llll}-2&2&2&2\\ -3&3&2&2\\ -2&0&4&2\\ -1&0&0&5\end{array}\right]

with eigenvalues $\lambda_{1}=1$ , $\lambda_{2}=2$ , $\lambda_{3}=3$ , $\lambda_{4}=4$ [33]. First, we solve it with dd_real class with Rgees. This routine computes for an $N$ -by- $N$ real nonsymmetric matrix $A$ , the eigenvalues, the real Schur form $T$ , and, optionally, the matrix of Schur vectors $Z$ . This gives the Schur factorization $A=ZTZ^{t}$ . $T$ overwrites the matrix $A$ . The prototype definition of Rgees is the following:

void Rgees(const char *jobvs, const char *sort, bool (*select)(dd_real, dd_real),
mplapackint const n, dd_real *a, mplapackint const lda, mplapackint &sdim,
dd_real *wr, dd_real *wi, dd_real *vs, mplapackint const ldvs, dd_real *work,
mplapackint const lwork, bool *bwork, mplapackint &info);

The corresponding LAPACK routine is DGEES. We show 14 lines of DGEES are following.

 *       SUBROUTINE DGEES( JOBVS, SORT, SELECT, N, A, LDA, SDIM, WR, WI,
 *                         VS, LDVS, WORK, LWORK, BWORK, INFO )
 *
 *       .. Scalar Arguments ..
 *       CHARACTER          JOBVS, SORT
 *       INTEGER            INFO, LDA, LDVS, LWORK, N, SDIM
 *       ..
 *       .. Array Arguments ..
 *       LOGICAL            BWORK( * )
 *       DOUBLE PRECISION   A( LDA, * ), VS( LDVS, * ), WI( * ), WORK( * ),
 *      $                   WR( * )
 *       ..
 *       .. Function Arguments ..
 *       LOGICAL            SELECT
 *       EXTERNAL           SELECT

A sample program is following:

⬇

1//public domain

2#include <iostream>

3#include <string>

4#include <sstream>

5#include <cstring>

6#include <algorithm>

8#include <mpblas_dd.h>

9#include <mplapack_dd.h>

11#define DD_PRECISION_SHORT 16

13inline void printnum(dd_real rtmp) {

14 std::cout.precision(DD_PRECISION_SHORT);

15 if (rtmp >= 0.0) {

16 std::cout << "+" << rtmp;

17 } else {

18 std::cout << rtmp;

19 }

20 return;

21}

23//Matlab/Octave format

24void printvec(dd_real *a, int len) {

25 dd_real tmp;

26 printf("[ ");

27 for (int i = 0; i < len; i++) {

28 tmp = a[i];

29 printnum(tmp);

30 if (i < len - 1)

31 printf(", ");

32 }

33 printf("]");

34}

35void printmat(int n, int m, dd_real * a, int lda)

36{

37 dd_real mtmp;

38 printf("[ ");

39 for (int i = 0; i < n; i++) {

40 printf("[ ");

41 for (int j = 0; j < m; j++) {

42 mtmp = a[i + j * lda];

43 printnum(mtmp);

44 if (j < m - 1)

45 printf(", ");

46 }

47 if (i < n - 1)

48 printf("]; ");

49 else

50 printf("] ");

51 }

52 printf("]");

53}

54bool rselect(dd_real ar, dd_real ai) {

55 // sorting rule for eigenvalues.

56 return false;

57}

59int main() {

60 mplapackint n = 4;

62 dd_real *a = new dd_real[n * n];

63 dd_real *vs = new dd_real[n * n];

64 mplapackint sdim = 0;

65 mplapackint lwork = 3 * n;

66 dd_real *wr = new dd_real[n];

67 dd_real *wi = new dd_real[n];

68 dd_real *work = new dd_real[lwork];

69 bool bwork[n];

70 mplapackint info;

71 // setting A matrix

72 a[0 + 0 * n] = -2.0; a[0 + 1 * n] = 2.0; a[0 + 2 * n] = 2.0; a[0 + 3 * n] = 2.0;

73 a[1 + 0 * n] = -3.0; a[1 + 1 * n] = 3.0; a[1 + 2 * n] = 2.0; a[1 + 3 * n] = 2.0;

74 a[2 + 0 * n] = -2.0; a[2 + 1 * n] = 0.0; a[2 + 2 * n] = 4.0; a[2 + 3 * n] = 2.0;

75 a[3 + 0 * n] = -1.0; a[3 + 1 * n] = 0.0; a[3 + 2 * n] = 0.0; a[3 + 3 * n] = 5.0;

77 printf("# octave check\n");

78 printf("a ="); printmat(n, n, a, n); printf("\n");

79 Rgees("V", "S", rselect, n, a, n, sdim, wr, wi, vs, n, work, lwork, bwork, info);

80 printf("vs ="); printmat(n, n, vs, n); printf("\n");

81 printf("t ="); printmat(n, n, a, n); printf("\n");

82 printf("vs*t*vs’\n");

83 printf("eig(a)\n");

84 for (int i = 1; i <= n; i = i + 1) {

85 printf("w_%d = ", (int)i); printnum(wr[i - 1]); printf(" "); printnum(wi[i - 1]); printf("i\n");

86 }

87 delete[] work;

88 delete[] wr;

89 delete[] wi;

90 delete[] vs;

91 delete[] a;

92}

One can compile this source code by:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgees_test_dd.cpp -L/home/docker/MPLAPACK/lib -lmplapack_dd -lmpblas_dd -lqd

The output of the executable is the following:

$ LD_LIBRARY_PATH=/home/docker/MPLAPACK/lib ./a.out
# octave check
a =[ [ -2.0000000000000000e+00, +2.0000000000000000e+00, +2.0000000000000000e+00, +2.0000000000000000e+00];
[ -3.0000000000000000e+00, +3.0000000000000000e+00, +2.0000000000000000e+00, +2.0000000000000000e+00];
[ -2.0000000000000000e+00, +0.0000000000000000e+00, +4.0000000000000000e+00, +2.0000000000000000e+00];
[ -1.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, +5.0000000000000000e+00] ]
vs =[ [ -7.3029674334022148e-01, -6.8313005106397323e-01, +0.0000000000000000e+00, +0.0000000000000000e+00];
[ -5.4772255750516611e-01, +5.8554004376911991e-01, +5.9761430466719682e-01, -2.7760873845656105e-33];
[ -3.6514837167011074e-01, +3.9036002917941327e-01, -7.1713716560063618e-01, -4.4721359549995794e-01];
[ -1.8257418583505537e-01, +1.9518001458970664e-01, -3.5856858280031809e-01, +8.9442719099991588e-01] ]
t =[ [ +1.0000000000000000e+00, -6.9487922897230340e+00, +2.5313275267375116e+00, -1.9595917942265425e+00];
[ +0.0000000000000000e+00, +2.0000000000000000e+00, -1.3063945294843617e+00, +7.8558440484957257e-01];
[ +0.0000000000000000e+00, +0.0000000000000000e+00, +3.0000000000000000e+00, -1.0690449676496975e+00];
[ +0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, +4.0000000000000000e+00] ]
vs*t*vs’
eig(a)
w_1 = +1.0000000000000000e+00 +0.0000000000000000e+00i
w_2 = +2.0000000000000000e+00 +0.0000000000000000e+00i
w_3 = +3.0000000000000000e+00 +0.0000000000000000e+00i
w_4 = +4.0000000000000000e+00 +0.0000000000000000e+00i

We see that we calculated all the eigenvalues correctly. Moreover, you can check the Schur matrix by octave.

$ LD_LIBRARY_PATH=/home/docker/MPLAPACK/lib ./a.out | octave
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
a =

  -2   2   2   2
  -3   3   2   2
  -2   0   4   2
  -1   0   0   5

vs =

  -0.73030  -0.68313   0.00000   0.00000
  -0.54772   0.58554   0.59761  -0.00000
  -0.36515   0.39036  -0.71714  -0.44721
  -0.18257   0.19518  -0.35857   0.89443

t =

   1.00000  -6.94879   2.53133  -1.95959
   0.00000   2.00000  -1.30639   0.78558
   0.00000   0.00000   3.00000  -1.06904
   0.00000   0.00000   0.00000   4.00000

ans =

  -2.0000e+00   2.0000e+00   2.0000e+00   2.0000e+00
  -3.0000e+00   3.0000e+00   2.0000e+00   2.0000e+00
  -2.0000e+00   5.1394e-16   4.0000e+00   2.0000e+00
  -1.0000e+00   2.5697e-16  -4.5393e-17   5.0000e+00

ans =

   1.00000
   2.00000
   3.00000
   4.00000

w_1 =  1
w_2 =  2
w_3 =  3
w_4 =  4

We can see that we correctly calculated all the eigenvalues, the Schur form, and the Schur vectors.

7.4.4 Eigenvalue problem of a complex non-symmetric matrix (Cgees)

The following example shows how to solve a complex non-symmetric eigenvalue problem. For example, let $A$ be four times four matrices:

A=\left[\begin{array}[]{cccc}5+9i&5+5i&-6-6i&-7-7i\\ 3+3i&6+10i&-5-5i&-6-6i\\ 2+2i&3+3i&-1+3i&-5-5i\\ 1+1i&2+2i&-3-3i&4i\end{array}\right]

with eigenvalues $\lambda_{1}=1+5i$ , $\lambda_{2}=2+6i$ , $\lambda_{3}=3+7i$ , $\lambda_{4}=4+8i$ [33]. First, we solve it with dd_real class with Cgees. This routine computes for an $N$ -by- $N$ real nonsymmetric matrix $A$ , the eigenvalues, the real Schur form $T$ , and, optionally, the matrix of Schur vectors $Z$ . This gives the Schur factorization $A=ZTZ^{t}$ . $T$ overwrites the matrix $A$ . The prototype definition of Cgees is the following:

void Cgees(const char *jobvs, const char *sort, bool (*select)(dd_complex),
mplapackint const n, dd_complex *a, mplapackint const lda, mplapackint &sdim,
dd_complex *w, dd_complex *vs, mplapackint const ldvs, dd_complex *work,
mplapackint const lwork, dd_real *rwork, bool *bwork, mplapackint &info);

The corresponding LAPACK routine is ZGEES. We show 14 lines of ZGEES are following.

*       SUBROUTINE ZGGES( JOBVSL, JOBVSR, SORT, SELCTG, N, A, LDA, B, LDB,
*                         SDIM, ALPHA, BETA, VSL, LDVSL, VSR, LDVSR, WORK,
*                         LWORK, RWORK, BWORK, INFO )
*
*       .. Scalar Arguments ..
*       CHARACTER          JOBVSL, JOBVSR, SORT
*       INTEGER            INFO, LDA, LDB, LDVSL, LDVSR, LWORK, N, SDIM
*       ..
*       .. Array Arguments ..
*       LOGICAL            BWORK( * )
*       DOUBLE PRECISION   RWORK( * )
*       COMPLEX*16         A( LDA, * ), ALPHA( * ), B( LDB, * ),
*      $                   BETA( * ), VSL( LDVSL, * ), VSR( LDVSR, * ),
*      $                   WORK( * )
*       ..
*       .. Function Arguments ..
*       LOGICAL            SELCTG
*       EXTERNAL           SELCTG

A sample program is following:

⬇

1//public domain

2#include <iostream>

3#include <string>

4#include <sstream>

5#include <cstring>

6#include <algorithm>

8#include <mpblas_dd.h>

9#include <mplapack_dd.h>

11#define DD_PRECISION_SHORT 16

13inline void printnum(dd_real rtmp) {

14 std::cout.precision(DD_PRECISION_SHORT);

15 if (rtmp >= 0.0) {

16 std::cout << "+" << rtmp;

17 } else {

18 std::cout << rtmp;

19 }

20 return;

21}

23inline void printnum(dd_complex rtmp) {

24 std::cout.precision(DD_PRECISION_SHORT);

25 if (rtmp.real() >= 0.0) {

26 std::cout << "+" << rtmp.real();

27 } else {

28 std::cout << rtmp.real();

29 }

30 if (rtmp.imag() >= 0.0) {

31 std::cout << "+" << rtmp.imag() << "i";

32 } else {

33 std::cout << rtmp.imag() << "i";

34 }

35 return;

36}

38//Matlab/Octave format

39template <class X> void printvec(X *a, int len) {

40 X tmp;

41 printf("[ ");

42 for (int i = 0; i < len; i++) {

43 tmp = a[i];

44 printnum(tmp);

45 if (i < len - 1)

46 printf(", ");

47 }

48 printf("]");

49}

51template <class X> void printmat(int n, int m, X *a, int lda)

52{

53 X mtmp;

55 printf("[ ");

56 for (int i = 0; i < n; i++) {

57 printf("[ ");

58 for (int j = 0; j < m; j++) {

59 mtmp = a[i + j * lda];

60 printnum(mtmp);

61 if (j < m - 1)

62 printf(", ");

63 }

64 if (i < n - 1)

65 printf("]; ");

66 else

67 printf("] ");

68 }

69 printf("]");

70}

71bool cselect(dd_complex a) {

72 // sorting rule for eigenvalues.

73 return false;

74}

76int main() {

77 mplapackint n = 4;

79 dd_complex *a = new dd_complex[n * n];

80 mplapackint sdim = 0;

81 mplapackint lwork = 2 * n;

82 dd_complex *w = new dd_complex[n];

83 dd_complex *vs = new dd_complex[n * n];

84 dd_complex *work = new dd_complex[lwork];

85 dd_real *rwork = new dd_real[n];

86 bool bwork[n];

87 mplapackint info;

89 // setting A matrix

90 a[0 + 0 * n] = dd_complex(5.0, 9.0); a[0 + 1 * n] = dd_complex(5.0, 5.0); a[0 + 2 * n] = dd_complex(-6.0, -6.0); a[0 + 3 * n] = dd_complex(-7.0,-7.0);

91 a[1 + 0 * n] = dd_complex(3.0, 3.0); a[1 + 1 * n] = dd_complex(6.0,10.0); a[1 + 2 * n] = dd_complex(-5.0, -5.0); a[1 + 3 * n] = dd_complex(-6.0,-6.0);

92 a[2 + 0 * n] = dd_complex(2.0, 2.0); a[2 + 1 * n] = dd_complex(3.0, 3.0); a[2 + 2 * n] = dd_complex(-1.0, 3.0); a[2 + 3 * n] = dd_complex(-5.0,-5.0);

93 a[3 + 0 * n] = dd_complex(1.0, 1.0); a[3 + 1 * n] = dd_complex(2.0, 2.0); a[3 + 2 * n] = dd_complex(-3.0,-3.0); a[3 + 3 * n] = dd_complex(0.0, 4.0);

95 printf("# Ex. 6.5 p. 116, Collection of Matrices for Testing Computational Algorithms, Robert T. Gregory, David L. Karney\n");

96 printf("# octave check\n");

97 printf("split_long_rows(0)\n");

98 printf("a ="); printmat(n, n, a, n); printf("\n");

99 Cgees("V", "S", cselect, n, a, n, sdim, w, vs, n, work, lwork, rwork, bwork, info);

100 printf("w ="); printvec(w, n); printf("\n");

101 printf("vs ="); printmat(n, n, vs, n); printf("\n");

102 printf("t ="); printmat(n, n, a, n); printf("\n");

103 printf("vs*t*vs’\n");

104 printf("eig(a)\n");

105

106 delete[] rwork;

107 delete[] work;

108 delete[] vs;

109 delete[] w;

110 delete[] a;

111}

One can compile this source code by:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cgees_test_dd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_dd -lmpblas_dd -lqd

The output of the executable is the following:

# Ex. 6.5 p. 116, Collection of Matrices for Testing Computational Algorithms, Robert T. Gregory, David L. Karney
# octave check
split_long_rows(0)
a =[ [ +5.0000000000000000e+00+9.0000000000000000e+00i, +5.0000000000000000e+00+5.0000000000000000e+00i,
-6.0000000000000000e+00-6.0000000000000000e+00i, -7.0000000000000000e+00-7.0000000000000000e+00i];
[ +3.0000000000000000e+00+3.0000000000000000e+00i, +6.0000000000000000e+00+1.0000000000000000e+01i,
-5.0000000000000000e+00-5.0000000000000000e+00i, -6.0000000000000000e+00-6.0000000000000000e+00i];
[ +2.0000000000000000e+00+2.0000000000000000e+00i, +3.0000000000000000e+00+3.0000000000000000e+00i,
-1.0000000000000000e+00+3.0000000000000000e+00i, -5.0000000000000000e+00-5.0000000000000000e+00i];
[ +1.0000000000000000e+00+1.0000000000000000e+00i, +2.0000000000000000e+00+2.0000000000000000e+00i,
-3.0000000000000000e+00-3.0000000000000000e+00i, +0.0000000000000000e+00+4.0000000000000000e+00i] ]
w =[ +2.0000000000000000e+00+6.0000000000000000e+00i, +4.0000000000000000e+00+8.0000000000000000e+00i,
+3.0000000000000000e+00+7.0000000000000000e+00i, +1.0000000000000000e+00+5.0000000000000000e+00i]
vs =[ [ +3.7428970594742688e-01-5.2577170701090183e-02i, -1.1134324587883832e-01+4.9471763536387269e-01i,
+3.8355427599552469e-01-6.7296813993349557e-01i, +0.0000000000000000e+00+0.0000000000000000e+00i];
[ +7.4857941189485376e-01-1.0515434140218037e-01i, +3.7114415292946107e-02-1.6490587845462423e-01i,
-1.2785142533184156e-01+2.2432271331116519e-01i, +4.2849243249895099e-01-3.8694646738853330e-01i];
[ +3.7428970594742688e-01-5.2577170701090183e-02i, -1.1134324587883832e-01+4.9471763536387269e-01i,
-2.5570285066368313e-01+4.4864542662233038e-01i, -4.2849243249895099e-01+3.8694646738853330e-01i];
[ +3.7428970594742688e-01-5.2577170701090183e-02i, +1.4845766117178443e-01-6.5962351381849692e-01i,
+1.2785142533184156e-01-2.2432271331116519e-01i, -4.2849243249895099e-01+3.8694646738853330e-01i] ]
t =[ [ +2.0000000000000000e+00+6.0000000000000000e+00i, -4.6106496752391337e+00+2.0837249271958309e+00i,
+4.9279443181496018e+00-6.3330804742348119e-01i, +1.9999717105990234e+01+3.8559640218114672e+00i];
[ +0.0000000000000000e+00+0.0000000000000000e+00i, +4.0000000000000000e+00+8.0000000000000000e+00i,
-2.8979419726911526e-01-5.4495110256015637e-01i, -1.0301880205614183e+00-5.9845553818543799e+00i];
[ +0.0000000000000000e+00+0.0000000000000000e+00i, +0.0000000000000000e+00+0.0000000000000000e+00i,
+3.0000000000000000e+00+7.0000000000000000e+00i, +2.2784477280226518e+00+4.5175962580412624e+00i];
[ +0.0000000000000000e+00+0.0000000000000000e+00i, +0.0000000000000000e+00+0.0000000000000000e+00i,
+0.0000000000000000e+00+0.0000000000000000e+00i, +1.0000000000000000e+00+5.0000000000000000e+00i] ]
vs*t*vs’
eig(a)

We see that we calculated all the eigenvalues correctly. Moreover, you can check the Schur matrix by octave.

octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
a =

    5 +  9i    5 +  5i   -6 -  6i   -7 -  7i
    3 +  3i    6 + 10i   -5 -  5i   -6 -  6i
    2 +  2i    3 +  3i   -1 +  3i   -5 -  5i
    1 +  1i    2 +  2i   -3 -  3i    0 +  4i

w =

   2 + 6i   4 + 8i   3 + 7i   1 + 5i

vs =

   0.37429 - 0.05258i  -0.11134 + 0.49472i   0.38355 - 0.67297i   0.00000 + 0.00000i
   0.74858 - 0.10515i   0.03711 - 0.16491i  -0.12785 + 0.22432i   0.42849 - 0.38695i
   0.37429 - 0.05258i  -0.11134 + 0.49472i  -0.25570 + 0.44865i  -0.42849 + 0.38695i
   0.37429 - 0.05258i   0.14846 - 0.65962i   0.12785 - 0.22432i  -0.42849 + 0.38695i

t =

    2.00000 +  6.00000i   -4.61065 +  2.08372i    4.92794 -  0.63331i   19.99972 +  3.85596i
    0.00000 +  0.00000i    4.00000 +  8.00000i   -0.28979 -  0.54495i   -1.03019 -  5.98456i
    0.00000 +  0.00000i    0.00000 +  0.00000i    3.00000 +  7.00000i    2.27845 +  4.51760i
    0.00000 +  0.00000i    0.00000 +  0.00000i    0.00000 +  0.00000i    1.00000 +  5.00000i

ans =

    5.00000 +  9.00000i    5.00000 +  5.00000i   -6.00000 -  6.00000i   -7.00000 -  7.00000i
    3.00000 +  3.00000i    6.00000 + 10.00000i   -5.00000 -  5.00000i   -6.00000 -  6.00000i
    2.00000 +  2.00000i    3.00000 +  3.00000i   -1.00000 +  3.00000i   -5.00000 -  5.00000i
    1.00000 +  1.00000i    2.00000 +  2.00000i   -3.00000 -  3.00000i   -0.00000 +  4.00000i

ans =

   2.0000 + 6.0000i
   4.0000 + 8.0000i
   3.0000 + 7.0000i
   1.0000 + 5.0000i

We can see that we correctly calculated all the eigenvalues, the Schur form, and the Schur vectors.

7.4.5 Eigenvalues and eigenvectors of non-symmetric matrix (Rgeev)

Another example shows how to solve real non-symmetric eigenvalue problems with eigenvectors. Rgeev computes the eigenvalues and, optionally, the left and/or right eigenvectors for an $N$ by $N$ real nonsymmetric matrix $A$ . For example, let $A$ be four times four matrices:

A=\left[\begin{array}[]{rrrr}4&-5&0&3\\ 0&4&-3&-5\\ 5&-3&4&0\\ 3&0&5&4\end{array}\right]

Then, the eigenvalues are $\lambda_{1}=12$ , $\lambda_{2}=1+5i$ , $\lambda_{3}=1-5i$ and $\lambda_{4}=2$ and the right eigenvectors are

{x}_{1}=\left[\begin{array}[]{r}1\\ -1\\ 1\\ 1\end{array}\right],\quad{x}_{2}=\left[\begin{array}[]{c}1\\ -i\\ -i\\ -1\end{array}\right],\quad{x}_{3}=\left[\begin{array}[]{c}1\\ i\\ i\\ -1\end{array}\right],\quad{x}_{4}=\left[\begin{array}[]{c}1\\ 1\\ -1\\ 1\end{array}\right],

And the left eigenvectors are

{y}_{1}=[1,-1,1,1],\quad{y}_{2}=[1,i,i,-1],\quad{y}_{3}=[1,-i,-i,-1],\quad{y}_{4}=[1,1,-1,1].

Thus, $Ax_{1}=12x_{1}$ , $Ax_{2}=(1+5i)x_{2}$ , $Ax_{3}=(1-5i)x_{3}$ and $Ax_{4}=2x_{4}$ , and $y_{1}A=12y_{1}$ , and $y_{2}A=(1+5i)y_{2}$ , $y_{3}A=(1-5i)y_{3}$ , $y_{4}A=2y_{4}$ .

Let us solve this eigenvalue problem with left and right eigenvectors for $A$ using MPLAPACK $\tt qd\_real$ . The prototype definition of Rgeev is the following:

void Rgeev(const char *jobvl, const char *jobvr, mplapackint const n, qd_real *a,
mplapackint const lda, qd_real *wr, qd_real *wi, qd_real *vl, mplapackint const ldvl,
qd_real *vr, mplapackint const ldvr, qd_real *work, mplapackint const lwork,
mplapackint &info);

The corresponding LAPACK routine is DGEEV. We show the 14 lines of DGEEV are following.

      SUBROUTINE dgeev( JOBVL, JOBVR, N, A, LDA, WR, WI, VL, LDVL, VR,
      $                  LDVR, WORK, LWORK, INFO )
       implicit none
 *
 *  -- LAPACK driver routine --
 *  -- LAPACK is a software package provided by Univ. of Tennessee,    --
 *  -- Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd..--
 *
 *     .. Scalar Arguments ..
       CHARACTER          JOBVL, JOBVR
       INTEGER            INFO, LDA, LDVL, LDVR, LWORK, N
 *     ..
 *     .. Array Arguments ..
       DOUBLE PRECISION   A( LDA, * ), VL( LDVL, * ), VR( LDVR, * ),
      $                   wi( * ), work( * ), wr( * )

A sample program is following:

⬇

1//public domain

2#include <iostream>

3#include <string>

4#include <sstream>

5#include <cstring>

6#include <algorithm>

8#include <mpblas_qd.h>

9#include <mplapack_qd.h>

11#define QD_PRECISION_SHORT 16

13inline void printnum(qd_real rtmp) {

14 std::cout.precision(QD_PRECISION_SHORT);

15 if (rtmp >= 0.0) {

16 std::cout << "+" << rtmp;

17 } else {

18 std::cout << rtmp;

19 }

20 return;

21}

23//Matlab/Octave format

24void printvec(qd_real *a, int len) {

25 qd_real tmp;

26 printf("[ ");

27 for (int i = 0; i < len; i++) {

28 tmp = a[i];

29 printnum(tmp);

30 if (i < len - 1)

31 printf(", ");

32 }

33 printf("]");

34}

36void printmat(int n, int m, qd_real * a, int lda)

37{

38 qd_real mtmp;

39 printf("[ ");

40 for (int i = 0; i < n; i++) {

41 printf("[ ");

42 for (int j = 0; j < m; j++) {

43 mtmp = a[i + j * lda];

44 printnum(mtmp);

45 if (j < m - 1)

46 printf(", ");

47 }

48 if (i < n - 1)

49 printf("]; ");

50 else

51 printf("] ");

52 }

53 printf("]");

54}

55bool rselect(qd_real ar, qd_real ai) {

56 // sorting rule for eigenvalues.

57 return false;

58}

60int main() {

61 mplapackint n = 4;

62 qd_real *a = new qd_real[n * n];

63 qd_real *vl = new qd_real[n * n];

64 qd_real *vr = new qd_real[n * n];

65 mplapackint lwork = 4 * n;

66 qd_real *wr = new qd_real[n];

67 qd_real *wi = new qd_real[n];

68 qd_real *work = new qd_real[lwork];

69 mplapackint info;

70 // setting A matrix

71 a[0 + 0 * n] = 4.0; a[0 + 1 * n] = -5.0; a[0 + 2 * n] = 0.0; a[0 + 3 * n] = 3.0;

72 a[1 + 0 * n] = 0.0; a[1 + 1 * n] = 4.0; a[1 + 2 * n] = -3.0; a[1 + 3 * n] = -5.0;

73 a[2 + 0 * n] = 5.0; a[2 + 1 * n] = -3.0; a[2 + 2 * n] = 4.0; a[2 + 3 * n] = 0.0;

74 a[3 + 0 * n] = 3.0; a[3 + 1 * n] = 0.0; a[3 + 2 * n] = 5.0; a[3 + 3 * n] = 4.0;

76 printf("# octave check\n");

77 printf("split_long_rows(0)\n");

78 printf("a =");

79 printmat(n, n, a, n);

80 printf("\n");

81 Rgeev("V", "V", n, a, n, wr, wi, vl, n, vr, n, work, lwork, info);

82 printf("# right vectors\n");

83 for (int j = 1; j <= n; j = j + 1) {

84 if (abs(wi[j - 1]) < 1e-15) {

85 printf("vr_%d =[ ", j);

86 for (int i = 1; i <= n - 1; i = i + 1) {

87 printnum(vr[(i - 1) + (j - 1) * n]); printf(", ");

88 }

89 printnum(vr[(n - 1) + (j - 1) * n]); printf("];\n");

90 } else {

91 printf("vr_%d =[ ", j);

92 for (int i = 1; i <= n - 1; i = i + 1) {

93 printnum(vr[(i - 1) + (j - 1) * n]); printnum(-vr[(i - 1) + j * n]); printf("i, ");

94 }

95 printnum(vr[(n - 1) + (j - 1) * n]); printnum(-vr[(n - 1) + j * n]); printf("i ];\n");

96 printf("vr_%d =[ ", j + 1);

97 for (int i = 1; i <= n - 1; i = i + 1) {

98 printnum(vr[(i - 1) + (j - 1) * n]); printnum(vr[(i - 1) + j * n]); printf("i, ");

99 }

100 printnum(vr[(n - 1) + (j - 1) * n]); printnum(vr[(n - 1) + j * n]); printf("i ];\n");

101 j++;

102 }

103 }

104 printf("# left vectors\n");

105 for (int j = 1; j <= n; j = j + 1) {

106 if (abs(wi[j - 1]) < 1e-15) {

107 printf("vl_%d =[ ", j);

108 for (int i = 1; i <= n - 1; i = i + 1) {

109 printnum(vl[(i - 1) + (j - 1) * n]); printf(", ");

110 }

111 printnum(vl[(n - 1) + (j - 1) * n]); printf("];\n");

112 } else {

113 printf("vl_%d =[ ", j);

114 for (int i = 1; i <= n - 1; i = i + 1) {

115 printnum(vl[(i - 1) + (j - 1) * n]); printnum(-vl[(i - 1) + j * n]); printf("i, ");

116 }

117 printnum(vl[(n - 1) + (j - 1) * n]); printnum(-vl[(n - 1) + j * n]); printf("i ];\n");

118 printf("vl_%d =[ ", j + 1);

119 for (int i = 1; i <= n - 1; i = i + 1) {

120 printnum(vl[(i - 1) + (j - 1) * n]); printnum(vl[(i - 1) + j * n]); printf("i, ");

121 }

122 printnum(vl[(n - 1) + (j - 1) * n]); printnum(vl[(n - 1) + j * n]); printf("i ];\n");

123 j++;

124 }

125 }

126 for (int i = 1; i <= n; i = i + 1) {

127 printf("w_%d = ", (int)i); printnum(wr[i - 1]); printf(" "); printnum(wi[i - 1]); printf("i\n");

128 }

129 for (int i = 1; i <= n; i = i + 1) {

130 printf("disp (\"a * vr_%d\")\n", i);

131 printf("a * vr_%d’\n", i);

132 printf("disp (\"w_%d * vr_%d\")\n", i, i);

133 printf("w_%d * vr_%d’\n", i, i);

134 printf("disp (\"vr_%d\")\n", i);

135 printf("vr_%d’\n", i);

136 }

137

138 for (int i = 1; i <= n; i = i + 1) {

139 printf("disp (\"vl_%d * a \")\n", i);

140 printf("vl_%d * a\n", i);

141 printf("disp (\"w_%d * vl_%d \")\n", i, i);

142 printf("w_%d * vl_%d\n", i, i);

143 printf("disp (\"vl_%d\")\n", i);

144 printf("vl_%d\n", i);

145 }

146 delete[] work;

147 delete[] wr;

148 delete[] wi;

149 delete[] vr;

150 delete[] vl;

151 delete[] a;

152}

You can find this file as
/home/docker/mplapack/examples/mplapack/04_NonsymmetricEigenproblems/Rgeev_test_qd.cpp , Furthermore, compile this source code by:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgeev_test_qd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_qd -lmpblas_qd -lqd

The output of the executable is the following:

$ ./a.out
# octave check
split_long_rows(0)
a =[ [ +4.0000000000000000e+00, -5.0000000000000000e+00, +0.0000000000000000e+00, +3.0000000000000000e+00];
[ +0.0000000000000000e+00, +4.0000000000000000e+00, -3.0000000000000000e+00, -5.0000000000000000e+00];
[ +5.0000000000000000e+00, -3.0000000000000000e+00, +4.0000000000000000e+00, +0.0000000000000000e+00];
[ +3.0000000000000000e+00, +0.0000000000000000e+00, +5.0000000000000000e+00, +4.0000000000000000e+00] ]
# right vectors
vr_1 =[ -5.0000000000000000e-01, +5.0000000000000000e-01, -5.0000000000000000e-01, -5.0000000000000000e-01];
vr_2 =[ +2.8486703237279396e-65-5.0000000000000000e-01i, +5.0000000000000000e-01+0.0000000000000000e+00i,
+5.0000000000000000e-01+0.0000000000000000e+00i, -9.0207893584718088e-65+5.0000000000000000e-01i ];
vr_3 =[ +2.8486703237279396e-65+5.0000000000000000e-01i, +5.0000000000000000e-01+0.0000000000000000e+00i,
+5.0000000000000000e-01+0.0000000000000000e+00i, -9.0207893584718088e-65-5.0000000000000000e-01i ];
vr_4 =[ +5.0000000000000000e-01, +5.0000000000000000e-01, -5.0000000000000000e-01, +5.0000000000000000e-01];
# left vectors
vl_1 =[ -5.0000000000000000e-01, +5.0000000000000000e-01, -5.0000000000000000e-01, -5.0000000000000000e-01];
vl_2 =[ -4.7477838728798994e-65-5.0000000000000000e-01i, +5.0000000000000000e-01+0.0000000000000000e+00i,
+5.0000000000000000e-01-1.7092021942367638e-64i, -3.4184043884735275e-64+5.0000000000000000e-01i ];
vl_3 =[ -4.7477838728798994e-65+5.0000000000000000e-01i, +5.0000000000000000e-01+0.0000000000000000e+00i,
+5.0000000000000000e-01+1.7092021942367638e-64i, -3.4184043884735275e-64-5.0000000000000000e-01i ];
vl_4 =[ +5.0000000000000000e-01, +5.0000000000000000e-01, -5.0000000000000000e-01, +5.0000000000000000e-01];
w_1 = +1.2000000000000000e+01 +0.0000000000000000e+00i
w_2 = +1.0000000000000000e+00 +5.0000000000000000e+00i
w_3 = +1.0000000000000000e+00 -5.0000000000000000e+00i
w_4 = +2.0000000000000000e+00 +0.0000000000000000e+00i
...

This can be confirmed by passing the output of the terminal to Octave, as shown below.

$ ./a.out | octave
a =

   4  -5   0   3
   0   4  -3  -5
   5  -3   4   0
   3   0   5   4

w_1 =  12
w_2 =  1 + 5i
w_3 =  1 - 5i
w_4 =  2
a * vr_1
ans =

  -6
   6
  -6
  -6

w_1 * vr_1
ans =

  -6
   6
  -6
  -6

vr_1
ans =

  -0.50000
   0.50000
  -0.50000
  -0.50000

a * vr_2
ans =

  -2.50000 + 0.50000i
   0.50000 + 2.50000i
   0.50000 + 2.50000i
   2.50000 - 0.50000i

w_2 * vr_2
ans =

  -2.50000 + 0.50000i
   0.50000 + 2.50000i
   0.50000 + 2.50000i
   2.50000 - 0.50000i

vr_2
ans =

   0.00000 + 0.50000i
   0.50000 - 0.00000i
   0.50000 - 0.00000i
  -0.00000 - 0.50000i

...

We only show the first part of the output since the whole output is lengthy. w_1, w_2, w_3 and w_4 are the eigenvalues. vr_1 and vr_2 are right eigenvectors. You will see that you obtained the correct eigenvalues and eigenvectors.

7.4.6 Eigenvalues and eigenvectors of complex non-symmetric matrix (Cgeev)

Another example shows how to solve real non-symmetric eigenvalue problems with eigenvectors. Cgeev computes the eigenvalues, and, optionally, the left and/or right eigenvectors for an $N$ by $N$ complex nonsymmetric matrix $A$ . For example, let $A$ be four times four matrices: The following example shows how to solve a complex non-symmetric eigenvalue problem. For example, let $A$ be four times four matrices:

A=\left[\begin{array}[]{llll}5+9i&5+5i&-6-6i&-7-7i\\ 3+3i&6+10i&-5-5i&-6-6i\\ 2+2i&3+3i&-1+3i&-5-5i\\ 1+1i&2+2i&-3-3i&4i\end{array}\right]

with eigenvalues $\lambda_{1}=1+5i$ , $\lambda_{2}=2+6i$ , $\lambda_{3}=3+7i$ , $\lambda_{4}=4+8i$ and the right eigenvectors are

x_{1}=\left[\begin{array}[]{l}2\\ 1\\ 1\\ 1\end{array}\right],\quad x_{2}=\left[\begin{array}[]{l}1\\ 2\\ 1\\ 1\end{array}\right],\quad x_{3}=\left[\begin{array}[]{c}-1\\ -1\\ 0\\ -1\end{array}\right],\quad x_{4}=\left[\begin{array}[]{c}-1\\ -1\\ -1\\ 0\end{array}\right]

[33].

Let us solve this eigenvalue problem with left and right eigenvectors for $A$ using MPLAPACK $\tt qd\_real$ . The prototype definition of Cgeev is the following:

void Cgeev(const char *jobvl, const char *jobvr, mplapackint const n, qd_complex *a,
mplapackint const lda, qd_complex *w, qd_complex *vl, mplapackint const ldvl,
qd_complex *vr, mplapackint const ldvr, qd_complex *work, mplapackint const lwork,
qd_real *rwork, mplapackint &info);

The corresponding LAPACK routine is ZGEEV. We show the 14 lines of ZGEEV are following.

*       SUBROUTINE ZGGEV( JOBVL, JOBVR, N, A, LDA, B, LDB, ALPHA, BETA,
*                         VL, LDVL, VR, LDVR, WORK, LWORK, RWORK, INFO )
*
*       .. Scalar Arguments ..
*       CHARACTER          JOBVL, JOBVR
*       INTEGER            INFO, LDA, LDB, LDVL, LDVR, LWORK, N
*       ..
*       .. Array Arguments ..
*       DOUBLE PRECISION   RWORK( * )
*       COMPLEX*16         A( LDA, * ), ALPHA( * ), B( LDB, * ),
*      $                   BETA( * ), VL( LDVL, * ), VR( LDVR, * ),
*      $                   WORK( * )

A sample program is following:

⬇

1//public domain

2#include <iostream>

3#include <string>

4#include <sstream>

5#include <cstring>

6#include <algorithm>

8#include <mpblas_qd.h>

9#include <mplapack_qd.h>

11#define QD_PRECISION_SHORT 16

13inline void printnum(qd_real rtmp) {

14 std::cout.precision(QD_PRECISION_SHORT);

15 if (rtmp >= 0.0) {

16 std::cout << "+" << rtmp;

17 } else {

18 std::cout << rtmp;

19 }

20 return;

21}

23inline void printnum(qd_complex rtmp) {

24 std::cout.precision(QD_PRECISION_SHORT);

25 if (rtmp.real() >= 0.0) {

26 std::cout << "+" << rtmp.real();

27 } else {

28 std::cout << rtmp.real();

29 }

30 if (rtmp.imag() >= 0.0) {

31 std::cout << "+" << rtmp.imag() << "i";

32 } else {

33 std::cout << rtmp.imag() << "i";

34 }

35 return;

36}

38//Matlab/Octave format

39template <class X> void printvec(X *a, int len) {

40 X tmp;

41 printf("[ ");

42 for (int i = 0; i < len; i++) {

43 tmp = a[i];

44 printnum(tmp);

45 if (i < len - 1)

46 printf(", ");

47 }

48 printf("]");

49}

51template <class X> void printmat(int n, int m, X *a, int lda)

52{

53 X mtmp;

55 printf("[ ");

56 for (int i = 0; i < n; i++) {

57 printf("[ ");

58 for (int j = 0; j < m; j++) {

59 mtmp = a[i + j * lda];

60 printnum(mtmp);

61 if (j < m - 1)

62 printf(", ");

63 }

64 if (i < n - 1)

65 printf("]; ");

66 else

67 printf("] ");

68 }

69 printf("]");

70}

71bool rselect(qd_real ar, qd_real ai) {

72 // sorting rule for eigenvalues.

73 return false;

74}

76int main() {

77 mplapackint n = 4;

78 qd_complex *a = new qd_complex[n * n];

79 qd_complex *w = new qd_complex[n];

80 qd_complex *vl = new qd_complex[n * n];

81 qd_complex *vr = new qd_complex[n * n];

82 mplapackint lwork = 4 * n;

83 qd_complex *work = new qd_complex[lwork];

84 qd_real *rwork = new qd_real[lwork];

85 mplapackint info;

86 // setting A matrix

87 a[0 + 0 * n] = qd_complex(5.0, 9.0); a[0 + 1 * n] = qd_complex(5.0, 5.0); a[0 + 2 * n] = qd_complex(-6.0, -6.0); a[0 + 3 * n] = qd_complex(-7.0, -7.0);

88 a[1 + 0 * n] = qd_complex(3.0, 3.0); a[1 + 1 * n] = qd_complex(6.0, 10.0); a[1 + 2 * n] = qd_complex(-5.0, -5.0); a[1 + 3 * n] = qd_complex(-6.0, -6.0);

89 a[2 + 0 * n] = qd_complex(2.0, 2.0); a[2 + 1 * n] = qd_complex(3.0, 3.0); a[2 + 2 * n] = qd_complex(-1.0, 3.0); a[2 + 3 * n] = qd_complex(-5.0, -5.0);

90 a[3 + 0 * n] = qd_complex(1.0, 1.0); a[3 + 1 * n] = qd_complex(2.0, 2.0); a[3 + 2 * n] = qd_complex(-3.0, -3.0); a[3 + 3 * n] = qd_complex(0.0, 4.0);

92 printf("# Ex. 6.5 p. 116, Collection of Matrices for Testing Computational Algorithms, Robert T. Gregory, David L. Karney\n");

93 printf("# octave check\n");

94 printf("split_long_rows(0)\n");

95 printf("a ="); printmat(n, n, a, n); printf("\n");

96 Cgeev("V", "V", n, a, n, w, vl, n, vr, n, work, lwork, rwork, info);

97 printf("lambda ="); printvec(w,n); printf("\n");

98 printf("vr ="); printmat(n,n,vr,n); printf("\n");

100 delete[] rwork;

101 delete[] work;

102 delete[] vr;

103 delete[] vl;

104 delete[] w;

105 delete[] a;

106}

You can find this file as
/home/docker/mplapack/examples/mplapack/04_NonsymmetricEigenproblems/Cgeev_test_qd.cpp , and compile this source code by:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cgeev_test_qd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_qd -lmpblas_qd -lqd

The output of the executable is the following:

# Ex. 6.5 p. 116, Collection of Matrices for Testing Computational Algorithms, Robert T. Gregory, David L. Karney
# octave check
split_long_rows(0)
a =[ [ +5.0000000000000000e+00+9.0000000000000000e+00i, +5.0000000000000000e+00+5.0000000000000000e+00i,
-6.0000000000000000e+00-6.0000000000000000e+00i, -7.0000000000000000e+00-7.0000000000000000e+00i];
[+3.0000000000000000e+00+3.0000000000000000e+00i, +6.0000000000000000e+00+1.0000000000000000e+01i,
-5.0000000000000000e+00-5.0000000000000000e+00i, -6.0000000000000000e+00-6.0000000000000000e+00i];
[ +2.0000000000000000e+00+2.0000000000000000e+00i, +3.0000000000000000e+00+3.0000000000000000e+00i,
-1.0000000000000000e+00+3.0000000000000000e+00i, -5.0000000000000000e+00-5.0000000000000000e+00i];
[ +1.0000000000000000e+00+1.0000000000000000e+00i, +2.0000000000000000e+00+2.0000000000000000e+00i,
-3.0000000000000000e+00-3.0000000000000000e+00i, +0.0000000000000000e+00+4.0000000000000000e+00i] ]
lambda =[ +2.0000000000000000e+00+6.0000000000000000e+00i, +4.0000000000000000e+00+8.0000000000000000e+00i,
+3.0000000000000000e+00+7.0000000000000000e+00i, +1.0000000000000000e+00+5.0000000000000000e+00i]
vr =[ [ +3.7796447300922723e-01-1.5808636614229790e-64i, +5.7735026918962576e-01+0.0000000000000000e+00i,
+5.7735026918962576e-01+1.6142465167791658e-64i, +7.5592894601845445e-01+0.0000000000000000e+00i];
[ +7.5592894601845445e-01+0.0000000000000000e+00i, +5.7735026918962576e-01+1.2462932666309736e-65i,
+5.7735026918962576e-01+0.0000000000000000e+00i, +3.7796447300922723e-01+6.7655920188538566e-65i];
[ +3.7796447300922723e-01-1.2425840604802862e-66i, +5.7735026918962576e-01+1.0949576556829268e-64i,
-2.1515093228017323e-65-6.2460902724518248e-64i, +3.7796447300922723e-01-1.1275986698089761e-65i];
[ +3.7796447300922723e-01-5.0333927464828308e-65i, -4.9136820461357707e-65-1.1498598272944756e-64i,
+5.7735026918962576e-01+5.4614351362721592e-64i, +3.7796447300922723e-01+5.6379933490448805e-65i] ]

This can be confirmed by passing the output of the terminal to Octave, as shown below.

$ ./a.out | octave
a =

    5 +  9i    5 +  5i   -6 -  6i   -7 -  7i
    3 +  3i    6 + 10i   -5 -  5i   -6 -  6i
    2 +  2i    3 +  3i   -1 +  3i   -5 -  5i
    1 +  1i    2 +  2i   -3 -  3i    0 +  4i

lambda =

   2 + 6i   4 + 8i   3 + 7i   1 + 5i

vr =

   0.37796 - 0.00000i   0.57735 + 0.00000i   0.57735 + 0.00000i   0.75593 + 0.00000i
   0.75593 + 0.00000i   0.57735 + 0.00000i   0.57735 + 0.00000i   0.37796 + 0.00000i
   0.37796 - 0.00000i   0.57735 + 0.00000i  -0.00000 - 0.00000i   0.37796 - 0.00000i
   0.37796 - 0.00000i  -0.00000 - 0.00000i   0.57735 + 0.00000i   0.37796 + 0.00000i

lambda is a list of eigenvalues, and vr is a list of the corresponding eigenvectors. Aside from vr is normalized to unity, you will see you obtained correct eigenvalues and eigenvectors.

7.4.7 Singular value decomposition of a real matrix (Rgesvd)

We show an example of how to solve a singular value problem [34]. Let $A$ is $4\times 5$ as follows:

A=\left[\begin{array}[]{lllll}1&0&0&0&2\\ 0&0&3&0&0\\ 0&0&0&0&0\\ 0&2&0&0&0\end{array}\right]

Then, a singular value decomposition of this matrix is given by $A=U\Sigma V^{t}$ as follows.

{U}=\left[\begin{array}[]{cccc}0&-1&0&0\\ -1&0&0&0\\ 0&0&0&-1\\ 0&0&-1&0\end{array}\right],

\Sigma=\left[\begin{array}[]{ccccc}3&0&0&0&0\\ 0&\sqrt{5}&0&0&0\\ 0&0&2&0&0\\ 0&0&0&0&0\end{array}\right],

and

V^{t}=\left[\begin{array}[]{ccccc}0&0&-1&0&0\\ -\sqrt{0.2}&0&0&0&-\sqrt{0.8}\\ 0&-1&0&0&0\\ 0&0&0&1&0\\ -\sqrt{0.8}&0&0&0&\sqrt{0.2}\end{array}\right].

Let us solve the singular value problem for matrix $A$ using MPLAPACK $\tt dd\_real$ . The prototype definition of Rgesvd is following:

void Rgesvd(const char *jobu, const char *jobvt, mplapackint const m,
mplapackint const n, dd_real *a, mplapackint const lda, dd_real *s, dd_real *u,
mplapackint const ldu, dd_real *vt, mplapackint const ldvt, dd_real *work,
mplapackint const lwork, mplapackint &info);

The corresponding LAPACK routine is DGESVD. We show some lines of DGESVD are following.

       SUBROUTINE dgesvd( JOBU, JOBVT, M, N, A, LDA, S, U, LDU,
      $                   VT, LDVT, WORK, LWORK, INFO )
 *
 *  -- LAPACK driver routine --
 *  -- LAPACK is a software package provided by Univ. of Tennessee,    --
 *  -- Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd..--
 *
 *     .. Scalar Arguments ..
       CHARACTER          JOBU, JOBVT
       INTEGER            INFO, LDA, LDU, LDVT, LWORK, M, N
 *     ..
 *     .. Array Arguments ..
       DOUBLE PRECISION   A( LDA, * ), S( * ), U( LDU, * ),
      $                   vt( ldvt, * ), work( * )
 *     ..

A sample program for dd_real is following:

⬇

1//public domain

2#include <iostream>

3#include <string>

4#include <sstream>

5#include <cstring>

6#include <algorithm>

8#include <mpblas_dd.h>

9#include <mplapack_dd.h>

11#define DD_PRECISION_SHORT 16

13inline void printnum(dd_real rtmp) {

14 std::cout.precision(DD_PRECISION_SHORT);

15 if (rtmp >= 0.0) {

16 std::cout << "+" << rtmp;

17 } else {

18 std::cout << rtmp;

19 }

20 return;

21}

23//Matlab/Octave format

24void printvec(dd_real *a, int len) {

25 dd_real tmp;

26 printf("[ ");

27 for (int i = 0; i < len; i++) {

28 tmp = a[i];

29 printnum(tmp);

30 if (i < len - 1)

31 printf(", ");

32 }

33 printf("]");

34}

35void printmat(int n, int m, dd_real * a, int lda)

36{

37 dd_real mtmp;

38 printf("[ ");

39 for (int i = 0; i < n; i++) {

40 printf("[ ");

41 for (int j = 0; j < m; j++) {

42 mtmp = a[i + j * lda];

43 printnum(mtmp);

44 if (j < m - 1)

45 printf(", ");

46 }

47 if (i < n - 1)

48 printf("]; ");

49 else

50 printf("] ");

51 }

52 printf("]");

53}

54int main() {

55 mplapackint n = 5;

56 mplapackint m = 4;

58 dd_real *a = new dd_real[m * n];

59 dd_real *s = new dd_real[std::min(m, n)];

60 dd_real *u = new dd_real[m * m];

61 dd_real *vt = new dd_real[n * n];

62 mplapackint lwork = std::max({(mplapackint)1, 3 * std::min(m, n) + std::max(m, n), 5 * std::min(m, n)});

63 dd_real *work = new dd_real[lwork];

64 mplapackint info;

66 // setting A matrix

67 a[0 + 0 * m] = 1.0; a[0 + 1 * m] = 0.0; a[0 + 2 * m] = 0.0; a[0 + 3 * m] = 0.0; a[0 + 4 * m] = 2.0;

68 a[1 + 0 * m] = 0.0; a[1 + 1 * m] = 0.0; a[1 + 2 * m] = 3.0; a[1 + 3 * m] = 0.0; a[1 + 4 * m] = 0.0;

69 a[2 + 0 * m] = 0.0; a[2 + 1 * m] = 0.0; a[2 + 2 * m] = 0.0; a[2 + 3 * m] = 0.0; a[2 + 4 * m] = 0.0;

70 a[3 + 0 * m] = 0.0; a[3 + 1 * m] = 2.0; a[3 + 2 * m] = 0.0; a[3 + 3 * m] = 0.0; a[3 + 4 * m] = 0.0;

72 printf("# octave check\n");

73 printf("a ="); printmat(m, n, a, m); printf("\n");

74 Rgesvd("A", "A", m, n, a, m, s, u, m, vt, n, work, lwork, info);

75 printf("s="); printvec(s, std::min(m, n)); printf("\n");

76 if (m < n)

77 printf("padding=zeros(%d, %d-%d)\n", (int)m, (int)n, (int)m);

78 if (n < m)

79 printf("padding=zeros(%d-%d,%d)\n", (int)m, (int)n, (int)n);

80 printf("u ="); printmat(m, m, u, m); printf("\n");

81 printf("vt ="); printmat(n, n, vt, n); printf("\n");

82 printf("svd(a)\n");

83 if (m < n)

84 printf("sigma=[diag(s) padding] \n");

85 if (n < m)

86 printf("sigma=[diag(s); padding] \n");

87 if (n == m)

88 printf("sigma=[diag(s)] \n");

89 printf("sigma \n");

90 printf("u * sigma * vt\n");

91 delete[] work;

92 delete[] vt;

93 delete[] u;

94 delete[] s;

95 delete[] a;

96}

You can find corresponding example files for various mutiple-precision verions as Rgesvd_test__Float128.cpp, Rgesvd_test_dd.cpp, Rgesvd_test_gmp.cpp, Rgesvd_test_qd.cpp, Rgesvd_test__Float64x.cpp,
Rgesvd_test_double.cpp, and Rgesvd_test_mpfr.cpp
at /home/docker/mplapack/examples/mplapack/05_SingularValueDecomposition/. In dd_real case, you can compile dd_real version by:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgesvd_test_dd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_dd -lmpblas_dd -lqd

The output of the executable is the following:

$ ./a.out
# octave check
a =[ [ +1.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00,
+2.0000000000000000e+00]; [ +0.0000000000000000e+00, +0.0000000000000000e+00, +3.0000000000000000e+00,
+0.0000000000000000e+00, +0.0000000000000000e+00]; [ +0.0000000000000000e+00, +0.0000000000000000e+00,
+0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00]; [ +0.0000000000000000e+00,
+2.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00] ]
s=[ +3.0000000000000000e+00, +2.2360679774997897e+00, +2.0000000000000000e+00, +0.0000000000000000e+00]
padding=zeros(4, 5-4)
u =[ [ +0.0000000000000000e+00, +1.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00];
[ +1.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00];
[ +0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, -1.0000000000000000e+00];
[ +0.0000000000000000e+00, +0.0000000000000000e+00, +1.0000000000000000e+00, +0.0000000000000000e+00] ]
vt =[ [ +0.0000000000000000e+00, +0.0000000000000000e+00, +1.0000000000000000e+00, +0.0000000000000000e+00,
+0.0000000000000000e+00]; [ +4.4721359549995794e-01, +0.0000000000000000e+00, +0.0000000000000000e+00,
+0.0000000000000000e+00, +8.9442719099991588e-01]; [ +0.0000000000000000e+00, +1.0000000000000000e+00,
+0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00]; [ +0.0000000000000000e+00,
+0.0000000000000000e+00, +0.0000000000000000e+00, +1.0000000000000000e+00, +0.0000000000000000e+00];
[ -8.9442719099991588e-01,
+0.0000000000000000e+00, +0.0000000000000000e+00, +0.0000000000000000e+00, +4.4721359549995794e-01] ]
svd(a)
sigma=[diag(s) padding]
sigma
u * sigma  * vt

To better readability, you can pass the output to the octave. The output is the following:

$ ./a.out | octave
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
a =

   1   0   0   0   2
   0   0   3   0   0
   0   0   0   0   0
   0   2   0   0   0

s =

   3.00000   2.23607   2.00000   0.00000

padding =

   0
   0
   0
   0

u =

   0   1   0   0
   1   0   0   0
   0   0   0  -1
   0   0   1   0

vt =

   0.00000   0.00000   1.00000   0.00000   0.00000
   0.44721   0.00000   0.00000   0.00000   0.89443
   0.00000   1.00000   0.00000   0.00000   0.00000
   0.00000   0.00000   0.00000   1.00000   0.00000
  -0.89443   0.00000   0.00000   0.00000   0.44721

ans =

   3.00000
   2.23607
   2.00000
   0.00000

sigma =

   3.00000   0.00000   0.00000   0.00000   0.00000
   0.00000   2.23607   0.00000   0.00000   0.00000
   0.00000   0.00000   2.00000   0.00000   0.00000
   0.00000   0.00000   0.00000   0.00000   0.00000

sigma =

   3.00000   0.00000   0.00000   0.00000   0.00000
   0.00000   2.23607   0.00000   0.00000   0.00000
   0.00000   0.00000   2.00000   0.00000   0.00000
   0.00000   0.00000   0.00000   0.00000   0.00000

ans =

   1   0   0   0   2
   0   0   3   0   0
   0   0   0   0   0
   0   2   0   0   0

You will see that Rgesvd, the singular value decomposition solver solved the problem correctly.

7.4.8 Singular value decomposition of a complex matrix (Cgesvd)

We show an example of how to solve a singular value problem. Let $A$ is $4\times 4$ as follows:

A=\left[\begin{array}[]{cccc}0.9-i&20-2.25i&1.75-0.5i&0.5i\\ 8-2.25i&-0.25&1.25-0.25i&-3.75\\ -1.75&-80+1.25i&1.5&30+2.25i\\ 3+0.25i&1.75&2.25i&-0.25-80.0i\\ \end{array}\right]

Then, a singular value decomposition of this matrix is given by $A=U\Sigma V^{t}$ as follows.

{U}=\left[\begin{array}[]{cccc}-0.091031+0.104196i&0.060624+0.210082i&-0.099978+0.156181i&0.873552-0.368191i\\ -0.019857+0.021261i&-0.009166-0.038982i&-0.947158+0.255136i&-0.176226-0.065515i\\ 0.572129-0.526799i&-0.102805-0.570697i&-0.027491+0.035503i&0.228321-0.068858i\\ -0.472776-0.389416i&0.745095-0.243220i&0.001372+0.037303i&0.034920+0.088893i\\ \end{array}\right],

\Sigma=\left[\begin{array}[]{cccc}97.5165&0&0&0\\ 0&68.1063&0&0\\ 0&0&8.4782&0\\ 0&0&0&2.1504\end{array}\right],

and

V^{t}=\left[\begin{array}[]{cccc}-0.02984+0.00000i&-0.50562-0.43706i&-0.00266-0.00443i&0.48583+0.56247i\\ 0.03250+0.00000i&0.14032-0.72984i&-0.01031+0.03209i&0.22087-0.62982i\\ -0.98322+0.00000i&0.01556-0.01112i&-0.17198-0.04195i&-0.01174-0.03782i\\ -0.17699+0.00000i&0.02458+0.00145i&0.95397+0.23966i&0.02386-0.00037i\\ \end{array}\right].

Let us solve the singular value problem for matrix $A$ using MPLAPACK $\tt dd\_real$ . The prototype definition of Cgesvd is the following:

void Cgesvd(const char *jobu, const char *jobvt, mplapackint const m,
mplapackint const n, dd_complex *a, mplapackint const lda, dd_real *s,
dd_complex *u, mplapackint const ldu, dd_complex *vt, mplapackint const ldvt,
dd_complex *work, mplapackint const lwork, dd_real *rwork, mplapackint &info);

The corresponding LAPACK routine is ZGESVD. We show some lines of ZGESVD are following.

*       SUBROUTINE ZGESVD( JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,
*                          WORK, LWORK, RWORK, INFO )
*
*       .. Scalar Arguments ..
*       CHARACTER          JOBU, JOBVT
*       INTEGER            INFO, LDA, LDU, LDVT, LWORK, M, N
*       ..
*       .. Array Arguments ..
*       DOUBLE PRECISION   RWORK( * ), S( * )
*       COMPLEX*16         A( LDA, * ), U( LDU, * ), VT( LDVT, * ),
*      $                   WORK( * )

A sample program for dd_real is following:

⬇

1//public domain

2#include <iostream>

3#include <string>

4#include <sstream>

5#include <cstring>

6#include <algorithm>

8#include <mpblas_dd.h>

9#include <mplapack_dd.h>

11#define DD_PRECISION_SHORT 16

13inline void printnum(dd_real rtmp) {

14 std::cout.precision(DD_PRECISION_SHORT);

15 if (rtmp >= 0.0) {

16 std::cout << "+" << rtmp;

17 } else {

18 std::cout << rtmp;

19 }

20 return;

21}

23inline void printnum(dd_complex rtmp) {

24 std::cout.precision(DD_PRECISION_SHORT);

25 if (rtmp.real() >= 0.0) {

26 std::cout << "+" << rtmp.real();

27 } else {

28 std::cout << rtmp.real();

29 }

30 if (rtmp.imag() >= 0.0) {

31 std::cout << "+" << rtmp.imag() << "i";

32 } else {

33 std::cout << rtmp.imag() << "i";

34 }

35 return;

36}

38//Matlab/Octave format

39template <class X> void printvec(X *a, int len) {

40 X tmp;

41 printf("[ ");

42 for (int i = 0; i < len; i++) {

43 tmp = a[i];

44 printnum(tmp);

45 if (i < len - 1)

46 printf(", ");

47 }

48 printf("]");

49}

51template <class X> void printmat(int n, int m, X *a, int lda)

52{

53 X mtmp;

55 printf("[ ");

56 for (int i = 0; i < n; i++) {

57 printf("[ ");

58 for (int j = 0; j < m; j++) {

59 mtmp = a[i + j * lda];

60 printnum(mtmp);

61 if (j < m - 1)

62 printf(", ");

63 }

64 if (i < n - 1)

65 printf("]; ");

66 else

67 printf("] ");

68 }

69 printf("]");

70}

71int main() {

72 mplapackint n = 4;

73 mplapackint m = 4;

75 dd_complex *a = new dd_complex[m * n];

76 dd_real *s = new dd_real[std::min(m, n)];

77 dd_complex *u = new dd_complex[m * m];

78 dd_complex *vt = new dd_complex[n * n];

79 mplapackint lwork = std::max((mplapackint)1, 2 * std::min(m, n) + std::max(m, n));

80 dd_complex *work = new dd_complex[lwork];

81 dd_real *rwork = new dd_real[5 * std::min(m, n)];

82 mplapackint info;

84 // setting A matrix

85 a[0 + 0 * n] = dd_complex(0.9, -1.0); a[0 + 1 * n] = dd_complex(20.0, -2.25); a[0 + 2 * n] = dd_complex(1.75, -0.5); a[0 + 3 * n] = dd_complex(0.0, 0.5);

86 a[1 + 0 * n] = dd_complex(8.0,-2.25); a[1 + 1 * n] = dd_complex(-0.25, 0.0); a[1 + 2 * n] = dd_complex(1.25, -0.25); a[1 + 3 * n] = dd_complex(-3.75, 0.0);

87 a[2 + 0 * n] = dd_complex(-1.75,0.0); a[2 + 1 * n] = dd_complex(-80.0, 1.25); a[2 + 2 * n] = dd_complex(1.5, 0.0); a[2 + 3 * n] = dd_complex(30.0, 2.25);

88 a[3 + 0 * n] = dd_complex(3.0, 0.25); a[3 + 1 * n] = dd_complex(1.75, 0.0); a[3 + 2 * n] = dd_complex(0.0, 2.25); a[3 + 3 * n] = dd_complex(-0.25, -80.0);

90 printf("# octave check\n");

91 printf("split_long_rows(0)\n");

92 printf("a ="); printmat(m, n, a, m); printf("\n");

93 Cgesvd("A", "A", m, n, a, m, s, u, m, vt, n, work, lwork, rwork, info);

94 printf("s="); printvec(s, std::min(m, n)); printf("\n");

95 if (m < n)

96 printf("padding=zeros(%d, %d-%d)\n", (int)m, (int)n, (int)m);

97 if (n < m)

98 printf("padding=zeros(%d-%d,%d)\n", (int)m, (int)n, (int)n);

99 printf("u ="); printmat(m, m, u, m); printf("\n");

100 printf("vt ="); printmat(n, n, vt, n); printf("\n");

101 printf("svd(a)\n");

102 if (m < n)

103 printf("sigma=[diag(s) padding] \n");

104 if (n < m)

105 printf("sigma=[diag(s); padding] \n");

106 if (n == m)

107 printf("sigma=[diag(s)] \n");

108 printf("sigma \n");

109 printf("u * sigma * vt\n");

110 delete[] rwork;

111 delete[] work;

112 delete[] vt;

113 delete[] u;

114 delete[] s;

115 delete[] a;

116}

You can find corresponding example files for various mutiple-precision verions as Cgesvd_test__Float128.cpp, Cgesvd_test_dd.cpp, Cgesvd_test_gmp.cpp, Cgesvd_test_qd.cpp, Cgesvd_test__Float64x.cpp,
Cgesvd_test_double.cpp, and Cgesvd_test_mpfr.cpp
at /home/docker/mplapack/examples/mplapack/05_SingularValueDecomposition/. In dd_real case, you can compile dd_real version by:

$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Cgesvd_test_dd.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib -L/home/docker/MPLAPACK/lib \
-lmplapack_dd -lmpblas_dd -lqd

The output of the executable is the following:

$ ./a.out
# octave check
split_long_rows(0)
a =[ [ +9.0000000000000002220446049250313080847263336181640625000000000000e-01-1.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+2.0000000000000000000000000000000000000000000000000000000000000000e+01-2.2500000000000000000000000000000000000000000000000000000000000000e+00i,
+1.7500000000000000000000000000000000000000000000000000000000000000e+00-5.0000000000000000000000000000000000000000000000000000000000000000e-01i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+5.0000000000000000000000000000000000000000000000000000000000000000e-01i];
[ +8.0000000000000000000000000000000000000000000000000000000000000000e+00-2.2500000000000000000000000000000000000000000000000000000000000000e+00i,
-2.5000000000000000000000000000000000000000000000000000000000000000e-01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.2500000000000000000000000000000000000000000000000000000000000000e+00-2.5000000000000000000000000000000000000000000000000000000000000000e-01i,
-3.7500000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i];
[ -1.7500000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-8.0000000000000000000000000000000000000000000000000000000000000000e+01+1.2500000000000000000000000000000000000000000000000000000000000000e+00i,
+1.5000000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+3.0000000000000000000000000000000000000000000000000000000000000000e+01+2.2500000000000000000000000000000000000000000000000000000000000000e+00i];
[ +3.0000000000000000000000000000000000000000000000000000000000000000e+00+2.5000000000000000000000000000000000000000000000000000000000000000e-01i,
+1.7500000000000000000000000000000000000000000000000000000000000000e+00+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+0.0000000000000000000000000000000000000000000000000000000000000000e+00+2.2500000000000000000000000000000000000000000000000000000000000000e+00i,
-2.5000000000000000000000000000000000000000000000000000000000000000e-01-8.0000000000000000000000000000000000000000000000000000000000000000e+01i] ]
s=[ +9.7516494435839724342537934011744964189236290378865784643347960368e+01, +6.8106277354635765125630927080704091707844321939430382469272564138e+01,
+8.4781583569581929193257165114938017510166457266377499488969146190e+00, +2.1503787370153202482459981470702526329803293729092259494247941795e+00]
u =[ [ -9.1031344230065289625000603692895258254109199563673856613631987783e-02+1.0419550007707547081892003210016871725842634797582791410249228936e-01i,
+6.0624015933791684880101970656199134272860397091935547317571739868e-02+2.1008219841477778090011021923228197915008235604381081943292731807e-01i,
-9.9978469925849530955490215397374086462581292970446108678666366672e-02+1.5618105933067197645819512610919051343301177207836333131951543727e-01i,
+8.7355245250617254255021322838492243550830273310262636819188476464e-01-3.6819109944526785319692942345862467313117969007310991253602368805e-01i];
[ -1.9857057918844668508194923848264672674827541149575776877467762232e-02+2.1261406272041457850596651639122912053764705903056249828047829994e-02i,
-9.1661574892754928734045271369121793400923378513876808480540487939e-03-3.8982298938279813017463111582481151532286050557011081307982359715e-02i,
-9.4715781555964404355021035921467887158437034362335955910113579662e-01+2.5513564301397109620885117891155720411045767100304996091477259914e-01i,
-1.7622605699795972492191618082958299743466785925908687025400420878e-01-6.5515377611625650642778502070271900697881487135131290739710052841e-02i];
[+5.7212865453775911876209433088888378678015704682520653614966501059e-01-5.2679881119508889212570067087606726528213619071689150491978561913e-01i,
-1.0280453437606294917501451825749672199318907430352403992267112353e-01-5.7069659877524894683413175387466614439142123832093715941506060498e-01i,
-2.7491348747212516867749761285866090287970541616576302889303235643e-02+3.5503263645948459662712416388082136298992694820774724140988646759e-02i,
+2.2832147261394758144289270273679539897363368839194944166497880608e-01-6.8858435912252448915976238595384806375580376776718454679051942621e-02i];
[ -4.7277573051874758750884260023823098397257693984940907812279232275e-01-3.8941644263977541772805124956744968110703693808087656417153595879e-01i,
+7.4509533337138732928486540830547081945080989089341576817124012507e-01-2.4322032105460738462953739377692881294892930945862711758541460737e-01i,
+1.3722197565686791725123297720384344172881445692984491516542700100e-03+3.7303074581742350398412376125544143984966926055469528482835254700e-02i,
+3.4919914776892904123087051825643190789294671627841834161144819628e-02+8.8892967326921520048953580964935184955357406567111076952094885644e-02i] ]
vt =[ [ -2.9838283275437904677981078252061374697499430617316993098822847353e-02+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
-5.0561959990428189803987557498007050028508558598803909581266186879e-01-4.3706492603990562617863033812586628591798259998022451509139092210e-01i,
-2.6614296590750835969117035558529485216857552931590475544750949781e-03-4.4298774864078094877565754305104919310123158810553060779232926373e-03i,
+4.8583199843346362348703182426050056752793902330663066922401758006e-01+5.6247071901149899446413169974532320544220661587531798620308731958e-01i];
[ +3.2496992488762597907964387515542319276603362918117020319359346521e-02+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.4032474124032111114640285436525182953802654632942948451711455109e-01-7.2983568501622474240687954437213659452798801125293414002260123392e-01i,
-1.0309089626661974290949324966638353418737248637627508120396185556e-02+3.2090624256563363535957212538268213661898668484938870563587187365e-02i,
+2.2086896019824552483954657586346671754727421017002976965842749182e-01-6.2982029363839933799110165363797820437639004016585224523835087573e-01i];
[ -9.8322380454928627593357750706161049590724043016230250792627017777e-01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+1.5557746179521022745472470501153401319301822933632620168302167001e-02-1.1119064935323447671186965031169351281860527363722541382850349940e-02i,
-1.7198182390092257164272666175190298478482096240869947698066647614e-01-4.1946034125215249890874708936693176256949905794670062763938078790e-02i,
-1.1737785317001863645752020621778756342758817320563642174518560436e-02-3.7819042308536459610100091608526996733397032526690882424269244982e-02i];
[ -1.7698749249025177994139125324872686755256471092406618305781220382e-01+0.0000000000000000000000000000000000000000000000000000000000000000e+00i,
+2.4579174853446943521565353394166564952277757025400163843805728908e-02+1.4483036290672481703529854195308447371511910768799389582350077765e-03i,
+9.5397148644507469609163061186547701562956565817170822244991006671e-01+2.3966308226794491229599443416109686062123262637446982923420847550e-01i,
+2.3855098594784970246426131859255500566423607296986688880428268896e-02-3.7202254072076200340788215932734883131231597648245161542298777760e-04i] ]
svd(a)
sigma=[diag(s)]
sigma
u * sigma  * vt

To better readability, you can pass the output to the octave. The output is the following:

$ ./a.out | octave
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
a =

    0.90000 -  1.00000i   20.00000 -  2.25000i    1.75000 -  0.50000i    0.00000 +  0.50000i
    8.00000 -  2.25000i   -0.25000 +  0.00000i    1.25000 -  0.25000i   -3.75000 +  0.00000i
   -1.75000 +  0.00000i  -80.00000 +  1.25000i    1.50000 +  0.00000i   30.00000 +  2.25000i
    3.00000 +  0.25000i    1.75000 +  0.00000i    0.00000 +  2.25000i   -0.25000 - 80.00000i

s =

   97.5165   68.1063    8.4782    2.1504

u =

  -0.091031 + 0.104196i   0.060624 + 0.210082i  -0.099978 + 0.156181i   0.873552 - 0.368191i
  -0.019857 + 0.021261i  -0.009166 - 0.038982i  -0.947158 + 0.255136i  -0.176226 - 0.065515i
   0.572129 - 0.526799i  -0.102805 - 0.570697i  -0.027491 + 0.035503i   0.228321 - 0.068858i
  -0.472776 - 0.389416i   0.745095 - 0.243220i   0.001372 + 0.037303i   0.034920 + 0.088893i

vt =

  -0.02984 + 0.00000i  -0.50562 - 0.43706i  -0.00266 - 0.00443i   0.48583 + 0.56247i
   0.03250 + 0.00000i   0.14032 - 0.72984i  -0.01031 + 0.03209i   0.22087 - 0.62982i
  -0.98322 + 0.00000i   0.01556 - 0.01112i  -0.17198 - 0.04195i  -0.01174 - 0.03782i
  -0.17699 + 0.00000i   0.02458 + 0.00145i   0.95397 + 0.23966i   0.02386 - 0.00037i

ans =

   97.5165
   68.1063
    8.4782
    2.1504

sigma =

   97.51649    0.00000    0.00000    0.00000
    0.00000   68.10628    0.00000    0.00000
    0.00000    0.00000    8.47816    0.00000
    0.00000    0.00000    0.00000    2.15038

sigma =

   97.51649    0.00000    0.00000    0.00000
    0.00000   68.10628    0.00000    0.00000
    0.00000    0.00000    8.47816    0.00000
    0.00000    0.00000    0.00000    2.15038

ans =

   9.0000e-01 - 1.0000e+00i   2.0000e+01 - 2.2500e+00i   1.7500e+00 - 5.0000e-01i  -3.0693e-16 + 5.0000e-01i
   8.0000e+00 - 2.2500e+00i  -2.5000e-01 - 8.9405e-17i   1.2500e+00 - 2.5000e-01i  -3.7500e+00 + 2.1264e-16i
  -1.7500e+00 + 2.1460e-17i  -8.0000e+01 + 1.2500e+00i   1.5000e+00 + 2.9835e-17i   3.0000e+01 + 2.2500e+00i
   3.0000e+00 + 2.5000e-01i   1.7500e+00 - 2.3781e-15i  -2.0181e-17 + 2.2500e+00i  -2.5000e-01 - 8.0000e+01i

You will see that Cgesvd, the singular value decomposition solver solved the problem correctly.

7.5 Examples provided for MPLAPACK

There are more examples in /home/docker/mplapack/examples/mplapack/ as follows:

•

00_LinearEquations/Cgetri_test_*.cpp … samples of inversion of complex matrices
•

00_LinearEquations/Rgesv_test_*.cpp … samples of solving linear equation of real matrices
•

00_LinearEquations/Rgetri_Hilbert_*.cpp … samples of inversion of Hilbert matrices of various orders.
•

00_LinearEquations/Rgetri_test*.cpp … samples of inversion of real matrices.
•

03_SymmetricEigenproblems/Cheev_test_*.cpp … samples of diagonalization of Hermitian matrices.
•

03_SymmetricEigenproblems/Rsyev_Frank_*.cpp … samples of diagonalization of real Frank matrices.
•

03_SymmetricEigenproblems/Rsyev_test_*.cpp … samples of diagonalization of real matrices.
•

03_SymmetricEigenproblems/Rsyevd_DingDong_*.cpp … samples of diagonalization of Ding Dong matrices.
•

03_SymmetricEigenproblems/Rsyevd_Frank_*.cpp … samples of diagonalization of real Frank matrices using divide-and-conquer driver.
•

03_SymmetricEigenproblems/Rsyevr_Frank_*.cpp … samples of diagonalization of real Frank matrices using relatively robust representation driver.
•

04_NonsymmetricEigenproblems/Cgees_test*_*.cpp … samples of solving non-symmetric complex eigenvalues problem (Schur form)
•

04_NonsymmetricEigenproblems/Cgeev_NPR_*.cpp … samples of solving non-symmetric complex eigenvalues problem with eigenvectors; GMP version is not available because no high precision cosine is in GMP.
•

04_NonsymmetricEigenproblems/Cgeev_test*_*.cpp … samples of solving non-symmetric complex eigenvalues problem with eigenvectors.
•

04_NonsymmetricEigenproblems/Rgees_Grcar_*.cpp … samples of solving non-symmetric eigenvalues problem using Grcar matrix (Schur form)
•

04_NonsymmetricEigenproblems/Rgees_readfromfile_*.cpp … samples of solving non-symmetric eigenvalues problem using matrix from a file (Schur form)
•

04_NonsymmetricEigenproblems/Rgees_test_*.cpp … samples of solving simple non-symmetric eigenvalues problem (Schur form)
•

04_NonsymmetricEigenproblems/Rgeev_Frank_*.cpp … samples of solving non-symmetric eigenvalues problem for Frank matrix
•

04_NonsymmetricEigenproblems/Rgeev_random_highcond_*.cpp … samples of solving non-symmetric eigenvalues problem for high conditioned random matrix
•

04_NonsymmetricEigenproblems/Rgeev_readfromfile_*.cpp … samples of solving non-symmetric eigenvalues problem using matrix from a file
•

04_NonsymmetricEigenproblems/Rgeev_test_*.cpp .. samples of solving simple non-symmetric eigenvalues problem
•

05_SingularValueDecomposition/Cgesvd_test_*.cpp … samples of simple singular value decomposition problems
•

05_SingularValueDecomposition/Rgesvd_random_highcond_*.cpp …samples of solving singular value decomposition problem for matrices with high condition numbers
•

05_SingularValueDecomposition/Rgesvd_readfromfile_*.cpp … samples of singular value decomposition problems using matrix from a file
•

05_SingularValueDecomposition/Rgesvd_test_*.cpp … samples of simple singular value decomposition problems

7.6 Accuracy of each precision, Rlamch values, and specifying precision at runtime

For MPLAPACK, Rlamch routines are vital since they return precision-specific constants. These values control convergence behaviors such as eigenvalue and singular value problems. Thus, we cannot say the binary128 version of MPLAPACK routines runs output if we use the same Rlamch values for binary64 and vice versa. In table 11, we show Rlamch values for MPFR. The default accuracy for MPFR is approximately 154 decimal digits ( $154.127=512\log_{10}2$ ). In table 12, we show Rlamch values for MPFR with MPLAPACK_MPFR_PRECISION=65536. This accuracy is approximately 19728 decimal digits ( $19728=65536\log_{10}2)$ .

In table 14, we show Rlamch values for GMP. The default accuracy for MPFR is approximately 154 decimal digits ( $154.127=512\log_{10}2$ ). In table 16, we show Rlamch values with for GMP with MPLAPACK_GMP_PRECISION=1024. This accuracy is approximately 308 decimal digits ( $308.25=1024\log_{10}2)$ . In table 17, we show Rlamch values for _Float128. In table 18, we show Rlamch values for _Float64x. In table 19, we show Rlamch values for double. In table 20, we show Rlamch values for dd_real. We intentionally reduce the overflow values so that sqrt does not fail. In table 21, we show Rlamch values for qd_real, respectively.

We can change the default precision at runtime for GMP and MPFR by setting the environment variable, MPLAPACK_GMP_PRECISION and MPLAPACK_MPFR_PRECISION, respectively.

Let us show an example by inverting the Hilerbt matrix. Rgetri_Hilbert_mpfr inverts Hilbert matrix of order $n$ , where $n=1,2,\cdots$ via LU decomposition. ${\tt a}$ is the Hilbert matrix and ainv is the inverted matrix, and InfnormL is the left residuals.

The output is the following:

$ cd /home/docker/mplapack/examples/mplapack/00_LinearEquations
$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgetri_Hilbert_mpfr.cpp -Wl,--rpath=/home/docker/MPLAPACK/lib \
-L/home/docker/MPLAPACK/lib -lmplapack_mpfr -lmpblas_mpfr -lmpfr -lmpc -lgmp
$ ./a.out | grep InfnromL
InfnormL:(ainv * a - I)=+0.0000000000000000e+00
InfnormL:(ainv * a - I)=+0.0000000000000000e+00
InfnormL:(ainv * a - I)=+4.7733380679681323e-153
InfnormL:(ainv * a - I)=+7.6373409087490117e-152
InfnormL:(ainv * a - I)=+2.4439490907996837e-150
InfnormL:(ainv * a - I)=+1.1730955635838482e-148
InfnormL:(ainv * a - I)=+2.1897783853565166e-147
InfnormL:(ainv * a - I)=+6.0062492855493028e-146
InfnormL:(ainv * a - I)=+2.7228330094490173e-144
InfnormL:(ainv * a - I)=+8.2005323578699814e-143
...
$ MPLAPACK_MPFR_PRECISION=65536 ./a.out | grep InfnromL
InfnormL:(ainv * a - I)=+0.0000000000000000e+00
InfnormL:(ainv * a - I)=+0.0000000000000000e+00
InfnormL:(ainv * a - I)=+3.9929525776415436e-19728
InfnormL:(ainv * a - I)=+5.1109792993811758e-19726
InfnormL:(ainv * a - I)=+3.2710267516039525e-19724
InfnormL:(ainv * a - I)=+5.2336428025663240e-19723
InfnormL:(ainv * a - I)=+8.3738284841061184e-19722
InfnormL:(ainv * a - I)=+5.0242970904636710e-19720
InfnormL:(ainv * a - I)=+1.0718500459655832e-19718
...

In the first case, we calculate in 512 bit accuracy or 154 decimal digit accuracy ( $154.127=512\log_{10}2$ ), and the next case we calculate in 65536 bit accuracy or 19728 decimal digits accuracy ( $19728=65536\log_{10}2$ ). The results show we can specify how many bits are for multiple-precision calculations at the runtime.

Table 11: Rlamch values for MPFR (default setting)

Rlamch E: Epsilon                      +7.4583407312002067432909653154629338373764715346004068942715183332e-155
Rlamch S: Safe minimum                 +9.5302596195518042928646791269306980816607845022369591518200950104e-323228497
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +1.4916681462400413486581930630925867674752943069200813788543036666e-154
Rlamch N: Number of digits in mantissa +5.1200000000000000000000000000000000000000000000000000000000000000e+02
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -1.0737418230000000000000000000000000000000000000000000000000000000e+09
Rlamch U: Underflow threshold          +9.5302596195518042928646791269306980816607845022369591518200950104e-323228497
Rlamch L: Largest exponent             +1.0737418220000000000000000000000000000000000000000000000000000000e+09
Rlamch O: Overflow threshold           +2.0985787164673876924043581168838390706380979654733526277866462257e+323228496
Rlamch -: Reciprocal of safe minimum   +1.0492893582336938462021790584419195353190489827366763138933231129e+323228496

Table 12: Rlamch values for MPFR (MPLAPACK_MPFR_PRECISION=65536)

Rlamch E: Epsilon                      +4.9911907220519294656590574792132451973746770423207674161425040336e-19729
Rlamch S: Safe minimum                 +9.5302596195518042928646791269306980816607845022369591518200950104e-323228497
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +9.9823814441038589313181149584264903947493540846415348322850080672e-19729
Rlamch N: Number of digits in mantissa +6.5536000000000000000000000000000000000000000000000000000000000000e+04
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -1.0737418230000000000000000000000000000000000000000000000000000000e+09
Rlamch U: Underflow threshold          +9.5302596195518042928646791269306980816607845022369591518200950104e-323228497
Rlamch L: Largest exponent             +1.0737418220000000000000000000000000000000000000000000000000000000e+09
Rlamch O: Overflow threshold           +2.0985787164673876924043581168838390706380979654733526277866462257e+323228496
Rlamch -: Reciprocal of safe minimum   +1.0492893582336938462021790584419195353190489827366763138933231129e+323228496

Table 13: Rlamch values for GMP (default setting) on Linux and macOS

Rlamch E: Epsilon                      +7.4583407312002067432909653154629338373764715346004068942715183332e-155
Rlamch S: Safe minimum                 +1.7019382623481672278259575819240965611355119939659392498165297957e-1388255822130839283
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +1.4916681462400413486581930630925867674752943069200813788543036666e-154
Rlamch N: Number of digits in mantissa +5.1200000000000000000000000000000000000000000000000000000000000000e+02
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -4.6116860184273879030000000000000000000000000000000000000000000000e+18
Rlamch U: Underflow threshold          +1.7019382623481672278259575819240965611355119939659392498165297957e-1388255822130839283
Rlamch L: Largest exponent             +0.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch O: Overflow threshold           +5.8756537891115875909369119988784425899385163927454983083337796065e+1388255822130839282
Rlamch -: Reciprocal of safe minimum   +5.8756537891115875909369119988784425899385163927454983083337796065e+1388255822130839282

Table 14: Rlamch values for GMP (default setting) on Windows (ming64)

Rlamch E: Epsilon                      +7.4583407312002067432909653154629338373764715346004068942715183332e-155
Rlamch S: Safe minimum                 +4.7651298097759021464323395634653490408303922511184795759100475052e-323228497
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +1.4916681462400413486581930630925867674752943069200813788543036666e-154
Rlamch N: Number of digits in mantissa +5.1200000000000000000000000000000000000000000000000000000000000000e+02
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -1.0737418230000000000000000000000000000000000000000000000000000000e+09
Rlamch U: Underflow threshold          +4.7651298097759021464323395634653490408303922511184795759100475052e-323228497
Rlamch L: Largest exponent             +0.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch O: Overflow threshold           +2.0985787164673876924043581168838390706380979654733526277866462257e+323228496
Rlamch -: Reciprocal of safe minimum   +2.0985787164673876924043581168838390706380979654733526277866462257e+323228496

Table 15: Rlamch values for GMP with MPLAPACK_GMP_PRECISION=1024 on Linux and macOS

Rlamch E: Epsilon                      +5.5626846462680034577255817933310101605480399511558295763833185422e-309
Rlamch S: Safe minimum                 +1.7019382623481672278259575819240965611355119939659392498165297957e-1388255822130839283
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +1.1125369292536006915451163586662020321096079902311659152766637084e-308
Rlamch N: Number of digits in mantissa +1.0240000000000000000000000000000000000000000000000000000000000000e+03
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -4.6116860184273879030000000000000000000000000000000000000000000000e+18
Rlamch U: Underflow threshold          +1.7019382623481672278259575819240965611355119939659392498165297957e-1388255822130839283
Rlamch L: Largest exponent             +0.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch O: Overflow threshold           +5.8756537891115875909369119988784425899385163927454983083337796065e+1388255822130839282
Rlamch -: Reciprocal of safe minimum   +5.8756537891115875909369119988784425899385163927454983083337796065e+1388255822130839282
Rlamch Z: dummy (error)                +0.0000000000000000000000000000000000000000000000000000000000000000e+00

Table 16: Rlamch values for GMP with MPLAPACK_GMP_PRECISION=1024 on Windows (mingw64)

Rlamch E: Epsilon                      +5.5626846462680034577255817933310101605480399511558295763833185422e-309
Rlamch S: Safe minimum                 +4.7651298097759021464323395634653490408303922511184795759100475052e-323228497
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +1.1125369292536006915451163586662020321096079902311659152766637084e-308
Rlamch N: Number of digits in mantissa +1.0240000000000000000000000000000000000000000000000000000000000000e+03
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -1.0737418230000000000000000000000000000000000000000000000000000000e+09
Rlamch U: Underflow threshold          +4.7651298097759021464323395634653490408303922511184795759100475052e-323228497
Rlamch L: Largest exponent             +0.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch O: Overflow threshold           +2.0985787164673876924043581168838390706380979654733526277866462257e+323228496
Rlamch -: Reciprocal of safe minimum   +2.0985787164673876924043581168838390706380979654733526277866462257e+323228496

Table 17: Rlamch values for _Float128

Rlamch E: Epsilon                      +1.9259299443872358530559779425849273185381e-34
Rlamch S: Safe minimum                 +3.3621031431120935062626778173217526025981e-4932
Rlamch B: Base                         +2.0000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +3.8518598887744717061119558851698546370762e-34
Rlamch N: Number of digits in mantissa +1.1300000000000000000000000000000000000000e+02
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -1.6381000000000000000000000000000000000000e+04
Rlamch U: Underflow threshold          +3.3621031431120935062626778173217526025981e-4932
Rlamch L: Largest exponent             +1.6384000000000000000000000000000000000000e+04
Rlamch O: Overflow threshold           +1.1897314953572317650857593266280070161965e+4932
Rlamch -: Reciprocal of safe minimum   +2.9743287383930794127143983165700178269086e+4931

Table 18: Rlamch values for _Float64x

Rlamch E: Epsilon                      +5.421010862427522170037e-20
Rlamch S: Safe minimum                 +3.362103143112093506263e-4932
Rlamch B: Base                         +2.000000000000000000000e+00
Rlamch P: Precision                    +1.084202172485504434007e-19
Rlamch N: Number of digits in mantissa +6.400000000000000000000e+01
Rlamch R: Rounding mode                +1.000000000000000000000e+00
Rlamch M: Minimum exponent:            -1.638100000000000000000e+04
Rlamch U: Underflow threshold          +3.362103143112093506263e-4932
Rlamch L: Largest exponent             +1.638400000000000000000e+04
Rlamch O: Overflow threshold           +1.189731495357231765021e+4932
Rlamch -: Reciprocal of safe minimum   +2.974328738393079412714e+4931

Table 19: Rlamch values for double

Rlamch E: Epsilon                      +1.1102230246251565e-16
Rlamch S: Safe minimum                 +2.2250738585072014e-308
Rlamch B: Base                         +2.0000000000000000e+00
Rlamch P: Precision                    +2.2204460492503131e-16
Rlamch N: Number of digits in mantissa +5.3000000000000000e+01
Rlamch R: Rounding mode                +1.0000000000000000e+00
Rlamch M: Minimum exponent:            -1.0210000000000000e+03
Rlamch U: Underflow threshold          +2.2250738585072014e-308
Rlamch L: Largest exponent             +1.0240000000000000e+03
Rlamch O: Overflow threshold           +1.7976931348623157e+308
Rlamch -: Reciprocal of safe minimum   +4.4942328371557898e+307

Table 20: Rlamch values for dd_real

Rlamch E: Epsilon                      +4.93038065763131995214781514484568e-32
Rlamch S: Safe minimum                 +2.00416836000897277799610805134985e-292
Rlamch B: Base                         +2.00000000000000000000000000000000e+00
Rlamch P: Precision                    +9.86076131526263990429563028969136e-32
Rlamch N: Number of digits in mantissa +1.06000000000000000000000000000000e+02
Rlamch R: Rounding mode                +1.00000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -9.68000000000000000000000000000001e+02
Rlamch U: Underflow threshold          +2.00416836000897277799610805134985e-292
Rlamch L: Largest exponent             +1.02400000000000000000000000000000e+03
Rlamch O: Overflow threshold           +1.79769313486231580793728971405328e+308
Rlamch -: Reciprocal of safe minimum   +4.98960077383679952914093178259285e+291

Table 21: Rlamch values for qd_real

Rlamch E: Epsilon                      +1.2154326714572500565324311366323150942261000827598106963711353150e-63
Rlamch S: Safe minimum                 +1.6259745436952323153369025467109371916500580246095533056887340823e-260
Rlamch B: Base                         +2.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch P: Precision                    +2.4308653429145001130648622732646301884522001655196213927422706300e-63
Rlamch N: Number of digits in mantissa +2.1200000000000000000000000000000000000000000000000000000000000000e+02
Rlamch R: Rounding mode                +1.0000000000000000000000000000000000000000000000000000000000000000e+00
Rlamch M: Minimum exponent:            -8.6200000000000000000000000000000000000000000000000000000000000000e+02
Rlamch U: Underflow threshold          +1.6259745436952323153369025467109371916500580246095533056887340823e-260
Rlamch L: Largest exponent             +1.0240000000000000000000000000000000000000000000000000000000000000e+03
Rlamch O: Overflow threshold           +1.7976931348623158079372897140530286112296785259868571699620068630e+308
Rlamch -: Reciprocal of safe minimum   +6.1501577861568104283923723841611832207865934590357532972465351809e+259

7.7 A basic strategy to use MPBLAS and MPLAPACK

Unfortunately, We do not provide concise manuals for MPBLAS and MPLAPACK since the API conversion from BLAS and LAPACK is trivial, and BLAS and LAPACK are very well documented. If you want to use some specific routines, first, consult LAPACK users’ guide [3], or https://www.netlib.org/lapack/lug/.

Following is the basic strategy for using MPLAPACK routines.

1.

See https://www.netlib.org/lapack/lug/node25.html to find the name of the driver routine.
2.

For example, If you want to compute the generalized singular values and vectors of a matrix $A$ , you will find the routine name as DGGSVD in https://www.netlib.org/lapack/lug/node36.html.
3.

Find the definition part of DGGSVD by download LAPACK and extract to find dggsvd.f or you can find on the internet https://www.netlib.org/lapack/explore-html/dd/db4/dggsvd_8f.html.
4.

Look for Rggsvd prototype definition of MPLAPACK and change the variables appropriately to fit MPLAPACK.

7.8 How to link MPBLAS and MPLAPACK statically

In this subsection, we show how to link MPBLAS and MPLAPACK statically. For MPBLAS, we need to specify the full path of the library name and enclose it with -Wl,--whole-archive and --no-whole-archive. Note that -Wl,--whole-archive is necessary as we call some initialization function at runtime; otherwise, MPBLAS will not work correctly. For example, when one builds Rgemm_mpfr statically, one should type the following:

$ cd /home/docker/MPLAPACK/share/examples/mpblas
$ g++ -fopenmp -O2 -I/home/docker/MPLAPACK/include \
-I/home/docker/MPLAPACK/include/mplapack Rgemm_mpfr.cpp \
-Wl,--whole-archive,/home/docker/MPLAPACK/lib/libmpblas_mpfr.a,--no-whole-archive \
-L/home/docker/MPLAPACK/lib -lmpfr -lmpc -lgmp
$ ldd ./a.out
        linux-vdso.so.1 (0x00007fff1dd12000)
        libmpfr.so.6 => /usr/lib/x86_64-linux-gnu/libmpfr.so.6 (0x00007fc36388a000)
        libmpc.so.3 => /usr/lib/x86_64-linux-gnu/libmpc.so.3 (0x00007fc363672000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc363458000)
        libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc36343d000)
        libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc36341a000)
        libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007fc363228000)
        libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fc3631a2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc3639d4000)
        libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007fc363053000)
$ ./a.out
...

For MPLAPACK, we need to specify the full path of library name, but enclosed with -Wl,--whole-archive and --no-whole-archive is not necessary. For example, To link MPBLAS and MLAPACK statically, we compile as follows:

$ cd /home/docker/MPLAPACK/share/examples/mplapack/00_LinearEquations
$ g++ -O2 -I/home/docker/MPLAPACK/include -I/home/docker/MPLAPACK/include/mplapack \
Rgetri_Hilbert_mpfr.cpp \
/home/docker/MPLAPACK/lib/libmplapack_mpfr.a \
-Wl,--whole-archive,/home/docker/MPLAPACK/lib/libmpblas_mpfr.a,--no-whole-archive \
-L/home/docker/MPLAPACK/lib -lmpfr -lmpc -lgmp
$ ldd ./a.out
        linux-vdso.so.1 (0x00007ffd64db1000)
        libmpfr.so.6 => /usr/lib/x86_64-linux-gnu/libmpfr.so.6 (0x00007fe59c609000)
        libmpc.so.3 => /usr/lib/x86_64-linux-gnu/libmpc.so.3 (0x00007fe59c3f1000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe59c1d7000)
        libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe59c1bc000)
        libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007fe59bfca000)
        libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007fe59bf46000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe59c75f000)
        libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007fe59bdf5000)
$ ./a.out | grep InfnromL
...

7.9 How to use MPBLAS and MPLAPACK on MacOS

This section explains how to configure and build MPLAPACK on macOS using MacPorts. We assume the home directory is /Volumes/Users/maho. You may build MPLAPACK on macOS as follows:

$ sudo port install gcc10 gsed
$ rm -rf /Volumes/Users/maho/tmp /Volumes/Users/maho/MPLAPACK
$ mkdir /Volumes/Users/maho/tmp
$ cd /Volumes/Users/maho/tmp
$ wget https://github.com/nakatamaho/mplapack/releases/download/v2.0.1/mplapack-2.0.1.tar.xz
$ tar xvfz mplapack-2.0.1.tar.xz
$ cd mplapack-2.0.1
$ CXX="g++-mp-10" ; export CXX
$ CC="gcc-mp-10" ; export CC
$ FC="gfortran-mp-10"; export FC
$ ./configure --prefix=$HOME/MPLAPACK --enable-gmp=yes --enable-mpfr=yes \
--enable-_Float128=yes --enable-qd=yes --enable-dd=yes --enable-double=yes \
--enable-test=yes --enable-_Float64x=yes
$ make -j4
$ make install
$ cd
$ rm -rf /Volumes/Users/maho/tmp

You can test the installation, and how libraries are linked can be checked by otool -L

$ cd /Volumes/Users/maho/MPLAPACK/share/examples/mpblas
$ g++-mp-10 -c -O2 -fopenmp -I/Volumes/Users/maho/MPLAPACK/include \
-I/Volumes/Users/maho/MPLAPACK/include/mplapack Rgemm_mpfr.cpp
$ g++-mp-10 -o Rgemm_mpfr Rgemm_mpfr.o -L/Volumes/Users/maho/MPLAPACK/lib \
-lmpblas_mpfr -lmpfr -lmpc -lgmp
$ ./Rgemm_mpfr
...

$ otool -L Rgemm_mpfr
Rgemm_mpfr:
        /Volumes/Users/maho/MPLAPACK/lib/libmpblas_mpfr.0.dylib (compatibility version 1.0.0, current version 1.0.0)
        /Volumes/Users/maho/MPLAPACK/lib/libmpfr.6.dylib (compatibility version 8.0.0, current version 8.0.0)
        /Volumes/Users/maho/MPLAPACK/lib/libmpc.3.dylib (compatibility version 6.0.0, current version 6.1.0)
        /Volumes/Users/maho/MPLAPACK/lib/libgmp.10.dylib (compatibility version 15.0.0, current version 15.1.0)
        @rpath/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.30.0)
        @rpath/libgcc_s.1.1.dylib (compatibility version 1.0.0, current version 1.1.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

We don’t have to pass -Wl,--rpath=/Volumes/Users/maho/docker/MPLAPACK/lib to linker. However, it is hard coded, and you cannot move the top directory of MPLAPACK to another place.

7.9.1 Static linking

We do not support making a static binary for macOS since it is complicated to make a static binary on macOS.

7.10 How to use MPBLAS and MPLAPACK on Windows(mingw64)

7.10.1 Dynamic linking

By default, the mingw64 system links dynamically to make an executable. For example, to make Rgemm_mpfr.exe, type as follows:

$ cd /home/docker/MPLAPACK_MINGW/share/examples/mpblas/
$ x86_64-w64-mingw32-g++ -c -O2 -fopenmp -I/home/docker/MPLAPACK_MINGW/include \
-I/home/docker/MPLAPACK_MINGW/include/mplapack Rgemm_mpfr.cpp
$ x86_64-w64-mingw32-g++ -o Rgemm_mpfr.exe Rgemm_mpfr.o -L/home/docker/MPLAPACK_MINGW/lib \
-lmpblas_mpfr -lmpfr -lmpc -lgmp
$ wine64 ./Rgemm_mpfr.exe
...

For example, to make Rgess_test_gmp.exe, type as follows:

$ cd /home/docker/MPLAPACK_MINGW/share/examples/mplapack/04_NonsymmetricEigenproblems
$ x86_64-w64-mingw32-g++ -c -O2 -fopenmp -I/home/docker/MPLAPACK_MINGW/include \
 -I/home/docker/MPLAPACK_MINGW/include/mplapack Rgees_test_gmp.cpp
$ x86_64-w64-mingw32-g++ -O2 -fopenmp -o Rgees_test_gmp.exe \
Rgees_test_gmp.o -L/home/docker/MPLAPACK_MINGW/lib -lmplapack_gmp -lmpblas_gmp -lgmp
$ wine64 ./Rgees_test_gmp.exe
...

By default, the mingw64 system links dynamically to make an executable.

$ cd /home/docker/MPLAPACK_MINGW/share/examples/mpblas/
$ x86_64-w64-mingw32-g++ -c -O2 -fopenmp -I/home/docker/MPLAPACK_MINGW/include \
-I/home/docker/MPLAPACK_MINGW/include/mplapack Rgemm_mpfr.cpp
$ x86_64-w64-mingw32-g++ -o Rgemm_mpfr.exe Rgemm_mpfr.o -L/home/docker/MPLAPACK_MINGW/lib \
-lmpblas_mpfr -lmpfr -lmpc -lgmp
$ x86_64-w64-mingw32-objdump -p Rgemm_mpfr.exe | grep ’DLL Name’
        DLL Name: libmpfr-6.dll
        DLL Name: libmpblas_mpfr-0.dll
        DLL Name: KERNEL32.dll
        DLL Name: msvcrt.dll
        DLL Name: libgcc_s_seh-1.dll
        DLL Name: libstdc++-6.dll

To run Rgemm_mpfr.exe on other computers, we must first install mingw64 runtime, then, copy DLLs in /home/docker/MPLAPACK_MINGW/bin to somewhere under C:\Program File, like
C:\Program File\MPLAPACK or current directory.

7.10.2 Static link for mingw64 (Windows)

To link mpblas statically on the mingw64 system, we should add
“-Wl,--whole-archive,/home/docker/MPLAPACK_MINGW/lib/libmpblas_mpfr.a,--no-whole-archive”
when linking. This is required because MPBLAS initializes some values at the runtime. If you do not specify them, simply link like “/home/docker/MPLAPACK_MINGW/lib/libmpblas_mpfr.a”, MPBLAS will not work properly. The following example shows how we link Rgemm_mpfr.exe against libmpblas_mpfr.a, which is a static library:

$ cd /home/docker/MPLAPACK_MINGW/share/examples/mpblas/
$ x86_64-w64-mingw32-g++ -c -O2 -fopenmp -I/home/docker/MPLAPACK_MINGW/include \
-I/home/docker/MPLAPACK_MINGW/include/mplapack Rgemm_mpfr.cpp
$ x86_64-w64-mingw32-g++ -o Rgemm_mpfr.exe Rgemm_mpfr.o -L/home/docker/MPLAPACK_MINGW/lib \
-static-libgcc -static-libstdc++ \
-Wl,--whole-archive,/home/docker/MPLAPACK_MINGW/lib/libmpblas_mpfr.a,--no-whole-archive \
-lmpfr -lmpc -lgmp

However, the resultant file Rgemm_mpfr.exe is not a completely static executable.

$ x86_64-w64-mingw32-objdump -p Rgemm_mpfr.exe | grep ’DLL Name’
        DLL Name: libmpc-3.dll
        DLL Name: libmpfr-6.dll
        DLL Name: KERNEL32.dll
        DLL Name: msvcrt.dll

Moreover, this executable implicitly depends on
/usr/lib/gcc/x86_64-w64-mingw32/9.3-win32/libgcc_s_seh-1.dll. Therefore, to run Rgemm_mpfr.exe on other computers, we must copy the contents of /home/docker/MPLAPACK_MINGW/bin and
/usr/lib/gcc/x86_64-w64-mingw32/9.3-win32/libgcc_s_seh-1.dll to current directory or somewhere under C:\Program File, like C:\Program File\MPLAPACK. Currently, it is not possible to build complete static executables using MPLAPACK. However, we are planning to handle this issue in the next version.

8 Quality assurance of MPBLAS and MPLAPACK

8.1 Testing MPBLAS routines

We performed quality assurance of MPBLAS as follows.

1.

We believe there are no bugs in the reference BLAS [2].
2.

We input random values of binary64 for matrices, vectors, and scalars to BLAS and MPFR version of MPBLAS for various sizes of matrices, vectors, and leading dimensions (It means MPBLAS and BLAS may fail with error codes) to a routine.
3.

If the difference between the two results is within tolerance, we regard the quality as assured for the MPFR version of the MPBLAS routine.
4.

We also check the error codes by xerbla and Mxerbla, and invariably these error codes are the same for the routine.
5.

Do the same checks for all routines in Table 4.
6.

If there are no errors, we regard the MPFR version of MPBLAS as bug-free.
7.

Then, we input random values of a multiple precision type (e.g., _Float128) for matrices, vectors, and scalars to this version of MPBLAS and MPFR version MPBLAS for various sizes of the matrix, vector, and leading dimensions.
8.

If the difference between the two results is within tolerance, we regard the quality as assured for the multiple-precision type version of MPBLAS.

Since MPBLAS routines only consist of algebraic operations, this method ensures that all routines are correctly implemented. Test programs and input files are installed at
/home/docker/MPLAPACK/lib/*/mplapack/test/compare/.
For example, if you check the MPFR version of Rgemm,

$ cd /home/docker/MPLAPACK/lib/x86_64-pc-linux-gnu/mplapack/test/compare/mpfr
$ ./Rgemm.test.mpfr
*** Testing Rgemm start ***
maxerror: +0.0000000000000000e+00
maxerror: +8.8494802737544305e-17
maxerror: +5.2155910699049989e-16
maxerror: +5.2155910699049989e-16
...
maxerror: +5.7344511452715043e-16
*** Testing Rgemm successful ***

. There are 256 error check programs for each precision.

8.2 Testing MPLAPACK routines and driver routines

We converted LAPACK’s original test programs to C++ and extended them to each multi-precision version. As a result, there are two tests: testing for linear equation solver and eigenvalue problems, singluarvarlue decomposition, and least square fitting. The former is in the “lin” directory, and the latter is in the “eig” directory.

In MPLAPACK, all testing programs are installed at /home/docker/MPLAPACK/lib/*/test/lin/ and
/home/docker/MPLAPACK/lib/*/test/eig/. To test all MPLAPACK routines and drivers on Linux, you type the following:

$ cd /home/docker/MPLAPACK/lib/x86_64-pc-linux-gnu/mplapack/test/lin
$ bash test_lin_all.sh
$ less log.*
$ cd /home/docker/MPLAPACK/lib/x86_64-pc-linux-gnu/mplapack/test/eig
$ bash test_eig_all.sh
$ less log.*

On windows (mingw64, using wine64):

$ cd /home/docker/MPLAPACK_MINGW/lib/x86_64-w64-mingw32/mplapack/test/lin
$ bash test_lin_all_mingw.sh
$ less log.*
$ cd /home/docker/MPLAPACK_MINGW/lib/x86_64-w64-mingw32/mplapack/test/eig
$ bash test_eig_all_migw.sh
$ less log.*

On macOS (x86_64):

$ cd /Volumes/Users/maho/MPLAPACK/lib/x86_64-apple-darwin20.6.0/mplapack/test/lin
$ bash test_lin_all.sh
$ less log.*
$ cd /Volumes/Users/maho/MPLAPACK/lib/x86_64-apple-darwin20.6.0/mplapack/test/eig
$ bash test_eig_all.sh
$ less log.*

It may take a day to run through all tests. The MPLAPACK libraries are reliable when there are not so many errors. You can find complete log files at https://github.com/nakatamaho/mplapack/tree/master/mplapack/test/lin/results
and https://github.com/nakatamaho/mplapack/tree/master/mplapack/test/eig/results for several platforms.

Note that we only tested for default precision (512bits) for GMP and MPFR. For other precisions, users should check by themselves.

8.2.1 Background

Until version 0.9.4, we did the same method for quality assurance for MPBLAS and MPLAPACK. From version 2.0.1, we ported lapack-3.9.1/TESTING/LIN and lapack-3.9.1/TESTING/EIG for all precisions using FABLE. The original test programs use the FORMAT statement extensively. Interestingly, FABLE and FABLE’s Fortran EMulator library translate these FORMAT statements almost perfectly except for printing out multiple-precision floating-point numbers. Thus our modifications to the testing routines were minimal. Then we extended the program to support multiple precision arithmetic so that our libraries are accurate in more than 16 decimal significant digits.

8.2.2 Details of testing real linear MPLAPACK routines and drivers

First, we review how to test LAPACK linear routines:

$ cd /home/docker/mplapack/external/lapack/work/internal/lapack-3.9.1/TESTING/LIN/
$ ./xlintstd < ../dtest.in
 Tests of the DOUBLE PRECISION LAPACK routines
 LAPACK VERSION 3.9.1

 The following parameter values will be used:
    M   :       0     1     2     3     5    10    50
    N   :       0     1     2     3     5    10    50
    NRHS:       1     2    15
    NB  :       1     3     3     3    20
    NX  :       1     0     5     9     1
    RANK:      30    50    90

 Routines pass computational tests if test ratio is less than   30.00

 Relative machine underflow is taken to be    0.222507-307
 Relative machine overflow  is taken to be    0.179769+309
 Relative machine precision is taken to be    0.111022D-15


 DGE routines passed the tests of the error exits

 All tests for DGE routines passed the threshold (   3653 tests run)

 DGE drivers passed the tests of the error exits

 All tests for DGE drivers  passed the threshold (   5748 tests run)

 DGB routines passed the tests of the error exits
...

We omit the details of how we test the linear routines, but ./xlintstd tests all the linear routines used in LAPACK. Details can be found in the literature [35].

Corresponding ported testing programs for linear routines
/home/docker/MPLAPACK/lib/*/mplapack/test/lin are following:

•

xlintstR_mpfr
•

xlintstR_double
•

xlintstR__Float64x
•

xlintstR__Float128
•

xlintstR_dd
•

xlintstR_gmp
•

xlintstR_qd
•

xlintstrfR__Float128
•

xlintstrfR__Float64x
•

xlintstrfR_dd
•

xlintstrfR_double
•

xlintstrfR_gmp
•

xlintstrfR_mpfr
•

xlintstrfR_qd

where we add suffix “R” + “mplib” for xlintst. Test input files are Rtest.in (usual matrix) and Rtest_rfp.in (matrix in recutangular full packed format). Input files can be used for all precisions.

We tested real MPLAPACK routines for MPFR as follows:

$ cd ~/MPLAPACK/lib/x86_64-pc-linux-gnu/mplapack/test/lin/
$ ./xlintstR_mpfr < Rtest.in
 Tests of the Multiple precision version of LAPACK MPLAPACK VERSION 2.0.1
 Based on the original LAPACK VERSION 3.9.1

The following parameter values will be used:
        M  :     0     1     2     3     5    10    50
        N  :     0     1     2     3     5    10    50
     NRHS  :     1     2    15
       NB  :     1     3     3     3    20
       NX  :     1     0     5     9     1
     RANK  :    30    50    90

 Routines pass computational tests if test ratio is less than   30.00

 Relative machine underflow is taken to be : +9.5302596195518043e-323228497
 Relative machine overflow  is taken to be : +2.0985787164673877e+323228496
 Relative machine precision is taken to be : +7.4583407312002067e-155
 RGE routines passed the tests of the error exits

 All tests for RGE routines passed the threshold (   3653 tests run)
 RGE drivers passed the tests of the error exits

 All tests for RGE drivers  passed the threshold (   5748 tests run)
 RGB routines passed the tests of the error exits

 All tests for RGB routines passed the threshold (  28938 tests run)
 RGB drivers passed the tests of the error exits

 All tests for RGB drivers  passed the threshold (  36567 tests run)
 RGT routines passed the tests of the error exits

 All tests for RGT routines passed the threshold (   2694 tests run)
 RGT drivers passed the tests of the error exits

 All tests for RGT drivers  passed the threshold (   2033 tests run)
 RPO routines passed the tests of the error exits

 All tests for RPO routines passed the threshold (   1628 tests run)
 RPO drivers passed the tests of the error exits

 All tests for RPO drivers  passed the threshold (   1910 tests run)
 RPS routines passed the tests of the error exits

 All tests for RPS routines passed the threshold (    150 tests run)
 RPP routines passed the tests of the error exits
...

The file Rtest.in is same as dtest.in of LAPACK, except for the first comment line. For other precisions, we similarly performed tests. We verified that all tests passed for all precisions.

The RFP (Rectangular Full Packed) version can be tested, for example, the dd version, as follows:

$ ./xlintstrfR_dd < Rtest_rfp.in
 Tests of the Multiple precision version of LAPACK MPLAPACK VERSION 2.0.1
 Based on the original LAPACK VERSION 3.9.1

The following parameter values will be used:
        N  :     0     1     2     3     5     6    10    11    50
     NRHS  :     1     2    15
        M  :     1     2     3     4     5     6     7     8     9
 Routines pass computational tests if test ratio is less than   30.00

 Relative machine underflow is taken to be : 2.0041683600089728e-292
 Relative machine overflow  is taken to be : 1.7976931348623158e+308
 Relative machine precision is taken to be : 4.9303806576313200e-32
 MULTIPLE PRECISION RFP routines passed the tests of the error exits

 All tests for RPF drivers  passed the threshold (   2304 tests run)
 All tests for Rlansf auxiliary routine passed the threshold (   384 tests run)
 All tests for the RFP conversion routines passed (    72 tests run)
 All tests for Rtfsm auxiliary routine passed the threshold (  7776 tests run)
 All tests for Rsfrk auxiliary routine passed the threshold (  2592 tests run)

 End of tests
 Total time used =  15 seconds

Since we use random number with no fixed random seed, we may obtain failed results like as follows:

$ ./xlintstR__Float128 < Rtest.in
 Tests of the Multiple precision version of LAPACK MPLAPACK VERSION 2.0.1
 Based on the original LAPACK VERSION 3.9.1

The following parameter values will be used:
        M  :     0     1     2     3     5    10    50
        N  :     0     1     2     3     5    10    50
     NRHS  :     1     2    15
       NB  :     1     3     3     3    20
       NX  :     1     0     5     9     1
     RANK  :    30    50    90

 Routines pass computational tests if test ratio is less than   30.00

 Relative machine underflow is taken to be : +3.3621031431120935e-4932
 Relative machine overflow  is taken to be : +1.1897314953572318e+4932
 Relative machine precision is taken to be : +1.9259299443872359e-34
 RGE routines passed the tests of the error exits

 All tests for RGE routines passed the threshold (   3653 tests run)
 RGE drivers passed the tests of the error exits

 All tests for RGE drivers  passed the threshold (   5748 tests run)
 RGB routines passed the tests of the error exits

 All tests for RGB routines passed the threshold (  28938 tests run)
 RGB drivers passed the tests of the error exits

 RGB drivers:  General band matrices
 Matrix types:
    1. Random, CNDNUM = 2              5. Random, CNDNUM = sqrt(0.1/EPS)
    2. First column zero               6. Random, CNDNUM = 0.1/EPS
    3. Last column zero                7. Scaled near underflow
    4. Last n/2 columns zero           8. Scaled near overflow
 Test ratios:
    1: norm( L * U - A )  / ( N * norm(A) * EPS )
    2: norm( B - A * X )  / ( norm(A) * norm(X) * EPS )
    3: norm( X - XACT )   / ( norm(XACT) * CNDNUM * EPS )
    4: norm( X - XACT )   / ( norm(XACT) * (error bound) )
    5: (backward error)   / EPS
    6: RCOND * CNDNUM - 1.0
    7: abs( WORK(1) - RPVGRW ) / ( max( WORK(1), RPVGRW ) * EPS )
 Messages:
 Rgbsvx( ’ ’,’ ’,   50,   12,   12,...), type 6, test(5)=+4.5401871787002674e+01
 Rgbsvx( ’ ’,’ ’,   50,   12,   12,...), type 6, test(5)=+4.5401871787002674e+01
 Rgbsvx( ’ ’,’ ’,   50,   12,   12,...), EQUED=’ ’, type 6, test(5)=+8.2754819479199405e+01
 Rgbsvx( ’ ’,’ ’,   50,   12,   12,...), EQUED=’ ’, type 6, test(5)=+8.2754819479199405e+01
 Rgbsvx( ’ ’,’ ’,   50,   12,   12,...), EQUED=’ ’, type 6, test(5)=+4.5401871787002674e+01
 Rgbsvx( ’ ’,’ ’,   50,   12,   12,...), EQUED=’ ’, type 6, test(5)=+4.5401871787002674e+01
 RGB drivers:      6 out of  36567 tests failed to pass the threshold
 RGT routines passed the tests of the error exits

However, this result is not fatal since errors are only slightly above the threshold.

8.2.3 Details of testing complex linear MPLAPACK routines and drivers

Corresponding ported testing programs for linear routines
/home/docker/MPLAPACK/lib/*/mplapack/test/lin are following:

•

xlintstC_mpfr
•

xlintstC_double
•

xlintstC__Float64x
•

xlintstC__Float128
•

xlintstC_dd
•

xlintstC_gmp
•

xlintstC_qd
•

xlintstrfC__Float128
•

xlintstrfC__Float64x
•

xlintstrfC_dd
•

xlintstrfC_double
•

xlintstrfC_gmp
•

xlintstrfC_mpfr
•

xlintstrfC_qd

where we add suffix “C” + “mplib” for xlintst. Test input files are Ctest.in (usual matrix) and Ctest_rfp.in (matrix in recutangular full packed format). Input files can be used for all precisions.

We test complex MPLAPACK routines for the GMP version as follows:

$ cd ~/MPLAPACK/lib/x86_64-pc-linux-gnu/mplapack/test/lin/
$ ./xlintstC_gmp < Ctest.in
 Tests of the Multiple precision version of LAPACK MPLAPACK VERSION 2.0.1
 Based on the original LAPACK VERSION 3.9.1

The following parameter values will be used:
        M  :     0     1     2     3     5    10    50
        N  :     0     1     2     3     5    10    50
     NRHS  :     1     2    15
       NB  :     1     3     3     3    20
       NX  :     1     0     5     9     1
     RANK  :    30    50    90

 Routines pass computational tests if test ratio is less than   60.00

 Relative machine underflow is taken to be : +1.7019382623481672e-1388255822130839283
 Relative machine overflow  is taken to be : +5.8756537891115876e+1388255822130839282
 Relative machine precision is taken to be : +7.4583407312002067e-155
 CGE routines passed the tests of the error exits

 All tests for CGE routines passed the threshold (   3653 tests run)
 CGE drivers passed the tests of the error exits
...
 All tests for CXQ routines passed the threshold (   1482 tests run)
 CTQ routines passed the tests of the error exits

 All tests for CTQ routines passed the threshold (    510 tests run)
 CTS routines passed the tests of the error exits

 All tests for CTS routines passed the threshold (  10800 tests run)
 CHH routines passed the tests of the error exits

 All tests for CHH routines passed the threshold (  15900 tests run)

 End of tests
 Total time used =  2066 seconds

We can test the RFP version, for example, the qd version, as follows:

$ ./xlintstrfC_qd < Ctest_rfp.in
 Tests of the Multiple precision version of LAPACK MPLAPACK VERSION 2.0.1
 Based on original LAPACK VERSION 3.9.1

The following parameter values will be used:
        N  :     0     1     2     3     5     6    10    11    50
     NRHS  :     1     2    15
        M  :     1     2     3     4     5     6     7     8     9
 Routines pass computational tests if test ratio is less than   30.00

 Relative machine underflow is taken to be : 1.6259745436952323e-260
 Relative machine overflow  is taken to be : 1.7976931348623158e+308
 Relative machine precision is taken to be : 1.2154326714572501e-63
 MULTIPLE PRECISION COMPLEX RFP routines passed the tests of the error exits

 All tests for ZPF drivers  passed the threshold (   2304 tests run)
 All tests for Clanhf auxiliary routine passed the threshold (   384 tests run)
 All tests for the RFP conversion routines passed (   72 tests run)
 All tests for Ctfsm auxiliary routine passed the threshold (  7776 tests run)
 All tests for Chfrk auxiliary routine passed the threshold (   2592 tests run)

 End of tests
 Total time used =  134 seconds

8.2.4 Details of testing other (EIG) MPLAPACK routines and drivers

Next, we explain how to test MPLAPACK EIG.

There are 20 input files for testing eigenvalue problems, singular value problems, least square fitting problems, and other related routines. For MPLAPCK, we use the following input files:

Cbak.in Cbal.in Cbal_double.in Cbb.in Cec.in Ced.in
Cgbak.in Cgbal.in Cgd.in Cgg.in Csb.in Csg.in Rbak.in
Rbal.in Rbal_double.in Rbb.in Rec.in Red.in Rgbak.in
Rgbal.in Rgd.in Rgg.in Rsb.in Rsg.in csd.in glm.in
gqr.in gsv.in lse.in nep.in se2.in sep.in svd.in

We modified the original input files; do not break the line for each matrix row so that the testing program can process the data correctly by C++. The original input files for the balancing test depend on the precision. There are two designed for double precision, Rbal_double.in and Cbal_double.in, which are identical to dbal.in and zbal.in . The final test matrices in each input file of Rbal_double.in and Cbal_double.in are tricky. We omitted the final matrix for each input and make them Rbal.in and Cbal.in

Recall that we test LAPACK EIG as follows:

$ cd /home/docker/mplapack/external/lapack/work/internal/lapack-3.9.1/TESTING/EIG/
$  ./xeigtstd < ../nep.in
 Tests of the Nonsymmetric Eigenvalue Problem routines

 LAPACK VERSION 3.9.1

 The following parameter values will be used:
    M:         0     1     2     3     5    10    16
    N:         0     1     2     3     5    10    16
    NB:        1     3     3     3    20
    NBMIN:     2     2     2     2     2
    NX:        1     0     5     9     1
    INMIN:     11    12    11    15    11
    INWIN:      2     3     5     3     2
    INIBL:      0     5     7     3   200
    ISHFTS:      1     2     4     2     1
    IACC22:      0     1     2     0     1

 Relative machine underflow is taken to be    0.222507-307
 Relative machine overflow  is taken to be    0.179769+309
 Relative machine precision is taken to be    0.111022D-15

 Routines pass computational tests if test ratio is less than   20.00


 DHS routines passed the tests of the error exits ( 66 tests done)
...

Corresponding ported testing programs for eigenvalues and sigular value problme solver routines are in /home/docker/mplapack/mplapack/test/eig, and we list them as follows:

•

mpfr/xeigtstC_mpfr
•

mpfr/xeigtstR_mpfr
•

_Float128/xeigtstC__Float128
•

_Float128/xeigtstR__Float128
•

_Float64x/xeigtstC__Float64x
•

_Float64x/xeigtstR__Float64x
•

dd/xeigtstC_dd
•

dd/xeigtstR_dd
•

double/xeigtstC_double
•

double/xeigtstR_double
•

gmp/xeigtstC_gmp
•

gmp/xeigtstR_gmp
•

qd/xeigtstC_qd
•

qd/xeigtstR_qd

where we add suffix “R” + “mplib” or “C” + “mplib” for xeigtst, respectively.

We test MPLAPACK routines for MPFR as follows:

$ cd /home/docker/mplapack/mplapack/test/eig/mpfr
$ ./xeigtstR_mpfr < ../svd.in
 Tests of the Singular Value Decomposition routines
 Tests of the Multiple precision version of LAPACK MPLAPACK VERSION 2.0.1
 Based on original LAPACK VERSION 3.9.1

The following parameter values will be used:
    M:         0     0     0     1     1     1     2     2     3     3
               3    10    10    16    16    30    30    40    40
    N:         0     1     3     0     1     2     0     1     0     1
               3    10    16    10    16    30    40    30    40
    NB:        1     3     3     3    20
    NBMIN:     2     2     2     2     2
    NX:        1     0     5     9     1
    NS:        2     0     2     2     2

 Relative machine underflow is taken to be+9.5302596195518043e-323228497
 Relative machine overflow  is taken to be+2.0985787164673877e+323228496
 Relative machine precision is taken to be+7.4583407312002067e-155

 Routines pass computational tests if test ratio is less than+5.0000000000000000e+01

 DBD routines passed the tests of the error exits ( 55 tests done)
 Rgesvd passed the tests of the error exits (  8 tests done)
 Rgesdd passed the tests of the error exits (  6 tests done)
 Rgejsv passed the tests of the error exits ( 11 tests done)
 Rgesvdx passed the tests of the error exits ( 12 tests done)
 Rgesvdq passed the tests of the error exits ( 11 tests done)


 SVD:  NB =   1, NBMIN =   2, NX =   1, NRHS =   2

 All tests for DBD routines passed the threshold (  10260 tests run)
....

We performed tests for all the 20 testing files for all real precision versions. Almost all the tests passed for every input file. In GMP version, gmp/xeigtstR_gmp, and the errors are related to precision loss in arithmetic; we should raise the error threshold by a factor of 10 or 100, and we do not consider such precision loesses are profound since we can raise precision at runtime. We do not perform test for csd.in, because GMP does not have trigonometric functions. MPLAPACK stops when users try to use cosine sine decomposition. We do not fix these failures and planning to drop the GMP version and integrate it into the MPFR version in MPLAPACK version 3.0.0. Besides, some failures for svd.in are using xeigtstR_dd, xeigstsC_dd, xeigtstC_dd and xeigstsC_qd are observed. We do not know the root cause of the error; however, we suspect the errors might be related to underflow.

In summary, MPLAPACK routines for all real and complex precisions are reliable.

9 Fortran90 to C++ Conversion

Until MPLAPACK 0.9.3, we used f2c to convert FORTRAN77 routines to C (until version 3.1.1, LAPACK only required the FORTRAN77 compiler). Unfortunately, the resultant source code was tough to read by a human; we replaced many tokens heavily using sed and finally converted them to user-friendly, easy-to-read C++ programs by hand.

We simply use REAL, COMPLEX, INTEGER, LOGICAL, and BOOL to support all floating-point types. These types are abstract, and using typedef to specify the actual floating-point type (e.g., typedef mpf_class REAL for GMP. See mplapack/include/mplapack.h for details).

The main differences between Fortran and C++ are: (i) array index starts from one, (ii) loop index may be incremented or decremented depending on its context, (iii) treatment of two-dimension arrays, (iv) when calling subroutines, FORTRAN always references the values, but we can choose call by value or call by reference in C++. For (i), we use the same array index and decrement the index when accessing the array. (ii) we judge the direction of loop by hand, (iii) we expand two-dimension array using the leading dimension; we translate a(i,j) to a[(i-1) + (j-1) * lda], and (iv) we changed call by value when possible.

Since MPLAPACK 1.0.0, we used FABLE [36] to convert all BLAS and LAPACK routines. In the original paper, they could translate dsyev.f without modifications. However, the resultant source code also depends on Fortran Emulation Module (FEM), which is unnecessary for the MPBLAS and MPLAPACK parts, and the treatment of the array is a bit unnatural. To make the resultant C++ code more natural and readable, we further patched cout.py and passed through sed. Then, we modified the source codes by hand. For MPBLAS, conversion is almost automatic. For MPLAPACK, there are many hand-corrected minor syntax errors, interpretation of numbers, casts, and other corrections, but we can compile many without modifications.

More important part is translation of TESTING/LIN and TESTING/EIG from Fortran90 to C++. Thanks to FEM, the written statement of Fortran can be used directly without modifications. According to ChangeLog, it took approximately three weeks to just compile programs in
TESTING/LIN/xlint{C,R}_mpfr and TESTING/EIG/xeigtst{C,R}_mpfr for MPFR, although through check the results are required, and it took four months for real version and another four-month for complex version.

10 Benchmarks

We show some benchmarks for MPBLAS and MPLAPACK. We do not perform any optimization for the reference implementations. Nevertheless, we provide a simple OpenMP accelerated version for MPBLAS and CUDA accelerated version of MPBLAS double-double routines (Rgemm and Rsyrk. They are shown in table 5 and table 6.

Tables 22 and 23 and, we show CPUs, OSes, machine names, Compilers, and their abbreviations Used to take the benchmarks. For A, D, E, and F, we took the benchmark inside the Docker, and for B, C and G, we ran on the native OS.

Table 22: CPU and memory configurations and their abbreviations

Abbrev.	CPU or GPU or Machine name	Memory
A	AMD Ryzen 3970X (1 CPU, 32 cores)	DDR4 ECC 3200 MT/s
B	Intel Xeon E5 2623 (2 CPUs, 8 cores)	DDR4 ECC 2133 MT/s
C	Intel Core i5-8500B (1 CPU, 6 cores)	DDR4 w/o ECC 2667 MT/s
D	same as A	same as A
E	NVIDIA A100 80GB PCIe	HBM2e 1935GB/s
F	NVIDIA V100 40GB PCIe	HBM2 900GB/s
G	Raspberry Pi4, ARM Cortex A72 (1 CPU, 4 cores)	LPDDR4-3200 SDRAM 4GB

Table 23: OSes, Compilers, and their abbreviations

Abbrev.	Docker filename	OS	Compiler/driver
A	Dockerfile_ubuntu20.04	Ubuntu 20.04	GCC 9.3
B	-	Ubuntu 20.04	GCC 9.3
C	-	macOS Big Sur	GCC 10.4
D	Dockerfile_ubuntu20.04_inteloneapi	Ubuntu 20.04	Intel oneAPI 2021.6.0
E	Dockerfile_ubuntu20.04_cuda	Ubuntu 20.04	Nvidia 11.7
F	-	Ubuntu 20.04	Nvidia 11.7
G	Dockerfile_ubuntu20.04	Ubuntu 20.04	GCC 9.3

10.1 Rgemm benchmarks

10.1.1 Rgemm on AMD Ryzen 3970X

In Figure 1, we show the result of Rgemm performance for _Float128, _Float64x and double-double, and in Figure 2 we show the result of Rgemm performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Rgemms of _Float128, _Float64x and double-double are 64.1 MFlops, 658 MFlops, 245 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemms of _Float128, _Float64x and double-double are 2450 MFlops, 19020 MFlops, 8582 MFlops, respectively. The peak performances of the reference Rgemms of MPFR 512bit, GMP 512bit and quad-double are 11.5 MFlops, 19.5 MFlops, 19.8 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 389 MFlops, 670 MFlops, 783 MFlops, respectively.

Refer to caption — Figure 1: Rgemm performance on AMD Ryzen 3970X for _Float128, _Float64x and double-double with/without simple OpenMP acceleration.

10.1.2 Rgemm on Intel Xeon E5 2623

In Figure 3, we show the result of Rgemm performance for _Float128, _Float64x and double-double, and in Figure 4 we show the result of Rgemm performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Rgemms of _Float128, _Float64x and double-double are 51.0 MFlops, 820 MFlops, 195 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemms of _Float128, _Float64x and double-double are 389 MFlops, 7633 MFlops, 1464 MFlops, respectively. The peak performances of the reference Rgemms of MPFR 512bit, GMP 512bit and quad-double are 8.3 MFlops, 13.2 MFlops, 15.5 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 56.7 MFlops, 92.9 MFlops, 130 MFlops, respectively.

10.1.3 Rgemm on Intel Core i5-8500B

In Figure 5, we show the result of Rgemm performance for _Float128, _Float64x and double-double, and in Figure 6 we show the result of Rgemm performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Rgemms of _Float128, _Float64x and double-double are 60.0 MFlops, 1268 MFlops, 282 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemms of _Float128, _Float64x and double-double are 324 MFlops, 6777 MFlops, 1568 MFlops, respectively. The peak performances of the reference Rgemms of MPFR 512bit, GMP 512bit and quad-double are 6.1 MFlops, 13.8 MFlops, 16.4 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 27.6 MFlops, 71.6 MFlops, 89.2 MFlops, respectively.

10.1.4 Rgemm on Raspberry Pi4 ARM Cortex A72

In Figure 7, we show the result of Rgemm performance for _Float128 and double-double, and in Figure 8 we show the result of Rgemm performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Rgemms of _Float128, _Float64x and double-double are 16.5 MFlops and 52.4 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemms of _Float128, _Float64x and double-double are 65.7 MFlops and 206 MFlops, respectively. The peak performances of the reference Rgemms of MPFR 512bit, GMP 512bit and quad-double are 2.4 MFlops, 3.1 MFlops, 5.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 9.3 MFlops, 12.0 MFlops, 21.8 MFlops, respectively.

10.1.5 Rgemm on NVIDIA A100 80GB PCIe

In Figure 9, we show the result of Rgemm performance double-double on NVIDIA A100 80GB PCIe. The peak performance of double-double on NVIDIA A100 80GB PCIe without host-GPU transfer was 592GFlops. Unfortunately, our hardware configuration does not equip with appropreate cooling system, we could not take a stable benchmark.

10.1.6 Rgemm on NVIDIA V100 40GB PCIe

In Figure 10, we show the result of Rgemm performance double-double on NVIDIA V100 40GB PCIe. The peak performance of double-double on NVIDIA V100 40GB PCIe without host-GPU transfer was 435GFlops, and The peak performance of double-double on NVIDIA V100 40GB PCIe without host-GPU transfer was 318GFlops.

10.2 Rsyrk benchmarks

10.2.1 Rsyrk on AMD Ryzen 3970X

In Figure 11, we show the result of Rsyrk performance for _Float128, _Float64x and double-double, and in Figure 12 we show the result of Rsyrk performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Rsyrks of _Float128, _Float64x and double-double are 62.2 MFlops, 655 MFlops, 246 MFlops, respectively. The peak performances of simple OpenMP parallelized Rsyrks of _Float128, _Float64x and double-double are 62.3 MFlops, 654 MFlops, 245 MFlops, respectively. The peak performances of the reference Rsyrks of MPFR 512bit, GMP 512bit and quad-double are 11.4 MFlops, 19.1 MFlops, 18.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 11.4 MFlops, 19.3 MFlops, 18.7 MFlops, respectively.

10.2.2 Rsyrk on Intel Xeon E5 2623

In Figure 13, we show the result of Rsyrk performance for _Float128, _Float64x and double-double, and in Figure 14 we show the result of Rsyrk performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Rsyrks of _Float128, _Float64x and double-double are 50.1 MFlops, 835 MFlops, 192 MFlops, respectively. The peak performances of simple OpenMP parallelized Rsyrks of _Float128, _Float64x and double-double are 50.2 MFlops, 925 MFlops, 192 MFlops, respectively. The peak performances of the reference Rsyrks of MPFR 512bit, GMP 512bit and quad-double are 8.3 MFlops, 12.8 MFlops, 14.9 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 8.1 MFlops, 12.8 MFlops, 14.9 MFlops, respectively.

10.2.3 Rsyrk on Intel Core i5-8500B

In Figure 15, we show the result of Rsyrk performance for _Float128, _Float64x and double-double, and in Figure 16 we show the result of Rsyrk performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Rsyrks of _Float128, _Float64x and double-double are 57.8 MFlops, 1276 MFlops, 276 MFlops, respectively. The peak performances of simple OpenMP parallelized Rsyrks of _Float128, _Float64x and double-double are 57.4 MFlops, 1280 MFlops, 283 MFlops, respectively. The peak performances of the reference Rsyrks of MPFR 512bit, GMP 512bit and quad-double are 6.5 MFlops, 14.0 MFlops, 17.4 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 6.5 MFlops, 14.6 MFlops, 16.0 MFlops, respectively.

10.2.4 Rsyrk on Raspberry Pi4 ARM Cortex A72

In Figure 17, we show the result of Rsyrk performance for _Float128 and double-double, and in Figure 18 we show the result of Rsyrk performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Rsyrks of _Float128 and double-double are 16.3 MFlops and 51.4 MFlops, respectively. The peak performances of simple OpenMP parallelized Rsyrks of _Float128 and double-double are 16.3 MFlops and 51.4 MFlops, respectively. The peak performances of the reference Rsyrks of MPFR 512bit, GMP 512bit and quad-double are 2.4 MFlops, 3.2 MFlops, 5.4 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 2.3 MFlops, 3.1 MFlops, 5.4 MFlops, respectively.

10.2.5 Rsyrk on NVIDIA V100

In Figure 19, we show the result of Rsyrk performance double-double on NVIDIA V100 40GB PCIe. The peak performance of double-double on NVIDIA V100 40GB PCIe without host-GPU transfer was 428GFlops, and The peak performance of double-double on NVIDIA V100 40GB PCIe without host-GPU transfer was 277GFlops.

10.3 Rgemv benchmarks

10.3.1 Rgemv on AMD Ryzen 3970X

In Figure 20, we show the result of Rgemv performance for _Float128, _Float64x and double-double, and in Figure 21 we show the result of Rgemv performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Rgemvs of _Float128, _Float64x and double-double are 63.6 MFlops, 669 MFlops, 247 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemvs of _Float128, _Float64x and double-double are 63.3 MFlops, 674 MFlops, 250 MFlops, resp ectively. The peak performances of the reference Rgemvs of MPFR 512bit, GMP 512bit and quad-double are 11.4 MFlops, 19.4 MFlops, 19.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 11.3 MFlops, 19.8 MFlops, 19.8 MFlops, respectively.

10.3.2 Rgemv on Intel Xeon E5 2623

In Figure 22, we show the result of Rgemv performance for _Float128, _Float64x and double-double, and in Figure 23 we show the result of Rgemv performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Rgemvs of _Float128, _Float64x and double-double are 52.0 MFlops, 859 MFlops, 194 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemvs of _Float128, _Float64x and double-double are 51.8 MFlops, 859 MFlops, 194 MFlops, respectively. The peak performances of the reference Rgemvs of MPFR 512bit, GMP 512bit and quad-double are 8.3 MFlops, 13.3 MFlops, 15.5 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 8.3 MFlops, 13.2 MFlops, 15.4 MFlops, respectively.

10.3.3 Rgemv on Intel Core i5-8500B

In Figure 24, we show the result of Rgemv performance for _Float128, _Float64x and double-double, and in Figure 25 we show the result of Rgemv performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Rgemvs of _Float128, _Float64x and double-double are 59.1 MFlops, 1266 MFlops, 288 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemvs of _Float128, _Float64x and double-double are 59.0 MFlops, 1272 MFlops, 283 MFlops, respectively. The peak performances of the reference Rgemvs of MPFR 512bit, GMP 512bit and quad-double are 6.7 MFlops, 13.8 MFlops, 16.9 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 6.7 MFlops, 13.7 MFlops, 16.6 MFlops, respectively.

10.3.4 Rgemv on Raspberry Pi4 ARM Cortex A72

In Figure 26, we show the result of Rgemv performance for _Float128 and double-double, and in Figure 27 we show the result of Rgemv performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Rgemvs of _Float128 and double-double are 16.6 MFlops and 52.1 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgemvs of _Float128 and double-double are 16.6 MFlops and 51.5 MFlops, respectively. The peak performances of the reference Rgemvs of MPFR 512bit, GMP 512bit and quad-double are 2.3 MFlops, 3.1 MFlops, 5.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 2.4 MFlops, 3.1 MFlops, 5.6 MFlops, respectively.

10.4 Rgetrf benchmarks

10.4.1 Rgetrf on AMD Ryzen 3970X

In Figure 28, we show the result of Rgetrf performance for _Float128, _Float64x and double-double, and in Figure 29 we show the result of Rgetrf performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Rgetrfs of _Float128, _Float64x and double-double are 48.5 MFlops, 645 MFlops, 242 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgetrfs of _Float128, _Float64x and double-double are 567 MFlops, 4352 MFlops, 2355 MFlops, respectively. The peak performances of the reference Rgetrfs of MPFR 512bit, GMP 512bit and quad-double are 11.0 MFlops, 18.4 MFlops, 17.9 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 106 MFlops, 208 MFlops, 232 MFlops, respectively.

10.4.2 Rgetrf on Intel Xeon E5 2623

In Figure 30, we show the result of Rgetrf performance for _Float128, _Float64x and double-double, and in Figure 31 we show the result of Rgetrf performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Rgetrfs of _Float128, _Float64x and double-double are 39.8 MFlops, 802 MFlops, 193 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgetrfs of _Float128, _Float64x and double-double are 224 MFlops, 3604 MFlops, 885 MFlops, respectively. The peak performances of the reference Rgetrfs of MPFR 512bit, GMP 512bit and quad-double are 7.9 MFlops, 12.5 MFlops, 14.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 38.4 MFlops, 64.3 MFlops, 88.1 MFlops, respectively.

10.4.3 Rgetrf on Intel Core i5-8500B

In Figure 32, we show the result of Rgetrf performance for _Float128, _Float64x and double-double, and in Figure 33 we show the result of Rgetrf performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Rgetrfs of _Float128, _Float64x and double-double are 44.8 MFlops, 1222 MFlops, 281 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgetrfs of _Float128, _Float64x and double-double are 195 MFlops, 4223 MFlops, 1142 MFlops, respectively. The peak performances of the reference Rgetrfs of MPFR 512bit, GMP 512bit and quad-double are 6.1 MFlops, 13.0 MFlops, 15.7 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 19.1 MFlops, 51.3 MFlops, 68.1 MFlops, respectively.

10.4.4 Rgetrf on Raspberry Pi4 ARM Cortex A72

In Figure 34, we show the result of Rgetrf performance for _Float128 and double-double, and in Figure 35 we show the result of Rgetrf performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Rgetrfs of _Float128 and double-double are 13.9 MFlops and 50.5 MFlops, respectively. The peak performances of simple OpenMP parallelized Rgetrfs of _Float128 and double-double are 48.3 MFlops and 173 MFlops, respectively. The peak performances of the reference Rgetrfs of MPFR 512bit, GMP 512bit and quad-double are 2.3 MFlops, 2.9 MFlops, 5.2 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 7.8 MFlops, 10.0 MFlops, 18.1 MFlops, respectively.

10.5 Rpotrf benchmarks

10.5.1 Rpotrf on AMD Ryzen 3970X

In Figure 36, we show the result of Rpotrf performance for _Float128, _Float64x and double-double, and in Figure 37 we show the result of Rpotrf performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Rpotrfs of _Float128, _Float64x and double-double are 54.9 MFlops, 1049 MFlops, 246 MFlops, respectively. The peak performances of simple OpenMP parallelized Rpotrfs of _Float128, _Float64x and double-double are 238 MFlops, 4307 MFlops, 985 MFlops, respectively. The peak performances of the reference Rpotrfs of MPFR 512bit, GMP 512bit and quad-double are 10.5 MFlops, 18.9 MFlops, 18.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 43.2 MFlops, 87.8 MFlops, 84.6 MFlops, respectively.

10.5.2 Rpotrf on Intel Xeon E5 2623

In Figure 38, we show the result of Rpotrf performance for _Float128, _Float64x and double-double, and in Figure 39 we show the result of Rpotrf performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Rpotrfs of _Float128, _Float64x and double-double are 43.8 MFlops, 1865 MFlops, 199 MFlops, respectively. The peak performances of simple OpenMP parallelized Rpotrfs of _Float128, _Float64x and double-double are 134 MFlops, 4317 MFlops, 534 MFlops, respectively. The peak performances of the reference Rpotrfs of MPFR 512bit, GMP 512bit and quad-double are 7.8 MFlops, 13.0 MFlops, 14.8 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 23.6 MFlops, 41.7 MFlops, 49.2 MFlops, respectively.

10.5.3 Rpotrf on Intel Core i5-8500B

In Figure 40, we show the result of Rpotrf performance for _Float128, _Float64x and double-double, and in Figure 41 we show the result of Rpotrf performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Rpotrfs of _Float128, _Float64x and double-double are 50.5 MFlops, 1972 MFlops, 176 MFlops, respectively. The peak performances of simple OpenMP parallelized Rpotrfs of _Float128, _Float64x and double-double are 150 MFlops, 5660 MFlops, 526 MFlops, respectively. The peak performances of the reference Rpotrfs of MPFR 512bit, GMP 512bit and quad-double are 5.9 MFlops, 13.1 MFlops, 15.9 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 13.7 MFlops, 36.0 MFlops, 46.8 MFlops, respectively.

10.5.4 Rpotrf on Raspberry Pi4 ARM Cortex A72

In Figure 42, we show the result of Rpotrf performance for _Float128 and double-double, and in Figure 43 we show the result of Rpotrf performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Rpotrfs of _Float128 and double-double are 15.0 MFlops and 49.4 MFlops, respectively. The peak performances of simple OpenMP parallelized Rpotrfs of _Float128 and double-double are 38.3 MFlops and 124 MFlops, respectively. The peak performances of the reference Rpotrfs of MPFR 512bit, GMP 512bit and quad-double are 2.3 MFlops, 3.0 MFlops, 5.0 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 5.6 MFlops, 7.5 MFlops, 12.8 MFlops, respectively.

10.6 Raxpy benchmarks

10.6.1 Raxpy on AMD Ryzen 3970X

In Figure 44, we show the result of Raxpy performance for _Float128, _Float64x and double-double, and in Figure 45 we show the result of Raxpy performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Raxpys of _Float128, _Float64x and double-double are 60.9 MFlops, 733 MFlops, 258 MFlops, respectively. The peak performances of simple OpenMP parallelized Raxpys of _Float128, _Float64x and double-double are 1250 MFlops, 4079 MFlops, 3188 MFlops, respectively. The peak performances of the reference Raxpys of MPFR 512bit, GMP 512bit and quad-double are 11.7 MFlops, 20.9 MFlops, 20.3 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 404 MFlops, 683 MFlops, 764 MFlops, respectively.

10.6.2 Raxpy on Intel Xeon E5 2623

In Figure 46, we show the result of Raxpy performance for _Float128, _Float64x and double-double, and in Figure 47 we show the result of Raxpy performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Raxpys of _Float128, _Float64x and double-double are 50.3 MFlops, 866 MFlops, 195 MFlops, respectively. The peak performances of simple OpenMP parallelized Raxpys of _Float128, _Float64x and double-double are 255 MFlops, 3333 MFlops, 1030 MFlops, respectively. The peak performances of the reference Raxpys of MPFR 512bit, GMP 512bit and quad-double are 8.6 MFlops, 14.4 MFlops, 15.7 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 43.0 MFlops, 73.0 MFlops, 98.7 MFlops, respectively.

10.6.3 Raxpy on Intel Core i5-8500B

In Figure 48, we show the result of Raxpy performance for _Float128, _Float64x and double-double, and in Figure 49 we show the result of Raxpy performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Raxpys of _Float128, _Float64x and double-double are 58.2 MFlops, 1303 MFlops, 308 MFlops, respectively. The peak performances of simple OpenMP parallelized Raxpys of _Float128, _Float64x and double-double are 309 MFlops, 3146 MFlops, 1316 MFlops, respectively. The peak performances of the reference Raxpys of MPFR 512bit, GMP 512bit and quad-double are 6.6 MFlops, 14.7 MFlops, 17.1 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 32.8 MFlops, 79.4 MFlops, 95.5 MFlops, respectively.

10.6.4 Raxpy on Raspberry Pi4 ARM Cortex A72

In Figure 50, we show the result of Raxpy performance for _Float128 and double-double, and in Figure 51 we show the result of Raxpy performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Raxpys of _Float128 and double-double are 16.5 MFlops and 53.5 MFlops, respectively. The peak performances of simple OpenMP parallelized Raxpys of _Float128 and double-double are 64.7 MFlops and 187 MFlops, respectively. The peak performances of the reference Raxpys of MPFR 512bit, GMP 512bit and quad-double are 2.5 MFlops, 3.3 MFlops, 5.6 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 9.5 MFlops, 12.7 MFlops, 22.1 MFlops, respectively.

10.7 Rdot benchmarks

10.7.1 Rdot on AMD Ryzen 3970X

In Figure 52, we show the result of Rdot performance for _Float128, _Float64x and double-double, and in Figure 53 we show the result of Rdot performance for MPFR, GMP and quad-double on AMD Ryzen 3970X. The peak performances of the reference Rdots of _Float128, _Float64x and double-double are 54.1 MFlops, 1077 MFlops, 207 MFlops, respectively. The peak performances of simple OpenMP parallelized Rdots of _Float128, _Float64x and double-double are 1085 MFlops, 3560 MFlops, 2866 MFlops, respectively. The peak performances of the reference Rdots of MPFR 512bit, GMP 512bit and quad-double are 8.5 MFlops, 19.8 MFlops, 19.4 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 415 MFlops, 643 MFlops, 760 MFlops, respectively.

10.7.2 Rdot on Intel Xeon E5 2623

In Figure 54, we show the result of Rdot performance for _Float128, _Float64x and double-double, and in Figure 55 we show the result of Rdot performance for MPFR, GMP and quad-double on Intel Xeon E5 2623. The peak performances of the reference Rdots of _Float128, _Float64x and double-double are 44.1 MFlops, 1977 MFlops, 171 MFlops, respectively. The peak performances of simple OpenMP parallelized Rdots of _Float128, _Float64x and double-double are 269 MFlops, 3623 MFlops, 833 MFlops, respectively. The peak performances of the reference Rdots of MPFR 512bit, GMP 512bit and quad-double are 6.6 MFlops, 13.9 MFlops, 14.5 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 42.6 MFlops, 69.0 MFlops, 83.3 MFlops, respectively.

10.7.3 Rdot on Intel Core i5-8500B

In Figure 56, we show the result of Rdot performance for _Float128, _Float64x and double-double, and in Figure 57 we show the result of Rdot performance for MPFR, GMP and quad-double on Intel Core i5-8500B. The peak performances of the reference Rdots of _Float128, _Float64x and double-double are 52.0 MFlops, 2258 MFlops, 206 MFlops, respectively. The peak performances of simple OpenMP parallelized Rdots of _Float128, _Float64x and double-double are 320 MFlops, 2949 MFlops, 886 MFlops, respectively. The peak performances of the reference Rdots of MPFR 512bit, GMP 512bit and quad-double are 4.5 MFlops, 10.9 MFlops, 16.2 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 35.3 MFlops, 77.5 MFlops, 94.7 MFlops, respectively.

10.7.4 Rdot on Raspberry Pi4 ARM Cortex A72

In Figure 58, we show the result of Rdot performance for _Float128 and double-double, and in Figure 59 we show the result of Rdot performance for MPFR, GMP and quad-double on Raspberry Pi4 ARM Cortex A72. The peak performances of the reference Rdots of _Float128 and double-double are 15.0 MFlops and 53.8 MFlops, respectively. The peak performances of simple OpenMP parallelized Rdots of _Float128 and double-double are 65.2 MFlops and 194 MFlops, respectively. The peak performances of the reference Rdots of MPFR 512bit, GMP 512bit and quad-double are 1.9 MFlops, 3.2 MFlops, 5.2 MFlops, respectively. The peak performances of simple OpenMP parallelized MPFR 512bit, GMP 512bit and quad-double are 9.7 MFlops, 12.5 MFlops, 22.2 MFlops, respectively.

10.8 Raw data for the benchmarks

At https://github.com/nakatamaho/mplapack/tree/master/benchmark/results/2022, you will find the raw data for benchmarks.

10.9 How to take benchmarks

Users can take performance benchmarks on their machine:

$ cd /home/docker/MPLAPACK/lib/*/mplapack/benchmark
$ bash -x benchall.sh

11 History

M.N started developing MPLAPACK as a by-product of an arbitrary accurate semidefinite programming solver, SDPA-GMP, around 2006 [14]. The first MPLAPACK (formerly MPACK) version, 0.0.1, was released on 2008-7-15. In 2009-2-5, we released SDPA-GMP 7.1.2, and MPLAPACK supports SDPA-GMP, SDPA-QD, and SDPA-DD [15, 17, 37, 38]. On 2010-01-13, we supported Windows via mingw32. In 2010-5-21, we supported MPFR. In 2012-10-13, we released a fast implementation of a double-double version of Rgemm for NVIDIA C2050 [32, 39]. In 2017-3-29, we moved the web site from http://mplapack.sourceforge.net/ to https://github.com/nakatamaho/mplapack/. In 2021-4-1, we renamed our project to MPLAPACK. In 2021-4-11, we supported AArch64 and released 0.9.4. In 2021-10-1, we released 1.0.0. It has huge improvements; all real subroutines are supported (except for RFP formats and mixed precision versions). In 2021, we released version 1.0.1. It fixed double-double arithmetic for Intel one API. In 2022-07-26 we released 2.0.0. It has huge improvements now that all routines, including the RFP matrix format, are supported. Currently, only mixed precision routines are not yet supported. Since this release, we also provide all the results of tests for amd64 (aka x86_64) (Linux/Mac/Win) and Arm64 (Linux) [40, 41]. We also fixed minor pieces of stuff like linking issues. On 2022-09-12, we released 2.0.1. It comes with A100 and V100 support for double-double version of Rgemm and Rsyrk [32, 39]. Their peak performance is impressive, and approximately 400-600 GFlops [42].

12 Related works

We can relate our work in two main directions. (i) acceleration of multiple-precision extension to BLAS (especially on GPU) and (ii) development of a multiple-precision version of linear algebra packages.

For (i), In 2009, Mukunokiet al. [43] implemented a double-double version of matrix-matrix multiplication kernels for NVIDIA Tesla C1060 and evaluated the performance. In 2010, Mukunoki et al. [44] the quadruple precision Basic Linear Algebra Subprograms (BLAS) functions, AXPY, GEMV, and GEMM, on graphics processing units (GPUs), and evaluated their performance. On an NVIDIA Tesla C1060, their BLAS functions are approximately 30 times faster than the existing quadruple precision BLAS on an Intel Core i7 920.

In 2011, Nakasato [45] implemented optimized dense matrix multiplication kernels for AMD Cypress GPU for various precisions. Their result includes double-double precision and attained peak performance of 30 GFlops.

In 2012, Nakata et al. [39] implemented double-double version of Rgemm on NVIDIA C2050. Their implementation is a complete Rgemm implementation; thus, their implementation can be used for real applications; it can handle all the sizes in matrices and supports the transpose of each matrix. They attained 16.1GFlops with CPU-GPU transfer and 26.4GFlops (25.7GFlops with CPU-GPU transfer included). Moreover, they applied to semidefinite programming solver and attained ten times acceleration compared to CPU-only implementation.

In 2012, Yamada et al. developed QPBLAS packages for CPUs [46], as well as the QPBLAS-GPU package for GPUs in 2013 [47]. These correspond to the complete set of double-double version parts of MPBLAS, accelerating CPU and GPUs. Note that memory alignment is different from MPBLAS and written in Fortran 90.

In 2014 and 2015, Kouya [48, 49] implemented optimized matrix-matrix multiplications using Strassen and Winograd algorithms using MPFR, double-double, and quad-double precisions and applied them to LU factorizations. Strassen and Winograd versions are twice as fast as double-double and quad-double precisions running one core using a simple blocking algorithm. Furthermore, there is no significant performance difference between the Strassen version and Winograd. However, when we used MPFR, the Winograd version was always faster. He also found a significant loss of accuracy when performing LU factorization using Strassen and Winograd versions.

In 2016, Joldes et al. implemented CAMPARY: Cuda Multiple Precision Arithmetic Library and applied it to semi-definite programming [50]. However, it was a prototype implementation and somewhat slower than our CPU implementation of ours [15]. In 2017, Joldes et al. implemented an improved version of CAMPARY [51]. The significant result of this paper is that they improved Rgemm performance on GPU for various multiple-precision versions; 1.6GFlops for triple-double precision, 976MFlops for quadruple-precision, 660MFlops for quintuple-double, 453MFlops for sextuple-double, 200MFlops for octuple-double.

In 2019, Hishinuma et al. implimented Rgemm kernels for double-double precision on MIMD type acclearlator PEZY-SC2 [52]. The performance of their implementation of Rgemm the PEZY-SC2 attained 75% of the peak performance. This value is 20 times faster than an Intel Xeon E5-2618L v3, even including the communication time between the host CPU and the PEZY-SC2.

In 2020, Isopov et al. [53] took a benchmark for several level-1 multi-precision routines for MPFR, ARPREC, MPDECIMAL, MPACK, GARPREC, CUMP,, and MPRES-BLAS. MPRES-BLAS is an ongoing effort to develop a multiple-precision version of BLAS like MPBLAS on GPUS The[54]. MPRES-BLAS defines a new arbitrary precision type suited for GPUs and fastest among CAMPARY [50], CUMP [55] and GARPREC [56] when the fractions range from 424bits to 848 bits.

In 2021, Kouya [57] acceralated dd_real, qd_real and triple-double precision version of Rgemm with AVX2 using Strassen algorithm [58].

For (ii), Kouya has long developed C libraries for multiple-precision calculation using GMP and MPFR, including matrix operations [59]. He calls them as BNCpack and MPIBNCPack. At least, we can go back to 2003 to find there was MPIBNCpack version 0.1 [60]. In 2011, version 0.7 had released, and it can solve linear equations and eigenvalue problems and many multiple-precision functions that are not directly related to linear algebra. From version 0.8 (2013-03-11), BNCpack and MPIBNCpack are integrated [59, 61].

Saito developed ZKCM [62], a C++ library for multi-precision matrix computation for quantum computer simulation. He implemented matrix inversion, singular-value decomposition of a general matrix, the diagonalization of a Hermitian matrix, and other operations.

Arb is a C library for arbitrary-precision ball arithmetic developed by Johansson [63]. It uses ball arithmetic to track numerical errors automatically. It can perform and solve a wide range of matrix operations, e.g., LU factorization, inversion, and eigenvalue problems, but it is not an extension of LAPACK to higher precision since the employed arithmetic is different.

Multiprecision Computing Toolbox by Advanpix for MATLAB is a toolbox of MATLAB that can solve many arbitrary precision matrix operations like diagonalization of real and complex non-symmetric matrices and singular value decomposition with outstanding performance. They seem to employ a similar approach to ours to convert LAPACK to multiple precision versions, but details are not open and do not replace LAPACK [64]. Symbolic Math Toolbox (SMT) of Matlab also provides such functionalities [65].

RalphAS has been developing GenericSchur.jl [66], and it calculates Schur decomposition of matrices with generic floating-point element types in Julia. However, it is not a replacement for the LAPACK library.

Johansson et al. also has been developing mpmath [67] can handle linear algebra (linear system solving, LU factorization, matrix inverse, matrix norms, matrix exponentials/logarithms/square roots, eigenvalues, singular values, QR factorization), written in python, and it is not a replacement of LAPACK.

Mathematica [68] has multiple-precision versions of the eigenvalue problem of non-symmetric matrices, singular value decomposition problems, and other solvers.

13 Future plans

For version 3.0, (i) we are planning to add more optimized subroutines for MPBLAS and MPLAPACK, (ii) python and octave integration, and (iii) dropping the GMP version. Finally, for version 4.0, we plan to add a template version so we can use _Float16 and other precisions more transparently. Our development status can be found at https://github.com/nakatamaho/mplapack#mplapack-release-process.

Acknowledgement

First, we are grateful to the LAPACK team for making an excellent library. This work was supported by the Special Postdoctoral Researchers’ Program of RIKEN (2008, 2009), Grant-in-Aid for Scientific Research (B) 21300017 from the Japan Society for the Promotion of Science (2009, 2010, 2011), Microsoft Research CORE6 (2010), the Japan Society for the Promotion of Science (JSPS KAKENHI Grant no. 18H03206) and TIS inc.

The author would like to thank Dr. Imamura Toshiyuki. Dr. Nakasato Naohito, Dr. Fujisawa Katsuki, Mr. Umeda Kouzo, Dr. Kouya Tomonori, Dr. Takahashi Daisuke, Dr. Goto Kazushige, Dr. Himeno Ryutaro, Dr. Hishimuna Toshiaki, Dr. Katagiri Takahiro, Dr. Ogita Takeshi, Dr. Kashiwagi Masahide, Dr. Yuasa Fukuko, Dr. Ishikawa Tadashi, Dr. Geshi Masaaki, Dr. Mukunoki Daichi and Mr. Minato Yuichiro for warm encouragement.

References

[1] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996.
[2] An updated set of basic linear algebra subprograms (blas). ACM Trans. Math. Softw., 28(2):135151, June 2002.
[3] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.
[4] Ieee standard for floating-point arithmetic. IEEE Std 754-2008, pages 1–70, 2008.
[5] Kazushige Goto and Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3), May 2008.
[6] Zhang Xianyi, Wang Qian, and Zhang Yunquan. Model-driven level 3 blas performance optimization on loongson 3a processor. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems, pages 684–691, 2012.
[7] Jack Dongarra, Vladimir Getov, and Kevin Walsh. The 30th anniversary of the supercomputing conference: Bringing the future closer–supercomputing history and the immortality of now. Computer, 51(10):74–85, 2018.
[8] NVIDIA. Basic linear algebra on nvidia gpus, 2021. https://developer.nvidia.com/cublas.
[9] Stanimire Tomov, Jack Dongarra, and Marc Baboulin. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing, 36(5-6):232–240, June 2010.
[10] D.H. Bailey, R. Barrio, and J.M. Borwein. High-precision computation: Mathematical physics and dynamics. Applied Mathematics and Computation, 218(20):10106–10121, 2012.
[11] David H. Bailey and Jonathan M. Borwein. High-precision arithmetic in mathematical physics. Mathematics, 3(2):337–367, 2015.
[12] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2002.
[13] Lieven Vandenberghe and Stephen Boyd. Semidefinite programming. SIAM Rev., 38(1):4995, March 1996.
[14] Maho Nakata, Bastiaan J. Braams, Katsuki Fujisawa, Mituhiro Fukuda, Jerome K. Percus, Makoto Yamashita, and Zhengji Zhao. Variational calculation of second-order reduced density matrices by strong n-representability conditions and an accurate semidefinite programming solver. The Journal of Chemical Physics, 128(16):164113, 2008.
[15] Maho Nakata. A numerical evaluation of highly accurate multiple-precision arithmetic version of semidefinite programming solver: Sdpa-gmp, -qd and -dd. In 2010 IEEE International Symposium on Computer-Aided Control System Design, pages 29–34, 2010.
[16] Makoto Yamashita, Katsuki Fujisawa, Mituhiro Fukuda, Kazuhiro Kobayashi, Kazuhide Nakata, and Maho Nakata. Latest Developments in the Family for Solving Large-Scale SDPs, pages 687–713. Springer US, Boston, MA, 2012.
[17] Nakata, Maho. SDPA-GMP, retrieved September 23, 2021. https://github.com/nakatamaho/sdpa-gmp/.
[18] GMP failure on Mac OS X 10.5, retrieved August 29, 2022. https://gmplib.org/list-archives/gmp-bugs/2008-February/000930.html.
[19] Dirk Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux journal, 2014(239):2, 2014.
[20] Ieee standard for binary floating-point arithmetic. ANSI/IEEE Std 754-1985, pages 1–20, 1985.
[21] ISO/IEC TS 18661-3:2015 Information Technology - Programming languages, their environments, and system software interfaces - Floating-point extensions for C - Part 3: Interchange and extended types. the International Organization for Standardization, Chemin de Blandonnet 8 CP 401 1214 Vernier, Geneva Switzerland, 2015.
[22] Cedric Lichtenau, Steven Carlough, and Silvia Melitta Mueller. Quad precision floating point on the ibm z13. In 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), pages 87–94, 2016.
[23] Xiaoye S. Li Yozo Hida and David H. Bailey. Quad-double arithmetic: Algorithms, implementation, and application. In Technical Report LBNL-46996. Lawrence Berkley National Laboratory, 2000.
[24] Donald E. Knuth. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., USA, 1997.
[25] T.J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18:224–242, 1971/72.
[26] 6.12 Additional Floating Types, retrieved September 7, 2022. https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html.
[27] Torbjörn Granlund and the GMP development team. GNU MP: The GNU Multiple Precision Arithmetic Library, 5.0.5 edition, 2012. http://gmplib.org/.
[28] Laurent Fousse, Guillaume Hanrot, Vincent Lefèvre, Patrick Pélissier, and Paul Zimmermann. Mpfr: A multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw., 33(2):13es, June 2007.
[29] MPFR C++, retrieved September 10, 2022. http://www.holoborodko.com/pavel/mpfr/.
[30] Andreas Enge, Mickaël Gastineau, Philippe Théveny, and Paul Zimmermann. mpc — A library for multiprecision complex arithmetic with exact rounding. INRIA, 1.1.0 edition, January 2018. http://mpc.multiprecision.org/.
[31] 8 Arm C AND C++ Language Mappings, retrieved September 10, 2022. https://github.com/ARM-software/abi-aa/blob/2bcab1e3b22d55170c563c3c7940134089176746/aapcs64/aapcs64.rst#811arithmetic-types.
[32] Maho Nakata, Yasuyoshi Takao, Shigeho Noda, and Ryutaro Himeno. A fast implementation of matrix-matrix product in double-double precision on nvidia c2050 and application to semidefinite programming. In 2012 Third International Conference on Networking and Computing, pages 68–75, 2012.
[33] R. T. Gregory and D. Karney. A collection of matrices for testing computational algorithms. 1969.
[34] Singular value decomposition, retrieved September 24, 2021. https://en.wikipedia.org/wiki/Singular_value_decomposition.
[35] Susan Blackford and Jack Dongarra. Installation guide for LAPACK. LAPACK Working Note 41, March 1992. UT-CS-92-151, March, 1992.
[36] Ralf W. Grosse-Kunstleve, Thomas C. Terwilliger, Nicholas K. Sauter, and Paul D. Adams. Automatic fortran to++ conversion with. Source Code for Biology and Medicine, 7:5, 2012. https://doi.org/10.1186/1751-0473-7-5.
[37] Nakata, Maho. SDPA-QD, retrieved September 23, 2021. https://github.com/nakatamaho/sdpa-qd/.
[38] Nakata, Maho. SDPA-DD, retrieved September 23, 2021. https://github.com/nakatamaho/sdpa-dd/.
[39] Maho Nakata. Poster: Mpack 0.7.0: Multiple precision version of blas and lapack. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pages 1353–1353, 2012.
[40] Linear equation est results of MPLAPACK, retrieved September 9, 2022. https://github.com/nakatamaho/mplapack/tree/master/mplapack/test/lin/results.
[41] Eigenproblem test results of MPLAPACK, retrieved September 9, 2022. https://github.com/nakatamaho/mplapack/tree/master/mplapack/test/eig/results.
[42] Benchmark results of MPLAPACK and MPBLAS, retrieved September 9, 2022. https://github.com/nakatamaho/mplapack/tree/master/benchmark/results/2022.
[43] Mukunoki Daichi and Takahashi Daisuke. Implementation and evaluation of quadruple precision blas on gpu (in japanese). In IPSJ SIG Technical Report, volume 137, pages 1–6, 2009.
[44] Daichi Mukunoki and Daisuke Takahashi. Implementation and evaluation of quadruple precision blas functions on gpus. In Kristján Jónasson, editor, Applied Parallel and Scientific Computing, pages 249–259, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[45] N. Nakasato. A fast gemm implementation on the cypress gpu. SIGMETRICS Perform. Evaluation Rev., 38:50–55, 2011.
[46] Susumu Yamada, Takuya Ina, Narimasa Sasa, Yasuhiro Idomura, Masahiko Machida, and Toshiyuki Imamura. Quadruple-precision blas using bailey’s arithmetic with fma instruction: its performance and applications. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1418–1425, 2017.
[47] Japan Atomic Energy Agency. Quadruple Precision BLAS Routines for GPU QPBLAS-GPU Ver.1.0 User’s Manual, July 2013.
[48] Tomonori Kouya. Accelerated multiple precision matrix multiplication using strassen’s algorithm and winograd’s variant. JSIAM Letters, 6:81–84, 2014.
[49] Tomonori Kouya. Performance evaluation of multiple precision matrix multiplications using parallelized strassen and winograd algorithms. JSIAM Letters, 8:21–24, 2016.
[50] Mioara Joldes, Jean-Michel Muller, Valentina Popescu, and Warwick Tucker. Campary: Cuda multiple precision arithmetic library and applications. In Gert-Martin Greuel, Thorsten Koch, Peter Paule, and Andrew Sommese, editors, Mathematical Software – ICMS 2016, pages 232–240, Cham, 2016. Springer International Publishing.
[51] Mioara Joldes, Jean-Michel Muller, and Valentina Popescu. Implementation and performance evaluation of an extended precision floating-point arithmetic library for high-accuracy semidefinite programming. In 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH), pages 27–34, 2017.
[52] Toshiaki Hishinuma and Maho Nakata. pzqd: Pezy-sc2 acceleration of double-double precision arithmetic library for high-precision blas. In International Conference on Computational & Experimental Engineering and Sciences, pages 717–736. Springer, 2019.
[53] Konstantin Isupov. Performance data of multiple-precision scalar and vector blas operations on cpu and gpu. Data in Brief, 30:105506, 2020.
[54] Konstantin Isupov and Vladimir Knyazkov. Multiple-precision blas library for graphics processing units. In Vladimir Voevodin and Sergey Sobolev, editors, Supercomputing, pages 37–49, Cham, 2020. Springer International Publishing.
[55] Takato Nakayama and D. Takahashi. Implementation of multiple-precision floating-point arithmetic library for gpu computing. In Proc. 23rd IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2011), pages 343–349, 2011.
[56] Mian Lu, Bingsheng He, and Qiong Luo. Supporting extended precision on graphics processors. In Proceedings of the Sixth International Workshop on Data Management on New Hardware, DaMoN ’10, pages 19–26, New York, NY, USA, 2010. ACM.
[57] Tomonori Kouya. Acceleration of multiple precision matrix multiplication based on multi-component floating-point arithmetic using avx2. In Osvaldo Gervasi, Beniamino Murgante, Sanjay Misra, Chiara Garau, Ivan Blečić, David Taniar, Bernady O. Apduhan, Ana Maria A. C. Rocha, Eufemia Tarantino, and Carmelo Maria Torre, editors, Computational Science and Its Applications – ICCSA 2021, pages 202–217, Cham, 2021. Springer International Publishing.
[58] V. STRASSEN. Gaussian elimination is not optimal. Numerische Mathematik, 13:354–356, 1969.
[59] Tomonori Kouya. BNCpack 0.7, September 2011. http://na-inet.jp/na/bnc/.
[60] Tomonori Kouya. MPIBNCpack 0.1, September 2003. https://na-inet.jp/na/bnc/mpibncpack.pdf.
[61] Tomonori Kouya. Tuning technique for multiple precision dense matrix multiplication using prediction of computational time, 2017.
[62] Akira SaiToh. Zkcm: A c++ library for multiprecision matrix computation with applications in quantum information. Computer Physics Communications, 184(8):2005–2020, 2013.
[63] Fredrik Johansson. Arb: Efficient arbitrary-precision midpoint-radius interval arithmetic, 2016. https://arblib.org/.
[64] Advanpix. Multiprecision Computing Toolbox for MATLAB, retrieved September 23, 2021. https://www.advanpix.com/.
[65] Inc. The MathWorks. Symbolic Math Toolbox. Natick, Massachusetts, United State, 2019.
[66] RalphAS. Schur decomposition of matrices with generic floating-point element types in Julia, retrieved September 23, 2021. https://github.com/RalphAS/GenericSchur.jl.
[67] Fredrik Johansson et al. mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.18), December 2013. http://mpmath.org/.
[68] Wolfram Research, Inc. Mathematica, Version 12.3.1. Champaign, IL, 2021.