\fnm

Vedran \surNovaković \equalconthttps://orcid.org/0000-0003-2964-9674 \orgdivindependent \orgnameresearcher, \orgaddress\streetVankina ulica 15, \stateHR-\postcode10020 \cityZagreb, \countryCroatia

Arithmetical enhancements of the Kogbetliantz method for the SVD of order two

[email protected] *

Abstract

An enhanced Kogbetliantz method for the singular value decomposition (SVD) of general matrices of order two is proposed. The method consists of three phases: an almost exact prescaling, that can be beneficial to the LAPACK’s xLASV2 routine for the SVD of upper triangular $2\times 2$ matrices as well, a highly relatively accurate triangularization in the absence of underflows, and an alternative procedure for computing the SVD of triangular matrices, that employs the correctly rounded $\mathop{\mathrm{hypot}}$ function. A heuristic for improving numerical orthogonality of the left singular vectors is also presented and tested on a wide spectrum of random input matrices. On upper triangular matrices under test, the proposed method, unlike xLASV2, finds both singular values with high relative accuracy as long as the input elements are within a safe range that is almost as wide as the entire normal range. On general matrices of order two, the method’s safe range for which the smaller singular values remain accurate is of about half the width of the normal range.

keywords:

singular value decomposition, the Kogbetliantz method, matrices of order two, roundoff analysis

pacs:

[

MSC Classification]65F15, 15A18, 65Y99

1 Introduction

The singular value decomposition (SVD) of a square matrix of order two is a widely used numerical tool. In LAPACK [1] alone, its xLASV2 routine for the SVD of real upper triangular $2\times 2$ matrices is a building block for the QZ algorithm [2] for the generalized eigenvalue problem $Ax=\lambda Bx$ , with $A$ and $B$ real and square, and for the SVD of real bidiagonal matrices by the implicit QR algorithm [3]. Also, the oldest method for the SVD of square matrices that is still in use was developed by Kogbetliantz [4], based on the SVD of order two, and as such is the primary motivation for this research.

This work explores how to compute the SVD of a general matrix of order two indirectly, by a careful scaling, a highly relatively accurate triangularization if the matrix indeed contains no zeros, and an alternative triangular SVD method, since the straightforward formulas for general matrices are challenging to be evaluated stably.

Let $G$ be a square real matrix of order $n$ . The SVD of $G$ is a decomposition $G=U\Sigma V^{T}$ , where $U$ and $V$ are orthogonal¹¹1If $G$ is complex, $U$ and $V$ are unitary and $G=U\Sigma V^{\ast}$ , but this case is only briefly dealt with here. matrices of order $n$ of the left and the right singular vectors of $G$ , respectively, and $\Sigma=\mathop{\mathrm{diag}}(\sigma_{1},\ldots,\sigma_{n})$ is a diagonal matrix of its singular values, such that $\sigma_{i}\geq\sigma_{j}\geq 0$ for all $i$ and $j$ where $1\leq i<j\leq n$ .

In the step $k$ of the Kogbetliantz SVD method, a pivot submatrix of order two (or several of them, not sharing indices with each other, if the method is parallel) is found according to the chosen pivot strategy in the iteration matrix $G_{k}$ , its SVD is computed, and $U_{k}$ , $V_{k}$ , and $G_{k}$ are updated by the transformation matrices $\mathsf{U}_{k}$ and/or $\mathsf{V}_{k}$ , leaving zeros in the off-diagonal positions $(j_{k},i_{k})$ and $(i_{k},j_{k})$ of $G_{k+1}$ , as in

\begin{gathered}G_{0}=\mathop{\mathrm{preprocess}}(G),\quad U_{0}=\mathop{\mathrm{preprocess}}(I),\quad V_{0}=\mathop{\mathrm{preprocess}}(I);\\ G_{k+1}=\mathsf{U}_{k}^{T}G_{k}\mathsf{V}_{k},\quad U_{k+1}=U_{k}\mathsf{U}_{k},\quad V_{k+1}=V_{k}\mathsf{V}_{k},\qquad k\geq 0;\\ \mathop{\mathrm{convergence}}(k=K)\implies U\approx U_{K},\quad V\approx V_{K},\quad\sigma_{i}\approx g_{ii}^{(K)},\quad 1\leq i\leq n.\end{gathered}

(1)

The left and the right singular vectors of the pivot matrix are embedded into identities to get $\mathsf{U}_{k}$ and $\mathsf{V}_{k}$ , respectively, with the index mapping from matrices of order two to $\mathsf{U}_{k}$ and $\mathsf{V}_{k}$ being $(1,1)\mapsto(i_{k},i_{k})$ , $(2,1)\mapsto(j_{k},i_{k})$ , $(1,2)\mapsto(i_{k},j_{k})$ , $(2,2)\mapsto(j_{k},j_{k})$ , where $1\leq i_{k}<j_{k}\leq n$ are the pivot indices. The process is repeated until convergence, i.e., until for some $k=K$ the off-diagonal norm of $G_{K}$ falls below a certain threshold.

If $G$ has $m$ rows and $n$ columns, $m>n$ , it should be preprocessed [5] to a square matrix $G_{0}$ by a factorization of the URV [6] type (e.g., the QR factorization with column pivoting [7]). Then, $U_{0}^{T}GV_{0}=G_{0}$ , where $G_{0}$ is triangular of order $n$ and $U_{0}$ is orthogonal of order $m$ . If $m<n$ , then the SVD of $G^{T}$ can be computed instead.

In all iterations it would be beneficial to have the pivot matrix $\widehat{G}_{k}$ triangular, since its SVD can be computed with high relative accuracy under mild assumptions [8]. This is however not possible with time consuming but simple, quadratically convergent [9, Remark 6] pivot strategy that chooses the pivot with the largest off-diagonal norm $\sqrt{|g_{j_{k}i_{k}}^{(k)}|^{2}+|g_{i_{k}j_{k}}^{(k)}|^{2}}$ , but is possible, if $G_{0}$ is triangular, with certain sequential cyclic (i.e., periodic) strategies [5, 10] like the row-cyclic and column-cyclic, and even with some parallel ones, after further preprocessing $G_{0}$ into a suitable “butterfly” form [11].

Although the row-cyclic and column-cyclic strategies ensure global [10] and asymptotically quadratic [9, 12] convergence of the method, as well as its high relative accuracy [13], the method’s sequential variants remain slow on modern hardware, while preprocessing $G$ to $G_{0}$ (in the butterfly form or not) can only be partially parallelized.

This work is a part of a broader effort [14, 15] to investigate if a fast and accurate (in practice if not in theory) variant of the Kogbetliantz method could be developed, that would be entirely parallel and would function on general square matrices without expensive preprocessing, with full pivots $\widehat{G}_{k}^{[\ell]}$ , $1\leq\ell\leq\mathsf{n}\leq\lfloor n/2\rfloor$ , that are independently diagonalized, and with $\mathsf{n}$ ensuing concurrent updates of $U_{k}$ , $\mathsf{n}$ of $V_{k}$ , and $\mathsf{n}$ from each of the sides of $G_{k}$ in a parallel step. This way the employed parallel pivot strategy does not have to be cyclic. A promising candidate is the dynamic ordering [16, 17, 15].

The proposed Kogbetliantz SVD of order two supports a wider exponent range of the elements of a triangular input matrix for which both singular values are computed with high relative accuracy than xLAEV2, although the latter is slightly more accurate when comparison is possible. Matrices of the singular vectors obtained by the proposed method are noticeably more numerically orthogonal. With respect to [14, 15] and a general matrix $G$ of order two, the following enhancements have been implemented:

1.

The structure of $G$ is exploited to the utmost extent, so the triangularization and a stronger scaling is employed only when $G$ has no zeros, thus preserving accuracy.
2.

The triangularization of $G$ by a special URV factorization is tweaked so that high relative accuracy of each computed element is provable when no underflow occurs.
3.

The SVD procedure for triangular matrices utilizes the latest advances in computing the correctly rounded functions, so the pertinent formulas from [14, 15] are altered.
4.

The left singular vectors are computed by a heuristic when the triangularization is involved, by composing the two plane rotations—one from the URV factorization, and the other from the triangular SVD—into one without the matrix multiplication.

High relative accuracy of the singular values of $G$ is observed, but not proved, when the range of the elements of $G$ is narrower than about half of the entire normal range.

This paper is organized as follows. In Section 2 the floating-point environment and the required operations are described, and some auxiliary results regarding them are proved. Section 3 presents the proposed SVD method. In Section 4 the numerical testing results are shown. Section 5 concludes the paper with the comments on future work.

2 Floating-point considerations

Let $x$ be a real, infinite, or undefined (Not-a-Number) value: $x\in\mathbb{R}\cup\{-\infty,+\infty,\mathrm{NaN}\}$ . Its floating-point representation is denoted by $\mathop{\mathrm{fl}}(x)$ and is obtained by rounding $x$ to a value of the chosen standard floating-point datatype using the rounding mode in effect, that is here assumed to be to nearest (ties to even). If the result is normal, $\mathop{\mathrm{fl}}(x)=x(1+\epsilon)$ , where $|\epsilon|\leq\varepsilon=2^{-p}$ and $p$ is the number of bits in the significand of a floating-point value. In the LAPACK’s terms, $\varepsilon_{\text{{x}}}=\text{{xLAMCH('E')}}$ , where $\text{{x}}=\text{{S}}$ or D. Thus, $p_{\text{{x}}}=24$ or $53$ , and $\varepsilon_{\text{{x}}}=2^{-24}$ or $2^{-53}$ for single (S) or double (D) precision, respectively. The gradual underflow mode, allowing subnormal inputs and outputs, has to be enabled (e.g., on Intel-compatible architectures the Denormals-Are-Zero and Flush-To-Zero modes have to be turned off). Trapping on floating-point exceptions has to be disabled (what is the default non-stop handling from [18, Sect. 7.1]).

Possible discretization errors in input data are ignored. Input matrices in floating-point are thus considered exact, and are, for simplicity, required to have finite elements.

The Fused Multiply-and-Add (FMA) function, $\mathop{\mathrm{fma}}(a,b,c)=\mathop{\mathrm{fl}}(a\cdot b+c)$ , is required. Conceptually, the exact value of $a\cdot b+c$ is correctly rounded. Also, the hypotenuse function, $\mathop{\mathrm{hypot}}(a,b)=\mathop{\mathrm{fl}}(\sqrt{a^{2}+b^{2}})$ , is assumed to be correctly rounded (unless stated otherwise), as recommended by the standard [18, Sect. 9.2], but unlike²²2https://members.loria.fr/PZimmermann/papers/accuracy.pdf many current implementations of the routines hypotf and hypot. Such a function (see also [19]) never overflows when the rounded result should be finite, it is zero if and only if $|a|=|b|=0$ , and is symmetric and monotone. The CORE-MATH library [20] provides an open-source implementation³³3https://core-math.gitlabpages.inria.fr of some of the optional correctly rounded single and double precision mathematical C functions (e.g., cr_hypotf and cr_hypot).

Radix two is assumed for floating-point values. Scaling of a value $x$ by $2^{s}$ where $s\in\mathbb{Z}$ is exact if the result is normal. Only non-normal results can lose precision. Let, for $x=\pm 0$ , $e_{x}=0$ and $f_{x}=0$ , and for a finite non-zero $x$ let the exponent be $e_{x}=\lfloor\lg|x|\rfloor$ and the “mantissa” $1\leq|f_{x}|<2$ , such that $x=\mathop{\mathrm{fl}}(2^{e_{x}}f_{x})$ . Also, let $f_{x}=x$ for $x=\pm\infty$ , while $e_{x}=0$ . Note that $f_{x}$ is normal even for subnormal $x$ . Keep in mind that the frexp routine represents a finite non-zero $x$ with $e_{x}^{\prime}=e_{x}+1$ and $f_{x}^{\prime}=f_{x}/2$ .

Let $\mu$ be the smallest and $\nu$ the largest positive finite normal value. Then, in the notation just introduced, $e_{\mu}=\lg\mu=-126$ or $-1022$ , and $e_{\nu}=\lfloor\lg\nu\rfloor=127$ or $1023$ , for single or double precision. Lemma 2.1 can now be stated using this notation.

Lemma 2.1.

Assume that $e_{\nu}-p\geq 1$ , with rounding to nearest. Then,

\mathop{\mathrm{fl}}(\nu+1)=\nu=\mathop{\mathrm{hypot}}(\nu,1).

(2)

Proof.

By the assumption, $\nu\geq 2^{p+1}+2^{p}+\cdots+2^{2}$ , since $\nu=2^{e_{\nu}}(1.1\cdots 1)_{2}$ , with $p$ ones. The bit in the last place thus represents a value of at least four. Adding one to $\nu$ would require rounding of the exact value $\nu+1=2^{e_{\nu}}\cdot(1.1\cdots 10\cdots 01)_{2}$ to $p$ bits of significand. The number of zeros is $e_{\nu}-p\geq 1$ . Rounding to nearest in such a case is equivalent to truncating the trailing $e_{\nu}-p+1$ bits, starting from the leftmost zero, giving the result $\mathop{\mathrm{fl}}(\nu+1)=2^{e_{\nu}}(1.1\cdots 1)_{2}=\nu$ . This proves the first equality in (2).

For the second equality in (2), note that $(\nu+1)^{2}=\nu^{2}+1+2\nu\geq\nu^{2}+1\geq\nu^{2}$ since $\nu>0$ . By taking the square roots, it follows that $\nu+1\geq\sqrt{\nu^{2}+1}\geq\nu$ , and therefore

\nu=\mathop{\mathrm{fl}}(\nu+1)\geq\mathop{\mathrm{fl}}(\sqrt{\nu^{2}+1})=\mathop{\mathrm{hypot}}(\nu,1)\geq\mathop{\mathrm{hypot}}(\nu,0)=\mathop{\mathrm{fl}}(\nu)=\nu,

since $\mathop{\mathrm{fl}}$ and $\mathop{\mathrm{hypot}}$ are monotone operations in all arguments. ∎

The claims of Lemma 2.1 and its following corollaries were used and their proofs partially sketched in [21], e.g. They are expanded and clarified here for completeness.

An underlined expression denotes a computed floating-point approximation of the exact value of that expression. Given $\tan(2\phi)$ , for $0\leq|\phi|\leq\pi/4$ , $\tan\phi$ and $\underline{\tan\phi}$ are

\tan\phi=\frac{\tan(2\phi)}{1+\sqrt{\tan^{2}(2\phi)+1}},\qquad\underline{\tan\phi}=\mathop{\mathrm{fl}}\left(\frac{\mathop{\mathrm{fl}}(\tan(2\phi))}{\mathop{\mathrm{fl}}(1+\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan(2\phi)),1))}\right),

(3)

if $\tan(2\phi)$ and $\mathop{\mathrm{fl}}(\tan(2\phi))$ are finite, or $\mathop{\mathrm{sign}}(\tan(2\phi))$ and $\mathop{\mathrm{sign}}(\mathop{\mathrm{fl}}(\tan(2\phi)))$ otherwise.

Corollary 2.2.

Let $\tan(2\phi)$ be given, such that $\mathop{\mathrm{fl}}(\tan(2\phi))$ is finite. Then, under the assumptions of Lemma 2.1, for $\underline{\tan\phi}$ from (3) holds $0\leq|\underline{\tan\phi}|\leq 1$ .

Proof.

For $|\mathop{\mathrm{fl}}(\tan(2\phi))|=\nu$ , due to Lemma 2.1, $\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan(2\phi)),1)=\nu$ , and so the denominator in (3) is $\mathop{\mathrm{fl}}(1+\nu)=\nu$ . Note that the numerator is always at most as large in magnitude as the denominator. Thus, $0\leq|\underline{\tan\phi}|\leq 1$ , what had to be proven. ∎

Corollary 2.3.

Let $\tan{\phi}$ be given, for $|\phi|\leq\pi/2$ . Then, $\sec{\phi}=1/\cos{\phi}$ can be approximated as $\underline{\sec{\phi}}=\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan{\phi}),1)$ . If $\tan{\phi}=\mathop{\mathrm{fl}}(\tan{\phi})$ , then $\underline{\sec{\phi}}=\mathop{\mathrm{fl}}(\sec{\phi})$ . When the assumptions of Lemma 2.1 hold and $\mathop{\mathrm{fl}}(\tan{\phi})$ is finite, so is $\underline{\sec{\phi}}$ .

Proof.

The approximation relation follows from the definition of $\mathop{\mathrm{hypot}}$ and from $\sec\phi=\sqrt{\tan^{2}\phi+1}$ , while its finiteness for a finite $\mathop{\mathrm{fl}}(\tan\phi)$ follows from Lemma 2.1, since $|\mathop{\mathrm{fl}}(\tan\phi)|\leq\nu$ implies $\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan\phi),1)\leq\nu$ . ∎

For any $w\in\mathbb{R}$ , let $\mathbf{w}=(e_{w},f_{w})=2^{e_{w}}f_{w}$ and $\mathop{\mathrm{fl}}(\mathbf{w})=\mathop{\mathrm{fl}}(2^{e_{w}}f_{w})\approx w$ . Even $w$ such that $|w|>\nu$ or $0<|w|<\check{\mu}$ , where $\check{\mu}$ is the smallest positive non-zero floating-point value, can be represented with a finite $e_{w}$ and a normalized $f_{w}$ , though $\mathop{\mathrm{fl}}(\mathbf{w})$ is not finite or non-zero, respectively. The closest double precision approximation of $\mathbf{w}$ is $\underline{w}=\text{{scalbn($f_{w},e_{w}$)}}$ , with a possible underflow or overflow, and similarly for single precision (using scalbnf). A similar definition could be made with $e_{w}^{\prime}$ and $f_{w}^{\prime}$ instead.

An overflow-avoiding addition and an underflow-avoiding subtraction of positive finite values $x$ and $y$ , resulting in such exponent-“mantissa” pairs, can be defined as

x\oplus y=\begin{cases}(e_{z},f_{z}),&\text{if $z=\mathop{\mathrm{fl}}(x+y)\leq\nu$},\\ (e_{z}+1,f_{z}),&\text{otherwise, with $z=\mathop{\mathrm{fl}}(2^{-1}x+2^{-1}y)$},\end{cases}

(4)

and, assuming $x\neq y$ (for $x=y$ let $x\ominus y=(0,0)$ directly),

x\ominus y=\begin{cases}(e_{z},f_{z}),&\text{if $|z|\geq\mu$, with $z=\mathop{\mathrm{fl}}(x-y)$},\\ (e_{z}-c,f_{z}),&\text{otherwise, with $z=\mathop{\mathrm{fl}}(2^{c}x-2^{c}y)$},\end{cases}

(5)

where $c=e_{\mu}+p-1-\min\{e_{x},e_{y}\}$ . In (4), $z\leq\nu$ in both cases, since $\mathop{\mathrm{fl}}(\nu/2+\nu/2)=\nu$ .

Lemma 2.4.

In the second case in (5), $c>0$ and $\mu\leq|z|\leq\nu$ , if $e_{\nu}\geq e_{\mu}+2p-1$ .

Proof.

Assume $x>y$ (else, swap $x$ and $y$ , and change the sign of $z$ ). Then, $e_{y}\leq e_{x}$ and $y=2^{e_{y}}(1.\mathtt{y}_{-1}\cdots\mathtt{y}_{1-p})_{2}$ . The rightmost bit $\mathtt{y}_{1-p}$ multiplies the value $w=2^{e_{y}+1-p}$ .

If $e_{x}-e_{y}\geq p+1$ then $x$ is normal and, due to rounding to nearest, $\mathop{\mathrm{fl}}(x-y)=x$ . Therefore, assume that $e_{x}-e_{y}\leq p$ . If $w\geq\mu=2^{e_{\mu}}$ , then $x-y\geq\mu$ as well, since $x-y\geq w$ . Thus, $\mathop{\mathrm{fl}}(x-y)\geq w\geq\mu$ , so assume that $w<\mu$ , i.e., $e_{y}+1-p<e_{\mu}$ .

It now suffices to upscale $x$ and $y$ to $x^{\prime\prime}=2^{c}x$ and $y^{\prime\prime}=2^{c}y$ , for some $c\in\mathbb{N}$ , to ensure $e_{y}^{\prime\prime}=e_{y}+c\geq e_{\mu}+p-1$ . Any $c\geq e_{\mu}+p-1-e_{y}$ that will not overflow $x^{\prime\prime}$ will do, so the smallest one is chosen. Note that $e_{x}^{\prime\prime}=e_{x}+c=e_{x}+e_{\mu}+p-e_{y}-1$ . Since $e_{x}-e_{y}\leq p$ , by the Lemma’s assumption it holds $e_{x}^{\prime\prime}\leq e_{\mu}+2p-1\leq e_{\nu}$ . ∎

Several arithmetic operations on $(e,f)$ -pairs can be defined (see also [22]), such as

\begin{gathered}|\mathbf{x}|=(e_{x},|f_{x}|),\qquad-\mathbf{x}=(e_{x},-f_{x}),\qquad 2^{\varsigma}\odot\mathbf{x}=(\varsigma+e_{x},f_{x}),\\ 1\oslash\mathbf{y}=(e_{z}-e_{y},f_{z}),\quad z=\mathop{\mathrm{fl}}(1/f_{y}),\end{gathered}

(6)

which are unary operations. The binary multiplication and division are defined as

\begin{gathered}\mathbf{x}\odot\mathbf{y}=(e_{x}+e_{y}+e_{z},f_{z}),\quad z=\mathop{\mathrm{fl}}(f_{x}\cdot f_{y}),\\ \mathbf{x}\oslash\mathbf{y}=(e_{x}-e_{y}+e_{z},f_{z}),\quad z=\mathop{\mathrm{fl}}(f_{x}/f_{y}),\end{gathered}

(7)

and the relation $\prec$ , that compares the represented values in the $<$ sense, is given as

	$\displaystyle\mathbf{x}\prec\mathbf{y}\iff$	$\displaystyle(\mathop{\mathrm{sign}}(f_{x})<\mathop{\mathrm{sign}}(f_{y}))$		(8)
	$\displaystyle\vee$	$\displaystyle((\mathop{\mathrm{sign}}(f_{x})=\mathop{\mathrm{sign}}(f_{y}))\wedge((e_{x}<e_{y})\vee((e_{x}=e_{y})\wedge(f_{x}<f_{y})))).$		(8)

Let, for any $G$ of order $n$ , where INT_MIN is the smallest representable integer,

e_{G}=\max_{1\leq i,j\leq n}e_{ij},\quad e_{ij}=\max\{\left\lfloor\lg g_{ij}\right\rfloor,\text{{INT\_MIN}}\}.

(9)

A prescaling of $G$ as $\underline{G^{\prime}}=2^{s}G$ , that avoids overflows, and underflows if possible, in the course of computing the SVD of $\underline{G^{\prime}}$ (and thus of $\underline{G}\approx 2^{-s}\underline{G^{\prime}}$ ), is defined by $s$ such that

e_{G}=\text{{INT\_MIN}}\implies s=0,\qquad e_{G}>\text{{INT\_MIN}}\implies s=e_{\nu}-e_{G}-\mathfrak{s},\quad\mathfrak{s}\geq 0,

(10)

where $\mathfrak{s}=0$ for $n=1$ . For $n=2$ , $\mathfrak{s}$ is chosen such that certain intermediate results while computing the SVD of $\underline{G^{\prime}}$ cannot overflow, as explained in [14, 15] and Section 3, but the final singular values are represented in the $(e,f)$ form, and are immune from overflow and underflow as long as they are not converted to simple floating-point values. If $s\geq 0$ , the result of such a prescaling is exact. Otherwise, some elements of $\underline{G^{\prime}}$ might be computed inexactly due to underflow. If for the elements of $G$ holds

g_{ij}\neq 0\implies\mu\leq|g_{ij}|\leq\nu/2^{\mathfrak{s}},\quad 1\leq i,j\leq n,

(11)

then $s\geq 0$ , and $g_{ij}^{\prime}=0$ or $\mu\leq|g_{ij}^{\prime}|\leq\nu$ , i.e., the elements of $G^{\prime}$ are zero or normal.

3 The SVD of general matrices of order two

This section presents a Kogbetliantz-like procedure for computing the singular values of $G$ when $n=2$ , and the matrices of the left ( $U$ ) and the right ( $V$ ) singular vectors.

In general, $U$ is a product of permutations (denoted by $P$ with subscripts and including $I$ ), sign matrices (denoted by $S$ with subscripts) with each diagonal element being either $1$ or $-1$ while the rest are zeros, and plane rotations by the angles $\vartheta$ and $\varphi$ . If $U_{\vartheta}$ is not generated, the notation changes from $U_{\varphi}$ to $U_{\phi}$ . Likewise, $V$ is a product of permutations, a sign matrix, and a plane rotation by the angle $\psi$ , where

U_{\vartheta}=\begin{bmatrix}\cos{\vartheta}&-\sin{\vartheta}\\ \sin{\vartheta}&\hphantom{-}\cos{\vartheta}\end{bmatrix},\qquad U_{\varphi}=\begin{bmatrix}\cos{\varphi}&-\sin{\varphi}\\ \sin{\varphi}&\hphantom{-}\cos{\varphi}\end{bmatrix},\qquad V_{\psi}=\begin{bmatrix}\cos{\psi}&-\sin{\psi}\\ \sin{\psi}&\hphantom{-}\cos{\psi}\end{bmatrix}.

(12)

Depending on its pattern of zeros, a matrix of order two falls into one of the 16 types $\mathfrak{t}$ shown in (13), where $\circ=0$ and $\bullet\neq 0$ . Some types are permutationally equivalent to others, what is denoted by $\mathfrak{t}_{1}\cong\mathfrak{t}_{2}$ , and means that a $\mathfrak{t}_{1}$ -matrix can be pre-multiplied and/or post-multiplied by permutations to be transformed into a $\mathfrak{t}_{2}$ -matrix, and vice versa, keeping the number of zeros intact. Each $\mathfrak{t}\neq 0$ has its associated scale type $\mathfrak{s}$ .

\begin{gathered}\begin{gathered}\text{\small 0}\\[-3.0pt] \begin{bmatrix}\circ&\circ\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 1}\\[-3.0pt] \begin{bmatrix}\bullet&\circ\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 2}\\[-3.0pt] \begin{bmatrix}\circ&\circ\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 3}\\[-3.0pt] \begin{bmatrix}\bullet&\circ\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 4}\\[-3.0pt] \begin{bmatrix}\circ&\bullet\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 5}\\[-3.0pt] \begin{bmatrix}\bullet&\bullet\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 6}\\[-3.0pt] \begin{bmatrix}\circ&\bullet\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 7}\\[-3.0pt] \begin{bmatrix}\bullet&\bullet\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\\ \begin{gathered}\begin{bmatrix}\circ&\circ\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 8}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\circ\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 9}\end{gathered}\begin{gathered}\begin{bmatrix}\circ&\circ\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 10}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\circ\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 11}\end{gathered}\begin{gathered}\begin{bmatrix}\circ&\bullet\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 12}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\bullet\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 13}\end{gathered}\begin{gathered}\begin{bmatrix}\circ&\bullet\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 14}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\bullet\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 15}\end{gathered}\end{gathered}\quad\begin{gathered}\mathfrak{t}\\ 0,1,2,4,6,8\cong\mathbf{9}\\ 12\cong\mathbf{3};\quad 10\cong\mathbf{5}\\ 7,11,14\cong\mathbf{13}\\ \quad\mathbf{15}\end{gathered}\quad\begin{gathered}\mathfrak{s}\\ 0\\ 1\\ 1\\ 2\end{gathered}

(13)

For $\mathfrak{s}=0$ , there is one equivalence class of matrix types, represented by $\mathfrak{t}=9$ . For $\mathfrak{s}=1$ , there are three classes, represented by $\mathfrak{t}=3$ , $\mathfrak{t}=5$ , and $\mathfrak{t}=13$ , while for $\mathfrak{s}=2$ there is one class, $\mathfrak{t}=15$ . The SVD computation for the first three classes is straightforward, while for the fourth and the fifth class is more involved. A matrix of any type, except $\mathfrak{t}=15$ , can be permuted into an upper triangular one. If a matrix so obtained is well scaled, its SVD can alternatively be computed by xLASV2. However, xLASV2 does not accept general matrices (i.e., $\mathfrak{t}=15$ ), unlike the proposed method, which is a modification of [15] when $J=I$ , and consists of the following three phases:

1.

For $G$ determine $\mathfrak{t}$ , $\mathfrak{s}$ , and $s$ to obtain $\underline{G^{\prime}}$ . Handle the simple cases of $\mathfrak{t}$ separately.
2.

If $\mathfrak{t}\cong 13$ or $\mathfrak{t}=15$ , factorize $\underline{G^{\prime}}$ as $U_{+}RV_{+}$ , such that $U_{+}$ and $V_{+}$ are orthogonal, and $R$ is upper triangular, with $\min\{r_{11},r_{12},r_{22}\}>0$ and all $r_{ij}$ finite, $1\leq i\leq j\leq 2$ .
3.

From the SVD of $\underline{R}$ assemble the SVD of $\underline{G^{\prime}}$ . Optionally backscale $\underline{\Sigma^{\prime}}$ by $2^{-s}$ .

The phases 1, 2, and 3 are described in Sections 3.1, 3.2, and 3.3, respectively.

3.1 Prescaling of the matrix and the simple cases ( $\mathfrak{t}\cong 3,5,9$ )

Matrices with $\mathfrak{t}\cong 9$ do not have to be scaled, but only permuted into the $\mathfrak{t}=0$ , $\mathfrak{t}=1$ , or $\mathfrak{t}=9$ (where the first diagonal element is not smaller by magnitude than the second one) form, according to their number of non-zeros, with at most one permutation from the left and at most one from the right hand side. Then, the rows of $P_{U}^{T}GP_{V}$ are multiplied by the signs of their diagonal elements, to obtain $\sigma_{1}=|g_{11}|$ and $\sigma_{2}=|g_{22}|$ , while $U=P_{U}S$ and $V=P_{V}$ . The error-free SVD computation is thus completed.

Note that the signs might have been taken out of the columns instead of the rows, and the sign matrix $S$ would have then be incorporated into $V$ instead. The structure of the left and the right singular vector matrices is therefore not uniquely determined.

Be aware that $\mathfrak{t}$ determined before the prescaling (to compute $\mathfrak{s}$ and $s$ ) may differ from $\mathfrak{t}^{\prime}$ that would be found afterwards. If, e.g., $\mathfrak{t}\ncong 9$ and $G$ contains, among others, $\nu$ and $\check{\mu}$ as elements, the element(s) $\check{\mu}$ will vanish after the prescaling since $s<0$ (from (10), due to $\mathfrak{s}\geq 1$ ), so $\mathfrak{t}^{\prime}<\mathfrak{t}$ and the zero pattern of $\underline{G^{\prime}}$ has to be re-examined.

A $\mathfrak{t}\cong 3$ or $\mathfrak{t}\cong 5$ matrix is scaled by $2^{s}$ . The columns (resp., rows) of a $\mathfrak{t}^{\prime}=12$ (resp., $\mathfrak{t}^{\prime}=10$ ) matrix are swapped, to bring it to the $\mathfrak{t}^{\prime\prime}=3$ (resp., $\mathfrak{t}^{\prime\prime}=5$ ) form. Then, the non-zero elements are made positive by multiplying the rows (resp., columns) by their signs. Next, the rows (resp., columns) are swapped if required to make the upper left element largest by magnitude. The sign-extracting and magnitude-ordering operations may be swapped or combined. The resulting matrix $G^{\prime\prime}$ undergoes the QR (resp., RQ) factorization, by a single Givens rotation $U_{\theta}^{T}$ (resp., $V_{\theta}$ ), determined by $\tan\theta$ (consequently, by $\cos\theta$ and $\sin\theta$ ) as in (12), with $\theta$ substituted for $\vartheta$ (resp., $\psi$ ), where

\mathfrak{t}^{\prime\prime}=3\implies\tan\theta=g_{21}^{\prime\prime}/g_{11}^{\prime\prime},\qquad\mathfrak{t}^{\prime\prime}=5\implies\tan\theta=g_{12}^{\prime\prime}/g_{11}^{\prime\prime}.

(14)

By construction, $0<\tan\theta\leq 1$ and $0\leq\underline{\tan\theta}\leq 1$ . The upper left element is not transformed, but explicitly set to hold the Frobenius norm of the whole non-zero column (resp., row), as $\underline{g_{11}^{\prime\prime\prime}}=\mathop{\mathrm{hypot}}(\underline{g_{11}^{\prime\prime}},\underline{g_{21}^{\prime\prime}})$ (resp., $\underline{g_{11}^{\prime\prime\prime}}=\mathop{\mathrm{hypot}}(\underline{g_{11}^{\prime\prime}},\underline{g_{12}^{\prime\prime}})$ ), while the other non-zero element is zeroed out. Thus, to avoid overflow of $\underline{g_{11}^{\prime\prime\prime}}$ it is sufficient to ensure that $\underline{g_{11}^{\prime\prime}}\ll\nu/\sqrt{2}$ , what $\mathfrak{s}=1$ achieves. The SVD is given by $U=S_{U}P_{U}U_{\theta}$ and $V=P_{V}$ for $\mathfrak{t}^{\prime}\cong 3$ , and by $U=P_{U}$ and $V=S_{V}P_{V}V_{\theta}$ for $\mathfrak{t}^{\prime}\cong 5$ . The scaled singular values are $\sigma_{1}^{\prime}=g_{11}^{\prime\prime\prime}$ and $\sigma_{2}^{\prime}=0$ in both cases, and $\underline{\sigma_{1}^{\prime}}$ cannot overflow ( $\underline{\sigma_{1}}=2^{-s}\underline{\sigma_{1}^{\prime}}$ can).

If no inexact underflow occurs while scaling $G$ to $G^{\prime}$ , then $\underline{\sigma_{1}^{\prime}}=\sigma_{1}^{\prime}(1+\epsilon_{1}^{\prime})$ , where $|\epsilon_{1}^{\prime}|\leq\varepsilon$ . With the same assumption, $\underline{\tan\theta}\geq\mu$ implies $\underline{\tan\theta}=\tan\theta(1+\epsilon_{\theta})$ , where $|\epsilon_{\theta}|\leq\varepsilon$ . The resulting Givens rotation can be represented and applied as one of

\underline{U_{\theta}^{T}}=\begin{bmatrix}1&\underline{\tan\theta}\\ -\underline{\tan\theta}&1\end{bmatrix}/\underline{\sec\theta},\qquad\underline{V_{\theta}}=\begin{bmatrix}1&-\underline{\tan\theta}\\ \underline{\tan\theta}&1\end{bmatrix}/\underline{\sec\theta},

(15)

what avoids computing $\underline{\cos\theta}$ and $\underline{\sin\theta}$ explicitly. Lemma 3.1 bounds the error in $\underline{\sec\theta}$ .

Lemma 3.1.

Let $\underline{\sec\theta}$ from (15) be computed as $\mathop{\mathrm{hypot}}(\underline{\tan\theta},1)$ for $\underline{\tan\theta}\geq\mu$ . Then,

\underline{\sec\theta}=\delta_{\theta}^{\prime}\sec\theta,\quad\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)\leq\delta_{\theta}^{\prime}\leq\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon).

(16)

Proof.

Let $\delta_{\theta}=(1+\epsilon_{\theta})$ . Then $\underline{\tan\theta}\geq\mu$ implies $(\underline{\tan\theta})^{2}=\delta_{\theta}^{2}\tan^{2}\theta$ , and

1-\varepsilon=\delta_{\theta}^{-}\leq\delta_{\theta}\leq\delta_{\theta}^{+}=1+\varepsilon.

Express $(\underline{\tan\theta})^{2}+1=\delta_{\theta}^{2}\tan^{2}\theta+1$ as $(\tan^{2}\theta+1)(1+\epsilon_{\theta}^{\prime})$ , from which it follows

\epsilon_{\theta}^{\prime}=\frac{\tan^{2}\theta}{\tan^{2}\theta+1}(\delta_{\theta}^{2}-1),\qquad 0<\frac{\tan^{2}\theta}{\tan^{2}\theta+1}\leq\frac{1}{2}.

By adding unity to both sides of the equation for $\epsilon_{\theta}^{\prime}$ and taking the maximal value of the first factor on its right hand side, while accounting for the bounds of $\delta_{\theta}$ , it holds

((\delta_{\theta}^{-})^{2}+1)/2\leq 1+\epsilon_{\theta}^{\prime}\leq((\delta_{\theta}^{+})^{2}+1)/2.

Since $\underline{\sec\theta}=\mathop{\mathrm{hypot}}(\underline{\tan\theta},1)=\sqrt{(\underline{\tan\theta})^{2}+1}(1+\epsilon_{\sqrt{\hbox{}}})=\sqrt{(\tan^{2}\theta+1)(1+\epsilon_{\theta}^{\prime})}(1+\epsilon_{\sqrt{\hbox{}}})$ , where $|\epsilon_{\sqrt{\hbox{}}}|\leq\varepsilon$ , factorizing the last square root into a product of square roots gives

\underline{\sec\theta}=\sec\theta\sqrt{1+\epsilon_{\theta}^{\prime}}(1+\epsilon_{\sqrt{\hbox{}}}).

The proof is concluded by denoting the error factor on the right hand side by $\delta_{\theta}^{\prime}$ . ∎

This proof and the following one use several techniques from [21, Theorem 1]. Due to the structure of a matrix that $U_{\theta}^{T}$ or $V_{\theta}$ is applied to, containing in each row and column one zero and one $\pm 1$ , it follows that $U$ or $V$ have in each row and column $\pm\underline{\cos\theta}$ and $\pm\underline{\sin\theta}$ , computed implicitly, for which Lemma 3.2 gives error bounds.

Lemma 3.2.

Let $\underline{\cos\theta}$ and $\underline{\sin\theta}$ result from applying (15) with $\underline{\tan\theta}\geq\mu$ . Then,

\underline{\cos\theta}=\delta_{\theta}^{\prime\prime}\cos\theta,\quad\delta_{\theta}^{\prime\prime}=(1+\epsilon_{/})/\delta_{\theta}^{\prime},\qquad\underline{\sin\theta}=\delta_{\theta}^{\prime\prime\prime}\sin\theta,\quad\delta_{\theta}^{\prime\prime\prime}=(1+\epsilon_{/}^{\prime})\delta_{\theta}/\delta_{\theta}^{\prime},

(17)

where $\max\{|\epsilon_{/}|,|\epsilon_{/}^{\prime}|\}\leq\varepsilon$ and $\delta_{\theta}^{\prime\prime}$ and $\delta_{\theta}^{\prime\prime\prime}$ can be bound below and above in the terms of $\varepsilon$ only. Let $\delta_{\theta}^{\prime-}$ and $\delta_{\theta}^{\prime+}$ be the lower and the upper bounds for $\delta_{\theta}^{\prime}$ from (16). Then,

(1-\varepsilon)/\delta_{\theta}^{\prime+}\leq\delta_{\theta}^{\prime\prime}\leq(1+\varepsilon)/\delta_{\theta}^{\prime-},\qquad(1-\varepsilon)\delta_{\theta}^{-}/\delta_{\theta}^{\prime+}\leq\delta_{\theta}^{\prime\prime\prime}\leq(1+\varepsilon)\delta_{\theta}^{+}/\delta_{\theta}^{\prime-}.

(18)

Proof.

The claims follow from (16), $\underline{\cos\theta}=\mathop{\mathrm{fl}}(1/\underline{\sec\theta})$ , and $\underline{\sin\theta}=\mathop{\mathrm{fl}}(\underline{\tan\theta}/\underline{\sec\theta})$ . ∎

If, in (14), $\underline{\tan\theta}<\mu$ , then $\underline{\sec\theta}=1$ since $1\leq\underline{\sec\theta}\leq\mathop{\mathrm{fl}}(\sqrt{1+\mu^{2}})\leq\mathop{\mathrm{fl}}(1+\mu)=1$ , and the relative error in $\underline{\sec\theta}$ is below $\varepsilon$ for any standard floating-point datatype. Thus, even though $\underline{\tan\theta}$ can be relatively inaccurate, (16) holds for all $\underline{\sec\theta}$ . Also, $\underline{\cos\theta}$ is always relatively accurate, but $\underline{\sin\theta}$ might not be if $\underline{\tan\theta}<\mu$ , when $\underline{\sin\theta}=\underline{\tan\theta}$ .

3.2 A (pivoted) $URV$ factorization of order two ( $\mathfrak{t}^{\prime}\cong 13,15$ )

If, after the prescaling, $\mathfrak{t}^{\prime}\cong 13$ or $\mathfrak{t}^{\prime}=15$ , $\underline{G^{\prime}}$ is transformed into an upper triangular matrix $R$ with all elements non-negative, i.e., a special URV factorization of $\underline{G^{\prime}}$ is computed. Section 3.2.1 deals with the $\mathfrak{t}^{\prime}\cong 13$ , and Section 3.2.2 with the $\mathfrak{t}^{\prime}=15$ case.

3.2.1 An error-free transformation from $\mathfrak{t}^{\prime}\cong 13$ to $\mathfrak{t}^{\prime\prime}=13$ form

A triangular or anti-triangular matrix is first permuted into an upper triangular one, $G^{\prime\prime}$ . Its first row is then multiplied by the sign of $g_{11}^{\prime\prime}$ . This might change the sign of $g_{12}^{\prime\prime}$ . The second column is multiplied by its new sign, what might change the sign of $g_{22}^{\prime\prime}$ . Its new sign then multiplies the second row, what completes the construction of $R$ .

The transformations $U_{+}^{T}$ and $V_{+}$ , such that $R=U_{+}^{T}G^{\prime}V_{+}$ , can be expressed as $U_{+}=P_{U}S_{11}S_{22}=\underline{U_{+}}$ and $V_{+}=P_{V}S_{12}=\underline{V_{+}}$ , and are exact, as well as $\underline{R}$ if $\underline{G^{\prime}}$ is.

3.2.2 A fully pivoted URV when $\mathfrak{t}^{\prime}=15$

In all previous cases, a sequence of error-free transformations would bring $G^{\prime}$ into an upper triangular $G^{\prime\prime}$ , of which xLASV2 can compute the SVD. However, a matrix without zeros either has to be preprocessed into such a form, in the spirit of [14, 15], or its SVD has to computed by more complicated and numerically less stable formulas, that follow from the annihilation requirement for the off-diagonal matrix elements as

\frac{\tan(2\phi)}{2}=\frac{g_{11}g_{21}+g_{12}g_{22}}{g_{11}^{2}+g_{12}^{2}-g_{21}^{2}-g_{22}^{2}},\qquad\tan\psi=\frac{g_{12}+g_{22}\tan\phi}{g_{11}+g_{21}\tan\phi}.

(19)

A sketched derivation of (19) can be found in Section 1 of the supplementary material.

Opting for the first approach, compute the Frobenius norms of the columns of $G^{\prime}$ , as $w_{1}$ and $w_{2}$ . Due to the prescaling, $\underline{w_{1}}=\mathop{\mathrm{hypot}}(\underline{g_{11}^{\prime}},\underline{g_{21}^{\prime}})$ and $\underline{w_{2}}=\mathop{\mathrm{hypot}}(\underline{g_{12}^{\prime}},\underline{g_{22}^{\prime}})$ cannot overflow. If $w_{1}<w_{2}$ , swap the columns and their norms (so that $w_{1}^{\prime}$ would be the norm of the new first column of $\widetilde{G}^{\prime}$ , and $w_{2}^{\prime}$ the norm of the second one). Multiply each row by the sign of its new first element to get $G^{\prime\prime}$ . Swap the rows if $g_{11}^{\prime\prime}<g_{21}^{\prime\prime}$ to get the fully pivoted $G^{\prime\prime\prime}$ , while the norms remain unchanged. Note that $g_{11}^{\prime\prime\prime}\geq g_{21}^{\prime\prime\prime}>0$ .

Now the QR factorization of $G^{\prime\prime\prime}$ is computed as $U_{\vartheta}^{T}G^{\prime\prime\prime}=R^{\prime\prime}$ . Then, $r_{21}^{\prime\prime}=0$ , and

r_{11}^{\prime\prime}=w_{1}^{\prime},\quad\tan\vartheta=\frac{g_{21}^{\prime\prime\prime}}{g_{11}^{\prime\prime\prime}},\quad r_{12}^{\prime\prime}=\frac{g_{12}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}\tan\vartheta}{\sec\vartheta},\quad r_{22}^{\prime\prime}=\frac{g_{22}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}\tan\vartheta}{\sec\vartheta}.

(20)

All properties of the functions of $\theta$ from Section 3.1 also hold for the functions of $\vartheta$ . The prescaling of $G$ causes the elements of $R^{\prime\prime}$ to be at most $\nu/(2\sqrt{2})$ in magnitude.

If $\underline{r_{12}^{\prime\prime}}=0$ , then $\mathfrak{t}^{\prime\prime}=9$ , and if $\underline{r_{22}^{\prime\prime}}=0$ , then $\mathfrak{t}^{\prime\prime}=5$ . In either case $\underline{R^{\prime\prime}}$ is processed further as in Section 3.1, while accounting for the already applied transformations. Else, the second column of $R^{\prime\prime}$ is multiplied by the sign of $r_{12}^{\prime\prime}$ to obtain $R^{\prime}$ . The second row of $R^{\prime}$ is multiplied by the sign of $r_{22}^{\prime}$ to finalize $R$ , in which the upper triangle is positive. It is evident how to construct $U_{+}$ and $V_{+}=\underline{V_{+}}$ such that $R=U_{+}^{T}G^{\prime}V_{+}$ , since

U_{+}^{T}=S_{22}^{T}U_{\vartheta}^{T}P_{U}^{T}S_{1}^{T},\quad S_{1}=\mathop{\mathrm{diag}}(\mathop{\mathrm{sign}}(\tilde{g}_{11}^{\prime}),\mathop{\mathrm{sign}}(\tilde{g}_{21}^{\prime})),\qquad V_{+}=P_{V}S_{12}.

(21)

However, $\underline{U_{+}^{T}}$ is not explicitly formed, as explained in Section 3.3. Now $\mathfrak{t}^{\prime\prime}=13$ for $\underline{R}$ .

Lemma 3.3 bounds the relative errors in (some of) the elements of $\underline{R}$ , when possible.

Lemma 3.3.

Assume that no inexact underflow occurs at any stage of the above computation, leading from $G$ to $\underline{R}$ . Then, $\underline{r_{11}}=\delta_{0}r_{11}$ , where

1-\varepsilon=\delta_{0}^{-}\leq\delta_{0}\leq\delta_{0}^{+}=1+\varepsilon.

(22)

If in $\underline{\tan\vartheta}=\delta_{\vartheta}\tan\vartheta$ holds $\delta_{\vartheta}=1$ , then $\underline{r_{12}}=\delta_{1}^{\prime}r_{12}$ and $\underline{r_{22}}=\delta_{1}^{\prime\prime}r_{22}$ , where

\frac{(1-\varepsilon)^{2}}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{1}^{-}\leq\delta_{1}^{\prime},\delta_{1}^{\prime\prime}\leq\delta_{1}^{+}=\frac{(1+\varepsilon)^{2}}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)}.

(23)

Else, if $\underline{g_{12}^{\prime\prime\prime}}$ and $\underline{g_{22}^{\prime\prime\prime}}$ are of the same sign, then $\underline{r_{12}}=\delta_{2}^{\prime}r_{12}$ , and if they are of the opposite signs, then $\underline{r_{22}}=\delta_{2}^{\prime\prime}r_{22}$ , where, with $1-\varepsilon=\delta_{\vartheta}^{-}\leq\delta_{\vartheta}\leq\delta_{\vartheta}^{+}=1+\varepsilon$ and $\delta_{\vartheta}\neq 1$ ,

\frac{(1-\varepsilon)^{3}}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{2}^{-}<\delta_{2}^{\prime},\delta_{2}^{\prime\prime}<\delta_{2}^{+}=\frac{(1+\varepsilon)^{3}}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)}.

(24)

Proof.

Eq. (22) follows from the correct rounding of $\mathop{\mathrm{hypot}}$ in the computation of $\underline{w_{1}^{\prime}}$ .

To prove (23) and (24), solve $x\pm y\delta_{\vartheta}\tan\vartheta=(x\pm y\tan\vartheta)(1+\epsilon_{\pm})$ for $\epsilon_{\pm}$ with $xy\neq 0$ . After expanding and rearranging the terms, it follows that

\epsilon_{\pm}=\frac{\pm y\tan\vartheta}{x\pm y\tan\vartheta}(\delta_{\vartheta}-1),\qquad\delta_{\vartheta}=1+\epsilon_{\vartheta},\quad|\epsilon_{\vartheta}|\leq\varepsilon.

(25)

If $x$ and $y$ are of the same sign and the addition operation is chosen, the first factor on the first right hand side in (25) is above zero and below unity, so $|\epsilon_{\pm}|<|\delta_{\vartheta}-1|$ and

\delta_{\vartheta}^{-}=\delta_{\pm}^{-}<\delta_{\pm}<\delta_{\pm}^{+}=\delta_{\vartheta}^{+},\qquad\delta_{\pm}=1+\epsilon_{\pm}.

(26)

The same holds if $x$ and $y$ are of the opposite signs and the subtraction is taken instead. Specifically, from (20), the bound (26) holds for $x^{\prime}=\underline{g_{12}^{\prime\prime\prime}}$ and $y^{\prime}=\underline{g_{22}^{\prime\prime\prime}}$ of the same sign, with $\pm^{\prime}=+$ , and for $x^{\prime\prime}=\underline{g_{22}^{\prime\prime\prime}}$ and $y^{\prime\prime}=\underline{g_{12}^{\prime\prime\prime}}$ of the opposite signs, with $\pm^{\prime\prime}=-$ .

With $|\epsilon_{\mathop{\mathrm{fma}}}|\leq\varepsilon$ and $\delta_{\mathop{\mathrm{fma}}}=1+\epsilon_{\mathop{\mathrm{fma}}}$ , from the definition of $\delta_{\pm}$ it follows that

\mathop{\mathrm{fma}}(\pm y,\underline{\tan\vartheta},x)=(x\pm y\underline{\tan\vartheta})\delta_{\mathop{\mathrm{fma}}}=(x\pm y\tan\vartheta)\delta_{\mathop{\mathrm{fma}}}\delta_{\pm}.

(27)

Due to (16) and (27), with $\vartheta$ instead of $\theta$ and with $\delta_{/}^{\prime\prime}=(1+\epsilon_{/}^{\prime\prime})$ where $|\epsilon_{/}^{\prime\prime}|\leq\varepsilon$ , it holds

\mathop{\mathrm{fl}}\left(\frac{\mathop{\mathrm{fma}}(\pm^{!}y^{!},\underline{\tan\vartheta},x^{!})}{\underline{\sec\vartheta}}\right)=\frac{x^{!}\pm^{!}y^{!}\tan\vartheta}{\sec\vartheta}\cdot\frac{\delta_{\mathop{\mathrm{fma}}}\delta_{/}^{\prime\prime}}{\delta_{\vartheta}^{\prime}}\cdot\delta_{\pm}=\underline{r_{?2}^{\prime\prime}}\cdot\delta_{1}^{!}\cdot\delta_{\pm}=\underline{r_{?2}^{\prime\prime}}\delta_{2}^{!},

(28)

where $!=^{\prime}$ for $?=1$ and $!=^{\prime\prime}$ for $?=2$ . Now (23) follows from bounding $\delta_{1}^{!}=\delta_{\mathop{\mathrm{fma}}}\delta_{/}^{\prime\prime}/\delta_{\vartheta}^{\prime}$ below and above using (16). Since $\delta_{2}^{!}=\delta_{1}^{!}\delta_{\pm}$ in (28), (24) is a consequence of (23). ∎

Lemma 3.3 thus shows that high relative accuracy of all elements of $\underline{R}$ is achieved if no underflow has occurred at any stage, and $\underline{\tan\vartheta}$ has been computed exactly. If $\underline{\tan\vartheta}$ is inexact, high relative accuracy is guaranteed for $\underline{r_{11}}$ and exactly one of $\underline{r_{12}}$ and $\underline{r_{22}}$ . If it is also desired for the remaining element, transformed by an essential subtraction and thus amenable to cancellation, one possibility is to compute it by an expression equivalent to (20), but with $\tan\vartheta$ expanded to its definition $g_{21}^{\prime\prime\prime}/g_{11}^{\prime\prime\prime}$ , as in

r_{12}^{\prime\prime}=(g_{12}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/(g_{11}^{\prime\prime\prime}\sec\vartheta),\quad r_{22}^{\prime\prime}=(g_{22}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/(g_{11}^{\prime\prime\prime}\sec\vartheta),

(29)

after prescaling the numerator and denominator in (29) by the largest power-of-two that avoids overflow of both of them. A floating-point primitive of the form $a\cdot b\pm c\cdot d$ with a single rounding [23] can give the correctly rounded numerator but it has to be emulated in software on most platforms at present [24]. For high relative accuracy of $\underline{r_{12}^{\prime\prime}}$ or $\underline{r_{22}^{\prime\prime}}$ the absence of inexact underflows is required, except in the prescaling.

Alternatively, the numerators in (29) can be calculated by the Kahan’s algorithm for determinants of order two [25], but an overflow-avoiding prescaling is still necessary. It is thus easier to resort to computing (29) in a wider and more precise datatype as

	$\displaystyle\underline{r_{12}^{\prime\prime}}$	$\displaystyle=\mathop{\mathrm{fl}}(\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\mathop{\mathrm{FL}}(g_{12}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/g_{11}^{\prime\prime\prime}))/\underline{\sec\vartheta}),$		(30)
	$\displaystyle\underline{r_{22}^{\prime\prime}}$	$\displaystyle=\mathop{\mathrm{fl}}(\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\mathop{\mathrm{FL}}(g_{22}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/g_{11}^{\prime\prime\prime}))/\underline{\sec\vartheta}).$		(30)

First, in (30) it is assumed that no underflow has occurred so far, so $\underline{G^{\prime\prime\prime}}=G^{\prime\prime\prime}$ . Second, a product of two floating-point values requires $2p$ bits of the significand to be represented exactly if the factors’ significands are encoded with $p$ bits each. Thus, for single precision, $48$ bits are needed, what is less than $53$ bits available in double precision. Similarly, for double precision, $106$ bits are needed, what is less than $113$ bits of a quadruple precision significand. Therefore, every product in (30) is exact if computed using a more precise standard datatype T. The characteristic values of T are the underflow and overflow thresholds $\mu_{\text{{T}}}$ and $\nu_{\text{{T}}}$ , and $\varepsilon_{\text{{T}}}=2^{-p_{\text{{T}}}}$ . Third, let $\mathop{\mathrm{FL}}(x)$ round an infinitely precise result of $x$ to the nearest value in T. Since all addends in (30) are exact and way above the underflow threshold $\mu_{\text{{T}}}$ by magnitude, the rounded result of the addition or the subtraction is relatively accurate, with the error factor $(1+\epsilon_{\pm}^{\prime})$ , $|\epsilon_{\pm}^{\prime}|\leq\varepsilon_{\text{{T}}}$ . This holds even if the result is zero, but since the transformed matrix would then be processed according to its new structure, assume that the result is normal in T. The ensuing division cannot overflow nor underflow in T. Now the quotient rounded by $\mathop{\mathrm{FL}}$ is rounded again, by $\mathop{\mathrm{fl}}$ , back to the working datatype. This operation can underflow, as well as the following division by $\underline{\sec\vartheta}$ . If they do not, the resulting transformed element is relatively accurate. This outlines the proof of Theorem 3.4.

Theorem 3.4.

Assume that no underflow occurs at any stage of the computation leading from $G$ to $\underline{R}$ . Then, $\underline{r_{11}}=\delta_{0}r_{11}$ , where for $\delta_{0}$ holds (22). If $\underline{\tan\vartheta}$ is exact, then $\underline{r_{12}}=\delta_{1}^{\prime}r_{12}$ and $\underline{r_{22}}=\delta_{1}^{\prime\prime}r_{22}$ , where $\delta_{1}^{\prime}$ and $\delta_{1}^{\prime\prime}$ are as in (23). Else, if $\underline{g_{12}^{\prime\prime\prime}}$ and $\underline{g_{22}^{\prime\prime\prime}}$ are of the same sign, then $\underline{r_{12}}=\delta_{2}^{\prime}r_{12}$ and $\underline{r_{22}}=\delta_{3}^{\prime}r_{22}$ . If they are of the opposite signs, then $\underline{r_{22}}=\delta_{2}^{\prime\prime}r_{22}$ and $\underline{r_{12}}=\delta_{3}^{\prime\prime}r_{12}$ , where $\delta_{2}^{\prime}$ and $\delta_{2}^{\prime\prime}$ are as in (24), while $\delta_{3}^{\prime}$ and $\delta_{3}^{\prime\prime}$ come from evaluating their corresponding matrix elements as in (30) and are bounded as

\frac{(1-\varepsilon_{\text{{T}}})^{2}(1-\varepsilon)^{2}}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{3}^{-}\leq\delta_{3}^{\prime},\delta_{3}^{\prime\prime}\leq\delta_{3}^{+}=\frac{(1+\varepsilon_{\text{{T}}})^{2}(1+\varepsilon)^{2}}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)}.

(31)

Proof.

It remains to prove (31), since the other relations follow from Lemma 3.3.

Every element of $G^{\prime\prime\prime}$ is not above $\nu/4$ and not below $\mu$ in magnitude. Therefore, a difference (in essence) of their exact products cannot exceed $\nu^{2}/16<\nu_{\text{{T}}}$ in magnitude in a standard T. At least one element is above $\nu/8$ in magnitude due to the prescaling, so the said difference, if not exactly zero, is above $\varepsilon\mu\nu/8\gg\mu_{\text{{T}}}$ in magnitude. Thus, the quotient of this difference and $g_{11}^{\prime\prime\prime}$ is above $\varepsilon\mu(1-\varepsilon_{\text{{T}}})/2>\mu_{\text{{T}}}$ and, due to the prescaling and pivoting, not above $\nu/2\ll\nu_{\text{{T}}}$ in magnitude. For (30) it therefore holds

\begin{gathered}\begin{aligned} \underline{\star_{+}}&=\mathop{\mathrm{FL}}(\star_{+})=\star_{+}(1+\epsilon_{+}^{\prime}),\quad\star_{+}=g_{12}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}g_{21}^{\prime\prime\prime},\quad|\epsilon_{+}^{\prime}|\leq\varepsilon_{\text{{T}}},\\ \underline{\star_{-}}&=\mathop{\mathrm{FL}}(\star_{-})=\star_{-}(1+\epsilon_{-}^{\prime}),\quad\star_{-}=g_{22}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}g_{21}^{\prime\prime\prime},\quad|\epsilon_{-}^{\prime}|\leq\varepsilon_{\text{{T}}},\end{aligned}\\ \mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime})=(\star_{\pm}/g_{11}^{\prime\prime\prime})(1+\epsilon_{\pm}^{\prime})(1+\epsilon_{/}^{\prime\prime\prime}),\quad|\epsilon_{/}^{\prime\prime\prime}|\leq\varepsilon_{\text{{T}}}.\end{gathered}

(32)

The quotient $\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime})$ , converted to the working precision with $\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}))$ , is possibly not correctly rounded from the value of $\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}$ due to its double rounding. Using the assumption that no underflow in the working precision occurs, it follows that

\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}))=(\star_{\pm}/g_{11}^{\prime\prime\prime})(1+\epsilon_{\pm}^{\prime})(1+\epsilon_{/}^{\prime\prime\prime})(1+\epsilon_{\mathop{\mathrm{fl}}})=(\star_{\pm}/g_{11}^{\prime\prime\prime})\delta_{\pm}^{\prime},\quad|\epsilon_{\mathop{\mathrm{fl}}}|\leq\varepsilon,

(33)

with $\delta_{\pm}^{\prime}=(1+\epsilon_{\pm}^{\prime})(1+\epsilon_{/}^{\prime\prime\prime})(1+\epsilon_{\mathop{\mathrm{fl}}})$ . For the final division by $\underline{\sec\vartheta}$ , due to (16), holds

\mathop{\mathrm{fl}}\left(\frac{\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}))}{\underline{\sec\vartheta}}\right)=\frac{\star_{\pm}}{g_{11}^{\prime\prime\prime}\sec\vartheta}\frac{\delta_{\pm}^{\prime}}{\delta_{\vartheta}^{\prime}}\delta_{/}^{\prime\prime\prime\prime},\quad\delta_{/}^{\prime\prime\prime\prime}=(1+\epsilon_{/}^{\prime\prime\prime\prime}),\quad|\epsilon_{/}^{\prime\prime\prime\prime}|\leq\varepsilon.

(34)

By fixing the sign in the $\pm$ subscript in (32), (33), and (34), let $\delta_{3}^{\prime}=\delta_{-}^{\prime}\delta_{/}^{\prime\prime\prime\prime}/\delta_{\vartheta}^{\prime}$ and $\delta_{3}^{\prime\prime}=\delta_{+}^{\prime}\delta_{/}^{\prime\prime\prime\prime}/\delta_{\vartheta}^{\prime}$ . Now bound $\delta_{3}^{\prime}$ and $\delta_{3}^{\prime\prime}$ below by $\delta_{3}^{-}$ and above by $\delta_{3}^{+}$ , by combining the appropriate lower and upper bounds for $\delta_{\pm}^{\prime}$ , $\delta_{\vartheta}^{\prime}$ from Lemma 3.1, and $\delta_{/}^{\prime\prime\prime\prime}$ , where

\frac{(1-\varepsilon_{\text{{T}}})^{2}(1-\varepsilon)\cdot(1-\varepsilon)}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{3}^{-}\leq\delta_{3}^{\prime},\delta_{3}^{\prime\prime}\leq\delta_{3}^{+}=\frac{(1+\varepsilon_{\text{{T}}})^{2}(1+\varepsilon)\cdot(1+\varepsilon)}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)},

what proves (31), by minimizing the numerator and maximizing the denominator to minimize the expression for $\delta_{3}^{\prime}$ and $\delta_{3}^{\prime\prime}$ , and vice versa, as in the proof of Lemma 3.3. ∎

Therefore, it is possible to compute $\underline{R}$ with high relative accuracy in the absence of underflows, in the working precision only, but it is easier to employ two multiplications, one addition or subtraction, and one division in a wider, more precise datatype. Table 1 shows by how many $\varepsilon$ s the lower and the upper bounds for $\delta_{1}$ , $\delta_{2}$ , and $\delta_{3}$ from (23), (24), and (31), respectively, differ from unity. The quantities in the table’s header were computed symbolically as algebraic expressions by substituting $\varepsilon$ and $\varepsilon_{\text{{T}}}$ with their defining powers of two, then approximated numerically with $p_{\text{{T}}}$ decimal digits, and rounded upwards to nine digits after the decimal point, by a Wolfram Language script⁴⁴4The relerr.wls file in the code supplement, executed by the Wolfram Engine 14.0.0 for macOS (Intel)..

Table 1: Lower and upper bounds for

\delta_{1}

\delta_{2}

, and

\delta_{3}

in single and double precision.

\topruleprecision	$(1-\delta_{1}^{-})/\varepsilon$	$(\delta_{1}^{+}-1)/\varepsilon$	$(1-\delta_{2}^{-})/\varepsilon$	$(\delta_{2}^{+}-1)/\varepsilon$	$(1-\delta_{3}^{-})/\varepsilon$	$(\delta_{3}^{+}-1)/\varepsilon$
\midrulesingle	3.499999665	3.500000336	4.499999457	4.500000544	3.499999669	3.500000340
double	3.500000000	3.500000001	4.500000000	4.500000001	3.500000000	3.500000001
\botrule

3.3 The SVD of $R$ and $G$

For $\underline{R}$ now holds $\mathfrak{t}^{\prime\prime}=13$ . If $\underline{r_{11}}<\underline{r_{22}}$ , the diagonal elements of $\underline{R}$ have to be swapped, similarly to xLASV2. This is done by symmetrically permuting $\underline{R}^{T}$ with $\widetilde{P}=\left[\begin{smallmatrix}1&0\\ 0&1\end{smallmatrix}\right]$ . Multiplying $\widetilde{R}=\widetilde{P}^{T}R^{T}\widetilde{P}=\widetilde{U}\widetilde{\Sigma}\widetilde{V}^{T}$ by $\widetilde{P}$ from the left and $\widetilde{P}^{T}$ from the right gives

R^{T}=\widetilde{P}\widetilde{U}\widetilde{\Sigma}\widetilde{V}^{T}\widetilde{P}^{T},\qquad R=\widetilde{P}\widetilde{V}\widetilde{\Sigma}\widetilde{U}^{T}\widetilde{P}^{T}=\check{U}\check{\Sigma}\check{V}^{T},

where $\check{U}=\widetilde{P}\widetilde{V}$ , $\check{\Sigma}=\widetilde{\Sigma}$ , and $\check{V}=\widetilde{P}\widetilde{U}$ . Therefore, having applied the permutation $\widetilde{P}$ ,

\begin{gathered}\widetilde{U}=\widetilde{U}_{\varphi}P_{\tilde{\bm{\sigma}}},\quad\widetilde{V}=\widetilde{V}_{\psi}P_{\tilde{\bm{\sigma}}},\qquad\check{U}=\check{U}_{\bm{\varphi}}P_{\tilde{\bm{\sigma}}},\quad\check{V}=\check{V}_{\bm{\psi}}P_{\tilde{\bm{\sigma}}},\\ \begin{aligned} \check{U}_{\bm{\varphi}}&=\widetilde{P}\widetilde{V}_{\psi}=\begin{bmatrix}\sin\psi&\hphantom{-}\cos\psi\\ \cos\psi&-\sin\psi\end{bmatrix}=\begin{bmatrix}\sin\psi&-\cos\psi\\ \cos\psi&\hphantom{-}\sin\psi\end{bmatrix}S_{2}=\begin{bmatrix}\cos{\bm{\varphi}}&-\sin{\bm{\varphi}}\\ \sin{\bm{\varphi}}&\hphantom{-}\cos{\bm{\varphi}}\end{bmatrix}S_{2},\\ \check{V}_{\bm{\psi}}&=\widetilde{P}\widetilde{U}_{\varphi}=\begin{bmatrix}\sin\varphi&\hphantom{-}\cos\varphi\\ \cos\varphi&-\sin\varphi\end{bmatrix}=\begin{bmatrix}\sin\varphi&-\cos\varphi\\ \cos\varphi&\hphantom{-}\sin\varphi\end{bmatrix}S_{2}=\begin{bmatrix}\cos{\bm{\psi}}&-\sin{\bm{\psi}}\\ \sin{\bm{\psi}}&\hphantom{-}\cos{\bm{\psi}}\end{bmatrix}S_{2},\end{aligned}\\ S_{2}=\begin{bmatrix}1&\hphantom{-}0\\ 0&-1\end{bmatrix},\qquad\begin{gathered}\cos{\bm{\varphi}}=\sin\psi,\quad\sin{\bm{\varphi}}=\cos\psi,\quad\tan{\bm{\varphi}}=1/\tan\psi,\\ \cos{\bm{\psi}}=\sin\varphi,\quad\sin{\bm{\psi}}=\cos\varphi,\quad\tan{\bm{\psi}}=1/\tan\varphi,\end{gathered}\end{gathered}

(35)

where $\widetilde{U}_{\varphi}$ , $\widetilde{V}_{\psi}$ , and $P_{\tilde{\bm{\sigma}}}$ come from the SVD of $\widetilde{R}$ in Section 3.3.1. If $\underline{r_{11}}\geq\underline{r_{22}}$ then $\underline{\widetilde{R}}=\underline{R}$ , $\check{U}=\widetilde{U}$ , $\check{V}=\widetilde{V}$ , (35) is not used, and let $S_{2}=I$ , $\bm{\varphi}=\varphi$ , and $\bm{\psi}=\psi$ .

Assume that no inexact underflow occurs in the computation of $\underline{R}$ . If the initial matrix $G$ was triangular, let $\delta_{11}=\delta_{12}=\delta_{22}=1$ . Else, let $\delta_{11}$ , $\delta_{12}$ , and $\delta_{22}$ stand for the error factors of the elements of $\underline{\widetilde{R}}$ that correspond to those from Theorem 3.4, i.e.,

\widetilde{R}=\begin{bmatrix}\tilde{r}_{11}&\tilde{r}_{12}\\ 0&\tilde{r}_{22}\end{bmatrix},\qquad\underline{\widetilde{R}}=\begin{bmatrix}\underline{\tilde{r}_{11}}&\underline{\tilde{r}_{12}}\\ 0&\underline{\tilde{r}_{22}}\end{bmatrix}=\begin{bmatrix}\tilde{r}_{11}\delta_{11}&\tilde{r}_{12}\delta_{12}\\ 0&\tilde{r}_{22}\delta_{22}\end{bmatrix},\quad\underline{\tilde{r}_{11}}\geq\underline{\tilde{r}_{22}}.

(36)

It remains to compute the SVD of $\underline{\widetilde{R}}$ by an alternative to xLASV2, what is described in Section 3.3.1, and assemble the SVD of $G$ , what is explained in Section 3.3.2.

3.3.1 The SVD of $\widetilde{R}$

The key observation in this part is that the traditional [5, Eq. (4.12)] formula for $\tan(2\varphi)/2$ involving the squares of the elements of $\widetilde{R}$ can be simplified to an expression that does not require any explicit squaring if the hypotenuse calculation is considered a basic arithmetic operation. The two following expressions for $\tan(2\varphi)$ are equivalent,

\tan(2\varphi)=\frac{2\tilde{r}_{12}\tilde{r}_{22}}{\tilde{r}_{11}^{2}+\tilde{r}_{12}^{2}-\tilde{r}_{22}^{2}}=\frac{2\tilde{r}_{12}\tilde{r}_{22}}{(h-\tilde{r}_{22})(h+\tilde{r}_{22})},\quad h=\sqrt{\tilde{r}_{11}^{2}+\tilde{r}_{12}^{2}}.

(37)

With $\underline{h}=\mathop{\mathrm{hypot}}(\underline{\tilde{r}_{11}},\underline{\tilde{r}_{12}})$ , let $\mathbf{s}=\underline{h}\oplus\underline{\tilde{r}_{22}}$ and $\mathbf{d}=\underline{h}\ominus\underline{\tilde{r}_{22}}$ be the sum and the difference⁵⁵5With the prescaling as employed, $\ominus$ can be replaced by subtraction $d=h-\tilde{r}_{22}$ and $\mathbf{d}=(e_{d},f_{d})$ . of $\underline{h}$ and $\underline{\tilde{r}_{22}}$ as in (4) and (5), respectively, for $\underline{h}>\underline{\tilde{r}_{22}}$ . From (36) and since $0<\tilde{r}_{ij}\leq\nu/(2\sqrt{2})$ for $1\leq i\leq j\leq 2$ , it holds $0<\underline{\tilde{r}_{22}}\leq\underline{\tilde{r}_{11}}\leq\underline{h}\leq\nu$ . Thus (37) can be re-written using (6) and (7), with $\tilde{\mathbf{r}}_{12}=(e_{\underline{\tilde{r}_{12}}},f_{\underline{\tilde{r}_{12}}})$ and $\tilde{\mathbf{r}}_{22}=(e_{\underline{\tilde{r}_{22}}},f_{\underline{\tilde{r}_{22}}})$ , as

\underline{h}=\underline{\tilde{r}_{22}}\implies\underline{\tan(2\varphi)}=\infty,\quad\underline{h}>\underline{\tilde{r}_{22}}\implies\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\mathbf{d}\odot\mathbf{s})),

(38)

where the computation’s precision is unchanged, but the exponent range is widened.

In (19), the denominator of the expression for $\tan(2\phi)$ , $\mathsf{d}=g_{11}^{2}+g_{12}^{2}-g_{21}^{2}-g_{22}^{2}$ , can also be computed using $\mathop{\mathrm{hypot}}$ , without explicitly squaring any matrix element, as

\mathsf{d}=\left(\sqrt{g_{11}^{2}+g_{12}^{2}}-\sqrt{g_{21}^{2}+g_{22}^{2}}\right)\left(\sqrt{g_{11}^{2}+g_{12}^{2}}+\sqrt{g_{21}^{2}+g_{22}^{2}}\right).

Only if $\mathfrak{t}^{\prime}\cong 13$ can happen that $\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}})<\varepsilon$ . In the first denominator in (37), $\tilde{r}_{11}^{2}$ and $\tilde{r}_{22}^{2}$ then have a negligible effect on $\tilde{r}_{12}^{2}$ , so the expression for $\underline{\tan(2\varphi)}$ can be simplified, as in xLASV2, to the same formula which would the case $\underline{\tilde{r}_{11}}=\underline{\tilde{r}_{22}}$ imply,

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{\tilde{r}_{22}})/\underline{\tilde{r}_{12}}).

(39)

Let $\tilde{\mathbf{r}}_{11}=(e_{\underline{\tilde{r}_{11}}},f_{\underline{\tilde{r}_{11}}})$ . If $\underline{\tilde{r}_{12}}=\underline{\tilde{r}_{22}}$ , (37) can be simplified by explicit squaring to

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\tilde{\mathbf{r}}_{11}\odot\tilde{\mathbf{r}}_{11})).

(40)

Both (39) and (40) admit a simple roundoff analysis. However, (38) does not, due to a subtraction of potentially inexact values of a similar magnitude when computing $\mathbf{d}$ . Section 4 shows, with a high probability by an exhaustive testing, that (38) does not cause excessive relative errors in the singular values for $\mathfrak{t}^{\prime}\cong 13$ , and neither for $\mathfrak{t}^{\prime}=15$ if the range of the exponents of the elements of input matrices is limited in width to about $(e_{\nu}-e_{\mu})/2$ . If $\mathop{\mathrm{hypot}}$ is not correctly rounded, the procedure from [14, 15] for computing $\underline{\tan(2\varphi)}$ without squaring the input values can be adopted instead of (38), as shown in Algorithm 1, but still without theoretical relative error bounds.

Algorithm 1 Computation of the functions of

\varphi

from

\underline{\widetilde{R}}

1: if

\underline{\tilde{r}_{11}}=\underline{\tilde{r}_{22}}

then

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{\tilde{r}_{22}})/\underline{\tilde{r}_{12}})

// (39)

3: else if

\underline{\tilde{r}_{12}}=\underline{\tilde{r}_{22}}

then

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\tilde{\mathbf{r}}_{11}\odot\tilde{\mathbf{r}}_{11}))

// (40)

5: else if

\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}})<\varepsilon

then // only if

\mathfrak{t}^{\prime}\cong 13

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{\tilde{r}_{22}})/\underline{\tilde{r}_{12}})

// (39)

7: else if

\mathop{\mathrm{hypot}}

is not correctly rounded then // [14, 15]

8: if

\underline{\tilde{r}_{11}}>\underline{\tilde{r}_{12}}

then

\underline{x}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{12}}/\underline{\tilde{r}_{11}});\quad\underline{y}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{22}}/\underline{\tilde{r}_{11}})

10:

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}(\mathop{\mathrm{fl}}((2\underline{x})\underline{y})/\max(\mathop{\mathrm{fma}}(\mathop{\mathrm{fl}}(\underline{x}-\underline{y}),\mathop{\mathrm{fl}}(\underline{x}+\underline{y}),1),0))

11: else //

\underline{\tilde{r}_{11}}\leq\underline{\tilde{r}_{12}}

12:

\underline{x}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}});\quad\underline{y}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{22}}/\underline{\tilde{r}_{12}})

13:

\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{y})/\max(\mathop{\mathrm{fma}}(\mathop{\mathrm{fl}}(\underline{x}-\underline{y}),\mathop{\mathrm{fl}}(\underline{x}+\underline{y}),1),0))

14: end if

15: else // the general case

16:

\underline{h}=\mathop{\mathrm{hypot}}(\underline{\tilde{r}_{11}},\underline{\tilde{r}_{12}})

17: if

\underline{h}=\underline{\tilde{r}_{22}}

then

18:

\underline{\tan(2\varphi)}=\infty

// (38)

19: else //

\underline{h}>\underline{\tilde{r}_{22}}

20:

\mathbf{s}=\underline{h}\oplus\underline{\tilde{r}_{22}};\quad\mathbf{d}=\underline{h}\ominus\underline{\tilde{r}_{22}};\quad\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\mathbf{d}\odot\mathbf{s}))

// (38)

21: end if

22: end if

23: if

\underline{\tan(2\varphi)}=\infty

then

24:

\underline{\tan\varphi}=1

25: else // the general case

26:

\underline{\tan\varphi}=\mathop{\mathrm{fl}}(\underline{\tan(2\varphi)}/\mathop{\mathrm{fl}}(1+\mathop{\mathrm{hypot}}(\underline{\tan(2\varphi)},1)))

// (3)

27: end if

28:

\underline{\sec\varphi}=\mathop{\mathrm{hypot}}(\underline{\tan\varphi},1);\quad\underline{\cos\varphi}=\mathop{\mathrm{fl}}(1/\underline{\sec\varphi});\quad\underline{\sin\varphi}=\mathop{\mathrm{fl}}(\underline{\tan\varphi}/\underline{\sec\varphi})

All cases of Algorithm 1 lead to $0\leq\underline{\tan\varphi}\leq 1$ . From $\underline{\tan\varphi}$ follows $\underline{\sec\varphi}$ , as well as $\underline{\cos\varphi}$ and $\underline{\sin\varphi}$ , when explicitly required, what completely determines $\underline{\widetilde{U}_{\varphi}}$ .

To determine $\underline{\widetilde{V}_{\psi}}$ , $\tan\psi$ is obtained from $\tan\varphi$ (see [5, Eq. (4.10)]) as

\tan\psi=(\tilde{r}_{12}+\tilde{r}_{22}\tan\varphi)/\tilde{r}_{11},\qquad\underline{\tan\psi}=\mathop{\mathrm{fl}}(\underline{t}/\underline{\tilde{r}_{11}}),\quad\underline{t}=\mathop{\mathrm{fma}}(\underline{\tilde{r}_{22}},\underline{\tan\varphi},\underline{\tilde{r}_{12}}).

(41)

Let $\mathop{\mathbf{sec}}\varphi=(e_{\underline{\sec\varphi}},f_{\underline{\sec\varphi}})$ . If $\underline{\tan\psi}$ is finite (e.g., when $\mathfrak{t}^{\prime}=15$ , due to the pivoting [14, Theorem 1]), so is $\underline{\sec\psi}$ . Then, let $\mathop{\mathbf{sec}}\psi=(e_{\underline{\sec\psi}},f_{\underline{\sec\psi}})$ . By fixing the evaluation order for reproducibility, the singular values $\tilde{\bm{\sigma}}_{1}^{\prime\prime}$ and $\tilde{\bm{\sigma}}_{2}^{\prime\prime}$ of $\underline{\widetilde{R}}$ are computed [8, 15] as

\mathbf{s}_{\psi}^{\varphi}=\mathop{\mathbf{sec}}\varphi\oslash\mathop{\mathbf{sec}}\psi,\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot\mathbf{s}_{\psi}^{\varphi},\quad\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\tilde{\mathbf{r}}_{11}\oslash\mathbf{s}_{\psi}^{\varphi}.

(42)

If $\underline{\tan\psi}$ overflows due to a small $\underline{\tilde{r}_{11}}$ (the prescaling ensures that $\underline{t}$ is always finite), let $\mathbf{t}=(e_{\underline{t}},f_{\underline{t}})$ . In this case, similarly to the one in xLASV2 for $\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}})<\varepsilon$ , it holds $\sec\psi\gtrapprox\tan\psi$ , so $\cos\psi\lessapprox 1/\tan\psi$ . To confine subnormal values to outputs only, let

\mathop{\mathbf{cos}}\psi=\tilde{\mathbf{r}}_{11}\oslash\mathbf{t},\quad\underline{\cos\psi}=\mathop{\mathrm{fl}}(\mathop{\mathbf{cos}}\psi),\quad\underline{\sin\psi}=1.

(43)

By substituting $1/\mathop{\mathbf{cos}}\psi\approx\mathbf{t}\oslash\tilde{\mathbf{r}}_{11}$ from (43) for $\mathop{\mathbf{sec}}\psi$ in (42), simplifying the results, and fixing the evaluation order, the singular values of $\underline{\widetilde{R}}$ in this case are obtained as

\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\mathbf{t}\oslash\mathop{\mathbf{sec}}\varphi,\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot(\mathop{\mathbf{sec}}\varphi\odot\mathop{\mathbf{cos}}\psi).

(44)

From (41), $\tan\psi>\nu$ implies $\tan\varphi\lessapprox\tan(2\varphi)/2\lessapprox\tilde{r}_{22}/\tilde{r}_{12}\leq\tilde{r}_{11}/\tilde{r}_{12}<1/(\nu-1)$ , so $\sec\varphi\gtrapprox 1$ . Therefore, $\mathop{\mathbf{sec}}\varphi$ may be eliminated from (44), similarly as in xLASV2.

The SVD of $\underline{\widetilde{R}}$ has thus been computed (without explicitly forming $\underline{\widetilde{U}_{\varphi}}$ and $\underline{\widetilde{V}_{\psi}}$ ) as

	$\displaystyle\underline{\widetilde{R}}$	$\displaystyle\approx\begin{bmatrix}\underline{\cos\varphi}&-\underline{\sin\varphi}\\ \underline{\sin\varphi}&\hphantom{-}\underline{\cos\varphi}\end{bmatrix}P_{\tilde{\bm{\sigma}}}P_{\tilde{\bm{\sigma}}}^{T}\begin{bmatrix}\tilde{\bm{\sigma}}_{1}^{\prime\prime}&0\\ 0&\tilde{\bm{\sigma}}_{2}^{\prime\prime}\end{bmatrix}P_{\tilde{\bm{\sigma}}}P_{\tilde{\bm{\sigma}}}^{T}\begin{bmatrix}\hphantom{-}\underline{\cos\psi}&\underline{\sin\psi}\\ -\underline{\sin\psi}&\underline{\cos\psi}\end{bmatrix}$		(45)
		$\displaystyle=(\underline{\widetilde{U}_{\varphi}}P_{\tilde{\bm{\sigma}}})(P_{\tilde{\bm{\sigma}}}^{T}\underline{\widetilde{\Sigma}_{\tilde{\bm{\sigma}}}^{\prime\prime}}P_{\tilde{\bm{\sigma}}})(P_{\tilde{\bm{\sigma}}}^{T}\underline{\widetilde{V}_{\psi}^{T}})=\underline{\widetilde{U}}\underline{\widetilde{\Sigma}_{\tilde{\bm{\sigma}}}^{\prime}}\underline{\widetilde{V}^{T}}.$		(45)

If $\tilde{\bm{\sigma}}_{1}^{\prime\prime}\prec\tilde{\bm{\sigma}}_{2}^{\prime\prime}$ , then $\tilde{\bm{\sigma}}_{1}^{\prime}=\tilde{\bm{\sigma}}_{2}^{\prime\prime}$ , $\tilde{\bm{\sigma}}_{2}^{\prime}=\tilde{\bm{\sigma}}_{1}^{\prime\prime}$ , and $P_{\tilde{\bm{\sigma}}}=\left[\begin{smallmatrix}0&1\\ 1&0\end{smallmatrix}\right]$ , else $\tilde{\bm{\sigma}}_{i}^{\prime}=\tilde{\bm{\sigma}}_{i}^{\prime\prime}$ and $P_{\tilde{\bm{\sigma}}}=I$ , as presented in Algorithm 2. If $\underline{r_{11}}<\underline{r_{22}}$ then $\underline{\check{V}}$ should be formed as in (35), and $\underline{\check{U}}$ as well if $\mathfrak{t}^{\prime}\cong 13$ . Else, if $\underline{r_{11}}\geq\underline{r_{22}}$ , then $\underline{\check{V}}=\underline{\widetilde{V}}$ , and (only implicitly for $\mathfrak{t}^{\prime}=15$ ) $\underline{\check{U}}=\underline{\widetilde{U}}$ .

Algorithm 2 Computation of the functions of

\psi

and the singular values of

\underline{\widetilde{R}}

\underline{t}=\mathop{\mathrm{fma}}(\underline{\tilde{r}_{22}},\underline{\tan\varphi},\underline{\tilde{r}_{12}});\quad\underline{\tan\psi}=\mathop{\mathrm{fl}}(\underline{t}/\underline{\tilde{r}_{11}})

// (41)

2: if

\underline{\tan\psi}=\infty

then // only if

\mathfrak{t}^{\prime}\cong 13

\mathbf{t}=(e_{\underline{t}},f_{\underline{t}});\quad\mathop{\mathbf{cos}}\psi=\tilde{\mathbf{r}}_{11}\oslash\mathbf{t};\quad\underline{\cos\psi}=\mathop{\mathrm{fl}}(\mathop{\mathbf{cos}}\psi);\quad\underline{\sin\psi}=1

// (43)

\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\mathbf{t};\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot\mathop{\mathbf{cos}}\psi

// (44)

5: else // the general case

\underline{\sec\psi}=\mathop{\mathrm{hypot}}(\underline{\tan\psi},1);\quad\underline{\cos\psi}=\mathop{\mathrm{fl}}(1/\underline{\sec\psi});\quad\underline{\sin\psi}=\mathop{\mathrm{fl}}(\underline{\tan\psi}/\underline{\sec\psi})

\mathop{\mathbf{sec}}\varphi=(e_{\underline{\sec\varphi}},f_{\underline{\sec\varphi}});\quad\mathop{\mathbf{sec}}\psi=(e_{\underline{\sec\psi}},f_{\underline{\sec\psi}});\quad\mathbf{s}_{\psi}^{\varphi}=\mathop{\mathbf{sec}}\varphi\oslash\mathop{\mathbf{sec}}\psi

\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\tilde{\mathbf{r}}_{11}\oslash\mathbf{s}_{\psi}^{\varphi};\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot\mathbf{s}_{\psi}^{\varphi}

// (42)

9: end if

10: if

\tilde{\bm{\sigma}}_{1}^{\prime\prime}\prec\tilde{\bm{\sigma}}_{2}^{\prime\prime}

then // (8)

11:

\tilde{\bm{\sigma}}_{1}^{\prime}=\tilde{\bm{\sigma}}_{2}^{\prime\prime};\quad\tilde{\bm{\sigma}}_{2}^{\prime}=\tilde{\bm{\sigma}}_{1}^{\prime\prime};\quad P_{\tilde{\bm{\sigma}}}=\left[\begin{smallmatrix}0&1\\ 1&0\end{smallmatrix}\right]

12: else // the general case

13:

\tilde{\bm{\sigma}}_{1}^{\prime}=\tilde{\bm{\sigma}}_{1}^{\prime\prime};\quad\tilde{\bm{\sigma}}_{2}^{\prime}=\tilde{\bm{\sigma}}_{2}^{\prime\prime};\quad P_{\tilde{\bm{\sigma}}}=\left[\begin{smallmatrix}1&0\\ 0&1\end{smallmatrix}\right]

14: end if

3.3.2 The SVD of $G$

The approximate backscaled singular values of $G$ are $\bm{\sigma}_{i}=2^{-s}\odot\tilde{\bm{\sigma}}_{i}^{\prime}$ . They should remain in the exponent-“mantissa” form if possible, to avoid overflows and underflows.

Recall that, for $\mathfrak{t}^{\prime}=15$ , $\underline{\widetilde{U}_{\varphi}^{T}}$ and the QR rotation $\underline{U_{\vartheta}^{T}}$ have not been explicitly formed. The reason is that $\underline{\widehat{U}^{T}}=\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{+}^{T}}$ , where $\underline{U_{+}^{T}}$ is constructed from $\underline{U_{\vartheta}^{T}}$ as in (21), requires a matrix-matrix multiplication that can and sporadically will degrade the numerical orthogonality of $\underline{\widehat{U}}$ . On its own, such a problem is expected and can be tolerated, but if the left singular vectors of a pivot submatrix are applied to a pair of pivot rows of a large iteration matrix, many times throughout the Kogbetliantz process (1), it is imperative to make the vectors as orthogonal as possible, and thus try not to destroy the singular values of the iteration matrix. In the following, $\underline{\widehat{U}}$ is generated from a single $\tan{\bm{\phi}}$ , where $\tan{\bm{\phi}}$ is a function of the already computed $\tan{\bm{\varphi}}$ and $\tan{\vartheta}$ .

If $\mathfrak{t}^{\prime}\cong 13$ , let $\underline{\widehat{U}}=U_{+}\underline{\check{U}}$ , where $U_{+}$ comes from Section 3.2.1. Else, due to (35), if $S_{22}^{T}=I$ in (21), the product $\check{U}_{\bm{\varphi}}^{T}U_{\vartheta}^{T}=U_{\bm{\varphi}+\vartheta}^{T}$ can be written in the terms of $\bm{\varphi}+\vartheta$ as

U_{\bm{\varphi}+\vartheta}^{T}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos{\bm{\varphi}}&\sin{\bm{\varphi}}\\ -\sin{\bm{\varphi}}&\cos{\bm{\varphi}}\end{bmatrix}\begin{bmatrix}\hphantom{-}\cos\vartheta&\sin\vartheta\\ -\sin\vartheta&\cos\vartheta\end{bmatrix}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos(\bm{\varphi}+\vartheta)&\sin(\bm{\varphi}+\vartheta)\\ -\sin(\bm{\varphi}+\vartheta)&\cos(\bm{\varphi}+\vartheta)\end{bmatrix}.

(46)

If $S_{22}^{T}=\left[\begin{smallmatrix}1&\hphantom{-}0\\ 0&-1\end{smallmatrix}\right]$ , $U_{\bm{\varphi}-\vartheta}^{T}=\check{U}_{\bm{\varphi}}^{T}S_{22}^{T}U_{\vartheta}^{T}$ can be written in the terms of $\bm{\varphi}-\vartheta$ as

U_{\bm{\varphi}-\vartheta}^{T}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos(\bm{\varphi}-\vartheta)&-\sin(\bm{\varphi}-\vartheta)\\ -\sin(\bm{\varphi}-\vartheta)&-\cos(\bm{\varphi}-\vartheta)\end{bmatrix}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos(\bm{\varphi}-\vartheta)&\sin(\bm{\varphi}-\vartheta)\\ -\sin(\bm{\varphi}-\vartheta)&\cos(\bm{\varphi}-\vartheta)\end{bmatrix}S_{2},

(47)

what is obtained by multiplying the matrices $U_{\varphi}^{T}\left[\begin{smallmatrix}1&\hphantom{-}0\\ 0&-1\end{smallmatrix}\right]U_{\vartheta}^{T}$ and simplifying the result using the trigonometric identities for the (co)sine of the difference of the angles $\bm{\varphi}$ and $\vartheta$ . The middle matrix factor represents a possible sign change of $r_{22}^{\prime}$ as in Section 3.2.2. The matrices defined in (46) and (47) are determined by $\tan(\bm{\varphi}+\vartheta)$ and $\tan(\bm{\varphi}-\vartheta)$ , respectively, where these tangents follow from the already computed ones as

\tan(\bm{\varphi}+\vartheta)=\frac{\tan\bm{\varphi}+\tan\vartheta}{1-\tan{\bm{\varphi}}\tan\vartheta},\qquad\tan(\bm{\varphi}-\vartheta)=\frac{\tan\bm{\varphi}-\tan\vartheta}{1+\tan{\bm{\varphi}}\tan\vartheta}.

(48)

Finally, from (21) and (35), using either (46) or (47), the SVD of $G$ is completed as

\underline{\widehat{U}^{T}}=\underline{U_{\bm{\varphi}\pm\vartheta}^{T}}P_{U}^{T}S_{1}^{T},\quad\underline{U}=\underline{\widehat{U}}P_{\tilde{\bm{\sigma}}},\qquad\underline{V}=\underline{\check{V}}P_{\tilde{\bm{\sigma}}}.

(49)

For (49), $P_{U}^{T}S_{1}^{T}$ from (21) is explicitly built and stored. It contains exactly one $\pm 1$ element in each row and column, while the other is zero. Its multiplication by $\underline{U_{\bm{\varphi}\pm\vartheta}^{T}}$ is thus performed error-free. The tangents computed as in (48) might be relatively inaccurate in theory, but the transformations they define via the cosines and the sines from either (46) or (47) are numerically orthogonal in practice, as shown in Section 4.

This heuristic might become irrelevant if the $ab+cd$ floating-point operation with a single rounding [23] becomes supported in hardware. Then, each element of $\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{\vartheta}^{T}}$ (a product of two $2\times 2$ matrices) can be formed with one such operation. It remains to be seen if the multiplication approach improves accuracy of the computed left singular vectors without spoiling their orthogonality, compared to the proposed heuristic.

From the method’s construction, it follows that if the side (left or right) on which the signs are extracted while preparing $R$ is fixed (see Section 3.1) and whenever the assumptions on the arithmetic hold, the SVD of $G$ as proposed here is bitwise reproducible for any $G$ with finite elements. Also, the method does not produce any infinite or undefined element in its outputs $U$ , $V$ , and (conditionally, as described) $\Sigma$ .

3.4 A complex input matrix

If $G$ is purely imaginary, $\pm\mathrm{i}G$ is real. Else, if $G$ has at least one complex element, the proposed real method is altered, as detailed in [14, 15], in the following ways:

1.

To make the element $0\neq g_{ij}=|g_{ij}|\mathrm{e}^{\mathrm{i}\alpha_{ij}}$ real and positive, its row or column is multiplied by $\mathrm{e}^{-\mathrm{i}\alpha_{ij}}$ (which goes into a sign matrix), and the element is replaced by its absolute value. To avoid overflow, let $\mathfrak{s}_{\mathbb{C}}=\mathfrak{s}+1$ in (13). The exponents of each component (real and imaginary) of every element are considered in (9).
2.

$\underline{U_{+}^{T}}$ is explicitly constructed in (21), and $\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{+}^{T}}$ is formed by a real-complex matrix multiplication. The correctly rounded $ab+cd$ operation [23] would be helpful here. Merging $\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{\vartheta}^{T}}$ as in (46) or (47) remains a possibility if $S_{22}$ happens to be real.
3.

Since (30) is no longer directly applicable for ensuring stability, no computation is performed in a wider datatype. Reproducibility of the whole method is conditional upon reproducibility of the complex multiplication and the absolute value ( $\mathop{\mathrm{hypot}}$ ).

Once $\underline{R}$ is obtained, the algorithms from Section 3.3 work unchanged.

4 Numerical testing

Numerical testing was performed on machines with a 64-core Intel Xeon Phi 7210 CPU, a 64-bit Linux, and the Intel oneAPI Base and HPC toolkits, version 2024.1.

Let the LAPACK’s xLASV2 routine be denoted by $\mathtt{L}$ . The Kogbetliantz SVD in the same datatype is denoted by $\mathtt{K}$ . Unless such information is clear from the context, let the results’ designators carry a subscript $\mathtt{K}$ or $\mathtt{L}$ in the following figures, depending on the routine that computed them, and also a superscript $\circ$ or $\bullet$ , signifying how the input matrices were generated. All inputs were random. Those denoted by $\circ$ had their elements generated as Fortran’s pseudorandom numbers not above unity in magnitude, and those symbolized by $\bullet$ had their elements’ magnitudes in the “safe” range $[\mu,\nu/4]$ , as defined by (11), to avoid overflows with $\mathtt{L}$ and underflows due to the prescaling in $\mathtt{K}$ . The latter random numbers were provided by the CPU’s rdrand instructions. If not kept, the $\bullet$ inputs are thus not reproducible, unlike the $\circ$ ones if the seed is preserved.

All relative error measures were computed in quadruple precision from data in the working (single or double) precision. The unknown exact singular values of the input matrices were approximated by the Kogbetliantz SVD method adapted to quadruple precision (with a $\mathop{\mathrm{hypot}}$ operation that might not have been correctly rounded).

With $G$ given and $\underline{U}$ , $\underline{\Sigma}$ , $\underline{V}$ computed, let the relative SVD residual be defined as

\mathop{\mathrm{re}}G=\|G-\underline{U}\underline{\Sigma}\underline{V^{T}}\|_{F}/\|G\|_{F},

(50)

the maximal relative error in the computed singular values $\underline{\sigma_{i}}$ (with $\sigma_{i}$ being exact) as

\mathop{\mathrm{re}}\sigma_{i}=|\sigma_{i}-\underline{\sigma_{i}}|/\sigma_{i},\quad 1\leq i\leq 2,\qquad\sigma_{i}=0\wedge\underline{\sigma_{i}}=0\implies\mathop{\mathrm{re}}\sigma_{i}=0,

(51)

and the departure from orthogonality in the Frobenius norm for matrices of the left and right singular vectors (what can be seen as the relative error with respect to $I$ ) as

\mathop{\mathrm{re}}U=\|\underline{U}^{T}\underline{U}-I\|_{F},\qquad\mathop{\mathrm{re}}V=\|\underline{V}^{T}\underline{V}-I\|_{F}.

(52)

Every datapoint in the figures shows the maximum of a particular relative error measure over a batch of input matrices, were each batch (run) contained $2^{30}$ matrices.

Figure 1 covers the case of upper triangular input matrices, which can be processed by both $\mathtt{K}$ and $\mathtt{L}$ , and the measures (50) and (52). Numerical orthogonality of the singular vectors computed by $\mathtt{K}$ is noticeably better than of those obtained by $\mathtt{L}$ , in the worst case. Also, the relative SVD residuals are slightly better, in the $\bullet$ and the $\circ$ runs.

Refer to caption — Figure 1: Numerical orthogonality of the singular vectors and the relative SVD residuals with $\mathtt{K}$ and $\mathtt{L}$ on random upper triangular double precision matrices.

Figure 2 shows the relative errors in the singular values (51) of the same matrices from Figure 1. The unity mark for $\mathop{\mathrm{re}_{\mathtt{L}}}\sigma_{2}^{\bullet}$ indicates that $\mathtt{L}$ can cause the relative errors in the smaller singular values, $\underline{\sigma_{2}}$ , to be so high in the $\bullet$ case that their maximum was unity in all runs and cannot be displayed in Figure 2, most likely due to underflow to zero of $\underline{\sigma_{2}^{\bullet}}$ when the “exact” $\sigma_{2}^{\bullet}>0$ in (51). However, when $\mathtt{L}$ managed to compute the smaller singular values accurately in the $\circ$ case, the maximum of their relative errors was a bit smaller than the one from $\mathtt{K}$ , the cause of which is worth exploring. The same holds for the larger singular values, which were computed accurately by both $\mathtt{L}$ and $\mathtt{K}$ .

To put $\max\kappa_{2}^{\bullet}\lessapprox 9.45\cdot 10^{1229}$ , for which $\mathtt{K}$ still accurately computed all singular values (in the exponent-“mantissa” form, and thus not underflowing), into perspective, the highest possible condition number for triangular matrices in the $\bullet$ case can be estimated by recalling that Algorithms 1 and 2 were also performed in quadruple precision (to get $\sigma_{1}$ , $\sigma_{2}$ , and so $\kappa_{2}$ ), where $\mu$ and $\nu$ of double precision, as well as $\nu/\mu$ , are within the normal range. Then, $\tan\varphi$ can be made small and $\tan\psi$ huge by, e.g.,

G=\begin{bmatrix}\mu&\nu/4\\ 0&\mu\end{bmatrix}\implies\tan(2\varphi)=\frac{8\mu}{\nu}\implies\frac{2\mu}{\nu}<\tan\varphi\lessapprox\frac{4\mu}{\nu}\implies\tan\psi\gtrapprox\frac{\nu}{4\mu}.

Therefore, the condition number of $G$ is a cubic expression in $\nu/\mu$ , since, from (42),

\sigma_{2}=\mu\frac{\sec\varphi}{\sec\psi}\approx\frac{4\mu^{2}}{\nu},\quad\sigma_{1}=\frac{\nu}{4}\frac{\sec\psi}{\sec\varphi}\approx\frac{\nu^{2}}{16\mu},\qquad\kappa_{2}=\frac{\sigma_{1}}{\sigma_{2}}\approx\frac{\nu^{3}}{64\mu^{3}}.

Figure 3 focuses on $\mathtt{K}$ and general input matrices, with all their elements random. Inaccuracy of the smaller singular values in the $\bullet$ case motivated the search for safe exponent ranges of the elements of input matrices that should preserve accuracy of $\underline{\sigma_{2}}$ from $\mathtt{L}$ for $\mathfrak{t}=13$ and from $\mathtt{K}$ for $\mathfrak{t}=15$ . For that, the range of random values was restricted, and only those outputs $x$ from rdrand for which $|x|\in[2^{\varsigma}\mu,\nu/4]$ were accepted, where $\varsigma$ was a positive integer parameter independently chosen for each run.

Figure 4 shows the results of this search for $\mathtt{K}$ and $\mathtt{L}$ . Approximately half-way through the entire normal exponent range the relative errors in the smaller singular values stabilize to a single-digit multiple of $\varepsilon$ . Thus, when for the exponents in $G$ holds

\max_{1\leq i,j\leq 2}{e_{g_{ij}}}-\min_{1\leq i,j\leq 2}{e_{g_{ij}}}<(e_{\nu}-e_{\mu})/2

(ignoring the exponent of $0$ ) it might be expected that $\mathtt{K}$ computes $\underline{\sigma_{2}}$ accurately, while $\mathtt{L}$ should additionally be safeguarded by its user from the elements too close to $\mu$ . The proposed prescaling, but with $\mathfrak{s}_{\mathtt{L}}=\mathfrak{s}+1$ (or more), might be applied to $G$ before $\mathtt{L}$ .

A timing comparison of xLASV2 and the proposed method was not performed since the correctly rounded routines are still in development. By construction the proposed method is more computationally complex than xLASV2, so it is expected to be slower.

An unoptimized OpenMP-parallel implementation of the Kogbetliantz SVD for $G$ of order $n>2$ with the scaling of $G$ in the spirit of [22] but stronger (accounting for the two-sided transformations of $G$ ) and the modified modulus pivot strategy [26], when run with 64 threads spread across the CPU cores, a deterministic reduction procedure, and OMP_DYNAMIC=FALSE, showed up to 10% speedup over the one-sided Jacobi SVD routine without preconditioning, DGESVJ [27, 28], from the threaded Intel MKL library for large enough $n$ (up to $5376$ ), with the left singular vectors from the former being a bit more orthogonal than the ones from the latter, while the opposite was true for the right singular vectors, on the highly conditioned input matrices from [22]. The singular values from DGESVJ were less than an order of magnitude more accurate.

5 Conclusions and future work

The proposed Kogbetliantz method for the SVD of order two computed highly numerically orthogonal singular vectors in all tests. The larger singular values were relatively accurate up to a few $\varepsilon$ in all tests, and the smaller ones were when the input matrices were triangular, or, for the general (without zeros) input matrices, if the range of their elements was narrower than or about half of the width of the range of normal values.

The constituent phases of the method can be used on their own. The prescaling might help xLASV2 when its inputs are small. The highly accurate triangularization might be combined with xLASV2 instead, as an alternative method for general matrices. And the proposed SVD of triangular matrices demonstrates some of the benefits of the more complex correctly rounded operations ( $\mathop{\mathrm{hypot}}$ ), but they go beyond that.

High relative accuracy for $\tan(2\varphi)$ from (19) might be achieved, barring underflow, if the four-way fused dot product operation $ab+cd+ef+gh$ , DOT4, with a single rounding of the exact value [29], becomes available in hardware. Then the denominator of the expression for $\tan(2\varphi)$ in (19) could be computed, even without scaling if in a wider datatype, by the DOT4, and the numerator by the DOT2 ( $ab+cd$ ) operation.

The proposed heuristic for improving orthogonality of the left singular vectors might be helpful in other cases when two plane rotations have to be composed into one and the tangents of their angles are known. It already brings a slight advantage to the Kogbetliantz SVD of order $n$ with respect to the one-sided Jacobi SVD in this regard.

With a proper vectorization, and by removing all redundancies from the preliminary implementation, it might be feasible to speed up the Kogbetliantz SVD of order $n$ further, since adding more threads is beneficial as long as their number is not above $\mathsf{n}$ . Supplementary information. The document sm.pdf supplements this paper with further remarks on methods for larger matrices and the single precision testing results.

Acknowledgments. The author would like to thank Dean Singer for his material support and to Vjeran Hari for fruitful discussions.

Declarations

Funding. This work was supported in part by Croatian Science Foundation under the expired project IP–2014–09–3670 “Matrix Factorizations and Block Diagonalization Algorithms” (MFBDA), in the form of unlimited compute time granted for the testing.

Competing interests. The author has no relevant competing interests to declare.

Code availability. The code is available in https://github.com/venovako/KogAcc repository, and in the supporting https://github.com/venovako/libpvn repository.

References

\bibcommenthead
Anderson et al. [1999] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, $3^{\rm{rd}}$ edn. Software, Environments and Tools. SIAM, Philadelphia, PA, USA (1999). https://doi.org/10.1137/1.9780898719604
Moler and Stewart [1973] Moler, C.B., Stewart, G.W.: An algorithm for generalized matrix eigenvalue problems. SIAM J. Numer. Anal. 10(2), 241–256 (1973) https://doi.org/10.1137/0710024
Demmel and Kahan [1990] Demmel, J., Kahan, W.: Accurate singular values of bidiagonal matrices. SIAM J. Sci. Statist. Comput. 11(5), 873–912 (1990) https://doi.org/10.1137/0911052
Kogbetliantz [1955] Kogbetliantz, E.G.: Solution of linear equations by diagonalization of coefficients matrix. Quart. Appl. Math. 13(2), 123–132 (1955) https://doi.org/%****␣ms.tex␣Line␣2075␣****10.1090/qam/88795
Charlier et al. [1987] Charlier, J.P., Vanbegin, M., Van Dooren, P.: On efficient implementations of Kogbetliantz’s algorithm for computing the singular value decomposition. Numer. Math. 52(3), 279–300 (1987) https://doi.org/10.1007/BF01398880
Stewart [1992] Stewart, G.W.: An updating algorithm for subspace tracking. IEEE Trans. Signal Process. 40(6), 1535–1541 (1992) https://doi.org/10.1109/78.139256
Quintana-Ortí et al. [1998] Quintana-Ortí, G., Sun, X., Bischof, C.H.: A BLAS-3 version of the QR factorization with column pivoting. SIAM J. Sci. Comp. 19(5), 1486–1494 (1998) https://doi.org/10.1137/S1064827595296732
Hari and Matejaš [2009] Hari, V., Matejaš, J.: Accuracy of two SVD algorithms for $2\times 2$ triangular matrices. Appl. Math. Comput. 210(1), 232–257 (2009) https://doi.org/10.1016/j.amc.2008.12.086
Charlier and Van Dooren [1987] Charlier, J.-P., Van Dooren, P.: On Kogbetliantz’s SVD algorithm in the presence of clusters. Linear Algebra Appl. 95, 135–160 (1987) https://doi.org/%****␣ms.tex␣Line␣2150␣****10.1016/0024-3795(87)90031-0
Hari and Veselić [1987] Hari, V., Veselić, K.: On Jacobi methods for singular value decompositions. SIAM J. Sci. Statist. Comput. 8(5), 741–754 (1987) https://doi.org/10.1137/0908064
Hari and Zadelj-Martić [2007] Hari, V., Zadelj-Martić, V.: Parallelizing the Kogbetliantz method: A first attempt. J. Numer. Anal. Ind. Appl. Math. 2(1–2), 49–66 (2007)
Hari [1991] Hari, V.: On sharp quadratic convergence bounds for the serial Jacobi methods. Numer. Math. 60(1), 375–406 (1991) https://doi.org/10.1007/BF01385728
Matejaš and Hari [2015] Matejaš, J., Hari, V.: On high relative accuracy of the Kogbetliantz method. Linear Algebra Appl. 464, 100–129 (2015) https://doi.org/10.1016/j.laa.2014.02.024
Novaković [2020] Novaković, V.: Batched computation of the singular value decompositions of order two by the AVX-512 vectorization. Parallel Process. Lett. 30(4), 1–232050015 (2020) https://doi.org/10.1142/S0129626420500152
Novaković and Singer [2022] Novaković, V., Singer, S.: A Kogbetliantz-type algorithm for the hyperbolic SVD. Numer. Algorithms 90(2), 523–561 (2022) https://doi.org/10.1007/s11075-021-01197-4
Bečka et al. [2002] Bečka, M., Okša, G., Vajteršic, M.: Dynamic ordering for a parallel block-Jacobi SVD algorithm. Parallel Comp. 28(2), 243–262 (2002) https://doi.org/10.1016/S0167-8191(01)00138-7
Okša et al. [2022] Okša, G., Yamamoto, Y., Vajteršic, M.: Convergence to singular triplets in the two-sided block-Jacobi SVD algorithm with dynamic ordering. SIAM J. Matrix Anal. Appl. 43(3), 1238–1262 (2022) https://doi.org/10.1137/21M1411895
IEEE Computer Society [2019] IEEE Computer Society: 754-2019 - IEEE Standard for Floating-Point Arithmetic, (2019). https://doi.org/10.1109/IEEESTD.2019.8766229
Borges [2020] Borges, C.F.: Algorithm 1014: An improved algorithm for hypot(x,y). ACM Trans. Math. Softw. 47(1), 1–129 (2020) https://doi.org/10.1145/3428446
Sibidanov et al. [2022] Sibidanov, A., Zimmermann, P., Glondu, S.: The CORE-MATH project. In: 2022 IEEE $29^{\rm{th}}$ Symposium on Computer Arithmetic (ARITH), pp. 26–34 (2022). https://doi.org/10.1109/ARITH54963.2022.00014
Novaković [2024] Novaković, V.: Accurate complex Jacobi rotations. J. Comput. Appl. Math. 450, 116003 (2024) https://doi.org/10.1016/j.cam.2024.116003
Novaković [2023] Novaković, V.: Vectorization of a thread-parallel Jacobi singular value decomposition method. SIAM J. Sci. Comput. 45(3), 73–100 (2023) https://doi.org/10.1137/22M1478847
Lauter [2017] Lauter, C.: An efficient software implementation of correctly rounded operations extending FMA: $a+b+c$ and $a\times b+c\times d$ . In: 2017 $51^{\rm{st}}$ Asilomar Conference on Signals, Systems, and Computers, pp. 452–456 (2017). https://doi.org/10.1109/ACSSC.2017.8335379
Hubrecht et al. [2024] Hubrecht, T., Jeannerod, C.-P., Muller, J.-M.: Useful applications of correctly-rounded operators of the form $ab+cd+e$ . In: 2024 IEEE $31^{\rm{st}}$ Symposium on Computer Arithmetic (ARITH), pp. 32–39 (2024). https://doi.org/10.1109/ARITH61463.2024.00015
Jeannerod et al. [2013] Jeannerod, C.-P., Luvet, N., Muller, J.-M.: Further analysis of Kahan’s algorithm for the accurate computation of $2\times 2$ determinants. Math. Comp. 82(284), 2245–2264 (2013) https://doi.org/10.1090/S0025-5718-2013-02679-8
Novaković and Singer [2011] Novaković, V., Singer, S.: A GPU-based hyperbolic SVD algorithm. BIT 51(4), 1009–1030 (2011) https://doi.org/10.1007/s10543-011-0333-5
Drmač [1997] Drmač, Z.: Implementation of Jacobi rotations for accurate singular value computation in floating point arithmetic. SIAM J. Sci. Comput. 18(4), 1200–1222 (1997) https://doi.org/10.1137/S1064827594265095
Drmač and Veselić [2008] Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm. II. SIAM J. Matrix Anal. Appl. 29(4), 1343–1362 (2008) https://doi.org/10.1137/05063920X
Lutz et al. [2024] Lutz, D.R., Saini, A., Kroes, M., Elmer, T., Valsaraju, H.: Fused FP8 4-way dot product with scaling and FP32 accumulation. In: 2024 IEEE $31^{\rm{st}}$ Symposium on Computer Arithmetic (ARITH), pp. 40–47 (2024). https://doi.org/10.1109/ARITH61463.2024.00016

Arithmetical enhancements of the Kogbetliantz method for the SVD of order two

Abstract

keywords:

pacs:

1 Introduction

2 Floating-point considerations

Lemma 2.1.

Proof.

Corollary 2.2.

Proof.

Corollary 2.3.

Proof.

Lemma 2.4.

Proof.

3 The SVD of general matrices of order two

3.1 Prescaling of the matrix and the simple cases (𝔱≅3,5,9\mathfrak{t}\cong 3,5,9)

Lemma 3.1.

Proof.

Lemma 3.2.

Proof.

3.2 A (pivoted) U​R​VURV factorization of order two (𝔱′≅13,15\mathfrak{t}^{\prime}\cong 13,15)

3.2.1 An error-free transformation from 𝔱′≅13\mathfrak{t}^{\prime}\cong 13 to 𝔱′′=13\mathfrak{t}^{\prime\prime}=13 form

3.2.2 A fully pivoted URV when 𝔱′=15\mathfrak{t}^{\prime}=15

Lemma 3.3.

Proof.

Theorem 3.4.

Proof.

3.3 The SVD of RR and GG

3.3.1 The SVD of R~\widetilde{R}

3.3.2 The SVD of GG

3.4 A complex input matrix

4 Numerical testing

5 Conclusions and future work

Declarations

References

3.1 Prescaling of the matrix and the simple cases ( $\mathfrak{t}\cong 3,5,9$ )

3.2 A (pivoted) $URV$ factorization of order two ( $\mathfrak{t}^{\prime}\cong 13,15$ )

3.2.1 An error-free transformation from $\mathfrak{t}^{\prime}\cong 13$ to $\mathfrak{t}^{\prime\prime}=13$ form

3.2.2 A fully pivoted URV when $\mathfrak{t}^{\prime}=15$

3.3 The SVD of $R$ and $G$

3.3.1 The SVD of $\widetilde{R}$

3.3.2 The SVD of $G$