This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\fnm

Vedran \surNovaković \equalconthttps://orcid.org/0000-0003-2964-9674 \orgdivindependent \orgnameresearcher, \orgaddress\streetVankina ulica 15, \stateHR-\postcode10020 \cityZagreb, \countryCroatia

Arithmetical enhancements of the Kogbetliantz method for the SVD of order two

Abstract

An enhanced Kogbetliantz method for the singular value decomposition (SVD) of general matrices of order two is proposed. The method consists of three phases: an almost exact prescaling, that can be beneficial to the LAPACK’s xLASV2 routine for the SVD of upper triangular 2×22\times 2 matrices as well, a highly relatively accurate triangularization in the absence of underflows, and an alternative procedure for computing the SVD of triangular matrices, that employs the correctly rounded hypot\mathop{\mathrm{hypot}} function. A heuristic for improving numerical orthogonality of the left singular vectors is also presented and tested on a wide spectrum of random input matrices. On upper triangular matrices under test, the proposed method, unlike xLASV2, finds both singular values with high relative accuracy as long as the input elements are within a safe range that is almost as wide as the entire normal range. On general matrices of order two, the method’s safe range for which the smaller singular values remain accurate is of about half the width of the normal range.

keywords:
singular value decomposition, the Kogbetliantz method, matrices of order two, roundoff analysis
pacs:
[

MSC Classification]65F15, 15A18, 65Y99

1 Introduction

The singular value decomposition (SVD) of a square matrix of order two is a widely used numerical tool. In LAPACK [1] alone, its xLASV2 routine for the SVD of real upper triangular 2×22\times 2 matrices is a building block for the QZ algorithm [2] for the generalized eigenvalue problem Ax=λBxAx=\lambda Bx, with AA and BB real and square, and for the SVD of real bidiagonal matrices by the implicit QR algorithm [3]. Also, the oldest method for the SVD of square matrices that is still in use was developed by Kogbetliantz [4], based on the SVD of order two, and as such is the primary motivation for this research.

This work explores how to compute the SVD of a general matrix of order two indirectly, by a careful scaling, a highly relatively accurate triangularization if the matrix indeed contains no zeros, and an alternative triangular SVD method, since the straightforward formulas for general matrices are challenging to be evaluated stably.

Let GG be a square real matrix of order nn. The SVD of GG is a decomposition G=UΣVTG=U\Sigma V^{T}, where UU and VV are orthogonal111If GG is complex, UU and VV are unitary and G=UΣVG=U\Sigma V^{\ast}, but this case is only briefly dealt with here. matrices of order nn of the left and the right singular vectors of GG, respectively, and Σ=diag(σ1,,σn)\Sigma=\mathop{\mathrm{diag}}(\sigma_{1},\ldots,\sigma_{n}) is a diagonal matrix of its singular values, such that σiσj0\sigma_{i}\geq\sigma_{j}\geq 0 for all ii and jj where 1i<jn1\leq i<j\leq n.

In the step kk of the Kogbetliantz SVD method, a pivot submatrix of order two (or several of them, not sharing indices with each other, if the method is parallel) is found according to the chosen pivot strategy in the iteration matrix GkG_{k}, its SVD is computed, and UkU_{k}, VkV_{k}, and GkG_{k} are updated by the transformation matrices 𝖴k\mathsf{U}_{k} and/or 𝖵k\mathsf{V}_{k}, leaving zeros in the off-diagonal positions (jk,ik)(j_{k},i_{k}) and (ik,jk)(i_{k},j_{k}) of Gk+1G_{k+1}, as in

G0=preprocess(G),U0=preprocess(I),V0=preprocess(I);Gk+1=𝖴kTGk𝖵k,Uk+1=Uk𝖴k,Vk+1=Vk𝖵k,k0;convergence(k=K)UUK,VVK,σigii(K),1in.\begin{gathered}G_{0}=\mathop{\mathrm{preprocess}}(G),\quad U_{0}=\mathop{\mathrm{preprocess}}(I),\quad V_{0}=\mathop{\mathrm{preprocess}}(I);\\ G_{k+1}=\mathsf{U}_{k}^{T}G_{k}\mathsf{V}_{k},\quad U_{k+1}=U_{k}\mathsf{U}_{k},\quad V_{k+1}=V_{k}\mathsf{V}_{k},\qquad k\geq 0;\\ \mathop{\mathrm{convergence}}(k=K)\implies U\approx U_{K},\quad V\approx V_{K},\quad\sigma_{i}\approx g_{ii}^{(K)},\quad 1\leq i\leq n.\end{gathered} (1)

The left and the right singular vectors of the pivot matrix are embedded into identities to get 𝖴k\mathsf{U}_{k} and 𝖵k\mathsf{V}_{k}, respectively, with the index mapping from matrices of order two to 𝖴k\mathsf{U}_{k} and 𝖵k\mathsf{V}_{k} being (1,1)(ik,ik)(1,1)\mapsto(i_{k},i_{k}), (2,1)(jk,ik)(2,1)\mapsto(j_{k},i_{k}), (1,2)(ik,jk)(1,2)\mapsto(i_{k},j_{k}), (2,2)(jk,jk)(2,2)\mapsto(j_{k},j_{k}), where 1ik<jkn1\leq i_{k}<j_{k}\leq n are the pivot indices. The process is repeated until convergence, i.e., until for some k=Kk=K the off-diagonal norm of GKG_{K} falls below a certain threshold.

If GG has mm rows and nn columns, m>nm>n, it should be preprocessed [5] to a square matrix G0G_{0} by a factorization of the URV [6] type (e.g., the QR factorization with column pivoting [7]). Then, U0TGV0=G0U_{0}^{T}GV_{0}=G_{0}, where G0G_{0} is triangular of order nn and U0U_{0} is orthogonal of order mm. If m<nm<n, then the SVD of GTG^{T} can be computed instead.

In all iterations it would be beneficial to have the pivot matrix G^k\widehat{G}_{k} triangular, since its SVD can be computed with high relative accuracy under mild assumptions [8]. This is however not possible with time consuming but simple, quadratically convergent [9, Remark 6] pivot strategy that chooses the pivot with the largest off-diagonal norm |gjkik(k)|2+|gikjk(k)|2\sqrt{|g_{j_{k}i_{k}}^{(k)}|^{2}+|g_{i_{k}j_{k}}^{(k)}|^{2}}, but is possible, if G0G_{0} is triangular, with certain sequential cyclic (i.e., periodic) strategies [5, 10] like the row-cyclic and column-cyclic, and even with some parallel ones, after further preprocessing G0G_{0} into a suitable “butterfly” form [11].

Although the row-cyclic and column-cyclic strategies ensure global [10] and asymptotically quadratic [9, 12] convergence of the method, as well as its high relative accuracy [13], the method’s sequential variants remain slow on modern hardware, while preprocessing GG to G0G_{0} (in the butterfly form or not) can only be partially parallelized.

This work is a part of a broader effort [14, 15] to investigate if a fast and accurate (in practice if not in theory) variant of the Kogbetliantz method could be developed, that would be entirely parallel and would function on general square matrices without expensive preprocessing, with full pivots G^k[]\widehat{G}_{k}^{[\ell]}, 1𝗇n/21\leq\ell\leq\mathsf{n}\leq\lfloor n/2\rfloor, that are independently diagonalized, and with 𝗇\mathsf{n} ensuing concurrent updates of UkU_{k}, 𝗇\mathsf{n} of VkV_{k}, and 𝗇\mathsf{n} from each of the sides of GkG_{k} in a parallel step. This way the employed parallel pivot strategy does not have to be cyclic. A promising candidate is the dynamic ordering [16, 17, 15].

The proposed Kogbetliantz SVD of order two supports a wider exponent range of the elements of a triangular input matrix for which both singular values are computed with high relative accuracy than xLAEV2, although the latter is slightly more accurate when comparison is possible. Matrices of the singular vectors obtained by the proposed method are noticeably more numerically orthogonal. With respect to [14, 15] and a general matrix GG of order two, the following enhancements have been implemented:

  1. 1.

    The structure of GG is exploited to the utmost extent, so the triangularization and a stronger scaling is employed only when GG has no zeros, thus preserving accuracy.

  2. 2.

    The triangularization of GG by a special URV factorization is tweaked so that high relative accuracy of each computed element is provable when no underflow occurs.

  3. 3.

    The SVD procedure for triangular matrices utilizes the latest advances in computing the correctly rounded functions, so the pertinent formulas from [14, 15] are altered.

  4. 4.

    The left singular vectors are computed by a heuristic when the triangularization is involved, by composing the two plane rotations—one from the URV factorization, and the other from the triangular SVD—into one without the matrix multiplication.

High relative accuracy of the singular values of GG is observed, but not proved, when the range of the elements of GG is narrower than about half of the entire normal range.

This paper is organized as follows. In Section 2 the floating-point environment and the required operations are described, and some auxiliary results regarding them are proved. Section 3 presents the proposed SVD method. In Section 4 the numerical testing results are shown. Section 5 concludes the paper with the comments on future work.

2 Floating-point considerations

Let xx be a real, infinite, or undefined (Not-a-Number) value: x{,+,NaN}x\in\mathbb{R}\cup\{-\infty,+\infty,\mathrm{NaN}\}. Its floating-point representation is denoted by fl(x)\mathop{\mathrm{fl}}(x) and is obtained by rounding xx to a value of the chosen standard floating-point datatype using the rounding mode in effect, that is here assumed to be to nearest (ties to even). If the result is normal, fl(x)=x(1+ϵ)\mathop{\mathrm{fl}}(x)=x(1+\epsilon), where |ϵ|ε=2p|\epsilon|\leq\varepsilon=2^{-p} and pp is the number of bits in the significand of a floating-point value. In the LAPACK’s terms, εx=xLAMCH(’E’)\varepsilon_{\text{{x}}}=\text{{xLAMCH('E')}}, where x=S\text{{x}}=\text{{S}} or D. Thus, px=24p_{\text{{x}}}=24 or 5353, and εx=224\varepsilon_{\text{{x}}}=2^{-24} or 2532^{-53} for single (S) or double (D) precision, respectively. The gradual underflow mode, allowing subnormal inputs and outputs, has to be enabled (e.g., on Intel-compatible architectures the Denormals-Are-Zero and Flush-To-Zero modes have to be turned off). Trapping on floating-point exceptions has to be disabled (what is the default non-stop handling from [18, Sect. 7.1]).

Possible discretization errors in input data are ignored. Input matrices in floating-point are thus considered exact, and are, for simplicity, required to have finite elements.

The Fused Multiply-and-Add (FMA) function, fma(a,b,c)=fl(ab+c)\mathop{\mathrm{fma}}(a,b,c)=\mathop{\mathrm{fl}}(a\cdot b+c), is required. Conceptually, the exact value of ab+ca\cdot b+c is correctly rounded. Also, the hypotenuse function, hypot(a,b)=fl(a2+b2)\mathop{\mathrm{hypot}}(a,b)=\mathop{\mathrm{fl}}(\sqrt{a^{2}+b^{2}}), is assumed to be correctly rounded (unless stated otherwise), as recommended by the standard [18, Sect. 9.2], but unlike222https://members.loria.fr/PZimmermann/papers/accuracy.pdf many current implementations of the routines hypotf and hypot. Such a function (see also [19]) never overflows when the rounded result should be finite, it is zero if and only if |a|=|b|=0|a|=|b|=0, and is symmetric and monotone. The CORE-MATH library [20] provides an open-source implementation333https://core-math.gitlabpages.inria.fr of some of the optional correctly rounded single and double precision mathematical C functions (e.g., cr_hypotf and cr_hypot).

Radix two is assumed for floating-point values. Scaling of a value xx by 2s2^{s} where ss\in\mathbb{Z} is exact if the result is normal. Only non-normal results can lose precision. Let, for x=±0x=\pm 0, ex=0e_{x}=0 and fx=0f_{x}=0, and for a finite non-zero xx let the exponent be ex=lg|x|e_{x}=\lfloor\lg|x|\rfloor and the “mantissa” 1|fx|<21\leq|f_{x}|<2, such that x=fl(2exfx)x=\mathop{\mathrm{fl}}(2^{e_{x}}f_{x}). Also, let fx=xf_{x}=x for x=±x=\pm\infty, while ex=0e_{x}=0. Note that fxf_{x} is normal even for subnormal xx. Keep in mind that the frexp routine represents a finite non-zero xx with ex=ex+1e_{x}^{\prime}=e_{x}+1 and fx=fx/2f_{x}^{\prime}=f_{x}/2.

Let μ\mu be the smallest and ν\nu the largest positive finite normal value. Then, in the notation just introduced, eμ=lgμ=126e_{\mu}=\lg\mu=-126 or 1022-1022, and eν=lgν=127e_{\nu}=\lfloor\lg\nu\rfloor=127 or 10231023, for single or double precision. Lemma 2.1 can now be stated using this notation.

Lemma 2.1.

Assume that eνp1e_{\nu}-p\geq 1, with rounding to nearest. Then,

fl(ν+1)=ν=hypot(ν,1).\mathop{\mathrm{fl}}(\nu+1)=\nu=\mathop{\mathrm{hypot}}(\nu,1). (2)
Proof.

By the assumption, ν2p+1+2p++22\nu\geq 2^{p+1}+2^{p}+\cdots+2^{2}, since ν=2eν(1.11)2\nu=2^{e_{\nu}}(1.1\cdots 1)_{2}, with pp ones. The bit in the last place thus represents a value of at least four. Adding one to ν\nu would require rounding of the exact value ν+1=2eν(1.11001)2\nu+1=2^{e_{\nu}}\cdot(1.1\cdots 10\cdots 01)_{2} to pp bits of significand. The number of zeros is eνp1e_{\nu}-p\geq 1. Rounding to nearest in such a case is equivalent to truncating the trailing eνp+1e_{\nu}-p+1 bits, starting from the leftmost zero, giving the result fl(ν+1)=2eν(1.11)2=ν\mathop{\mathrm{fl}}(\nu+1)=2^{e_{\nu}}(1.1\cdots 1)_{2}=\nu. This proves the first equality in (2).

For the second equality in (2), note that (ν+1)2=ν2+1+2νν2+1ν2(\nu+1)^{2}=\nu^{2}+1+2\nu\geq\nu^{2}+1\geq\nu^{2} since ν>0\nu>0. By taking the square roots, it follows that ν+1ν2+1ν\nu+1\geq\sqrt{\nu^{2}+1}\geq\nu, and therefore

ν=fl(ν+1)fl(ν2+1)=hypot(ν,1)hypot(ν,0)=fl(ν)=ν,\nu=\mathop{\mathrm{fl}}(\nu+1)\geq\mathop{\mathrm{fl}}(\sqrt{\nu^{2}+1})=\mathop{\mathrm{hypot}}(\nu,1)\geq\mathop{\mathrm{hypot}}(\nu,0)=\mathop{\mathrm{fl}}(\nu)=\nu,

since fl\mathop{\mathrm{fl}} and hypot\mathop{\mathrm{hypot}} are monotone operations in all arguments. ∎

The claims of Lemma 2.1 and its following corollaries were used and their proofs partially sketched in [21], e.g. They are expanded and clarified here for completeness.

An underlined expression denotes a computed floating-point approximation of the exact value of that expression. Given tan(2ϕ)\tan(2\phi), for 0|ϕ|π/40\leq|\phi|\leq\pi/4, tanϕ\tan\phi and tanϕ¯\underline{\tan\phi} are

tanϕ=tan(2ϕ)1+tan2(2ϕ)+1,tanϕ¯=fl(fl(tan(2ϕ))fl(1+hypot(fl(tan(2ϕ)),1))),\tan\phi=\frac{\tan(2\phi)}{1+\sqrt{\tan^{2}(2\phi)+1}},\qquad\underline{\tan\phi}=\mathop{\mathrm{fl}}\left(\frac{\mathop{\mathrm{fl}}(\tan(2\phi))}{\mathop{\mathrm{fl}}(1+\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan(2\phi)),1))}\right), (3)

if tan(2ϕ)\tan(2\phi) and fl(tan(2ϕ))\mathop{\mathrm{fl}}(\tan(2\phi)) are finite, or sign(tan(2ϕ))\mathop{\mathrm{sign}}(\tan(2\phi)) and sign(fl(tan(2ϕ)))\mathop{\mathrm{sign}}(\mathop{\mathrm{fl}}(\tan(2\phi))) otherwise.

Corollary 2.2.

Let tan(2ϕ)\tan(2\phi) be given, such that fl(tan(2ϕ))\mathop{\mathrm{fl}}(\tan(2\phi)) is finite. Then, under the assumptions of Lemma 2.1, for tanϕ¯\underline{\tan\phi} from (3) holds 0|tanϕ¯|10\leq|\underline{\tan\phi}|\leq 1.

Proof.

For |fl(tan(2ϕ))|=ν|\mathop{\mathrm{fl}}(\tan(2\phi))|=\nu, due to Lemma 2.1, hypot(fl(tan(2ϕ)),1)=ν\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan(2\phi)),1)=\nu, and so the denominator in (3) is fl(1+ν)=ν\mathop{\mathrm{fl}}(1+\nu)=\nu. Note that the numerator is always at most as large in magnitude as the denominator. Thus, 0|tanϕ¯|10\leq|\underline{\tan\phi}|\leq 1, what had to be proven. ∎

Corollary 2.3.

Let tanϕ\tan{\phi} be given, for |ϕ|π/2|\phi|\leq\pi/2. Then, secϕ=1/cosϕ\sec{\phi}=1/\cos{\phi} can be approximated as secϕ¯=hypot(fl(tanϕ),1)\underline{\sec{\phi}}=\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan{\phi}),1). If tanϕ=fl(tanϕ)\tan{\phi}=\mathop{\mathrm{fl}}(\tan{\phi}), then secϕ¯=fl(secϕ)\underline{\sec{\phi}}=\mathop{\mathrm{fl}}(\sec{\phi}). When the assumptions of Lemma 2.1 hold and fl(tanϕ)\mathop{\mathrm{fl}}(\tan{\phi}) is finite, so is secϕ¯\underline{\sec{\phi}}.

Proof.

The approximation relation follows from the definition of hypot\mathop{\mathrm{hypot}} and from secϕ=tan2ϕ+1\sec\phi=\sqrt{\tan^{2}\phi+1}, while its finiteness for a finite fl(tanϕ)\mathop{\mathrm{fl}}(\tan\phi) follows from Lemma 2.1, since |fl(tanϕ)|ν|\mathop{\mathrm{fl}}(\tan\phi)|\leq\nu implies hypot(fl(tanϕ),1)ν\mathop{\mathrm{hypot}}(\mathop{\mathrm{fl}}(\tan\phi),1)\leq\nu. ∎

For any ww\in\mathbb{R}, let 𝐰=(ew,fw)=2ewfw\mathbf{w}=(e_{w},f_{w})=2^{e_{w}}f_{w} and fl(𝐰)=fl(2ewfw)w\mathop{\mathrm{fl}}(\mathbf{w})=\mathop{\mathrm{fl}}(2^{e_{w}}f_{w})\approx w. Even ww such that |w|>ν|w|>\nu or 0<|w|<μˇ0<|w|<\check{\mu}, where μˇ\check{\mu} is the smallest positive non-zero floating-point value, can be represented with a finite ewe_{w} and a normalized fwf_{w}, though fl(𝐰)\mathop{\mathrm{fl}}(\mathbf{w}) is not finite or non-zero, respectively. The closest double precision approximation of 𝐰\mathbf{w} is w¯=scalbn(fw,ew)\underline{w}=\text{{scalbn($f_{w},e_{w}$)}}, with a possible underflow or overflow, and similarly for single precision (using scalbnf). A similar definition could be made with ewe_{w}^{\prime} and fwf_{w}^{\prime} instead.

An overflow-avoiding addition and an underflow-avoiding subtraction of positive finite values xx and yy, resulting in such exponent-“mantissa” pairs, can be defined as

xy={(ez,fz),if z=fl(x+y)ν,(ez+1,fz),otherwise, with z=fl(21x+21y),x\oplus y=\begin{cases}(e_{z},f_{z}),&\text{if $z=\mathop{\mathrm{fl}}(x+y)\leq\nu$},\\ (e_{z}+1,f_{z}),&\text{otherwise, with $z=\mathop{\mathrm{fl}}(2^{-1}x+2^{-1}y)$},\end{cases} (4)

and, assuming xyx\neq y (for x=yx=y let xy=(0,0)x\ominus y=(0,0) directly),

xy={(ez,fz),if |z|μ, with z=fl(xy),(ezc,fz),otherwise, with z=fl(2cx2cy),x\ominus y=\begin{cases}(e_{z},f_{z}),&\text{if $|z|\geq\mu$, with $z=\mathop{\mathrm{fl}}(x-y)$},\\ (e_{z}-c,f_{z}),&\text{otherwise, with $z=\mathop{\mathrm{fl}}(2^{c}x-2^{c}y)$},\end{cases} (5)

where c=eμ+p1min{ex,ey}c=e_{\mu}+p-1-\min\{e_{x},e_{y}\}. In (4), zνz\leq\nu in both cases, since fl(ν/2+ν/2)=ν\mathop{\mathrm{fl}}(\nu/2+\nu/2)=\nu.

Lemma 2.4.

In the second case in (5), c>0c>0 and μ|z|ν\mu\leq|z|\leq\nu, if eνeμ+2p1e_{\nu}\geq e_{\mu}+2p-1.

Proof.

Assume x>yx>y (else, swap xx and yy, and change the sign of zz). Then, eyexe_{y}\leq e_{x} and y=2ey(1.𝚢1𝚢1p)2y=2^{e_{y}}(1.\mathtt{y}_{-1}\cdots\mathtt{y}_{1-p})_{2}. The rightmost bit 𝚢1p\mathtt{y}_{1-p} multiplies the value w=2ey+1pw=2^{e_{y}+1-p}.

If exeyp+1e_{x}-e_{y}\geq p+1 then xx is normal and, due to rounding to nearest, fl(xy)=x\mathop{\mathrm{fl}}(x-y)=x. Therefore, assume that exeype_{x}-e_{y}\leq p. If wμ=2eμw\geq\mu=2^{e_{\mu}}, then xyμx-y\geq\mu as well, since xywx-y\geq w. Thus, fl(xy)wμ\mathop{\mathrm{fl}}(x-y)\geq w\geq\mu, so assume that w<μw<\mu, i.e., ey+1p<eμe_{y}+1-p<e_{\mu}.

It now suffices to upscale xx and yy to x′′=2cxx^{\prime\prime}=2^{c}x and y′′=2cyy^{\prime\prime}=2^{c}y, for some cc\in\mathbb{N}, to ensure ey′′=ey+ceμ+p1e_{y}^{\prime\prime}=e_{y}+c\geq e_{\mu}+p-1. Any ceμ+p1eyc\geq e_{\mu}+p-1-e_{y} that will not overflow x′′x^{\prime\prime} will do, so the smallest one is chosen. Note that ex′′=ex+c=ex+eμ+pey1e_{x}^{\prime\prime}=e_{x}+c=e_{x}+e_{\mu}+p-e_{y}-1. Since exeype_{x}-e_{y}\leq p, by the Lemma’s assumption it holds ex′′eμ+2p1eνe_{x}^{\prime\prime}\leq e_{\mu}+2p-1\leq e_{\nu}. ∎

Several arithmetic operations on (e,f)(e,f)-pairs can be defined (see also [22]), such as

|𝐱|=(ex,|fx|),𝐱=(ex,fx),2ς𝐱=(ς+ex,fx),1𝐲=(ezey,fz),z=fl(1/fy),\begin{gathered}|\mathbf{x}|=(e_{x},|f_{x}|),\qquad-\mathbf{x}=(e_{x},-f_{x}),\qquad 2^{\varsigma}\odot\mathbf{x}=(\varsigma+e_{x},f_{x}),\\ 1\oslash\mathbf{y}=(e_{z}-e_{y},f_{z}),\quad z=\mathop{\mathrm{fl}}(1/f_{y}),\end{gathered} (6)

which are unary operations. The binary multiplication and division are defined as

𝐱𝐲=(ex+ey+ez,fz),z=fl(fxfy),𝐱𝐲=(exey+ez,fz),z=fl(fx/fy),\begin{gathered}\mathbf{x}\odot\mathbf{y}=(e_{x}+e_{y}+e_{z},f_{z}),\quad z=\mathop{\mathrm{fl}}(f_{x}\cdot f_{y}),\\ \mathbf{x}\oslash\mathbf{y}=(e_{x}-e_{y}+e_{z},f_{z}),\quad z=\mathop{\mathrm{fl}}(f_{x}/f_{y}),\end{gathered} (7)

and the relation \prec, that compares the represented values in the << sense, is given as

𝐱𝐲\displaystyle\mathbf{x}\prec\mathbf{y}\iff (sign(fx)<sign(fy))\displaystyle(\mathop{\mathrm{sign}}(f_{x})<\mathop{\mathrm{sign}}(f_{y})) (8)
\displaystyle\vee ((sign(fx)=sign(fy))((ex<ey)((ex=ey)(fx<fy)))).\displaystyle((\mathop{\mathrm{sign}}(f_{x})=\mathop{\mathrm{sign}}(f_{y}))\wedge((e_{x}<e_{y})\vee((e_{x}=e_{y})\wedge(f_{x}<f_{y})))).

Let, for any GG of order nn, where INT_MIN is the smallest representable integer,

eG=max1i,jneij,eij=max{lggij,INT_MIN}.e_{G}=\max_{1\leq i,j\leq n}e_{ij},\quad e_{ij}=\max\{\left\lfloor\lg g_{ij}\right\rfloor,\text{{INT\_MIN}}\}. (9)

A prescaling of GG as G¯=2sG\underline{G^{\prime}}=2^{s}G, that avoids overflows, and underflows if possible, in the course of computing the SVD of G¯\underline{G^{\prime}} (and thus of G¯2sG¯\underline{G}\approx 2^{-s}\underline{G^{\prime}}), is defined by ss such that

eG=INT_MINs=0,eG>INT_MINs=eνeG𝔰,𝔰0,e_{G}=\text{{INT\_MIN}}\implies s=0,\qquad e_{G}>\text{{INT\_MIN}}\implies s=e_{\nu}-e_{G}-\mathfrak{s},\quad\mathfrak{s}\geq 0, (10)

where 𝔰=0\mathfrak{s}=0 for n=1n=1. For n=2n=2, 𝔰\mathfrak{s} is chosen such that certain intermediate results while computing the SVD of G¯\underline{G^{\prime}} cannot overflow, as explained in [14, 15] and Section 3, but the final singular values are represented in the (e,f)(e,f) form, and are immune from overflow and underflow as long as they are not converted to simple floating-point values. If s0s\geq 0, the result of such a prescaling is exact. Otherwise, some elements of G¯\underline{G^{\prime}} might be computed inexactly due to underflow. If for the elements of GG holds

gij0μ|gij|ν/2𝔰,1i,jn,g_{ij}\neq 0\implies\mu\leq|g_{ij}|\leq\nu/2^{\mathfrak{s}},\quad 1\leq i,j\leq n, (11)

then s0s\geq 0, and gij=0g_{ij}^{\prime}=0 or μ|gij|ν\mu\leq|g_{ij}^{\prime}|\leq\nu, i.e., the elements of GG^{\prime} are zero or normal.

3 The SVD of general matrices of order two

This section presents a Kogbetliantz-like procedure for computing the singular values of GG when n=2n=2, and the matrices of the left (UU) and the right (VV) singular vectors.

In general, UU is a product of permutations (denoted by PP with subscripts and including II), sign matrices (denoted by SS with subscripts) with each diagonal element being either 11 or 1-1 while the rest are zeros, and plane rotations by the angles ϑ\vartheta and φ\varphi. If UϑU_{\vartheta} is not generated, the notation changes from UφU_{\varphi} to UϕU_{\phi}. Likewise, VV is a product of permutations, a sign matrix, and a plane rotation by the angle ψ\psi, where

Uϑ=[cosϑsinϑsinϑcosϑ],Uφ=[cosφsinφsinφcosφ],Vψ=[cosψsinψsinψcosψ].U_{\vartheta}=\begin{bmatrix}\cos{\vartheta}&-\sin{\vartheta}\\ \sin{\vartheta}&\hphantom{-}\cos{\vartheta}\end{bmatrix},\qquad U_{\varphi}=\begin{bmatrix}\cos{\varphi}&-\sin{\varphi}\\ \sin{\varphi}&\hphantom{-}\cos{\varphi}\end{bmatrix},\qquad V_{\psi}=\begin{bmatrix}\cos{\psi}&-\sin{\psi}\\ \sin{\psi}&\hphantom{-}\cos{\psi}\end{bmatrix}. (12)

Depending on its pattern of zeros, a matrix of order two falls into one of the 16 types 𝔱\mathfrak{t} shown in (13), where =0\circ=0 and 0\bullet\neq 0. Some types are permutationally equivalent to others, what is denoted by 𝔱1𝔱2\mathfrak{t}_{1}\cong\mathfrak{t}_{2}, and means that a 𝔱1\mathfrak{t}_{1}-matrix can be pre-multiplied and/or post-multiplied by permutations to be transformed into a 𝔱2\mathfrak{t}_{2}-matrix, and vice versa, keeping the number of zeros intact. Each 𝔱0\mathfrak{t}\neq 0 has its associated scale type 𝔰\mathfrak{s}.

0[]1[]2[]3[]4[]5[]6[]7[][]8[]9[]10[]11[]12[]13[]14[]15𝔱0,1,2,4,6,8𝟗12𝟑;10𝟓7,11,14𝟏𝟑𝟏𝟓𝔰0112\begin{gathered}\begin{gathered}\text{\small 0}\\[-3.0pt] \begin{bmatrix}\circ&\circ\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 1}\\[-3.0pt] \begin{bmatrix}\bullet&\circ\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 2}\\[-3.0pt] \begin{bmatrix}\circ&\circ\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 3}\\[-3.0pt] \begin{bmatrix}\bullet&\circ\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 4}\\[-3.0pt] \begin{bmatrix}\circ&\bullet\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 5}\\[-3.0pt] \begin{bmatrix}\bullet&\bullet\\[-1.0pt] \circ&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 6}\\[-3.0pt] \begin{bmatrix}\circ&\bullet\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\begin{gathered}\text{\small 7}\\[-3.0pt] \begin{bmatrix}\bullet&\bullet\\[-1.0pt] \bullet&\circ\end{bmatrix}\end{gathered}\\ \begin{gathered}\begin{bmatrix}\circ&\circ\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 8}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\circ\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 9}\end{gathered}\begin{gathered}\begin{bmatrix}\circ&\circ\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 10}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\circ\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 11}\end{gathered}\begin{gathered}\begin{bmatrix}\circ&\bullet\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 12}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\bullet\\[-1.0pt] \circ&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 13}\end{gathered}\begin{gathered}\begin{bmatrix}\circ&\bullet\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 14}\end{gathered}\begin{gathered}\begin{bmatrix}\bullet&\bullet\\[-1.0pt] \bullet&\bullet\end{bmatrix}\\[-3.0pt] \text{\small 15}\end{gathered}\end{gathered}\quad\begin{gathered}\mathfrak{t}\\ 0,1,2,4,6,8\cong\mathbf{9}\\ 12\cong\mathbf{3};\quad 10\cong\mathbf{5}\\ 7,11,14\cong\mathbf{13}\\ \quad\mathbf{15}\end{gathered}\quad\begin{gathered}\mathfrak{s}\\ 0\\ 1\\ 1\\ 2\end{gathered} (13)

For 𝔰=0\mathfrak{s}=0, there is one equivalence class of matrix types, represented by 𝔱=9\mathfrak{t}=9. For 𝔰=1\mathfrak{s}=1, there are three classes, represented by 𝔱=3\mathfrak{t}=3, 𝔱=5\mathfrak{t}=5, and 𝔱=13\mathfrak{t}=13, while for 𝔰=2\mathfrak{s}=2 there is one class, 𝔱=15\mathfrak{t}=15. The SVD computation for the first three classes is straightforward, while for the fourth and the fifth class is more involved. A matrix of any type, except 𝔱=15\mathfrak{t}=15, can be permuted into an upper triangular one. If a matrix so obtained is well scaled, its SVD can alternatively be computed by xLASV2. However, xLASV2 does not accept general matrices (i.e., 𝔱=15\mathfrak{t}=15), unlike the proposed method, which is a modification of [15] when J=IJ=I, and consists of the following three phases:

  1. 1.

    For GG determine 𝔱\mathfrak{t}, 𝔰\mathfrak{s}, and ss to obtain G¯\underline{G^{\prime}}. Handle the simple cases of 𝔱\mathfrak{t} separately.

  2. 2.

    If 𝔱13\mathfrak{t}\cong 13 or 𝔱=15\mathfrak{t}=15, factorize G¯\underline{G^{\prime}} as U+RV+U_{+}RV_{+}, such that U+U_{+} and V+V_{+} are orthogonal, and RR is upper triangular, with min{r11,r12,r22}>0\min\{r_{11},r_{12},r_{22}\}>0 and all rijr_{ij} finite, 1ij21\leq i\leq j\leq 2.

  3. 3.

    From the SVD of R¯\underline{R} assemble the SVD of G¯\underline{G^{\prime}}. Optionally backscale Σ¯\underline{\Sigma^{\prime}} by 2s2^{-s}.

The phases 1, 2, and 3 are described in Sections 3.1, 3.2, and 3.3, respectively.

3.1 Prescaling of the matrix and the simple cases (𝔱3,5,9\mathfrak{t}\cong 3,5,9)

Matrices with 𝔱9\mathfrak{t}\cong 9 do not have to be scaled, but only permuted into the 𝔱=0\mathfrak{t}=0, 𝔱=1\mathfrak{t}=1, or 𝔱=9\mathfrak{t}=9 (where the first diagonal element is not smaller by magnitude than the second one) form, according to their number of non-zeros, with at most one permutation from the left and at most one from the right hand side. Then, the rows of PUTGPVP_{U}^{T}GP_{V} are multiplied by the signs of their diagonal elements, to obtain σ1=|g11|\sigma_{1}=|g_{11}| and σ2=|g22|\sigma_{2}=|g_{22}|, while U=PUSU=P_{U}S and V=PVV=P_{V}. The error-free SVD computation is thus completed.

Note that the signs might have been taken out of the columns instead of the rows, and the sign matrix SS would have then be incorporated into VV instead. The structure of the left and the right singular vector matrices is therefore not uniquely determined.

Be aware that 𝔱\mathfrak{t} determined before the prescaling (to compute 𝔰\mathfrak{s} and ss) may differ from 𝔱\mathfrak{t}^{\prime} that would be found afterwards. If, e.g., 𝔱9\mathfrak{t}\ncong 9 and GG contains, among others, ν\nu and μˇ\check{\mu} as elements, the element(s) μˇ\check{\mu} will vanish after the prescaling since s<0s<0 (from (10), due to 𝔰1\mathfrak{s}\geq 1), so 𝔱<𝔱\mathfrak{t}^{\prime}<\mathfrak{t} and the zero pattern of G¯\underline{G^{\prime}} has to be re-examined.

A 𝔱3\mathfrak{t}\cong 3 or 𝔱5\mathfrak{t}\cong 5 matrix is scaled by 2s2^{s}. The columns (resp., rows) of a 𝔱=12\mathfrak{t}^{\prime}=12 (resp., 𝔱=10\mathfrak{t}^{\prime}=10) matrix are swapped, to bring it to the 𝔱′′=3\mathfrak{t}^{\prime\prime}=3 (resp., 𝔱′′=5\mathfrak{t}^{\prime\prime}=5) form. Then, the non-zero elements are made positive by multiplying the rows (resp., columns) by their signs. Next, the rows (resp., columns) are swapped if required to make the upper left element largest by magnitude. The sign-extracting and magnitude-ordering operations may be swapped or combined. The resulting matrix G′′G^{\prime\prime} undergoes the QR (resp., RQ) factorization, by a single Givens rotation UθTU_{\theta}^{T} (resp., VθV_{\theta}), determined by tanθ\tan\theta (consequently, by cosθ\cos\theta and sinθ\sin\theta) as in (12), with θ\theta substituted for ϑ\vartheta (resp., ψ\psi), where

𝔱′′=3tanθ=g21′′/g11′′,𝔱′′=5tanθ=g12′′/g11′′.\mathfrak{t}^{\prime\prime}=3\implies\tan\theta=g_{21}^{\prime\prime}/g_{11}^{\prime\prime},\qquad\mathfrak{t}^{\prime\prime}=5\implies\tan\theta=g_{12}^{\prime\prime}/g_{11}^{\prime\prime}. (14)

By construction, 0<tanθ10<\tan\theta\leq 1 and 0tanθ¯10\leq\underline{\tan\theta}\leq 1. The upper left element is not transformed, but explicitly set to hold the Frobenius norm of the whole non-zero column (resp., row), as g11′′′¯=hypot(g11′′¯,g21′′¯)\underline{g_{11}^{\prime\prime\prime}}=\mathop{\mathrm{hypot}}(\underline{g_{11}^{\prime\prime}},\underline{g_{21}^{\prime\prime}}) (resp., g11′′′¯=hypot(g11′′¯,g12′′¯)\underline{g_{11}^{\prime\prime\prime}}=\mathop{\mathrm{hypot}}(\underline{g_{11}^{\prime\prime}},\underline{g_{12}^{\prime\prime}})), while the other non-zero element is zeroed out. Thus, to avoid overflow of g11′′′¯\underline{g_{11}^{\prime\prime\prime}} it is sufficient to ensure that g11′′¯ν/2\underline{g_{11}^{\prime\prime}}\ll\nu/\sqrt{2}, what 𝔰=1\mathfrak{s}=1 achieves. The SVD is given by U=SUPUUθU=S_{U}P_{U}U_{\theta} and V=PVV=P_{V} for 𝔱3\mathfrak{t}^{\prime}\cong 3, and by U=PUU=P_{U} and V=SVPVVθV=S_{V}P_{V}V_{\theta} for 𝔱5\mathfrak{t}^{\prime}\cong 5. The scaled singular values are σ1=g11′′′\sigma_{1}^{\prime}=g_{11}^{\prime\prime\prime} and σ2=0\sigma_{2}^{\prime}=0 in both cases, and σ1¯\underline{\sigma_{1}^{\prime}} cannot overflow (σ1¯=2sσ1¯\underline{\sigma_{1}}=2^{-s}\underline{\sigma_{1}^{\prime}} can).

If no inexact underflow occurs while scaling GG to GG^{\prime}, then σ1¯=σ1(1+ϵ1)\underline{\sigma_{1}^{\prime}}=\sigma_{1}^{\prime}(1+\epsilon_{1}^{\prime}), where |ϵ1|ε|\epsilon_{1}^{\prime}|\leq\varepsilon. With the same assumption, tanθ¯μ\underline{\tan\theta}\geq\mu implies tanθ¯=tanθ(1+ϵθ)\underline{\tan\theta}=\tan\theta(1+\epsilon_{\theta}), where |ϵθ|ε|\epsilon_{\theta}|\leq\varepsilon. The resulting Givens rotation can be represented and applied as one of

UθT¯=[1tanθ¯tanθ¯1]/secθ¯,Vθ¯=[1tanθ¯tanθ¯1]/secθ¯,\underline{U_{\theta}^{T}}=\begin{bmatrix}1&\underline{\tan\theta}\\ -\underline{\tan\theta}&1\end{bmatrix}/\underline{\sec\theta},\qquad\underline{V_{\theta}}=\begin{bmatrix}1&-\underline{\tan\theta}\\ \underline{\tan\theta}&1\end{bmatrix}/\underline{\sec\theta}, (15)

what avoids computing cosθ¯\underline{\cos\theta} and sinθ¯\underline{\sin\theta} explicitly. Lemma 3.1 bounds the error in secθ¯\underline{\sec\theta}.

Lemma 3.1.

Let secθ¯\underline{\sec\theta} from (15) be computed as hypot(tanθ¯,1)\mathop{\mathrm{hypot}}(\underline{\tan\theta},1) for tanθ¯μ\underline{\tan\theta}\geq\mu. Then,

secθ¯=δθsecθ,((1ε)2+1)/2(1ε)δθ((1+ε)2+1)/2(1+ε).\underline{\sec\theta}=\delta_{\theta}^{\prime}\sec\theta,\quad\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)\leq\delta_{\theta}^{\prime}\leq\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon). (16)
Proof.

Let δθ=(1+ϵθ)\delta_{\theta}=(1+\epsilon_{\theta}). Then tanθ¯μ\underline{\tan\theta}\geq\mu implies (tanθ¯)2=δθ2tan2θ(\underline{\tan\theta})^{2}=\delta_{\theta}^{2}\tan^{2}\theta, and

1ε=δθδθδθ+=1+ε.1-\varepsilon=\delta_{\theta}^{-}\leq\delta_{\theta}\leq\delta_{\theta}^{+}=1+\varepsilon.

Express (tanθ¯)2+1=δθ2tan2θ+1(\underline{\tan\theta})^{2}+1=\delta_{\theta}^{2}\tan^{2}\theta+1 as (tan2θ+1)(1+ϵθ)(\tan^{2}\theta+1)(1+\epsilon_{\theta}^{\prime}), from which it follows

ϵθ=tan2θtan2θ+1(δθ21),0<tan2θtan2θ+112.\epsilon_{\theta}^{\prime}=\frac{\tan^{2}\theta}{\tan^{2}\theta+1}(\delta_{\theta}^{2}-1),\qquad 0<\frac{\tan^{2}\theta}{\tan^{2}\theta+1}\leq\frac{1}{2}.

By adding unity to both sides of the equation for ϵθ\epsilon_{\theta}^{\prime} and taking the maximal value of the first factor on its right hand side, while accounting for the bounds of δθ\delta_{\theta}, it holds

((δθ)2+1)/21+ϵθ((δθ+)2+1)/2.((\delta_{\theta}^{-})^{2}+1)/2\leq 1+\epsilon_{\theta}^{\prime}\leq((\delta_{\theta}^{+})^{2}+1)/2.

Since secθ¯=hypot(tanθ¯,1)=(tanθ¯)2+1(1+ϵ)=(tan2θ+1)(1+ϵθ)(1+ϵ)\underline{\sec\theta}=\mathop{\mathrm{hypot}}(\underline{\tan\theta},1)=\sqrt{(\underline{\tan\theta})^{2}+1}(1+\epsilon_{\sqrt{\hbox{}}})=\sqrt{(\tan^{2}\theta+1)(1+\epsilon_{\theta}^{\prime})}(1+\epsilon_{\sqrt{\hbox{}}}), where |ϵ|ε|\epsilon_{\sqrt{\hbox{}}}|\leq\varepsilon, factorizing the last square root into a product of square roots gives

secθ¯=secθ1+ϵθ(1+ϵ).\underline{\sec\theta}=\sec\theta\sqrt{1+\epsilon_{\theta}^{\prime}}(1+\epsilon_{\sqrt{\hbox{}}}).

The proof is concluded by denoting the error factor on the right hand side by δθ\delta_{\theta}^{\prime}. ∎

This proof and the following one use several techniques from [21, Theorem 1]. Due to the structure of a matrix that UθTU_{\theta}^{T} or VθV_{\theta} is applied to, containing in each row and column one zero and one ±1\pm 1, it follows that UU or VV have in each row and column ±cosθ¯\pm\underline{\cos\theta} and ±sinθ¯\pm\underline{\sin\theta}, computed implicitly, for which Lemma 3.2 gives error bounds.

Lemma 3.2.

Let cosθ¯\underline{\cos\theta} and sinθ¯\underline{\sin\theta} result from applying (15) with tanθ¯μ\underline{\tan\theta}\geq\mu. Then,

cosθ¯=δθ′′cosθ,δθ′′=(1+ϵ/)/δθ,sinθ¯=δθ′′′sinθ,δθ′′′=(1+ϵ/)δθ/δθ,\underline{\cos\theta}=\delta_{\theta}^{\prime\prime}\cos\theta,\quad\delta_{\theta}^{\prime\prime}=(1+\epsilon_{/})/\delta_{\theta}^{\prime},\qquad\underline{\sin\theta}=\delta_{\theta}^{\prime\prime\prime}\sin\theta,\quad\delta_{\theta}^{\prime\prime\prime}=(1+\epsilon_{/}^{\prime})\delta_{\theta}/\delta_{\theta}^{\prime}, (17)

where max{|ϵ/|,|ϵ/|}ε\max\{|\epsilon_{/}|,|\epsilon_{/}^{\prime}|\}\leq\varepsilon and δθ′′\delta_{\theta}^{\prime\prime} and δθ′′′\delta_{\theta}^{\prime\prime\prime} can be bound below and above in the terms of ε\varepsilon only. Let δθ\delta_{\theta}^{\prime-} and δθ+\delta_{\theta}^{\prime+} be the lower and the upper bounds for δθ\delta_{\theta}^{\prime} from (16). Then,

(1ε)/δθ+δθ′′(1+ε)/δθ,(1ε)δθ/δθ+δθ′′′(1+ε)δθ+/δθ.(1-\varepsilon)/\delta_{\theta}^{\prime+}\leq\delta_{\theta}^{\prime\prime}\leq(1+\varepsilon)/\delta_{\theta}^{\prime-},\qquad(1-\varepsilon)\delta_{\theta}^{-}/\delta_{\theta}^{\prime+}\leq\delta_{\theta}^{\prime\prime\prime}\leq(1+\varepsilon)\delta_{\theta}^{+}/\delta_{\theta}^{\prime-}. (18)
Proof.

The claims follow from (16), cosθ¯=fl(1/secθ¯)\underline{\cos\theta}=\mathop{\mathrm{fl}}(1/\underline{\sec\theta}), and sinθ¯=fl(tanθ¯/secθ¯)\underline{\sin\theta}=\mathop{\mathrm{fl}}(\underline{\tan\theta}/\underline{\sec\theta}). ∎

If, in (14), tanθ¯<μ\underline{\tan\theta}<\mu, then secθ¯=1\underline{\sec\theta}=1 since 1secθ¯fl(1+μ2)fl(1+μ)=11\leq\underline{\sec\theta}\leq\mathop{\mathrm{fl}}(\sqrt{1+\mu^{2}})\leq\mathop{\mathrm{fl}}(1+\mu)=1, and the relative error in secθ¯\underline{\sec\theta} is below ε\varepsilon for any standard floating-point datatype. Thus, even though tanθ¯\underline{\tan\theta} can be relatively inaccurate, (16) holds for all secθ¯\underline{\sec\theta}. Also, cosθ¯\underline{\cos\theta} is always relatively accurate, but sinθ¯\underline{\sin\theta} might not be if tanθ¯<μ\underline{\tan\theta}<\mu, when sinθ¯=tanθ¯\underline{\sin\theta}=\underline{\tan\theta}.

3.2 A (pivoted) URVURV factorization of order two (𝔱13,15\mathfrak{t}^{\prime}\cong 13,15)

If, after the prescaling, 𝔱13\mathfrak{t}^{\prime}\cong 13 or 𝔱=15\mathfrak{t}^{\prime}=15, G¯\underline{G^{\prime}} is transformed into an upper triangular matrix RR with all elements non-negative, i.e., a special URV factorization of G¯\underline{G^{\prime}} is computed. Section 3.2.1 deals with the 𝔱13\mathfrak{t}^{\prime}\cong 13, and Section 3.2.2 with the 𝔱=15\mathfrak{t}^{\prime}=15 case.

3.2.1 An error-free transformation from 𝔱13\mathfrak{t}^{\prime}\cong 13 to 𝔱′′=13\mathfrak{t}^{\prime\prime}=13 form

A triangular or anti-triangular matrix is first permuted into an upper triangular one, G′′G^{\prime\prime}. Its first row is then multiplied by the sign of g11′′g_{11}^{\prime\prime}. This might change the sign of g12′′g_{12}^{\prime\prime}. The second column is multiplied by its new sign, what might change the sign of g22′′g_{22}^{\prime\prime}. Its new sign then multiplies the second row, what completes the construction of RR.

The transformations U+TU_{+}^{T} and V+V_{+}, such that R=U+TGV+R=U_{+}^{T}G^{\prime}V_{+}, can be expressed as U+=PUS11S22=U+¯U_{+}=P_{U}S_{11}S_{22}=\underline{U_{+}} and V+=PVS12=V+¯V_{+}=P_{V}S_{12}=\underline{V_{+}}, and are exact, as well as R¯\underline{R} if G¯\underline{G^{\prime}} is.

3.2.2 A fully pivoted URV when 𝔱=15\mathfrak{t}^{\prime}=15

In all previous cases, a sequence of error-free transformations would bring GG^{\prime} into an upper triangular G′′G^{\prime\prime}, of which xLASV2 can compute the SVD. However, a matrix without zeros either has to be preprocessed into such a form, in the spirit of [14, 15], or its SVD has to computed by more complicated and numerically less stable formulas, that follow from the annihilation requirement for the off-diagonal matrix elements as

tan(2ϕ)2=g11g21+g12g22g112+g122g212g222,tanψ=g12+g22tanϕg11+g21tanϕ.\frac{\tan(2\phi)}{2}=\frac{g_{11}g_{21}+g_{12}g_{22}}{g_{11}^{2}+g_{12}^{2}-g_{21}^{2}-g_{22}^{2}},\qquad\tan\psi=\frac{g_{12}+g_{22}\tan\phi}{g_{11}+g_{21}\tan\phi}. (19)

A sketched derivation of (19) can be found in Section 1 of the supplementary material.

Opting for the first approach, compute the Frobenius norms of the columns of GG^{\prime}, as w1w_{1} and w2w_{2}. Due to the prescaling, w1¯=hypot(g11¯,g21¯)\underline{w_{1}}=\mathop{\mathrm{hypot}}(\underline{g_{11}^{\prime}},\underline{g_{21}^{\prime}}) and w2¯=hypot(g12¯,g22¯)\underline{w_{2}}=\mathop{\mathrm{hypot}}(\underline{g_{12}^{\prime}},\underline{g_{22}^{\prime}}) cannot overflow. If w1<w2w_{1}<w_{2}, swap the columns and their norms (so that w1w_{1}^{\prime} would be the norm of the new first column of G~\widetilde{G}^{\prime}, and w2w_{2}^{\prime} the norm of the second one). Multiply each row by the sign of its new first element to get G′′G^{\prime\prime}. Swap the rows if g11′′<g21′′g_{11}^{\prime\prime}<g_{21}^{\prime\prime} to get the fully pivoted G′′′G^{\prime\prime\prime}, while the norms remain unchanged. Note that g11′′′g21′′′>0g_{11}^{\prime\prime\prime}\geq g_{21}^{\prime\prime\prime}>0.

Now the QR factorization of G′′′G^{\prime\prime\prime} is computed as UϑTG′′′=R′′U_{\vartheta}^{T}G^{\prime\prime\prime}=R^{\prime\prime}. Then, r21′′=0r_{21}^{\prime\prime}=0, and

r11′′=w1,tanϑ=g21′′′g11′′′,r12′′=g12′′′+g22′′′tanϑsecϑ,r22′′=g22′′′g12′′′tanϑsecϑ.r_{11}^{\prime\prime}=w_{1}^{\prime},\quad\tan\vartheta=\frac{g_{21}^{\prime\prime\prime}}{g_{11}^{\prime\prime\prime}},\quad r_{12}^{\prime\prime}=\frac{g_{12}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}\tan\vartheta}{\sec\vartheta},\quad r_{22}^{\prime\prime}=\frac{g_{22}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}\tan\vartheta}{\sec\vartheta}. (20)

All properties of the functions of θ\theta from Section 3.1 also hold for the functions of ϑ\vartheta. The prescaling of GG causes the elements of R′′R^{\prime\prime} to be at most ν/(22)\nu/(2\sqrt{2}) in magnitude.

If r12′′¯=0\underline{r_{12}^{\prime\prime}}=0, then 𝔱′′=9\mathfrak{t}^{\prime\prime}=9, and if r22′′¯=0\underline{r_{22}^{\prime\prime}}=0, then 𝔱′′=5\mathfrak{t}^{\prime\prime}=5. In either case R′′¯\underline{R^{\prime\prime}} is processed further as in Section 3.1, while accounting for the already applied transformations. Else, the second column of R′′R^{\prime\prime} is multiplied by the sign of r12′′r_{12}^{\prime\prime} to obtain RR^{\prime}. The second row of RR^{\prime} is multiplied by the sign of r22r_{22}^{\prime} to finalize RR, in which the upper triangle is positive. It is evident how to construct U+U_{+} and V+=V+¯V_{+}=\underline{V_{+}} such that R=U+TGV+R=U_{+}^{T}G^{\prime}V_{+}, since

U+T=S22TUϑTPUTS1T,S1=diag(sign(g~11),sign(g~21)),V+=PVS12.U_{+}^{T}=S_{22}^{T}U_{\vartheta}^{T}P_{U}^{T}S_{1}^{T},\quad S_{1}=\mathop{\mathrm{diag}}(\mathop{\mathrm{sign}}(\tilde{g}_{11}^{\prime}),\mathop{\mathrm{sign}}(\tilde{g}_{21}^{\prime})),\qquad V_{+}=P_{V}S_{12}. (21)

However, U+T¯\underline{U_{+}^{T}} is not explicitly formed, as explained in Section 3.3. Now 𝔱′′=13\mathfrak{t}^{\prime\prime}=13 for R¯\underline{R}.

Lemma 3.3 bounds the relative errors in (some of) the elements of R¯\underline{R}, when possible.

Lemma 3.3.

Assume that no inexact underflow occurs at any stage of the above computation, leading from GG to R¯\underline{R}. Then, r11¯=δ0r11\underline{r_{11}}=\delta_{0}r_{11}, where

1ε=δ0δ0δ0+=1+ε.1-\varepsilon=\delta_{0}^{-}\leq\delta_{0}\leq\delta_{0}^{+}=1+\varepsilon. (22)

If in tanϑ¯=δϑtanϑ\underline{\tan\vartheta}=\delta_{\vartheta}\tan\vartheta holds δϑ=1\delta_{\vartheta}=1, then r12¯=δ1r12\underline{r_{12}}=\delta_{1}^{\prime}r_{12} and r22¯=δ1′′r22\underline{r_{22}}=\delta_{1}^{\prime\prime}r_{22}, where

(1ε)2((1+ε)2+1)/2(1+ε)=δ1δ1,δ1′′δ1+=(1+ε)2((1ε)2+1)/2(1ε).\frac{(1-\varepsilon)^{2}}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{1}^{-}\leq\delta_{1}^{\prime},\delta_{1}^{\prime\prime}\leq\delta_{1}^{+}=\frac{(1+\varepsilon)^{2}}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)}. (23)

Else, if g12′′′¯\underline{g_{12}^{\prime\prime\prime}} and g22′′′¯\underline{g_{22}^{\prime\prime\prime}} are of the same sign, then r12¯=δ2r12\underline{r_{12}}=\delta_{2}^{\prime}r_{12}, and if they are of the opposite signs, then r22¯=δ2′′r22\underline{r_{22}}=\delta_{2}^{\prime\prime}r_{22}, where, with 1ε=δϑδϑδϑ+=1+ε1-\varepsilon=\delta_{\vartheta}^{-}\leq\delta_{\vartheta}\leq\delta_{\vartheta}^{+}=1+\varepsilon and δϑ1\delta_{\vartheta}\neq 1,

(1ε)3((1+ε)2+1)/2(1+ε)=δ2<δ2,δ2′′<δ2+=(1+ε)3((1ε)2+1)/2(1ε).\frac{(1-\varepsilon)^{3}}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{2}^{-}<\delta_{2}^{\prime},\delta_{2}^{\prime\prime}<\delta_{2}^{+}=\frac{(1+\varepsilon)^{3}}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)}. (24)
Proof.

Eq. (22) follows from the correct rounding of hypot\mathop{\mathrm{hypot}} in the computation of w1¯\underline{w_{1}^{\prime}}.

To prove (23) and (24), solve x±yδϑtanϑ=(x±ytanϑ)(1+ϵ±)x\pm y\delta_{\vartheta}\tan\vartheta=(x\pm y\tan\vartheta)(1+\epsilon_{\pm}) for ϵ±\epsilon_{\pm} with xy0xy\neq 0. After expanding and rearranging the terms, it follows that

ϵ±=±ytanϑx±ytanϑ(δϑ1),δϑ=1+ϵϑ,|ϵϑ|ε.\epsilon_{\pm}=\frac{\pm y\tan\vartheta}{x\pm y\tan\vartheta}(\delta_{\vartheta}-1),\qquad\delta_{\vartheta}=1+\epsilon_{\vartheta},\quad|\epsilon_{\vartheta}|\leq\varepsilon. (25)

If xx and yy are of the same sign and the addition operation is chosen, the first factor on the first right hand side in (25) is above zero and below unity, so |ϵ±|<|δϑ1||\epsilon_{\pm}|<|\delta_{\vartheta}-1| and

δϑ=δ±<δ±<δ±+=δϑ+,δ±=1+ϵ±.\delta_{\vartheta}^{-}=\delta_{\pm}^{-}<\delta_{\pm}<\delta_{\pm}^{+}=\delta_{\vartheta}^{+},\qquad\delta_{\pm}=1+\epsilon_{\pm}. (26)

The same holds if xx and yy are of the opposite signs and the subtraction is taken instead. Specifically, from (20), the bound (26) holds for x=g12′′′¯x^{\prime}=\underline{g_{12}^{\prime\prime\prime}} and y=g22′′′¯y^{\prime}=\underline{g_{22}^{\prime\prime\prime}} of the same sign, with ±=+\pm^{\prime}=+, and for x′′=g22′′′¯x^{\prime\prime}=\underline{g_{22}^{\prime\prime\prime}} and y′′=g12′′′¯y^{\prime\prime}=\underline{g_{12}^{\prime\prime\prime}} of the opposite signs, with ±′′=\pm^{\prime\prime}=-.

With |ϵfma|ε|\epsilon_{\mathop{\mathrm{fma}}}|\leq\varepsilon and δfma=1+ϵfma\delta_{\mathop{\mathrm{fma}}}=1+\epsilon_{\mathop{\mathrm{fma}}}, from the definition of δ±\delta_{\pm} it follows that

fma(±y,tanϑ¯,x)=(x±ytanϑ¯)δfma=(x±ytanϑ)δfmaδ±.\mathop{\mathrm{fma}}(\pm y,\underline{\tan\vartheta},x)=(x\pm y\underline{\tan\vartheta})\delta_{\mathop{\mathrm{fma}}}=(x\pm y\tan\vartheta)\delta_{\mathop{\mathrm{fma}}}\delta_{\pm}. (27)

Due to (16) and (27), with ϑ\vartheta instead of θ\theta and with δ/′′=(1+ϵ/′′)\delta_{/}^{\prime\prime}=(1+\epsilon_{/}^{\prime\prime}) where |ϵ/′′|ε|\epsilon_{/}^{\prime\prime}|\leq\varepsilon, it holds

fl(fma(±!y!,tanϑ¯,x!)secϑ¯)=x!±!y!tanϑsecϑδfmaδ/′′δϑδ±=r?2′′¯δ1!δ±=r?2′′¯δ2!,\mathop{\mathrm{fl}}\left(\frac{\mathop{\mathrm{fma}}(\pm^{!}y^{!},\underline{\tan\vartheta},x^{!})}{\underline{\sec\vartheta}}\right)=\frac{x^{!}\pm^{!}y^{!}\tan\vartheta}{\sec\vartheta}\cdot\frac{\delta_{\mathop{\mathrm{fma}}}\delta_{/}^{\prime\prime}}{\delta_{\vartheta}^{\prime}}\cdot\delta_{\pm}=\underline{r_{?2}^{\prime\prime}}\cdot\delta_{1}^{!}\cdot\delta_{\pm}=\underline{r_{?2}^{\prime\prime}}\delta_{2}^{!}, (28)

where !=!=^{\prime} for ?=1?=1 and !=′′!=^{\prime\prime} for ?=2?=2. Now (23) follows from bounding δ1!=δfmaδ/′′/δϑ\delta_{1}^{!}=\delta_{\mathop{\mathrm{fma}}}\delta_{/}^{\prime\prime}/\delta_{\vartheta}^{\prime} below and above using (16). Since δ2!=δ1!δ±\delta_{2}^{!}=\delta_{1}^{!}\delta_{\pm} in (28), (24) is a consequence of (23). ∎

Lemma 3.3 thus shows that high relative accuracy of all elements of R¯\underline{R} is achieved if no underflow has occurred at any stage, and tanϑ¯\underline{\tan\vartheta} has been computed exactly. If tanϑ¯\underline{\tan\vartheta} is inexact, high relative accuracy is guaranteed for r11¯\underline{r_{11}} and exactly one of r12¯\underline{r_{12}} and r22¯\underline{r_{22}}. If it is also desired for the remaining element, transformed by an essential subtraction and thus amenable to cancellation, one possibility is to compute it by an expression equivalent to (20), but with tanϑ\tan\vartheta expanded to its definition g21′′′/g11′′′g_{21}^{\prime\prime\prime}/g_{11}^{\prime\prime\prime}, as in

r12′′=(g12′′′g11′′′+g22′′′g21′′′)/(g11′′′secϑ),r22′′=(g22′′′g11′′′g12′′′g21′′′)/(g11′′′secϑ),r_{12}^{\prime\prime}=(g_{12}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/(g_{11}^{\prime\prime\prime}\sec\vartheta),\quad r_{22}^{\prime\prime}=(g_{22}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/(g_{11}^{\prime\prime\prime}\sec\vartheta), (29)

after prescaling the numerator and denominator in (29) by the largest power-of-two that avoids overflow of both of them. A floating-point primitive of the form ab±cda\cdot b\pm c\cdot d with a single rounding [23] can give the correctly rounded numerator but it has to be emulated in software on most platforms at present [24]. For high relative accuracy of r12′′¯\underline{r_{12}^{\prime\prime}} or r22′′¯\underline{r_{22}^{\prime\prime}} the absence of inexact underflows is required, except in the prescaling.

Alternatively, the numerators in (29) can be calculated by the Kahan’s algorithm for determinants of order two [25], but an overflow-avoiding prescaling is still necessary. It is thus easier to resort to computing (29) in a wider and more precise datatype as

r12′′¯\displaystyle\underline{r_{12}^{\prime\prime}} =fl(fl(FL(FL(g12′′′g11′′′+g22′′′g21′′′)/g11′′′))/secϑ¯),\displaystyle=\mathop{\mathrm{fl}}(\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\mathop{\mathrm{FL}}(g_{12}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/g_{11}^{\prime\prime\prime}))/\underline{\sec\vartheta}), (30)
r22′′¯\displaystyle\underline{r_{22}^{\prime\prime}} =fl(fl(FL(FL(g22′′′g11′′′g12′′′g21′′′)/g11′′′))/secϑ¯).\displaystyle=\mathop{\mathrm{fl}}(\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\mathop{\mathrm{FL}}(g_{22}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}g_{21}^{\prime\prime\prime})/g_{11}^{\prime\prime\prime}))/\underline{\sec\vartheta}).

First, in (30) it is assumed that no underflow has occurred so far, so G′′′¯=G′′′\underline{G^{\prime\prime\prime}}=G^{\prime\prime\prime}. Second, a product of two floating-point values requires 2p2p bits of the significand to be represented exactly if the factors’ significands are encoded with pp bits each. Thus, for single precision, 4848 bits are needed, what is less than 5353 bits available in double precision. Similarly, for double precision, 106106 bits are needed, what is less than 113113 bits of a quadruple precision significand. Therefore, every product in (30) is exact if computed using a more precise standard datatype T. The characteristic values of T are the underflow and overflow thresholds μT\mu_{\text{{T}}} and νT\nu_{\text{{T}}}, and εT=2pT\varepsilon_{\text{{T}}}=2^{-p_{\text{{T}}}}. Third, let FL(x)\mathop{\mathrm{FL}}(x) round an infinitely precise result of xx to the nearest value in T. Since all addends in (30) are exact and way above the underflow threshold μT\mu_{\text{{T}}} by magnitude, the rounded result of the addition or the subtraction is relatively accurate, with the error factor (1+ϵ±)(1+\epsilon_{\pm}^{\prime}), |ϵ±|εT|\epsilon_{\pm}^{\prime}|\leq\varepsilon_{\text{{T}}}. This holds even if the result is zero, but since the transformed matrix would then be processed according to its new structure, assume that the result is normal in T. The ensuing division cannot overflow nor underflow in T. Now the quotient rounded by FL\mathop{\mathrm{FL}} is rounded again, by fl\mathop{\mathrm{fl}}, back to the working datatype. This operation can underflow, as well as the following division by secϑ¯\underline{\sec\vartheta}. If they do not, the resulting transformed element is relatively accurate. This outlines the proof of Theorem 3.4.

Theorem 3.4.

Assume that no underflow occurs at any stage of the computation leading from GG to R¯\underline{R}. Then, r11¯=δ0r11\underline{r_{11}}=\delta_{0}r_{11}, where for δ0\delta_{0} holds (22). If tanϑ¯\underline{\tan\vartheta} is exact, then r12¯=δ1r12\underline{r_{12}}=\delta_{1}^{\prime}r_{12} and r22¯=δ1′′r22\underline{r_{22}}=\delta_{1}^{\prime\prime}r_{22}, where δ1\delta_{1}^{\prime} and δ1′′\delta_{1}^{\prime\prime} are as in (23). Else, if g12′′′¯\underline{g_{12}^{\prime\prime\prime}} and g22′′′¯\underline{g_{22}^{\prime\prime\prime}} are of the same sign, then r12¯=δ2r12\underline{r_{12}}=\delta_{2}^{\prime}r_{12} and r22¯=δ3r22\underline{r_{22}}=\delta_{3}^{\prime}r_{22}. If they are of the opposite signs, then r22¯=δ2′′r22\underline{r_{22}}=\delta_{2}^{\prime\prime}r_{22} and r12¯=δ3′′r12\underline{r_{12}}=\delta_{3}^{\prime\prime}r_{12}, where δ2\delta_{2}^{\prime} and δ2′′\delta_{2}^{\prime\prime} are as in (24), while δ3\delta_{3}^{\prime} and δ3′′\delta_{3}^{\prime\prime} come from evaluating their corresponding matrix elements as in (30) and are bounded as

(1εT)2(1ε)2((1+ε)2+1)/2(1+ε)=δ3δ3,δ3′′δ3+=(1+εT)2(1+ε)2((1ε)2+1)/2(1ε).\frac{(1-\varepsilon_{\text{{T}}})^{2}(1-\varepsilon)^{2}}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{3}^{-}\leq\delta_{3}^{\prime},\delta_{3}^{\prime\prime}\leq\delta_{3}^{+}=\frac{(1+\varepsilon_{\text{{T}}})^{2}(1+\varepsilon)^{2}}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)}. (31)
Proof.

It remains to prove (31), since the other relations follow from Lemma 3.3.

Every element of G′′′G^{\prime\prime\prime} is not above ν/4\nu/4 and not below μ\mu in magnitude. Therefore, a difference (in essence) of their exact products cannot exceed ν2/16<νT\nu^{2}/16<\nu_{\text{{T}}} in magnitude in a standard T. At least one element is above ν/8\nu/8 in magnitude due to the prescaling, so the said difference, if not exactly zero, is above εμν/8μT\varepsilon\mu\nu/8\gg\mu_{\text{{T}}} in magnitude. Thus, the quotient of this difference and g11′′′g_{11}^{\prime\prime\prime} is above εμ(1εT)/2>μT\varepsilon\mu(1-\varepsilon_{\text{{T}}})/2>\mu_{\text{{T}}} and, due to the prescaling and pivoting, not above ν/2νT\nu/2\ll\nu_{\text{{T}}} in magnitude. For (30) it therefore holds

+¯=FL(+)=+(1+ϵ+),+=g12′′′g11′′′+g22′′′g21′′′,|ϵ+|εT,¯=FL()=(1+ϵ),=g22′′′g11′′′g12′′′g21′′′,|ϵ|εT,FL(±¯/g11′′′)=(±/g11′′′)(1+ϵ±)(1+ϵ/′′′),|ϵ/′′′|εT.\begin{gathered}\begin{aligned} \underline{\star_{+}}&=\mathop{\mathrm{FL}}(\star_{+})=\star_{+}(1+\epsilon_{+}^{\prime}),\quad\star_{+}=g_{12}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}+g_{22}^{\prime\prime\prime}g_{21}^{\prime\prime\prime},\quad|\epsilon_{+}^{\prime}|\leq\varepsilon_{\text{{T}}},\\ \underline{\star_{-}}&=\mathop{\mathrm{FL}}(\star_{-})=\star_{-}(1+\epsilon_{-}^{\prime}),\quad\star_{-}=g_{22}^{\prime\prime\prime}g_{11}^{\prime\prime\prime}-g_{12}^{\prime\prime\prime}g_{21}^{\prime\prime\prime},\quad|\epsilon_{-}^{\prime}|\leq\varepsilon_{\text{{T}}},\end{aligned}\\ \mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime})=(\star_{\pm}/g_{11}^{\prime\prime\prime})(1+\epsilon_{\pm}^{\prime})(1+\epsilon_{/}^{\prime\prime\prime}),\quad|\epsilon_{/}^{\prime\prime\prime}|\leq\varepsilon_{\text{{T}}}.\end{gathered} (32)

The quotient FL(±¯/g11′′′)\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}), converted to the working precision with fl(FL(±¯/g11′′′))\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime})), is possibly not correctly rounded from the value of ±¯/g11′′′\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime} due to its double rounding. Using the assumption that no underflow in the working precision occurs, it follows that

fl(FL(±¯/g11′′′))=(±/g11′′′)(1+ϵ±)(1+ϵ/′′′)(1+ϵfl)=(±/g11′′′)δ±,|ϵfl|ε,\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}))=(\star_{\pm}/g_{11}^{\prime\prime\prime})(1+\epsilon_{\pm}^{\prime})(1+\epsilon_{/}^{\prime\prime\prime})(1+\epsilon_{\mathop{\mathrm{fl}}})=(\star_{\pm}/g_{11}^{\prime\prime\prime})\delta_{\pm}^{\prime},\quad|\epsilon_{\mathop{\mathrm{fl}}}|\leq\varepsilon, (33)

with δ±=(1+ϵ±)(1+ϵ/′′′)(1+ϵfl)\delta_{\pm}^{\prime}=(1+\epsilon_{\pm}^{\prime})(1+\epsilon_{/}^{\prime\prime\prime})(1+\epsilon_{\mathop{\mathrm{fl}}}). For the final division by secϑ¯\underline{\sec\vartheta}, due to (16), holds

fl(fl(FL(±¯/g11′′′))secϑ¯)=±g11′′′secϑδ±δϑδ/′′′′,δ/′′′′=(1+ϵ/′′′′),|ϵ/′′′′|ε.\mathop{\mathrm{fl}}\left(\frac{\mathop{\mathrm{fl}}(\mathop{\mathrm{FL}}(\underline{\star_{\pm}}/g_{11}^{\prime\prime\prime}))}{\underline{\sec\vartheta}}\right)=\frac{\star_{\pm}}{g_{11}^{\prime\prime\prime}\sec\vartheta}\frac{\delta_{\pm}^{\prime}}{\delta_{\vartheta}^{\prime}}\delta_{/}^{\prime\prime\prime\prime},\quad\delta_{/}^{\prime\prime\prime\prime}=(1+\epsilon_{/}^{\prime\prime\prime\prime}),\quad|\epsilon_{/}^{\prime\prime\prime\prime}|\leq\varepsilon. (34)

By fixing the sign in the ±\pm subscript in (32), (33), and (34), let δ3=δδ/′′′′/δϑ\delta_{3}^{\prime}=\delta_{-}^{\prime}\delta_{/}^{\prime\prime\prime\prime}/\delta_{\vartheta}^{\prime} and δ3′′=δ+δ/′′′′/δϑ\delta_{3}^{\prime\prime}=\delta_{+}^{\prime}\delta_{/}^{\prime\prime\prime\prime}/\delta_{\vartheta}^{\prime}. Now bound δ3\delta_{3}^{\prime} and δ3′′\delta_{3}^{\prime\prime} below by δ3\delta_{3}^{-} and above by δ3+\delta_{3}^{+}, by combining the appropriate lower and upper bounds for δ±\delta_{\pm}^{\prime}, δϑ\delta_{\vartheta}^{\prime} from Lemma 3.1, and δ/′′′′\delta_{/}^{\prime\prime\prime\prime}, where

(1εT)2(1ε)(1ε)((1+ε)2+1)/2(1+ε)=δ3δ3,δ3′′δ3+=(1+εT)2(1+ε)(1+ε)((1ε)2+1)/2(1ε),\frac{(1-\varepsilon_{\text{{T}}})^{2}(1-\varepsilon)\cdot(1-\varepsilon)}{\sqrt{((1+\varepsilon)^{2}+1)/2}(1+\varepsilon)}=\delta_{3}^{-}\leq\delta_{3}^{\prime},\delta_{3}^{\prime\prime}\leq\delta_{3}^{+}=\frac{(1+\varepsilon_{\text{{T}}})^{2}(1+\varepsilon)\cdot(1+\varepsilon)}{\sqrt{((1-\varepsilon)^{2}+1)/2}(1-\varepsilon)},

what proves (31), by minimizing the numerator and maximizing the denominator to minimize the expression for δ3\delta_{3}^{\prime} and δ3′′\delta_{3}^{\prime\prime}, and vice versa, as in the proof of Lemma 3.3. ∎

Therefore, it is possible to compute R¯\underline{R} with high relative accuracy in the absence of underflows, in the working precision only, but it is easier to employ two multiplications, one addition or subtraction, and one division in a wider, more precise datatype. Table 1 shows by how many ε\varepsilons the lower and the upper bounds for δ1\delta_{1}, δ2\delta_{2}, and δ3\delta_{3} from (23), (24), and (31), respectively, differ from unity. The quantities in the table’s header were computed symbolically as algebraic expressions by substituting ε\varepsilon and εT\varepsilon_{\text{{T}}} with their defining powers of two, then approximated numerically with pTp_{\text{{T}}} decimal digits, and rounded upwards to nine digits after the decimal point, by a Wolfram Language script444The relerr.wls file in the code supplement, executed by the Wolfram Engine 14.0.0 for macOS (Intel)..

Table 1: Lower and upper bounds for δ1\delta_{1}, δ2\delta_{2}, and δ3\delta_{3} in single and double precision.
\topruleprecision (1δ1)/ε(1-\delta_{1}^{-})/\varepsilon (δ1+1)/ε(\delta_{1}^{+}-1)/\varepsilon (1δ2)/ε(1-\delta_{2}^{-})/\varepsilon (δ2+1)/ε(\delta_{2}^{+}-1)/\varepsilon (1δ3)/ε(1-\delta_{3}^{-})/\varepsilon (δ3+1)/ε(\delta_{3}^{+}-1)/\varepsilon
\midrulesingle 3.499999665 3.500000336 4.499999457 4.500000544 3.499999669 3.500000340
double 3.500000000 3.500000001 4.500000000 4.500000001 3.500000000 3.500000001
\botrule

3.3 The SVD of RR and GG

For R¯\underline{R} now holds 𝔱′′=13\mathfrak{t}^{\prime\prime}=13. If r11¯<r22¯\underline{r_{11}}<\underline{r_{22}}, the diagonal elements of R¯\underline{R} have to be swapped, similarly to xLASV2. This is done by symmetrically permuting R¯T\underline{R}^{T} with P~=[1001]\widetilde{P}=\left[\begin{smallmatrix}1&0\\ 0&1\end{smallmatrix}\right]. Multiplying R~=P~TRTP~=U~Σ~V~T\widetilde{R}=\widetilde{P}^{T}R^{T}\widetilde{P}=\widetilde{U}\widetilde{\Sigma}\widetilde{V}^{T} by P~\widetilde{P} from the left and P~T\widetilde{P}^{T} from the right gives

RT=P~U~Σ~V~TP~T,R=P~V~Σ~U~TP~T=UˇΣˇVˇT,R^{T}=\widetilde{P}\widetilde{U}\widetilde{\Sigma}\widetilde{V}^{T}\widetilde{P}^{T},\qquad R=\widetilde{P}\widetilde{V}\widetilde{\Sigma}\widetilde{U}^{T}\widetilde{P}^{T}=\check{U}\check{\Sigma}\check{V}^{T},

where Uˇ=P~V~\check{U}=\widetilde{P}\widetilde{V}, Σˇ=Σ~\check{\Sigma}=\widetilde{\Sigma}, and Vˇ=P~U~\check{V}=\widetilde{P}\widetilde{U}. Therefore, having applied the permutation P~\widetilde{P},

U~=U~φP𝝈~,V~=V~ψP𝝈~,Uˇ=Uˇ𝝋P𝝈~,Vˇ=Vˇ𝝍P𝝈~,Uˇ𝝋=P~V~ψ=[sinψcosψcosψsinψ]=[sinψcosψcosψsinψ]S2=[cos𝝋sin𝝋sin𝝋cos𝝋]S2,Vˇ𝝍=P~U~φ=[sinφcosφcosφsinφ]=[sinφcosφcosφsinφ]S2=[cos𝝍sin𝝍sin𝝍cos𝝍]S2,S2=[1001],cos𝝋=sinψ,sin𝝋=cosψ,tan𝝋=1/tanψ,cos𝝍=sinφ,sin𝝍=cosφ,tan𝝍=1/tanφ,\begin{gathered}\widetilde{U}=\widetilde{U}_{\varphi}P_{\tilde{\bm{\sigma}}},\quad\widetilde{V}=\widetilde{V}_{\psi}P_{\tilde{\bm{\sigma}}},\qquad\check{U}=\check{U}_{\bm{\varphi}}P_{\tilde{\bm{\sigma}}},\quad\check{V}=\check{V}_{\bm{\psi}}P_{\tilde{\bm{\sigma}}},\\ \begin{aligned} \check{U}_{\bm{\varphi}}&=\widetilde{P}\widetilde{V}_{\psi}=\begin{bmatrix}\sin\psi&\hphantom{-}\cos\psi\\ \cos\psi&-\sin\psi\end{bmatrix}=\begin{bmatrix}\sin\psi&-\cos\psi\\ \cos\psi&\hphantom{-}\sin\psi\end{bmatrix}S_{2}=\begin{bmatrix}\cos{\bm{\varphi}}&-\sin{\bm{\varphi}}\\ \sin{\bm{\varphi}}&\hphantom{-}\cos{\bm{\varphi}}\end{bmatrix}S_{2},\\ \check{V}_{\bm{\psi}}&=\widetilde{P}\widetilde{U}_{\varphi}=\begin{bmatrix}\sin\varphi&\hphantom{-}\cos\varphi\\ \cos\varphi&-\sin\varphi\end{bmatrix}=\begin{bmatrix}\sin\varphi&-\cos\varphi\\ \cos\varphi&\hphantom{-}\sin\varphi\end{bmatrix}S_{2}=\begin{bmatrix}\cos{\bm{\psi}}&-\sin{\bm{\psi}}\\ \sin{\bm{\psi}}&\hphantom{-}\cos{\bm{\psi}}\end{bmatrix}S_{2},\end{aligned}\\ S_{2}=\begin{bmatrix}1&\hphantom{-}0\\ 0&-1\end{bmatrix},\qquad\begin{gathered}\cos{\bm{\varphi}}=\sin\psi,\quad\sin{\bm{\varphi}}=\cos\psi,\quad\tan{\bm{\varphi}}=1/\tan\psi,\\ \cos{\bm{\psi}}=\sin\varphi,\quad\sin{\bm{\psi}}=\cos\varphi,\quad\tan{\bm{\psi}}=1/\tan\varphi,\end{gathered}\end{gathered} (35)

where U~φ\widetilde{U}_{\varphi}, V~ψ\widetilde{V}_{\psi}, and P𝝈~P_{\tilde{\bm{\sigma}}} come from the SVD of R~\widetilde{R} in Section 3.3.1. If r11¯r22¯\underline{r_{11}}\geq\underline{r_{22}} then R¯~=R¯\underline{\widetilde{R}}=\underline{R}, Uˇ=U~\check{U}=\widetilde{U}, Vˇ=V~\check{V}=\widetilde{V}, (35) is not used, and let S2=IS_{2}=I, 𝝋=φ\bm{\varphi}=\varphi, and 𝝍=ψ\bm{\psi}=\psi.

Assume that no inexact underflow occurs in the computation of R¯\underline{R}. If the initial matrix GG was triangular, let δ11=δ12=δ22=1\delta_{11}=\delta_{12}=\delta_{22}=1. Else, let δ11\delta_{11}, δ12\delta_{12}, and δ22\delta_{22} stand for the error factors of the elements of R¯~\underline{\widetilde{R}} that correspond to those from Theorem 3.4, i.e.,

R~=[r~11r~120r~22],R¯~=[r~11¯r~12¯0r~22¯]=[r~11δ11r~12δ120r~22δ22],r~11¯r~22¯.\widetilde{R}=\begin{bmatrix}\tilde{r}_{11}&\tilde{r}_{12}\\ 0&\tilde{r}_{22}\end{bmatrix},\qquad\underline{\widetilde{R}}=\begin{bmatrix}\underline{\tilde{r}_{11}}&\underline{\tilde{r}_{12}}\\ 0&\underline{\tilde{r}_{22}}\end{bmatrix}=\begin{bmatrix}\tilde{r}_{11}\delta_{11}&\tilde{r}_{12}\delta_{12}\\ 0&\tilde{r}_{22}\delta_{22}\end{bmatrix},\quad\underline{\tilde{r}_{11}}\geq\underline{\tilde{r}_{22}}. (36)

It remains to compute the SVD of R¯~\underline{\widetilde{R}} by an alternative to xLASV2, what is described in Section 3.3.1, and assemble the SVD of GG, what is explained in Section 3.3.2.

3.3.1 The SVD of R~\widetilde{R}

The key observation in this part is that the traditional [5, Eq. (4.12)] formula for tan(2φ)/2\tan(2\varphi)/2 involving the squares of the elements of R~\widetilde{R} can be simplified to an expression that does not require any explicit squaring if the hypotenuse calculation is considered a basic arithmetic operation. The two following expressions for tan(2φ)\tan(2\varphi) are equivalent,

tan(2φ)=2r~12r~22r~112+r~122r~222=2r~12r~22(hr~22)(h+r~22),h=r~112+r~122.\tan(2\varphi)=\frac{2\tilde{r}_{12}\tilde{r}_{22}}{\tilde{r}_{11}^{2}+\tilde{r}_{12}^{2}-\tilde{r}_{22}^{2}}=\frac{2\tilde{r}_{12}\tilde{r}_{22}}{(h-\tilde{r}_{22})(h+\tilde{r}_{22})},\quad h=\sqrt{\tilde{r}_{11}^{2}+\tilde{r}_{12}^{2}}. (37)

With h¯=hypot(r~11¯,r~12¯)\underline{h}=\mathop{\mathrm{hypot}}(\underline{\tilde{r}_{11}},\underline{\tilde{r}_{12}}), let 𝐬=h¯r~22¯\mathbf{s}=\underline{h}\oplus\underline{\tilde{r}_{22}} and 𝐝=h¯r~22¯\mathbf{d}=\underline{h}\ominus\underline{\tilde{r}_{22}} be the sum and the difference555With the prescaling as employed, \ominus can be replaced by subtraction d=hr~22d=h-\tilde{r}_{22} and 𝐝=(ed,fd)\mathbf{d}=(e_{d},f_{d}). of h¯\underline{h} and r~22¯\underline{\tilde{r}_{22}} as in (4) and (5), respectively, for h¯>r~22¯\underline{h}>\underline{\tilde{r}_{22}}. From (36) and since 0<r~ijν/(22)0<\tilde{r}_{ij}\leq\nu/(2\sqrt{2}) for 1ij21\leq i\leq j\leq 2, it holds 0<r~22¯r~11¯h¯ν0<\underline{\tilde{r}_{22}}\leq\underline{\tilde{r}_{11}}\leq\underline{h}\leq\nu. Thus (37) can be re-written using (6) and (7), with 𝐫~12=(er~12¯,fr~12¯)\tilde{\mathbf{r}}_{12}=(e_{\underline{\tilde{r}_{12}}},f_{\underline{\tilde{r}_{12}}}) and 𝐫~22=(er~22¯,fr~22¯)\tilde{\mathbf{r}}_{22}=(e_{\underline{\tilde{r}_{22}}},f_{\underline{\tilde{r}_{22}}}), as

h¯=r~22¯tan(2φ)¯=,h¯>r~22¯tan(2φ)¯=fl((2𝐫~12𝐫~22)(𝐝𝐬)),\underline{h}=\underline{\tilde{r}_{22}}\implies\underline{\tan(2\varphi)}=\infty,\quad\underline{h}>\underline{\tilde{r}_{22}}\implies\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\mathbf{d}\odot\mathbf{s})), (38)

where the computation’s precision is unchanged, but the exponent range is widened.

In (19), the denominator of the expression for tan(2ϕ)\tan(2\phi), 𝖽=g112+g122g212g222\mathsf{d}=g_{11}^{2}+g_{12}^{2}-g_{21}^{2}-g_{22}^{2}, can also be computed using hypot\mathop{\mathrm{hypot}}, without explicitly squaring any matrix element, as

𝖽=(g112+g122g212+g222)(g112+g122+g212+g222).\mathsf{d}=\left(\sqrt{g_{11}^{2}+g_{12}^{2}}-\sqrt{g_{21}^{2}+g_{22}^{2}}\right)\left(\sqrt{g_{11}^{2}+g_{12}^{2}}+\sqrt{g_{21}^{2}+g_{22}^{2}}\right).

Only if 𝔱13\mathfrak{t}^{\prime}\cong 13 can happen that fl(r~11¯/r~12¯)<ε\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}})<\varepsilon. In the first denominator in (37), r~112\tilde{r}_{11}^{2} and r~222\tilde{r}_{22}^{2} then have a negligible effect on r~122\tilde{r}_{12}^{2}, so the expression for tan(2φ)¯\underline{\tan(2\varphi)} can be simplified, as in xLASV2, to the same formula which would the case r~11¯=r~22¯\underline{\tilde{r}_{11}}=\underline{\tilde{r}_{22}} imply,

tan(2φ)¯=fl((2r~22¯)/r~12¯).\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{\tilde{r}_{22}})/\underline{\tilde{r}_{12}}). (39)

Let 𝐫~11=(er~11¯,fr~11¯)\tilde{\mathbf{r}}_{11}=(e_{\underline{\tilde{r}_{11}}},f_{\underline{\tilde{r}_{11}}}). If r~12¯=r~22¯\underline{\tilde{r}_{12}}=\underline{\tilde{r}_{22}}, (37) can be simplified by explicit squaring to

tan(2φ)¯=fl((2𝐫~12𝐫~22)(𝐫~11𝐫~11)).\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\tilde{\mathbf{r}}_{11}\odot\tilde{\mathbf{r}}_{11})). (40)

Both (39) and (40) admit a simple roundoff analysis. However, (38) does not, due to a subtraction of potentially inexact values of a similar magnitude when computing 𝐝\mathbf{d}. Section 4 shows, with a high probability by an exhaustive testing, that (38) does not cause excessive relative errors in the singular values for 𝔱13\mathfrak{t}^{\prime}\cong 13, and neither for 𝔱=15\mathfrak{t}^{\prime}=15 if the range of the exponents of the elements of input matrices is limited in width to about (eνeμ)/2(e_{\nu}-e_{\mu})/2. If hypot\mathop{\mathrm{hypot}} is not correctly rounded, the procedure from [14, 15] for computing tan(2φ)¯\underline{\tan(2\varphi)} without squaring the input values can be adopted instead of (38), as shown in Algorithm 1, but still without theoretical relative error bounds.

Algorithm 1 Computation of the functions of φ\varphi from R¯~\underline{\widetilde{R}}.
1:  if r~11¯=r~22¯\underline{\tilde{r}_{11}}=\underline{\tilde{r}_{22}} then
2:     tan(2φ)¯=fl((2r~22¯)/r~12¯)\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{\tilde{r}_{22}})/\underline{\tilde{r}_{12}}) /​/ (39)
3:  else if r~12¯=r~22¯\underline{\tilde{r}_{12}}=\underline{\tilde{r}_{22}} then
4:     tan(2φ)¯=fl((2𝐫~12𝐫~22)(𝐫~11𝐫~11))\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\tilde{\mathbf{r}}_{11}\odot\tilde{\mathbf{r}}_{11})) /​/ (40)
5:  else if fl(r~11¯/r~12¯)<ε\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}})<\varepsilon then /​/ only if 𝔱13\mathfrak{t}^{\prime}\cong 13
6:     tan(2φ)¯=fl((2r~22¯)/r~12¯)\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{\tilde{r}_{22}})/\underline{\tilde{r}_{12}}) /​/ (39)
7:  else if hypot\mathop{\mathrm{hypot}} is not correctly rounded then /​/ [14, 15]
8:     if r~11¯>r~12¯\underline{\tilde{r}_{11}}>\underline{\tilde{r}_{12}} then
9:        x¯=fl(r~12¯/r~11¯);y¯=fl(r~22¯/r~11¯)\underline{x}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{12}}/\underline{\tilde{r}_{11}});\quad\underline{y}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{22}}/\underline{\tilde{r}_{11}})
10:        tan(2φ)¯=fl(fl((2x¯)y¯)/max(fma(fl(x¯y¯),fl(x¯+y¯),1),0))\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}(\mathop{\mathrm{fl}}((2\underline{x})\underline{y})/\max(\mathop{\mathrm{fma}}(\mathop{\mathrm{fl}}(\underline{x}-\underline{y}),\mathop{\mathrm{fl}}(\underline{x}+\underline{y}),1),0))
11:     else /​/ r~11¯r~12¯\underline{\tilde{r}_{11}}\leq\underline{\tilde{r}_{12}}
12:        x¯=fl(r~11¯/r~12¯);y¯=fl(r~22¯/r~12¯)\underline{x}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}});\quad\underline{y}=\mathop{\mathrm{fl}}(\underline{\tilde{r}_{22}}/\underline{\tilde{r}_{12}})
13:        tan(2φ)¯=fl((2y¯)/max(fma(fl(x¯y¯),fl(x¯+y¯),1),0))\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\underline{y})/\max(\mathop{\mathrm{fma}}(\mathop{\mathrm{fl}}(\underline{x}-\underline{y}),\mathop{\mathrm{fl}}(\underline{x}+\underline{y}),1),0))
14:     end if
15:  else /​/ the general case
16:     h¯=hypot(r~11¯,r~12¯)\underline{h}=\mathop{\mathrm{hypot}}(\underline{\tilde{r}_{11}},\underline{\tilde{r}_{12}})
17:     if h¯=r~22¯\underline{h}=\underline{\tilde{r}_{22}} then
18:        tan(2φ)¯=\underline{\tan(2\varphi)}=\infty /​/ (38)
19:     else /​/ h¯>r~22¯\underline{h}>\underline{\tilde{r}_{22}}
20:        𝐬=h¯r~22¯;𝐝=h¯r~22¯;tan(2φ)¯=fl((2𝐫~12𝐫~22)(𝐝𝐬))\mathbf{s}=\underline{h}\oplus\underline{\tilde{r}_{22}};\quad\mathbf{d}=\underline{h}\ominus\underline{\tilde{r}_{22}};\quad\underline{\tan(2\varphi)}=\mathop{\mathrm{fl}}((2\odot\tilde{\mathbf{r}}_{12}\odot\tilde{\mathbf{r}}_{22})\oslash(\mathbf{d}\odot\mathbf{s})) /​/ (38)
21:     end if
22:  end if
23:  if tan(2φ)¯=\underline{\tan(2\varphi)}=\infty then
24:     tanφ¯=1\underline{\tan\varphi}=1
25:  else /​/ the general case
26:     tanφ¯=fl(tan(2φ)¯/fl(1+hypot(tan(2φ)¯,1)))\underline{\tan\varphi}=\mathop{\mathrm{fl}}(\underline{\tan(2\varphi)}/\mathop{\mathrm{fl}}(1+\mathop{\mathrm{hypot}}(\underline{\tan(2\varphi)},1))) /​/ (3)
27:  end if
28:  secφ¯=hypot(tanφ¯,1);cosφ¯=fl(1/secφ¯);sinφ¯=fl(tanφ¯/secφ¯)\underline{\sec\varphi}=\mathop{\mathrm{hypot}}(\underline{\tan\varphi},1);\quad\underline{\cos\varphi}=\mathop{\mathrm{fl}}(1/\underline{\sec\varphi});\quad\underline{\sin\varphi}=\mathop{\mathrm{fl}}(\underline{\tan\varphi}/\underline{\sec\varphi})

All cases of Algorithm 1 lead to 0tanφ¯10\leq\underline{\tan\varphi}\leq 1. From tanφ¯\underline{\tan\varphi} follows secφ¯\underline{\sec\varphi}, as well as cosφ¯\underline{\cos\varphi} and sinφ¯\underline{\sin\varphi}, when explicitly required, what completely determines U~φ¯\underline{\widetilde{U}_{\varphi}}.

To determine V~ψ¯\underline{\widetilde{V}_{\psi}}, tanψ\tan\psi is obtained from tanφ\tan\varphi (see [5, Eq. (4.10)]) as

tanψ=(r~12+r~22tanφ)/r~11,tanψ¯=fl(t¯/r~11¯),t¯=fma(r~22¯,tanφ¯,r~12¯).\tan\psi=(\tilde{r}_{12}+\tilde{r}_{22}\tan\varphi)/\tilde{r}_{11},\qquad\underline{\tan\psi}=\mathop{\mathrm{fl}}(\underline{t}/\underline{\tilde{r}_{11}}),\quad\underline{t}=\mathop{\mathrm{fma}}(\underline{\tilde{r}_{22}},\underline{\tan\varphi},\underline{\tilde{r}_{12}}). (41)

Let 𝐬𝐞𝐜φ=(esecφ¯,fsecφ¯)\mathop{\mathbf{sec}}\varphi=(e_{\underline{\sec\varphi}},f_{\underline{\sec\varphi}}). If tanψ¯\underline{\tan\psi} is finite (e.g., when 𝔱=15\mathfrak{t}^{\prime}=15, due to the pivoting [14, Theorem 1]), so is secψ¯\underline{\sec\psi}. Then, let 𝐬𝐞𝐜ψ=(esecψ¯,fsecψ¯)\mathop{\mathbf{sec}}\psi=(e_{\underline{\sec\psi}},f_{\underline{\sec\psi}}). By fixing the evaluation order for reproducibility, the singular values 𝝈~1′′\tilde{\bm{\sigma}}_{1}^{\prime\prime} and 𝝈~2′′\tilde{\bm{\sigma}}_{2}^{\prime\prime} of R¯~\underline{\widetilde{R}} are computed [8, 15] as

𝐬ψφ=𝐬𝐞𝐜φ𝐬𝐞𝐜ψ,𝝈~2′′=𝐫~22𝐬ψφ,𝝈~1′′=𝐫~11𝐬ψφ.\mathbf{s}_{\psi}^{\varphi}=\mathop{\mathbf{sec}}\varphi\oslash\mathop{\mathbf{sec}}\psi,\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot\mathbf{s}_{\psi}^{\varphi},\quad\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\tilde{\mathbf{r}}_{11}\oslash\mathbf{s}_{\psi}^{\varphi}. (42)

If tanψ¯\underline{\tan\psi} overflows due to a small r~11¯\underline{\tilde{r}_{11}} (the prescaling ensures that t¯\underline{t} is always finite), let 𝐭=(et¯,ft¯)\mathbf{t}=(e_{\underline{t}},f_{\underline{t}}). In this case, similarly to the one in xLASV2 for fl(r~11¯/r~12¯)<ε\mathop{\mathrm{fl}}(\underline{\tilde{r}_{11}}/\underline{\tilde{r}_{12}})<\varepsilon, it holds secψtanψ\sec\psi\gtrapprox\tan\psi, so cosψ1/tanψ\cos\psi\lessapprox 1/\tan\psi. To confine subnormal values to outputs only, let

𝐜𝐨𝐬ψ=𝐫~11𝐭,cosψ¯=fl(𝐜𝐨𝐬ψ),sinψ¯=1.\mathop{\mathbf{cos}}\psi=\tilde{\mathbf{r}}_{11}\oslash\mathbf{t},\quad\underline{\cos\psi}=\mathop{\mathrm{fl}}(\mathop{\mathbf{cos}}\psi),\quad\underline{\sin\psi}=1. (43)

By substituting 1/𝐜𝐨𝐬ψ𝐭𝐫~111/\mathop{\mathbf{cos}}\psi\approx\mathbf{t}\oslash\tilde{\mathbf{r}}_{11} from (43) for 𝐬𝐞𝐜ψ\mathop{\mathbf{sec}}\psi in (42), simplifying the results, and fixing the evaluation order, the singular values of R¯~\underline{\widetilde{R}} in this case are obtained as

𝝈~1′′=𝐭𝐬𝐞𝐜φ,𝝈~2′′=𝐫~22(𝐬𝐞𝐜φ𝐜𝐨𝐬ψ).\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\mathbf{t}\oslash\mathop{\mathbf{sec}}\varphi,\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot(\mathop{\mathbf{sec}}\varphi\odot\mathop{\mathbf{cos}}\psi). (44)

From (41), tanψ>ν\tan\psi>\nu implies tanφtan(2φ)/2r~22/r~12r~11/r~12<1/(ν1)\tan\varphi\lessapprox\tan(2\varphi)/2\lessapprox\tilde{r}_{22}/\tilde{r}_{12}\leq\tilde{r}_{11}/\tilde{r}_{12}<1/(\nu-1), so secφ1\sec\varphi\gtrapprox 1. Therefore, 𝐬𝐞𝐜φ\mathop{\mathbf{sec}}\varphi may be eliminated from (44), similarly as in xLASV2.

The SVD of R¯~\underline{\widetilde{R}} has thus been computed (without explicitly forming U~φ¯\underline{\widetilde{U}_{\varphi}} and V~ψ¯\underline{\widetilde{V}_{\psi}}) as

R¯~\displaystyle\underline{\widetilde{R}} [cosφ¯sinφ¯sinφ¯cosφ¯]P𝝈~P𝝈~T[𝝈~1′′00𝝈~2′′]P𝝈~P𝝈~T[cosψ¯sinψ¯sinψ¯cosψ¯]\displaystyle\approx\begin{bmatrix}\underline{\cos\varphi}&-\underline{\sin\varphi}\\ \underline{\sin\varphi}&\hphantom{-}\underline{\cos\varphi}\end{bmatrix}P_{\tilde{\bm{\sigma}}}P_{\tilde{\bm{\sigma}}}^{T}\begin{bmatrix}\tilde{\bm{\sigma}}_{1}^{\prime\prime}&0\\ 0&\tilde{\bm{\sigma}}_{2}^{\prime\prime}\end{bmatrix}P_{\tilde{\bm{\sigma}}}P_{\tilde{\bm{\sigma}}}^{T}\begin{bmatrix}\hphantom{-}\underline{\cos\psi}&\underline{\sin\psi}\\ -\underline{\sin\psi}&\underline{\cos\psi}\end{bmatrix} (45)
=(U~φ¯P𝝈~)(P𝝈~TΣ~𝝈~′′¯P𝝈~)(P𝝈~TV~ψT¯)=U¯~Σ~𝝈~¯V~T¯.\displaystyle=(\underline{\widetilde{U}_{\varphi}}P_{\tilde{\bm{\sigma}}})(P_{\tilde{\bm{\sigma}}}^{T}\underline{\widetilde{\Sigma}_{\tilde{\bm{\sigma}}}^{\prime\prime}}P_{\tilde{\bm{\sigma}}})(P_{\tilde{\bm{\sigma}}}^{T}\underline{\widetilde{V}_{\psi}^{T}})=\underline{\widetilde{U}}\underline{\widetilde{\Sigma}_{\tilde{\bm{\sigma}}}^{\prime}}\underline{\widetilde{V}^{T}}.

If 𝝈~1′′𝝈~2′′\tilde{\bm{\sigma}}_{1}^{\prime\prime}\prec\tilde{\bm{\sigma}}_{2}^{\prime\prime}, then 𝝈~1=𝝈~2′′\tilde{\bm{\sigma}}_{1}^{\prime}=\tilde{\bm{\sigma}}_{2}^{\prime\prime}, 𝝈~2=𝝈~1′′\tilde{\bm{\sigma}}_{2}^{\prime}=\tilde{\bm{\sigma}}_{1}^{\prime\prime}, and P𝝈~=[0110]P_{\tilde{\bm{\sigma}}}=\left[\begin{smallmatrix}0&1\\ 1&0\end{smallmatrix}\right], else 𝝈~i=𝝈~i′′\tilde{\bm{\sigma}}_{i}^{\prime}=\tilde{\bm{\sigma}}_{i}^{\prime\prime} and P𝝈~=IP_{\tilde{\bm{\sigma}}}=I, as presented in Algorithm 2. If r11¯<r22¯\underline{r_{11}}<\underline{r_{22}} then V¯ˇ\underline{\check{V}} should be formed as in (35), and U¯ˇ\underline{\check{U}} as well if 𝔱13\mathfrak{t}^{\prime}\cong 13. Else, if r11¯r22¯\underline{r_{11}}\geq\underline{r_{22}}, then V¯ˇ=V¯~\underline{\check{V}}=\underline{\widetilde{V}}, and (only implicitly for 𝔱=15\mathfrak{t}^{\prime}=15) U¯ˇ=U¯~\underline{\check{U}}=\underline{\widetilde{U}}.

Algorithm 2 Computation of the functions of ψ\psi and the singular values of R¯~\underline{\widetilde{R}}.
1:  t¯=fma(r~22¯,tanφ¯,r~12¯);tanψ¯=fl(t¯/r~11¯)\underline{t}=\mathop{\mathrm{fma}}(\underline{\tilde{r}_{22}},\underline{\tan\varphi},\underline{\tilde{r}_{12}});\quad\underline{\tan\psi}=\mathop{\mathrm{fl}}(\underline{t}/\underline{\tilde{r}_{11}}) /​/ (41)
2:  if tanψ¯=\underline{\tan\psi}=\infty then /​/ only if 𝔱13\mathfrak{t}^{\prime}\cong 13
3:     𝐭=(et¯,ft¯);𝐜𝐨𝐬ψ=𝐫~11𝐭;cosψ¯=fl(𝐜𝐨𝐬ψ);sinψ¯=1\mathbf{t}=(e_{\underline{t}},f_{\underline{t}});\quad\mathop{\mathbf{cos}}\psi=\tilde{\mathbf{r}}_{11}\oslash\mathbf{t};\quad\underline{\cos\psi}=\mathop{\mathrm{fl}}(\mathop{\mathbf{cos}}\psi);\quad\underline{\sin\psi}=1 /​/ (43)
4:     𝝈~1′′=𝐭;𝝈~2′′=𝐫~22𝐜𝐨𝐬ψ\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\mathbf{t};\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot\mathop{\mathbf{cos}}\psi /​/ (44)
5:  else /​/ the general case
6:     secψ¯=hypot(tanψ¯,1);cosψ¯=fl(1/secψ¯);sinψ¯=fl(tanψ¯/secψ¯)\underline{\sec\psi}=\mathop{\mathrm{hypot}}(\underline{\tan\psi},1);\quad\underline{\cos\psi}=\mathop{\mathrm{fl}}(1/\underline{\sec\psi});\quad\underline{\sin\psi}=\mathop{\mathrm{fl}}(\underline{\tan\psi}/\underline{\sec\psi})
7:     𝐬𝐞𝐜φ=(esecφ¯,fsecφ¯);𝐬𝐞𝐜ψ=(esecψ¯,fsecψ¯);𝐬ψφ=𝐬𝐞𝐜φ𝐬𝐞𝐜ψ\mathop{\mathbf{sec}}\varphi=(e_{\underline{\sec\varphi}},f_{\underline{\sec\varphi}});\quad\mathop{\mathbf{sec}}\psi=(e_{\underline{\sec\psi}},f_{\underline{\sec\psi}});\quad\mathbf{s}_{\psi}^{\varphi}=\mathop{\mathbf{sec}}\varphi\oslash\mathop{\mathbf{sec}}\psi
8:     𝝈~1′′=𝐫~11𝐬ψφ;𝝈~2′′=𝐫~22𝐬ψφ\tilde{\bm{\sigma}}_{1}^{\prime\prime}=\tilde{\mathbf{r}}_{11}\oslash\mathbf{s}_{\psi}^{\varphi};\quad\tilde{\bm{\sigma}}_{2}^{\prime\prime}=\tilde{\mathbf{r}}_{22}\odot\mathbf{s}_{\psi}^{\varphi} /​/ (42)
9:  end if
10:  if 𝝈~1′′𝝈~2′′\tilde{\bm{\sigma}}_{1}^{\prime\prime}\prec\tilde{\bm{\sigma}}_{2}^{\prime\prime} then /​/ (8)
11:     𝝈~1=𝝈~2′′;𝝈~2=𝝈~1′′;P𝝈~=[0110]\tilde{\bm{\sigma}}_{1}^{\prime}=\tilde{\bm{\sigma}}_{2}^{\prime\prime};\quad\tilde{\bm{\sigma}}_{2}^{\prime}=\tilde{\bm{\sigma}}_{1}^{\prime\prime};\quad P_{\tilde{\bm{\sigma}}}=\left[\begin{smallmatrix}0&1\\ 1&0\end{smallmatrix}\right]
12:  else /​/ the general case
13:     𝝈~1=𝝈~1′′;𝝈~2=𝝈~2′′;P𝝈~=[1001]\tilde{\bm{\sigma}}_{1}^{\prime}=\tilde{\bm{\sigma}}_{1}^{\prime\prime};\quad\tilde{\bm{\sigma}}_{2}^{\prime}=\tilde{\bm{\sigma}}_{2}^{\prime\prime};\quad P_{\tilde{\bm{\sigma}}}=\left[\begin{smallmatrix}1&0\\ 0&1\end{smallmatrix}\right]
14:  end if

3.3.2 The SVD of GG

The approximate backscaled singular values of GG are 𝝈i=2s𝝈~i\bm{\sigma}_{i}=2^{-s}\odot\tilde{\bm{\sigma}}_{i}^{\prime}. They should remain in the exponent-“mantissa” form if possible, to avoid overflows and underflows.

Recall that, for 𝔱=15\mathfrak{t}^{\prime}=15, U~φT¯\underline{\widetilde{U}_{\varphi}^{T}} and the QR rotation UϑT¯\underline{U_{\vartheta}^{T}} have not been explicitly formed. The reason is that U^T¯=Uˇ𝝋T¯U+T¯\underline{\widehat{U}^{T}}=\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{+}^{T}}, where U+T¯\underline{U_{+}^{T}} is constructed from UϑT¯\underline{U_{\vartheta}^{T}} as in (21), requires a matrix-matrix multiplication that can and sporadically will degrade the numerical orthogonality of U¯^\underline{\widehat{U}}. On its own, such a problem is expected and can be tolerated, but if the left singular vectors of a pivot submatrix are applied to a pair of pivot rows of a large iteration matrix, many times throughout the Kogbetliantz process (1), it is imperative to make the vectors as orthogonal as possible, and thus try not to destroy the singular values of the iteration matrix. In the following, U¯^\underline{\widehat{U}} is generated from a single tanϕ\tan{\bm{\phi}}, where tanϕ\tan{\bm{\phi}} is a function of the already computed tan𝝋\tan{\bm{\varphi}} and tanϑ\tan{\vartheta}.

If 𝔱13\mathfrak{t}^{\prime}\cong 13, let U¯^=U+U¯ˇ\underline{\widehat{U}}=U_{+}\underline{\check{U}}, where U+U_{+} comes from Section 3.2.1. Else, due to (35), if S22T=IS_{22}^{T}=I in (21), the product Uˇ𝝋TUϑT=U𝝋+ϑT\check{U}_{\bm{\varphi}}^{T}U_{\vartheta}^{T}=U_{\bm{\varphi}+\vartheta}^{T} can be written in the terms of 𝝋+ϑ\bm{\varphi}+\vartheta as

U𝝋+ϑT=S2T[cos𝝋sin𝝋sin𝝋cos𝝋][cosϑsinϑsinϑcosϑ]=S2T[cos(𝝋+ϑ)sin(𝝋+ϑ)sin(𝝋+ϑ)cos(𝝋+ϑ)].U_{\bm{\varphi}+\vartheta}^{T}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos{\bm{\varphi}}&\sin{\bm{\varphi}}\\ -\sin{\bm{\varphi}}&\cos{\bm{\varphi}}\end{bmatrix}\begin{bmatrix}\hphantom{-}\cos\vartheta&\sin\vartheta\\ -\sin\vartheta&\cos\vartheta\end{bmatrix}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos(\bm{\varphi}+\vartheta)&\sin(\bm{\varphi}+\vartheta)\\ -\sin(\bm{\varphi}+\vartheta)&\cos(\bm{\varphi}+\vartheta)\end{bmatrix}. (46)

If S22T=[1001]S_{22}^{T}=\left[\begin{smallmatrix}1&\hphantom{-}0\\ 0&-1\end{smallmatrix}\right], U𝝋ϑT=Uˇ𝝋TS22TUϑTU_{\bm{\varphi}-\vartheta}^{T}=\check{U}_{\bm{\varphi}}^{T}S_{22}^{T}U_{\vartheta}^{T} can be written in the terms of 𝝋ϑ\bm{\varphi}-\vartheta as

U𝝋ϑT=S2T[cos(𝝋ϑ)sin(𝝋ϑ)sin(𝝋ϑ)cos(𝝋ϑ)]=S2T[cos(𝝋ϑ)sin(𝝋ϑ)sin(𝝋ϑ)cos(𝝋ϑ)]S2,U_{\bm{\varphi}-\vartheta}^{T}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos(\bm{\varphi}-\vartheta)&-\sin(\bm{\varphi}-\vartheta)\\ -\sin(\bm{\varphi}-\vartheta)&-\cos(\bm{\varphi}-\vartheta)\end{bmatrix}=S_{2}^{T}\begin{bmatrix}\hphantom{-}\cos(\bm{\varphi}-\vartheta)&\sin(\bm{\varphi}-\vartheta)\\ -\sin(\bm{\varphi}-\vartheta)&\cos(\bm{\varphi}-\vartheta)\end{bmatrix}S_{2}, (47)

what is obtained by multiplying the matrices UφT[1001]UϑTU_{\varphi}^{T}\left[\begin{smallmatrix}1&\hphantom{-}0\\ 0&-1\end{smallmatrix}\right]U_{\vartheta}^{T} and simplifying the result using the trigonometric identities for the (co)sine of the difference of the angles 𝝋\bm{\varphi} and ϑ\vartheta. The middle matrix factor represents a possible sign change of r22r_{22}^{\prime} as in Section 3.2.2. The matrices defined in (46) and (47) are determined by tan(𝝋+ϑ)\tan(\bm{\varphi}+\vartheta) and tan(𝝋ϑ)\tan(\bm{\varphi}-\vartheta), respectively, where these tangents follow from the already computed ones as

tan(𝝋+ϑ)=tan𝝋+tanϑ1tan𝝋tanϑ,tan(𝝋ϑ)=tan𝝋tanϑ1+tan𝝋tanϑ.\tan(\bm{\varphi}+\vartheta)=\frac{\tan\bm{\varphi}+\tan\vartheta}{1-\tan{\bm{\varphi}}\tan\vartheta},\qquad\tan(\bm{\varphi}-\vartheta)=\frac{\tan\bm{\varphi}-\tan\vartheta}{1+\tan{\bm{\varphi}}\tan\vartheta}. (48)

Finally, from (21) and (35), using either (46) or (47), the SVD of GG is completed as

U^T¯=U𝝋±ϑT¯PUTS1T,U¯=U¯^P𝝈~,V¯=V¯ˇP𝝈~.\underline{\widehat{U}^{T}}=\underline{U_{\bm{\varphi}\pm\vartheta}^{T}}P_{U}^{T}S_{1}^{T},\quad\underline{U}=\underline{\widehat{U}}P_{\tilde{\bm{\sigma}}},\qquad\underline{V}=\underline{\check{V}}P_{\tilde{\bm{\sigma}}}. (49)

For (49), PUTS1TP_{U}^{T}S_{1}^{T} from (21) is explicitly built and stored. It contains exactly one ±1\pm 1 element in each row and column, while the other is zero. Its multiplication by U𝝋±ϑT¯\underline{U_{\bm{\varphi}\pm\vartheta}^{T}} is thus performed error-free. The tangents computed as in (48) might be relatively inaccurate in theory, but the transformations they define via the cosines and the sines from either (46) or (47) are numerically orthogonal in practice, as shown in Section 4.

This heuristic might become irrelevant if the ab+cdab+cd floating-point operation with a single rounding [23] becomes supported in hardware. Then, each element of Uˇ𝝋T¯UϑT¯\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{\vartheta}^{T}} (a product of two 2×22\times 2 matrices) can be formed with one such operation. It remains to be seen if the multiplication approach improves accuracy of the computed left singular vectors without spoiling their orthogonality, compared to the proposed heuristic.

From the method’s construction, it follows that if the side (left or right) on which the signs are extracted while preparing RR is fixed (see Section 3.1) and whenever the assumptions on the arithmetic hold, the SVD of GG as proposed here is bitwise reproducible for any GG with finite elements. Also, the method does not produce any infinite or undefined element in its outputs UU, VV, and (conditionally, as described) Σ\Sigma.

3.4 A complex input matrix

If GG is purely imaginary, ±iG\pm\mathrm{i}G is real. Else, if GG has at least one complex element, the proposed real method is altered, as detailed in [14, 15], in the following ways:

  1. 1.

    To make the element 0gij=|gij|eiαij0\neq g_{ij}=|g_{ij}|\mathrm{e}^{\mathrm{i}\alpha_{ij}} real and positive, its row or column is multiplied by eiαij\mathrm{e}^{-\mathrm{i}\alpha_{ij}} (which goes into a sign matrix), and the element is replaced by its absolute value. To avoid overflow, let 𝔰=𝔰+1\mathfrak{s}_{\mathbb{C}}=\mathfrak{s}+1 in (13). The exponents of each component (real and imaginary) of every element are considered in (9).

  2. 2.

    U+T¯\underline{U_{+}^{T}} is explicitly constructed in (21), and Uˇ𝝋T¯U+T¯\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{+}^{T}} is formed by a real-complex matrix multiplication. The correctly rounded ab+cdab+cd operation [23] would be helpful here. Merging Uˇ𝝋T¯UϑT¯\underline{\check{U}_{\bm{\varphi}}^{T}}\underline{U_{\vartheta}^{T}} as in (46) or (47) remains a possibility if S22S_{22} happens to be real.

  3. 3.

    Since (30) is no longer directly applicable for ensuring stability, no computation is performed in a wider datatype. Reproducibility of the whole method is conditional upon reproducibility of the complex multiplication and the absolute value (hypot\mathop{\mathrm{hypot}}).

Once R¯\underline{R} is obtained, the algorithms from Section 3.3 work unchanged.

4 Numerical testing

Numerical testing was performed on machines with a 64-core Intel Xeon Phi 7210 CPU, a 64-bit Linux, and the Intel oneAPI Base and HPC toolkits, version 2024.1.

Let the LAPACK’s xLASV2 routine be denoted by 𝙻\mathtt{L}. The Kogbetliantz SVD in the same datatype is denoted by 𝙺\mathtt{K}. Unless such information is clear from the context, let the results’ designators carry a subscript 𝙺\mathtt{K} or 𝙻\mathtt{L} in the following figures, depending on the routine that computed them, and also a superscript \circ or \bullet, signifying how the input matrices were generated. All inputs were random. Those denoted by \circ had their elements generated as Fortran’s pseudorandom numbers not above unity in magnitude, and those symbolized by \bullet had their elements’ magnitudes in the “safe” range [μ,ν/4][\mu,\nu/4], as defined by (11), to avoid overflows with 𝙻\mathtt{L} and underflows due to the prescaling in 𝙺\mathtt{K}. The latter random numbers were provided by the CPU’s rdrand instructions. If not kept, the \bullet inputs are thus not reproducible, unlike the \circ ones if the seed is preserved.

All relative error measures were computed in quadruple precision from data in the working (single or double) precision. The unknown exact singular values of the input matrices were approximated by the Kogbetliantz SVD method adapted to quadruple precision (with a hypot\mathop{\mathrm{hypot}} operation that might not have been correctly rounded).

With GG given and U¯\underline{U}, Σ¯\underline{\Sigma}, V¯\underline{V} computed, let the relative SVD residual be defined as

reG=GU¯Σ¯VT¯F/GF,\mathop{\mathrm{re}}G=\|G-\underline{U}\underline{\Sigma}\underline{V^{T}}\|_{F}/\|G\|_{F}, (50)

the maximal relative error in the computed singular values σi¯\underline{\sigma_{i}} (with σi\sigma_{i} being exact) as

reσi=|σiσi¯|/σi,1i2,σi=0σi¯=0reσi=0,\mathop{\mathrm{re}}\sigma_{i}=|\sigma_{i}-\underline{\sigma_{i}}|/\sigma_{i},\quad 1\leq i\leq 2,\qquad\sigma_{i}=0\wedge\underline{\sigma_{i}}=0\implies\mathop{\mathrm{re}}\sigma_{i}=0, (51)

and the departure from orthogonality in the Frobenius norm for matrices of the left and right singular vectors (what can be seen as the relative error with respect to II) as

reU=U¯TU¯IF,reV=V¯TV¯IF.\mathop{\mathrm{re}}U=\|\underline{U}^{T}\underline{U}-I\|_{F},\qquad\mathop{\mathrm{re}}V=\|\underline{V}^{T}\underline{V}-I\|_{F}. (52)

Every datapoint in the figures shows the maximum of a particular relative error measure over a batch of input matrices, were each batch (run) contained 2302^{30} matrices.

Figure 1 covers the case of upper triangular input matrices, which can be processed by both 𝙺\mathtt{K} and 𝙻\mathtt{L}, and the measures (50) and (52). Numerical orthogonality of the singular vectors computed by 𝙺\mathtt{K} is noticeably better than of those obtained by 𝙻\mathtt{L}, in the worst case. Also, the relative SVD residuals are slightly better, in the \bullet and the \circ runs.

Refer to caption
Figure 1: Numerical orthogonality of the singular vectors and the relative SVD residuals with 𝙺\mathtt{K} and 𝙻\mathtt{L} on random upper triangular double precision matrices.

Figure 2 shows the relative errors in the singular values (51) of the same matrices from Figure 1. The unity mark for re𝙻σ2\mathop{\mathrm{re}_{\mathtt{L}}}\sigma_{2}^{\bullet} indicates that 𝙻\mathtt{L} can cause the relative errors in the smaller singular values, σ2¯\underline{\sigma_{2}}, to be so high in the \bullet case that their maximum was unity in all runs and cannot be displayed in Figure 2, most likely due to underflow to zero of σ2¯\underline{\sigma_{2}^{\bullet}} when the “exact” σ2>0\sigma_{2}^{\bullet}>0 in (51). However, when 𝙻\mathtt{L} managed to compute the smaller singular values accurately in the \circ case, the maximum of their relative errors was a bit smaller than the one from 𝙺\mathtt{K}, the cause of which is worth exploring. The same holds for the larger singular values, which were computed accurately by both 𝙻\mathtt{L} and 𝙺\mathtt{K}.

Refer to caption
Figure 2: The relative errors in the singular values with 𝙺\mathtt{K} and 𝙻\mathtt{L} on random upper triangular double precision matrices, with maxκ29.45101229\max\kappa_{2}^{\bullet}\lessapprox 9.45\cdot 10^{1229} and maxκ27.321010\max\kappa_{2}^{\circ}\lessapprox 7.32\cdot 10^{10}.

To put maxκ29.45101229\max\kappa_{2}^{\bullet}\lessapprox 9.45\cdot 10^{1229}, for which 𝙺\mathtt{K} still accurately computed all singular values (in the exponent-“mantissa” form, and thus not underflowing), into perspective, the highest possible condition number for triangular matrices in the \bullet case can be estimated by recalling that Algorithms 1 and 2 were also performed in quadruple precision (to get σ1\sigma_{1}, σ2\sigma_{2}, and so κ2\kappa_{2}), where μ\mu and ν\nu of double precision, as well as ν/μ\nu/\mu, are within the normal range. Then, tanφ\tan\varphi can be made small and tanψ\tan\psi huge by, e.g.,

G=[μν/40μ]tan(2φ)=8μν2μν<tanφ4μνtanψν4μ.G=\begin{bmatrix}\mu&\nu/4\\ 0&\mu\end{bmatrix}\implies\tan(2\varphi)=\frac{8\mu}{\nu}\implies\frac{2\mu}{\nu}<\tan\varphi\lessapprox\frac{4\mu}{\nu}\implies\tan\psi\gtrapprox\frac{\nu}{4\mu}.

Therefore, the condition number of GG is a cubic expression in ν/μ\nu/\mu, since, from (42),

σ2=μsecφsecψ4μ2ν,σ1=ν4secψsecφν216μ,κ2=σ1σ2ν364μ3.\sigma_{2}=\mu\frac{\sec\varphi}{\sec\psi}\approx\frac{4\mu^{2}}{\nu},\quad\sigma_{1}=\frac{\nu}{4}\frac{\sec\psi}{\sec\varphi}\approx\frac{\nu^{2}}{16\mu},\qquad\kappa_{2}=\frac{\sigma_{1}}{\sigma_{2}}\approx\frac{\nu^{3}}{64\mu^{3}}.

Figure 3 focuses on 𝙺\mathtt{K} and general input matrices, with all their elements random. Inaccuracy of the smaller singular values in the \bullet case motivated the search for safe exponent ranges of the elements of input matrices that should preserve accuracy of σ2¯\underline{\sigma_{2}} from 𝙻\mathtt{L} for 𝔱=13\mathfrak{t}=13 and from 𝙺\mathtt{K} for 𝔱=15\mathfrak{t}=15. For that, the range of random values was restricted, and only those outputs xx from rdrand for which |x|[2ςμ,ν/4]|x|\in[2^{\varsigma}\mu,\nu/4] were accepted, where ς\varsigma was a positive integer parameter independently chosen for each run.

Refer to caption
Figure 3: Numerical orthogonality of the singular vectors, the relative SVD residuals, and the relative errors in the singular values with 𝙺\mathtt{K} on random double precision matrices, with maxκ21.4210616\max\kappa_{2}^{\bullet}\lessapprox 1.42\cdot 10^{616} and maxκ22.841010\max\kappa_{2}^{\circ}\lessapprox 2.84\cdot 10^{10}.

Figure 4 shows the results of this search for 𝙺\mathtt{K} and 𝙻\mathtt{L}. Approximately half-way through the entire normal exponent range the relative errors in the smaller singular values stabilize to a single-digit multiple of ε\varepsilon. Thus, when for the exponents in GG holds

max1i,j2egijmin1i,j2egij<(eνeμ)/2\max_{1\leq i,j\leq 2}{e_{g_{ij}}}-\min_{1\leq i,j\leq 2}{e_{g_{ij}}}<(e_{\nu}-e_{\mu})/2

(ignoring the exponent of 0) it might be expected that 𝙺\mathtt{K} computes σ2¯\underline{\sigma_{2}} accurately, while 𝙻\mathtt{L} should additionally be safeguarded by its user from the elements too close to μ\mu. The proposed prescaling, but with 𝔰𝙻=𝔰+1\mathfrak{s}_{\mathtt{L}}=\mathfrak{s}+1 (or more), might be applied to GG before 𝙻\mathtt{L}.

Refer to caption
Figure 4: The observed decay of the relative errors in the smaller singular values by narrowing of the exponent range of the elements of double precision input matrices, where κ2𝙺\kappa_{2}^{\mathtt{K}} falls from 1.59103291.59\cdot 10^{329} for ς=953\varsigma=953 to 2.33103082.33\cdot 10^{308} for ς=1027\varsigma=1027, and κ2𝙻\kappa_{2}^{\mathtt{L}} from 2.23106562.23\cdot 10^{656} to 4.93106114.93\cdot 10^{611}.

A timing comparison of xLASV2 and the proposed method was not performed since the correctly rounded routines are still in development. By construction the proposed method is more computationally complex than xLASV2, so it is expected to be slower.

An unoptimized OpenMP-parallel implementation of the Kogbetliantz SVD for GG of order n>2n>2 with the scaling of GG in the spirit of [22] but stronger (accounting for the two-sided transformations of GG) and the modified modulus pivot strategy [26], when run with 64 threads spread across the CPU cores, a deterministic reduction procedure, and OMP_DYNAMIC=FALSE, showed up to 10% speedup over the one-sided Jacobi SVD routine without preconditioning, DGESVJ [27, 28], from the threaded Intel MKL library for large enough nn (up to 53765376), with the left singular vectors from the former being a bit more orthogonal than the ones from the latter, while the opposite was true for the right singular vectors, on the highly conditioned input matrices from [22]. The singular values from DGESVJ were less than an order of magnitude more accurate.

5 Conclusions and future work

The proposed Kogbetliantz method for the SVD of order two computed highly numerically orthogonal singular vectors in all tests. The larger singular values were relatively accurate up to a few ε\varepsilon in all tests, and the smaller ones were when the input matrices were triangular, or, for the general (without zeros) input matrices, if the range of their elements was narrower than or about half of the width of the range of normal values.

The constituent phases of the method can be used on their own. The prescaling might help xLASV2 when its inputs are small. The highly accurate triangularization might be combined with xLASV2 instead, as an alternative method for general matrices. And the proposed SVD of triangular matrices demonstrates some of the benefits of the more complex correctly rounded operations (hypot\mathop{\mathrm{hypot}}), but they go beyond that.

High relative accuracy for tan(2φ)\tan(2\varphi) from (19) might be achieved, barring underflow, if the four-way fused dot product operation ab+cd+ef+ghab+cd+ef+gh, DOT4, with a single rounding of the exact value [29], becomes available in hardware. Then the denominator of the expression for tan(2φ)\tan(2\varphi) in (19) could be computed, even without scaling if in a wider datatype, by the DOT4, and the numerator by the DOT2 (ab+cdab+cd) operation.

The proposed heuristic for improving orthogonality of the left singular vectors might be helpful in other cases when two plane rotations have to be composed into one and the tangents of their angles are known. It already brings a slight advantage to the Kogbetliantz SVD of order nn with respect to the one-sided Jacobi SVD in this regard.

With a proper vectorization, and by removing all redundancies from the preliminary implementation, it might be feasible to speed up the Kogbetliantz SVD of order nn further, since adding more threads is beneficial as long as their number is not above 𝗇\mathsf{n}. Supplementary information. The document sm.pdf supplements this paper with further remarks on methods for larger matrices and the single precision testing results.

Acknowledgments. The author would like to thank Dean Singer for his material support and to Vjeran Hari for fruitful discussions.

Declarations

Funding. This work was supported in part by Croatian Science Foundation under the expired project IP–2014–09–3670 “Matrix Factorizations and Block Diagonalization Algorithms” (MFBDA), in the form of unlimited compute time granted for the testing.

Competing interests. The author has no relevant competing interests to declare.

Code availability. The code is available in https://github.com/venovako/KogAcc repository, and in the supporting https://github.com/venovako/libpvn repository.

References

  • \bibcommenthead
  • Anderson et al. [1999] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd3^{\rm{rd}} edn. Software, Environments and Tools. SIAM, Philadelphia, PA, USA (1999). https://doi.org/10.1137/1.9780898719604
  • Moler and Stewart [1973] Moler, C.B., Stewart, G.W.: An algorithm for generalized matrix eigenvalue problems. SIAM J. Numer. Anal. 10(2), 241–256 (1973) https://doi.org/10.1137/0710024
  • Demmel and Kahan [1990] Demmel, J., Kahan, W.: Accurate singular values of bidiagonal matrices. SIAM J. Sci. Statist. Comput. 11(5), 873–912 (1990) https://doi.org/10.1137/0911052
  • Kogbetliantz [1955] Kogbetliantz, E.G.: Solution of linear equations by diagonalization of coefficients matrix. Quart. Appl. Math. 13(2), 123–132 (1955) https://doi.org/%****␣ms.tex␣Line␣2075␣****10.1090/qam/88795
  • Charlier et al. [1987] Charlier, J.P., Vanbegin, M., Van Dooren, P.: On efficient implementations of Kogbetliantz’s algorithm for computing the singular value decomposition. Numer. Math. 52(3), 279–300 (1987) https://doi.org/10.1007/BF01398880
  • Stewart [1992] Stewart, G.W.: An updating algorithm for subspace tracking. IEEE Trans. Signal Process. 40(6), 1535–1541 (1992) https://doi.org/10.1109/78.139256
  • Quintana-Ortí et al. [1998] Quintana-Ortí, G., Sun, X., Bischof, C.H.: A BLAS-3 version of the QR factorization with column pivoting. SIAM J. Sci. Comp. 19(5), 1486–1494 (1998) https://doi.org/10.1137/S1064827595296732
  • Hari and Matejaš [2009] Hari, V., Matejaš, J.: Accuracy of two SVD algorithms for 2×22\times 2 triangular matrices. Appl. Math. Comput. 210(1), 232–257 (2009) https://doi.org/10.1016/j.amc.2008.12.086
  • Charlier and Van Dooren [1987] Charlier, J.-P., Van Dooren, P.: On Kogbetliantz’s SVD algorithm in the presence of clusters. Linear Algebra Appl. 95, 135–160 (1987) https://doi.org/%****␣ms.tex␣Line␣2150␣****10.1016/0024-3795(87)90031-0
  • Hari and Veselić [1987] Hari, V., Veselić, K.: On Jacobi methods for singular value decompositions. SIAM J. Sci. Statist. Comput. 8(5), 741–754 (1987) https://doi.org/10.1137/0908064
  • Hari and Zadelj-Martić [2007] Hari, V., Zadelj-Martić, V.: Parallelizing the Kogbetliantz method: A first attempt. J. Numer. Anal. Ind. Appl. Math. 2(1–2), 49–66 (2007)
  • Hari [1991] Hari, V.: On sharp quadratic convergence bounds for the serial Jacobi methods. Numer. Math. 60(1), 375–406 (1991) https://doi.org/10.1007/BF01385728
  • Matejaš and Hari [2015] Matejaš, J., Hari, V.: On high relative accuracy of the Kogbetliantz method. Linear Algebra Appl. 464, 100–129 (2015) https://doi.org/10.1016/j.laa.2014.02.024
  • Novaković [2020] Novaković, V.: Batched computation of the singular value decompositions of order two by the AVX-512 vectorization. Parallel Process. Lett. 30(4), 1–232050015 (2020) https://doi.org/10.1142/S0129626420500152
  • Novaković and Singer [2022] Novaković, V., Singer, S.: A Kogbetliantz-type algorithm for the hyperbolic SVD. Numer. Algorithms 90(2), 523–561 (2022) https://doi.org/10.1007/s11075-021-01197-4
  • Bečka et al. [2002] Bečka, M., Okša, G., Vajteršic, M.: Dynamic ordering for a parallel block-Jacobi SVD algorithm. Parallel Comp. 28(2), 243–262 (2002) https://doi.org/10.1016/S0167-8191(01)00138-7
  • Okša et al. [2022] Okša, G., Yamamoto, Y., Vajteršic, M.: Convergence to singular triplets in the two-sided block-Jacobi SVD algorithm with dynamic ordering. SIAM J. Matrix Anal. Appl. 43(3), 1238–1262 (2022) https://doi.org/10.1137/21M1411895
  • IEEE Computer Society [2019] IEEE Computer Society: 754-2019 - IEEE Standard for Floating-Point Arithmetic, (2019). https://doi.org/10.1109/IEEESTD.2019.8766229
  • Borges [2020] Borges, C.F.: Algorithm 1014: An improved algorithm for hypot(x,y). ACM Trans. Math. Softw. 47(1), 1–129 (2020) https://doi.org/10.1145/3428446
  • Sibidanov et al. [2022] Sibidanov, A., Zimmermann, P., Glondu, S.: The CORE-MATH project. In: 2022 IEEE 29th29^{\rm{th}} Symposium on Computer Arithmetic (ARITH), pp. 26–34 (2022). https://doi.org/10.1109/ARITH54963.2022.00014
  • Novaković [2024] Novaković, V.: Accurate complex Jacobi rotations. J. Comput. Appl. Math. 450, 116003 (2024) https://doi.org/10.1016/j.cam.2024.116003
  • Novaković [2023] Novaković, V.: Vectorization of a thread-parallel Jacobi singular value decomposition method. SIAM J. Sci. Comput. 45(3), 73–100 (2023) https://doi.org/10.1137/22M1478847
  • Lauter [2017] Lauter, C.: An efficient software implementation of correctly rounded operations extending FMA: a+b+ca+b+c and a×b+c×da\times b+c\times d. In: 2017 51st51^{\rm{st}} Asilomar Conference on Signals, Systems, and Computers, pp. 452–456 (2017). https://doi.org/10.1109/ACSSC.2017.8335379
  • Hubrecht et al. [2024] Hubrecht, T., Jeannerod, C.-P., Muller, J.-M.: Useful applications of correctly-rounded operators of the form ab+cd+eab+cd+e. In: 2024 IEEE 31st31^{\rm{st}} Symposium on Computer Arithmetic (ARITH), pp. 32–39 (2024). https://doi.org/10.1109/ARITH61463.2024.00015
  • Jeannerod et al. [2013] Jeannerod, C.-P., Luvet, N., Muller, J.-M.: Further analysis of Kahan’s algorithm for the accurate computation of 2×22\times 2 determinants. Math. Comp. 82(284), 2245–2264 (2013) https://doi.org/10.1090/S0025-5718-2013-02679-8
  • Novaković and Singer [2011] Novaković, V., Singer, S.: A GPU-based hyperbolic SVD algorithm. BIT 51(4), 1009–1030 (2011) https://doi.org/10.1007/s10543-011-0333-5
  • Drmač [1997] Drmač, Z.: Implementation of Jacobi rotations for accurate singular value computation in floating point arithmetic. SIAM J. Sci. Comput. 18(4), 1200–1222 (1997) https://doi.org/10.1137/S1064827594265095
  • Drmač and Veselić [2008] Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm. II. SIAM J. Matrix Anal. Appl. 29(4), 1343–1362 (2008) https://doi.org/10.1137/05063920X
  • Lutz et al. [2024] Lutz, D.R., Saini, A., Kroes, M., Elmer, T., Valsaraju, H.: Fused FP8 4-way dot product with scaling and FP32 accumulation. In: 2024 IEEE 31st31^{\rm{st}} Symposium on Computer Arithmetic (ARITH), pp. 40–47 (2024). https://doi.org/10.1109/ARITH61463.2024.00016