This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2022

[1]\fnmIlse C. F. \surIpsen

[1]\orgdivDepartment of Mathematics, \orgnameNorth Carolina State University, \orgaddress\cityRaleigh, \postcode27695, \stateNC, \countryUS

2]\orgdivMathematical Sciences, \orgnameUniversity of Southampton, \orgaddress\citySouthampton, \postcodeSO17 1BJ, \countryUK

3]\orgdivSchool of Mathematics and Statistics, \orgnameNewcastle University, \orgaddress\cityNewcastle-upon-Tyne, \postcodeNE1 7RU, \countryUK

Statistical Properties of BayesCG under the Krylov prior

\fnmTim W. \surReid [email protected]    [email protected]    \fnmJon \surCockayne [email protected]    \fnmChris J. \surOates [email protected] * [ [
Abstract

Abstract

1 Introduction

We present a rigorous analysis of the probabilistic numeric solver BayesCG under the Krylov prior Cockayne:BCG; RICO21 for solving systems of linear equations

𝐀𝐱=𝐛,\mathbf{A}\mathbf{x}_{*}=\mathbf{b}, (1)

with symmetric positive definite coefficient matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}.

Probabilistic numerics.

This area Cockayne:BPNM; HOG15; Oates seeks to quantify the uncertainty due to limited computational resources, and to propagate these uncertainties through computational pipelines—sequences of computations where the output of one computation is the input for the next. At the core of many computational pipelines are iterative linear solvers CIOR20; HBH20; NW06; PZSHG12; SHB21, whose computational resources are limited by the impracticality of running the solver to completion. The premature termination generates uncertainty in the computed solution.

Probabilistic numeric linear solvers.

Probabilistic numeric extensions of Kry-lov space and stationary iterative methods Bartels; CIOR20; Cockayne:BCG; Fanaskov21; Hennig; RICO21; WH20 model the ‘epistemic uncertainty’ in a quantity of interest, which can be the matrix inverse 𝐀1\mathbf{A}^{-1} Hennig; Bartels; WH20 or the solution 𝐱\mathbf{x}_{*} Cockayne:BCG; Bartels; CIOR20; Fanaskov21. Our quantity of interest is the solution 𝐱\mathbf{x}_{*}, and the ‘epistemic uncertainty’ is the uncertainty in the user’s knowledge of the true value of 𝐱\mathbf{x}_{*}.

The probabilistic solver takes as input a prior distribution which models the initial uncertainty in 𝐱\mathbf{x}_{*} and then computes posterior distributions which model the uncertainty remaining after each iteration. Figure 1.1 depicts a prior and posterior distribution for the solution 𝐱\mathbf{x}_{*} of 2-dimensional linear system.

Refer to caption
Refer to caption
Refer to caption
Figure 1.1: Prior and posterior distributions for a linear system (1) with n=2n=2. Top plot: prior distribution. Bottom plots: posterior distributions, where the bottom right is a zoomed in version of the bottom left. The gray shaded contours represent the areas in which the distributions are concentrated, the symbol ‘×\times’ represents the solution, and the symbol ‘+’ the mean of the prior or posterior.
Calibration.

An important criterion of probabilistic solvers is the statistical quality of their posterior distributions. A solver is ‘calibrated’ if its posterior distributions accurately model the uncertainty in 𝐱\mathbf{x}_{*} (Cockayne:BCG, Section 6.1). Probabilistic Krylov solvers are not always calibrated because their posterior distributions tend to be pessimistic. This means, the posteriors imply that the error is larger than it actually is (Bartels, Section 6.4), (Cockayne:BCG, Section 6.1). Previous efforts for improving calibration have focused on scaling the posterior covariances (Cockayne:BCG, Section 4.2), (Fanaskov21, Section 7), (WH20, Section 3).

BayesCG.

We analyze the calibration of BayesCG under the Krylov prior Cockayne:BCG; RICO21. BayesCG was introduced in Cockayne:BCG as a probabilistic numeric extension of the Conjugate Gradient (CG) method Hestenes for solving (1). The Krylov prior proposed in RICO21 makes BayesCG competitive with CG. The numerical properties of BayesCG under the Krylov prior are analysed in RICO21, while here we analyse its statistical properties.

1.1 Contributions and Overview

Overall Conclusion.

BayesCG under the Krylov prior is not calibrated in the strict sense, but has the desirable properties of a calibrated solver. Under the efficient approximate Krylov posteriors, BayesCG is only slightly optimistic and competitive with CG.

Background (Section 2).

We present a short review of BayesCG, and the Krylov prior and posteriors.

Approximate Krylov posteriors (Section 3).

We define the 𝐀\mathbf{A}-Wasserstein distance (Definition 3.3, Theorem 3.4); determine the error between Krylov posteriors and their low-rank approximations in the 𝐀\mathbf{A}-Wasserstein distance (Theorem 3.5); and present a statistical interpretation of a Krylov prior as an empirical Bayesian procedure (Theorem 3.7, Remark 3.8).

Calibration (Section 4).

We review the strict notion of calibration for probabilistic solvers (Definition 4.1, Lemma 4.2), and show that it does not apply to BayesCG under the Krylov prior (Remark 4.6).

We relax the strict notion above and propose as an alternative form of assessment two test statistics that are necessary but not sufficient for calibration: the ZZ-statistic (Theorem 4.9) and the new SS-statistic (Theorem 4.15, Definition 4.17). We present implementations for both (Algorithms 4.1 and 4.2); and apply a Kolmogorov-Smirnov statistic (Definition 4.11) for evaluating the quality of samples from the ZZ-statistic.

The ZZ-statistic is inconclusive about the calibration of BayesCG under the Krylov prior (Theorem 4.13), while the SS-statistic indicates that it is not calibrated (Section 4.3.4).

Numerical Experiments (Section 5).

We create a calibrated but slowly converging version of BayesCG which has random search directions, and use it as a baseline for comparison with two BayesCG versions that both replicate CG: BayesCG under the inverse and the Krylov priors.

We assess calibration with the ZZ- and SS-statistics for BayesCG with random search directions (Algorithms B.1 and B.2); BayesCG under the inverse prior (Algorithms 2.1 and B.3); and BayesCG under the Krylov prior with full posteriors (Algorithm B.4) and approximate posteriors (Algorithm B.5).

Both, ZZ- and SS statistics indicate that BayesCG with random search directions is indeed a calibrated solver, while BayesCG under the inverse prior is pessimistic.

The SS-statistic indicates that BayesCG under full Krylov posteriors mimics a calibrated solver, and that BayesCG under rank-50 approximate posteriors does as well, although not as much, since it is slightly optimistic.

1.2 Notation

Matrices are represented in bold uppercase, such as 𝐀\mathbf{A}; vectors in bold lowercase, such as 𝐛\mathbf{b}; and scalars in lowercase, such as mm.

The m×mm\times m identity matrix is 𝐈m\mathbf{I}_{m}, or just 𝐈\mathbf{I} if the dimension is clear. The Moore-Penrose inverse of a matrix 𝐀\mathbf{A} is 𝐀\mathbf{A}^{\dagger}, and the matrix square root is 𝐀1/2\mathbf{A}^{1/2} (Higham08, Chapter 6).

Probability distributions are represented in lowercase Greek, such as μm\mu_{m}; and random variables in uppercase, such as XX. The random variable XX having distribution μ\mu is represented by XμX\sim\mu, and its expectation by 𝔼[X]\operatorname{\mathbb{E}}[X].

The Gaussian distribution with mean 𝐱n\mathbf{x}\in\mathbb{R}^{n} and covariance 𝚺n×n\mathbf{\Sigma}\in\mathbb{R}^{n\times n} is denoted by 𝒩(𝐱,𝚺)\operatorname{\mathcal{N}}(\mathbf{x},\mathbf{\Sigma}), and the chi-squared distribution with ff degrees of freedom by χf2\chi_{f}^{2}.

2 Review of Existing Work

We review BayesCG (Section 2.1), the ideal Krylov prior (Section 2.2), and practical approximations for Krylov posteriors (Section 2.3). All statements in this section hold in exact arithmetic.

2.1 BayesCG

We review the computation of posterior distributions for BayesCG under general priors (Theorem 2.1), and present a pseudo code for BayesCG (Algorithm 2.1).

Given an initial guess 𝐱0\mathbf{x}_{0}, BayesCG Cockayne:BCG solves symmetric positive definite linear systems (1) by computing iterates 𝐱m\mathbf{x}_{m} that converge to the solution 𝐱\mathbf{x}_{*}. In addition, BayesCG computes probability distributions that quantify the uncertainty about the solution at each iteration mm. Specifically, for a user-specified Gaussian prior μ0𝒩(𝐱0,𝚺0)\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}), BayesCG computes posterior distributions μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}), by conditioning a random variable Xμ0X\sim\mu_{0} on information from mm search directions 𝐒m\mathbf{S}_{m}.

Theorem 2.1 ((Cockayne:BCG, Proposition 1), (RICO21, Theorem 2.1)).

Let 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}_{*}=\mathbf{b} be a linear system where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite. Let μ0𝒩(𝐱0,𝚺0)\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}) be a prior with symmetric positive semi-definite covariance 𝚺0n×n\mathbf{\Sigma}_{0}\in\mathbb{R}^{n\times n}, and initial residual 𝐫0𝐛0𝐀𝐱0\mathbf{r}_{0}\equiv\mathbf{b}_{0}-\mathbf{A}\mathbf{x}_{0}.

Pick mnm\leq n so that 𝐒m[𝐬1𝐬2𝐬m]n×m\mathbf{S}_{m}\equiv\begin{bmatrix}\mathbf{s}_{1}&\mathbf{s}_{2}&\cdots&\mathbf{s}_{m}\end{bmatrix}\in\mathbb{R}^{n\times m} has rank(𝐒m)=m\operatorname{rank}(\mathbf{S}_{m})=m and 𝚲m𝐒mT𝐀𝚺0𝐀𝐒\mathbf{\Lambda}_{m}\equiv\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S} is non-singular. Then, the BayesCG posterior μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}) has mean and covariance

𝐱m\displaystyle\mathbf{x}_{m} =𝐱0+𝚺0𝐀𝐒m𝚲m1𝐒mT𝐫0\displaystyle=\mathbf{x}_{0}+\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S}_{m}\mathbf{\Lambda}_{m}^{-1}\mathbf{S}_{m}^{T}\mathbf{r}_{0} (2)
𝚺m\displaystyle\mathbf{\Sigma}_{m} =𝚺0𝚺0𝐀𝐒m𝚲m1𝐒mT𝐀𝚺0.\displaystyle=\mathbf{\Sigma}_{0}-\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S}_{m}\mathbf{\Lambda}_{m}^{-1}\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0}. (3)

Algorithm 2.1 represents the iterative computation of the posteriors from (Cockayne:BCG, Propositions 6 and 7), (RICO21, Theorem 2.7). To illustrate the resemblance of BayesCG and the Conjugate Gradient method, we present the most common implementation of CG in Algorithm 2.2.

BayesCG (Algorithm 2.1) computes specific search directions 𝐒m\mathbf{S}_{m} with two additional properties:

  1. 1.

    They are 𝐀𝚺0𝐀\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}-orthogonal, which means that 𝚲m=𝐒mT𝐀𝚺0𝐀𝐒m\mathbf{\Lambda}_{m}=\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S}_{m} is diagonal (Cockayne:BCG, Section 2.3), thus easy to invert.

  2. 2.

    They form a basis for the Krylov space (BCG:Supp, Proposition S4)

    range(𝐒m)=𝒦m(𝐀𝚺0𝐀,𝐫0)span{𝐫0,𝐀𝚺0𝐀𝐫0,,(𝐀𝚺0𝐀)m1𝐫0}.\operatorname{range}(\mathbf{S}_{m})=\mathcal{K}_{m}(\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A},\mathbf{r}_{0})\equiv\operatorname{span}\{\mathbf{r}_{0},\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{r}_{0},\ldots,(\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A})^{m-1}\mathbf{r}_{0}\}.
Remark 2.2.

The additional requirement 𝐱𝐱0range(𝚺0)\mathbf{x}_{*}-\mathbf{x}_{0}\in\operatorname{range}(\mathbf{\Sigma}_{0}) in Algorithm 2.1 ensures the nonsingularity of 𝚲m\mathbf{\Lambda}_{m} as required by Theorem 2.1, even for singular prior covariance matrices 𝚺0\mathbf{\Sigma}_{0} (RICO21, Theorem 2.7).

Algorithm 2.1 BayesCG (RICO21, Algorithm 2.1)
1:Input: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, prior μ0=𝒩(𝐱0,𝚺0)\mu_{0}=\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}) \ignorespaces\triangleright with 𝐱𝐱0range(𝚺0)\mathbf{x}_{*}-\mathbf{x}_{0}\in\operatorname{range}(\mathbf{\Sigma}_{0})
2:𝐫0=𝐛𝐀𝐱0\mathbf{r}_{0}=\mathbf{b}-\mathbf{A}\mathbf{x}_{0} \ignorespaces\triangleright Initial residual
3:𝐬1=𝐫0\mathbf{s}_{1}=\mathbf{r}_{0} \ignorespaces\triangleright Initial search direction
4:m=0m=0 \ignorespaces\triangleright Initial iteration count
5:while not converged do
6:     m=m+1m=m+1 \ignorespaces\triangleright Increment iteration count
7:     αm=(𝐫m1T𝐫m1)/(𝐬mT𝐀𝚺0𝐀𝐬m)\alpha_{m}=\left(\mathbf{r}_{m-1}^{T}\mathbf{r}_{m-1}\right)\big{/}\left(\mathbf{s}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m}\right)
8:     𝐱m=𝐱m1+αm𝚺0𝐀𝐬m\mathbf{x}_{m}=\mathbf{x}_{m-1}+\alpha_{m}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m} \ignorespaces\triangleright Next posterior mean
9:     𝚺m=𝚺m1𝚺0𝐀𝐬m(𝚺0𝐀𝐬m)T/(𝐬mT𝐀𝚺0𝐀𝐬m)\mathbf{\Sigma}_{m}=\mathbf{\Sigma}_{m-1}-\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m}\left(\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m}\right)^{T}\big{/}(\mathbf{s}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m}) \ignorespaces\triangleright Next posterior covariance
10:     𝐫m=𝐫m1αm𝐀𝚺0𝐀𝐬m\mathbf{r}_{m}=\mathbf{r}_{m-1}-\alpha_{m}\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m} \ignorespaces\triangleright Next residual
11:     βm=(𝐫mT𝐫m)/(𝐫m1T𝐫m1)\beta_{m}=\left(\mathbf{r}_{m}^{T}\mathbf{r}_{m}\right)\big{/}\left(\mathbf{r}_{m-1}^{T}\mathbf{r}_{m-1}\right)
12:     𝐬m+1=𝐫m+βm𝐬m\mathbf{s}_{m+1}=\mathbf{r}_{m}+\beta_{m}\mathbf{s}_{m} \ignorespaces\triangleright Next 𝐀𝚺0𝐀\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}-orthogonal search direction
13:end while
14:Output: μm=𝒩(𝐱m,𝚺m)\mu_{m}=\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}) \ignorespaces\triangleright Final posterior
Algorithm 2.2 Conjugate Gradient Method (CG) (Hestenes, Section 3)
1:Input: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}
2:𝐫0=𝐛𝐀𝐱0\mathbf{r}_{0}=\mathbf{b}-\mathbf{A}\mathbf{x}_{0} \ignorespaces\triangleright Initial residual
3:𝐰1=𝐫0\mathbf{w}_{1}=\mathbf{r}_{0} \ignorespaces\triangleright Initial search direction
4:m=0m=0 \ignorespaces\triangleright Initial iteration count
5:while Not converged do
6:     m=m+1m=m+1 \ignorespaces\triangleright Increment iteration count
7:     γm=(𝐫m1T𝐫m1)/(𝐰mT𝐀𝐰m)\gamma_{m}=(\mathbf{r}_{m-1}^{T}\mathbf{r}_{m-1})\big{/}(\mathbf{w}_{m}^{T}\mathbf{A}\mathbf{w}_{m}) \ignorespaces\triangleright Next step size
8:     𝐱m=𝐱m1+γm𝐰m\mathbf{x}_{m}=\mathbf{x}_{m-1}+\gamma_{m}\mathbf{w}_{m} \ignorespaces\triangleright Next iterate
9:     𝐫m=𝐫m1γm𝐀𝐰m\mathbf{r}_{m}=\mathbf{r}_{m-1}-\gamma_{m}\mathbf{A}\mathbf{w}_{m} \ignorespaces\triangleright Next residual
10:     δm=(𝐫mT𝐫m)/(𝐫m1T𝐫m1)\delta_{m}=(\mathbf{r}_{m}^{T}\mathbf{r}_{m})\big{/}(\mathbf{r}_{m-1}^{T}\mathbf{r}_{m-1})
11:     𝐰m+1=𝐫m+δm𝐰m\mathbf{w}_{m+1}=\mathbf{r}_{m}+\delta_{m}\mathbf{w}_{m} \ignorespaces\triangleright Next search direction
12:end while
13:Output: 𝐱m\mathbf{x}_{m} \ignorespaces\triangleright Final approximation for 𝐱\mathbf{x}_{*}

2.2 The ideal Krylov Prior

After defining the Krylov space of maximal dimension (Definition 2.3), we review the ideal but impractical Krylov prior (Definition 2.4), and discuss its construction (Lemma 2.6) and properties (Theorem 2.7).

Definition 2.3.

The Krylov space of maximal dimension for Algorithm 2.2 is

𝒦g(𝐀,𝐫0)span{𝐫0,𝐀𝐫0,,𝐀g1𝐫0}.\mathcal{K}_{g}(\mathbf{A},\mathbf{r}_{0})\equiv\operatorname{span}\{\mathbf{r}_{0},\mathbf{A}\mathbf{r}_{0},\ldots,\mathbf{A}^{g-1}\mathbf{r}_{0}\}.

Here gng\leq n represents the grade of 𝐫0\mathbf{r}_{0} with respect to 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} (Liesen, Definition 4.2.1), or the invariance index for (𝐀,𝐫0)(\mathbf{A},\mathbf{r}_{0}) (Berljafa, Section 2), which is the minimum value where

𝒦g(𝐀,𝐫0)=𝒦g+i(𝐀,𝐫0),i1.\displaystyle\mathcal{K}_{g}(\mathbf{A},\mathbf{r}_{0})=\mathcal{K}_{g+i}(\mathbf{A},\mathbf{r}_{0}),\qquad i\geq 1.

The Krylov prior is a Gaussian distribution whose covariance is constructed from a basis for the maximal dimensional CG Krylov space.

Definition 2.4 ((RICO21, Definition 3.1)).

The ideal Krylov prior for 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}_{*}=\mathbf{b} is η0𝒩(𝐱0,𝚪0)\eta_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Gamma}_{0}) with symmetric positive semi-definite covariance

𝚪0𝐕𝚽𝐕Tn×n.\mathbf{\Gamma}_{0}\equiv\mathbf{V}\mathbf{\Phi}\mathbf{V}^{T}\in\mathbb{R}^{n\times n}. (4)

The columns of 𝐕[𝐯1𝐯2𝐯g]n×g\mathbf{V}\equiv\begin{bmatrix}\mathbf{v}_{1}&\mathbf{v}_{2}&\cdots&\mathbf{v}_{g}\end{bmatrix}\in\mathbb{R}^{n\times g} are an 𝐀\mathbf{A}-orthonormal basis for 𝒦g(𝐀,𝐫0)\mathcal{K}_{g}(\mathbf{A},\mathbf{r}_{0}), which means that

𝐕T𝐀𝐕=𝐈gandspan{𝐯1,,𝐯i}=𝒦i(𝐀,𝐫0),1ig.\mathbf{V}^{T}\mathbf{A}\mathbf{V}=\mathbf{I}_{g}\quad\text{and}\quad\operatorname{span}\{\mathbf{v}_{1},\ldots,\mathbf{v}_{i}\}=\mathcal{K}_{i}(\mathbf{A},\mathbf{r}_{0}),\quad 1\leq i\leq g.

The diagonal matrix 𝚽diag(ϕ1ϕg)g×g\mathbf{\Phi}\equiv\operatorname{diag}\begin{pmatrix}\phi_{1}&\cdots&\phi_{g}\end{pmatrix}\in\mathbb{R}^{g\times g} has diagonal elements

ϕi=(𝐯iT𝐫0)2,1ig.\phi_{i}=(\mathbf{v}_{i}^{T}\mathbf{r}_{0})^{2},\qquad 1\leq i\leq g. (5)
Remark 2.5.

The Krylov prior covariance satisfies the requirement of Algorithm 2.1 that 𝐱𝐱0range(𝚪0)\mathbf{x}_{*}-\mathbf{x}_{0}\in\operatorname{range}(\mathbf{\Gamma}_{0}). This follows from (Liesen, Section 5.6),

𝐱𝐱0+𝒦g(𝐀,𝐫0)=range(𝚪0).\displaystyle\mathbf{x}_{*}\in\mathbf{x}_{0}+\mathcal{K}_{g}(\mathbf{A},\mathbf{r}_{0})=\operatorname{range}(\mathbf{\Gamma}_{0}).

If the maximal Krylov space 𝒦g(𝐀,𝐫0)\mathcal{K}_{g}(\mathbf{A},\mathbf{r}_{0}) has dimension g<ng<n, then 𝚪0\mathbf{\Gamma}_{0} is singular.

Lemma 2.6 ((RICO21, Remark SM2.1)).

The Krylov prior 𝚪0\mathbf{\Gamma}_{0} can be constructed from quantities computed by CG (Algorithm 2.2),

𝐯i𝐰i/(𝐰iT𝐀𝐰i),andϕiγi𝐫i122,1ig.\mathbf{v}_{i}\equiv\mathbf{w}_{i}/(\mathbf{w}_{i}^{T}\mathbf{A}\mathbf{w}_{i}),\quad\text{and}\quad\phi_{i}\equiv\gamma_{i}\|\mathbf{r}_{i-1}\|_{2}^{2},\qquad 1\leq i\leq g.

The posterior distributions from BayesCG under the Krylov prior depend on submatrices of 𝐕\mathbf{V} and 𝚽\mathbf{\Phi},

𝐕i:j[𝐯i𝐯j]𝚽i:jdiag(ϕiϕj),1ijg,\begin{split}\mathbf{V}_{i:j}&\equiv\begin{bmatrix}\mathbf{v}_{i}&\cdots&\mathbf{v}_{j}\end{bmatrix}\\ \mathbf{\Phi}_{i:j}&\equiv\operatorname{diag}\begin{pmatrix}\phi_{i}&\cdots&\phi_{j}\end{pmatrix},\qquad 1\leq i\leq j\leq g,\end{split} (6)

where 𝐕1:g=𝐕\mathbf{V}_{1:g}=\mathbf{V}, 𝚽1:g=𝚽\mathbf{\Phi}_{1:g}=\mathbf{\Phi}, and 𝐕j+1:j=𝚽j+1:j=𝟎\mathbf{V}_{j+1:j}=\mathbf{\Phi}_{j+1:j}=\mathbf{0}, 1jn1\leq j\leq n.

Under suitable assumptions, BayesCG (Algorithm 2.1) produces the same iterates as CG (Algorithm 2.2).

Theorem 2.7 ((RICO21, Theorem 3.3)).

Let 𝐱0\mathbf{x}_{0} be the starting vector for CG (Algorithm 2.2). Then BayesCG (Algorithm 2.1) under the Krylov prior η0𝒩(𝐱0,𝚪0)\eta_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Gamma}_{0}) produces Krylov posteriors ηm𝒩(𝐱m,𝚪m)\eta_{m}\equiv\mathcal{N}(\mathbf{x}_{m},\mathbf{\Gamma}_{m}) whose mean vectors

𝐱m=𝐱0+𝐕1:m𝐕1:mT𝐫0,1mg,\displaystyle\mathbf{x}_{m}=\mathbf{x}_{0}+\mathbf{V}_{1:m}\mathbf{V}_{1:m}^{T}\mathbf{r}_{0},\qquad 1\leq m\leq g,

are identical to the iterates in CG (Algorithm 2.2), and whose covariance matrices

𝚪m=𝐕m+1:g𝚽m+1:g𝐕m+1:gT,1m<g,\displaystyle\mathbf{\Gamma}_{m}=\mathbf{V}_{m+1:g}\mathbf{\Phi}_{m+1:g}\mathbf{V}_{m+1:g}^{T},\qquad 1\leq m<g, (7)

satisfy

trace(𝐀𝚪m)=trace(𝚽m+1:g)=𝐱𝐱m𝐀2.\operatorname{trace}(\mathbf{A}\mathbf{\Gamma}_{m})=\operatorname{trace}(\mathbf{\Phi}_{m+1:g})=\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}. (8)

Explicit construction of the ideal Krylov prior, followed by explicit computation of the Krylov posteriors in Algorithm 2.1 is impractical, because it is more expensive than solving the linear system (1) in the first place. That is the reason for introducing practical, approximate Krylov posteriors.

2.3 Practical Krylov posteriors

We dispense with the explicit computation of the Krylov prior, and instead compute a low-rank approximation of the final posterior (Definition 2.8) by running dd additional iterations. The corresponding CG-based implementation of BayesCG under approximate Krylov posteriors is relegated to Algorithm B.5 in Appendix B.

Definition 2.8 ((RICO21, Definition 3.4)).

Given the Krylov prior η0𝒩(𝐱0,𝚪0)\eta_{0}\equiv\mathcal{N}(\mathbf{x}_{0},\mathbf{\Gamma}_{0}) with posteriors ηm𝒩(𝐱m,𝚪m)\eta_{m}\equiv\mathcal{N}(\mathbf{x}_{m},\mathbf{\Gamma}_{m}), pick some d1d\geq 1. The rank-dd approximation of ηm\eta_{m} is a Gaussian distribution η^m𝒩(𝐱m,𝚪^m)\widehat{\eta}_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\widehat{\mathbf{\Gamma}}_{m}) with the same mean 𝐱m\mathbf{x}_{m} as ηm\eta_{m}, and a rank-dd covariance

𝚪^m𝐕m+1:m+d𝚽m+1:m+d𝐕m+1:m+dT,1m<gd,\widehat{\mathbf{\Gamma}}_{m}\equiv\mathbf{V}_{m+1:m+d}\mathbf{\Phi}_{m+1:m+d}\,\mathbf{V}_{m+1:m+d}^{T},\qquad 1\leq m<g-d,

that consists of the leading dd columns of 𝐕m+1:g\mathbf{V}_{m+1:g}.

In contrast to the full Krylov posteriors, which reproduce the error as in (8), approximate Krylov posteriors underestimate the error (RICO21, Section 3.4),

trace(𝐀𝚪^m)=trace(𝚽m+1:m+d)=𝐱𝐱m𝐀2𝐱𝐱m+d𝐀2,\operatorname{trace}(\mathbf{A}\widehat{\mathbf{\Gamma}}_{m})=\operatorname{trace}(\mathbf{\Phi}_{m+1:m+d})=\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}-\|\mathbf{x}_{*}-\mathbf{x}_{m+d}\|_{\mathbf{A}}^{2}, (9)

where 𝐱𝐱m+d𝐀2\|\mathbf{x}_{*}-\mathbf{x}_{m+d}\|_{\mathbf{A}}^{2} is the error after m+dm+d iterations of CG. The error underestimate trace(𝐀𝚪^m)\operatorname{trace}(\mathbf{A}\widehat{\mathbf{\Gamma}}_{m}) is equal to (StrakosTichy, Equation(4.9)), and it is more accurate when convergence is fast. Fast convergence makes trace(𝐀𝚪^m)\operatorname{trace}(\mathbf{A}\widehat{\mathbf{\Gamma}}_{m}) a more accurate estimate because fast convergence implies that 𝐱𝐱m+d𝐀2𝐱𝐱m𝐀2\|\mathbf{x}_{*}-\mathbf{x}_{m+d}\|_{\mathbf{A}}^{2}\ll\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}, and this, along with (9), implies that trace(𝐀𝚪^m)𝐱𝐱m𝐀2\operatorname{trace}(\mathbf{A}\widehat{\mathbf{\Gamma}}_{m})\approx\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2} (StrakosTichy, Section 4).

3 Approximate Krylov Posteriors

We determine the error in approximate Krylov posteriors (Section 3.1), and interpret the Krylov prior as an empirical Bayesian method (Section 3.2).

3.1 Error in Approximate Krylov Posteriors

We review the pp-Wasserstein distance (Definition 3.1), extend the 2-Wasserstein distance to the 𝐀\mathbf{A}-Wasserstein distance weighted by a symmetric positive definite matrix 𝐀\mathbf{A} (Theorem 3.4), and derive the 𝐀\mathbf{A}-Wasserstein distance between approximate and full Krylov posteriors (Theorem 3.5).

The pp-Wasserstein distance is a metric on the set of probability distributions.

Definition 3.1 ((KLMU20, Definition 2.1), (Villani09, Definition 6.1)).

The pp-Wasser-stein distance between probability distributions μ\mu and ν\nu on n\mathbb{R}^{n} is

Wp(μ,ν)(infπΠ(μ,ν)n×nMN2p𝑑π(M,N))1/p,p1,W_{p}(\mu,\nu)\equiv\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{\mathbb{R}^{n}\times\mathbb{R}^{n}}\|M-N\|_{2}^{p}\ d\pi(M,N)\right)^{1/p},\quad p\geq 1, (10)

where Π(μ,ν)\Pi(\mu,\nu) is the set of couplings between μ\mu and ν\nu, that is, the set of probability distributions on n×n\mathbb{R}^{n}\times\mathbb{R}^{n} that have μ\mu and ν\nu as marginal distributions.

In the special case p=2p=2, the 22-Wasserstein or Fréchet distance between two Gaussian distributions admits an explicit expression.

Lemma 3.2 ((Gelbrich90, Theorem 2.1)).

The 2-Wasserstein distance between Gaussian distributions μ𝒩(𝐱μ,𝚺μ)\mu\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{\mu},\mathbf{\Sigma}_{\mu}) and ν𝒩(𝐱ν,𝚺ν)\nu\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{\nu},\mathbf{\Sigma}_{\nu}) on n\mathbb{R}^{n} is

(W2(μ,ν))2=𝐱μ𝐱ν22+trace(𝚺μ+𝚺ν2(𝚺μ1/2𝚺ν𝚺μ1/2)1/2).\left(W_{2}(\mu,\nu)\right)^{2}=\|\mathbf{x}_{\mu}-\mathbf{x}_{\nu}\|_{2}^{2}+\operatorname{trace}\left(\mathbf{\Sigma}_{\mu}+\mathbf{\Sigma}_{\nu}-2\left(\mathbf{\Sigma}_{\mu}^{1/2}\mathbf{\Sigma}_{\nu}\mathbf{\Sigma}_{\mu}^{1/2}\right)^{1/2}\right).

We generalize the 2-Wasserstein distance to the 𝐀\mathbf{A}-Wasserstein distance weighted by a symmetric positive definite matrix 𝐀\mathbf{A}.

Definition 3.3.

The two-norm of 𝐱n\mathbf{x}\in\mathbb{R}^{n} weighted by a symmetric positive definite 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is

𝐱𝐀𝐀1/2𝐱2.\|\mathbf{x}\|_{\mathbf{A}}\equiv\|\mathbf{A}^{1/2}\mathbf{x}\|_{2}. (11)

The 𝐀\mathbf{A}-Wasserstein distance between Gaussian distributions μ𝒩(𝐱μ,𝚺μ)\mu\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{\mu},\mathbf{\Sigma}_{\mu}) and ν𝒩(𝐱ν,𝚺ν)\nu\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{\nu},\mathbf{\Sigma}_{\nu}) on n\mathbb{R}^{n} is

W𝐀(μ,ν)(infπΠ(μ,ν)n×nMN𝐀2𝑑π(M,N))1/2,\displaystyle W_{\mathbf{A}}(\mu,\nu)\equiv\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{\mathbb{R}^{n}\times\mathbb{R}^{n}}\|M-N\|_{\mathbf{A}}^{2}\ d\pi(M,N)\right)^{1/2}, (12)

where Π(μ,ν)\Pi(\mu,\nu) is the set of couplings between μ\mu and ν\nu.

We derive an explicit expression for the 𝐀\mathbf{A}-Wasserstein distance analogous to the one for the 2-Wasserstein distance in Lemma 3.2.

Theorem 3.4.

For symmetric positive definite 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, the 𝐀\mathbf{A}-Wasserstein distance between Gaussian distributions μ𝒩(𝐱μ,𝚺μ)\mu\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{\mu},\mathbf{\Sigma}_{\mu}) and ν𝒩(𝐱ν,𝚺ν)\nu\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{\nu},\mathbf{\Sigma}_{\nu}) on n\mathbb{R}^{n} is

(W𝐀(μ,ν))2\displaystyle\left(W_{\mathbf{A}}(\mu,\nu)\right)^{2} =𝐱μ𝐱ν𝐀2+trace(𝚺~μ)+trace(𝚺~ν)\displaystyle=\|\mathbf{x}_{\mu}-\mathbf{x}_{\nu}\|_{\mathbf{A}}^{2}+\operatorname{trace}(\widetilde{\mathbf{\Sigma}}_{\mu})+\operatorname{trace}(\widetilde{\mathbf{\Sigma}}_{\nu})
2trace((𝚺~μ1/2𝚺~ν𝚺~μ1/2)1/2),\displaystyle\qquad-2\operatorname{trace}\left((\widetilde{\mathbf{\Sigma}}_{\mu}^{1/2}\widetilde{\mathbf{\Sigma}}_{\nu}\widetilde{\mathbf{\Sigma}}_{\mu}^{1/2})^{1/2}\right), (13)

where 𝚺~μ𝐀1/2𝚺μ𝐀1/2\widetilde{\mathbf{\Sigma}}_{\mu}\equiv\mathbf{A}^{1/2}\mathbf{\Sigma}_{\mu}\mathbf{A}^{1/2} and 𝚺~ν𝐀1/2𝚺ν𝐀1/2\widetilde{\mathbf{\Sigma}}_{\nu}\equiv\mathbf{A}^{1/2}\mathbf{\Sigma}_{\nu}\mathbf{A}^{1/2}.

Proof: First express the 𝐀\mathbf{A}-Wasserstein distance as a 2-Wasserstein distance, by substituting (11) into (12),

(W𝐀(μ,ν))2=infπΠ(μ,ν)n×n𝐀1/2M𝐀1/2N22𝑑π(M,N).\left(W_{\mathbf{A}}(\mu,\nu)\right)^{2}=\inf_{\pi\in\Pi(\mu,\nu)}\int_{\mathbb{R}^{n}\times\mathbb{R}^{n}}\|\mathbf{A}^{1/2}M-\mathbf{A}^{1/2}N\|_{2}^{2}\ d\pi(M,N). (14)

Lemma A.1 in Appendix A implies that 𝐀1/2M\mathbf{A}^{1/2}M and 𝐀1/2N\mathbf{A}^{1/2}N are again Gaussian random variables with respective means and covariances

μ~𝒩(𝐀1/2𝐱μ,𝐀1/2𝚺μ𝐀1/2𝚺~μ),ν~𝒩(𝐀1/2𝐱ν,𝐀1/2𝚺ν𝐀1/2𝚺~ν).\tilde{\mu}\equiv\mathcal{N}(\mathbf{A}^{1/2}\mathbf{x}_{\mu},\,\underbrace{\mathbf{A}^{1/2}\mathbf{\Sigma}_{\mu}\mathbf{A}^{1/2}}_{\widetilde{\mathbf{\Sigma}}_{\mu}}),\qquad\tilde{\nu}\equiv\mathcal{N}(\mathbf{A}^{1/2}\mathbf{x}_{\nu},\,\underbrace{\mathbf{A}^{1/2}\mathbf{\Sigma}_{\nu}\mathbf{A}^{1/2}}_{\widetilde{\mathbf{\Sigma}}_{\nu}}).

Thus (14) is equal to the 2-Wasserstein distance

(W𝐀(μ,ν))2=infπΠ(μ~,ν~)n×nM~N~22𝑑π(M~,N~)=(W2(μ~,ν~))2.\displaystyle\left(W_{\mathbf{A}}(\mu,\nu)\right)^{2}=\inf_{\pi\in\Pi(\tilde{\mu},\tilde{\nu})}\int_{\mathbb{R}^{n}\times\mathbb{R}^{n}}\|\widetilde{M}-\widetilde{N}\|_{2}^{2}\ d\pi(\widetilde{M},\widetilde{N})=\left(W_{2}(\tilde{\mu},\tilde{\nu})\right)^{2}. (15)

At last, apply Lemma 3.2 and the linearity of the trace.        \blacksquare

We are ready to derive the 𝐀\mathbf{A}-Wasserstein distance between approximate and full Krylov posteriors.

Theorem 3.5.

Let ηm𝒩(𝐱m,𝚪m)\eta_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Gamma}_{m}) be a Krylov posterior from Theorem 2.7, and for some d1d\geq 1 let η^m𝒩(𝐱m,𝚪^m)\widehat{\eta}_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\widehat{\mathbf{\Gamma}}_{m}) be a rank-dd approximation from Definition 2.8. The 𝐀\mathbf{A}-Wasserstein distance between ηm\eta_{m} and η^m\widehat{\eta}_{m} is

W𝐀(ηm,η^m)=(i=m+d+1gϕi)1/2.W_{\mathbf{A}}(\eta_{m},\widehat{\eta}_{m})=\left(\sum_{i=m+d+1}^{g}\phi_{i}\right)^{1/2}. (16)

Proof: We factor the covariances into square factors, to obtain an eigenvalue decomposition for the congruence transformations of the covariances in (3.4).

Expand the column dimension of 𝐕m+1:g\mathbf{V}_{m+1:g} from gmg-m to nn by adding an 𝐀\mathbf{A}-orthogonal complement 𝐕mn×(ng+m)\mathbf{V}_{m}^{\perp}\in\mathbb{R}^{n\times(n-g+m)} to create an 𝐀\mathbf{A}-orthogonal matrix

𝐕~\displaystyle\widetilde{\mathbf{V}} [𝐕m+1:g𝐕m]n×n\displaystyle\equiv\begin{bmatrix}\mathbf{V}_{m+1:g}&\mathbf{V}_{m}^{\perp}\end{bmatrix}\in\mathbb{R}^{n\times n}

with 𝐕~T𝐀𝐕~=𝐈n\widetilde{\mathbf{V}}^{T}\mathbf{A}\widetilde{\mathbf{V}}=\mathbf{I}_{n}. Analogously expand the dimension of the diagonal matrices by padding with trailing zeros,

𝚽~m+1:g\displaystyle\widetilde{\mathbf{\Phi}}_{m+1:g} diag(ϕm+1ϕg𝟎1×(ng+m))n×n,\displaystyle\equiv\operatorname{diag}\begin{pmatrix}\phi_{m+1}&\cdots&\phi_{g}&\mathbf{0}_{1\times(n-g+m)}\end{pmatrix}\in\mathbb{R}^{n\times n},
𝚽~m+1:m+d\displaystyle\widetilde{\mathbf{\Phi}}_{m+1:m+d} diag(ϕm+1ϕm+d𝟎1×(nd))n×n.\displaystyle\equiv\operatorname{diag}\begin{pmatrix}\phi_{m+1}&\cdots&\phi_{m+d}&\mathbf{0}_{1\times(n-d)}\end{pmatrix}\in\mathbb{R}^{n\times n}.

Factor the covariances in terms of the above square matrices,

𝚪m=𝐕~𝚽~m+1:g𝐕~Tand𝚪^m=𝐕~𝚽~m+1:m+d𝐕~T.\mathbf{\Gamma}_{m}=\widetilde{\mathbf{V}}\widetilde{\mathbf{\Phi}}_{m+1:g}\widetilde{\mathbf{V}}^{T}\quad\text{and}\quad\widehat{\mathbf{\Gamma}}_{m}=\widetilde{\mathbf{V}}\widetilde{\mathbf{\Phi}}_{m+1:m+d}\widetilde{\mathbf{V}}^{T}.

Substitute the factorizations into (3.4), and compute the 𝐀\mathbf{A}-Wasserstein distance between ηm\eta_{m} and η^m\widehat{\eta}_{m} as

(W𝐀(ηm,η^m))2=trace(𝐆)+trace(𝐉)2trace((𝐆1/2𝐉𝐆1/2)1/2),\left(W_{\mathbf{A}}(\eta_{m},\widehat{\eta}_{m})\right)^{2}=\operatorname{trace}(\mathbf{G})+\operatorname{trace}(\mathbf{J})-2\operatorname{trace}\left((\mathbf{G}^{1/2}\,\mathbf{J}\,\mathbf{G}^{1/2})^{1/2}\right), (17)

where the congruence transformations of 𝚪m\mathbf{\Gamma}_{m} and 𝚪^m\widehat{\mathbf{\Gamma}}_{m} are again Hermitian,

𝐆\displaystyle\mathbf{G} 𝐀1/2𝐕~𝚽~m+1:g𝐕~T𝚪m𝐀1/2=𝐔𝚽~m+1:g𝐔T,𝐔𝐀1/2𝐕~\displaystyle\equiv\mathbf{A}^{1/2}\underbrace{\widetilde{\mathbf{V}}\widetilde{\mathbf{\Phi}}_{m+1:g}\widetilde{\mathbf{V}}^{T}}_{\mathbf{\Gamma}_{m}}\mathbf{A}^{1/2}=\mathbf{U}\widetilde{\mathbf{\Phi}}_{m+1:g}\mathbf{U}^{T},\qquad\mathbf{U}\equiv\mathbf{A}^{1/2}\widetilde{\mathbf{V}}
𝐉\displaystyle\mathbf{J} 𝐀1/2𝐕~𝚽~m+1:m+d𝐕~T𝚪^m𝐀1/2=𝐔𝚽~m+1:d𝐔T.\displaystyle\equiv\mathbf{A}^{1/2}\underbrace{\widetilde{\mathbf{V}}\widetilde{\mathbf{\Phi}}_{m+1:m+d}\widetilde{\mathbf{V}}^{T}}_{\widehat{\mathbf{\Gamma}}_{m}}\mathbf{A}^{1/2}=\mathbf{U}\widetilde{\mathbf{\Phi}}_{m+1:d}\mathbf{U}^{T}.

Lemma A.3 implies that 𝐔\mathbf{U} is an orthogonal matrix, so that the second factorizations of 𝐆\mathbf{G} and 𝐉\mathbf{J} represent eigenvalue decompositions. Commutativity of the trace implies

trace(𝐆)\displaystyle\operatorname{trace}(\mathbf{G}) =trace(𝚽~m+1:g)=i=m+1gϕi\displaystyle=\operatorname{trace}(\widetilde{\mathbf{\Phi}}_{m+1:g})=\sum_{i=m+1}^{g}{\phi_{i}}
trace(𝐉)\displaystyle\operatorname{trace}(\mathbf{J}) =trace(𝚽~m+1:m+d)=i=m+1m+dϕi.\displaystyle=\operatorname{trace}(\widetilde{\mathbf{\Phi}}_{m+1:m+d})=\sum_{i=m+1}^{m+d}{\phi_{i}}.

Since 𝐆\mathbf{G} and 𝐉\mathbf{J} have the same eigenvector matrix, they commute, and so do diagonal matrices,

𝐆1/2𝐉𝐆1/2\displaystyle\mathbf{G}^{1/2}\mathbf{J}\mathbf{G}^{1/2} =𝐔𝚽~m+1:g𝚽~m+1:m+d𝐔T\displaystyle=\mathbf{U}\widetilde{\mathbf{\Phi}}_{m+1:g}\widetilde{\mathbf{\Phi}}_{m+1:m+d}\mathbf{U}^{T}
=𝐔diag(ϕm+12ϕm+d2𝟎1×(nd))𝐔T\displaystyle=\mathbf{U}\operatorname{diag}\begin{pmatrix}\phi_{m+1}^{2}&\cdots\phi_{m+d}^{2}&\mathbf{0}_{1\times(n-d)}\end{pmatrix}\mathbf{U}^{T}

where the last equality follows from the fact that 𝚽~m+1:g\widetilde{\mathbf{\Phi}}_{m+1:g} and 𝚽~m+1:m+d\widetilde{\mathbf{\Phi}}_{m+1:m+d} share the leading dd diagonal elements. Thus

trace((𝐆1/2𝐉𝐆1/2)1/2)=i=m+1m+dϕi.\displaystyle\operatorname{trace}\left((\mathbf{G}^{1/2}\,\mathbf{J}\,\mathbf{G}^{1/2})^{1/2}\right)=\sum_{i=m+1}^{m+d}\phi_{i}.

Substituting the above expressions into (17) gives

(W𝐀(ηm,η^m))2=i=m+1gϕi+i=m+1m+dϕi2i=m+1m+dϕi=i=m+d+1gϕi.\displaystyle\left(W_{\mathbf{A}}(\eta_{m},\widehat{\eta}_{m})\right)^{2}=\sum_{i=m+1}^{g}\phi_{i}+\sum_{i=m+1}^{m+d}\phi_{i}-2\sum_{i=m+1}^{m+d}\phi_{i}=\sum_{i=m+d+1}^{g}{\phi_{i}}.

\blacksquare

Theorem 3.5 implies that the 𝐀\mathbf{A}-Wasserstein distance between approximate and full Krylov posteriors is the sum of the CG steps sizes skipped by the approximate posterior, and this, as seen in (9) and (StrakosTichy, Equation (4.4)), is equal to the distance between the error estimate trace(𝐀𝚪^m)\operatorname{trace}(\mathbf{A}\widehat{\mathbf{\Gamma}}_{m}) and the true error 𝐱𝐱m𝐀2\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}. As a consequence, the approximation error decreases as the convergence of the posterior mean accelerates, or the rank dd of the approximation increases.

Remark 3.6.

The distance in Theorem 3.5 is a special case of the 2-Wasserstein distance between two distributions whose covariance matrices commute (KLMU20, Corollary 2.4).

To see this, consider the 𝐀\mathbf{A}-Wasserstein distance between ηm\eta_{m} and η^m\widehat{\eta}_{m} from Theorem 3.5, and the 2-Wasserstein distance between νm𝒩(𝐱m,𝐀1/2𝚪𝐀1/2)\nu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{A}^{1/2}\mathbf{\Gamma}\mathbf{A}^{1/2}) and ν^m𝒩(𝐱m,𝐀1/2𝚪^𝐀1/2)\widehat{\nu}_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{A}^{1/2}\widehat{\mathbf{\Gamma}}\mathbf{A}^{1/2}). Then (15) implies that the 𝐀\mathbf{A}-Wasserstein distance is equal to the 2-Wassterstein distance of a congruence transformation,

W𝐀(ηm,η^m)=W2(νm,ν^m).W_{\mathbf{A}}(\eta_{m},\widehat{\eta}_{m})=W_{2}(\nu_{m},\widehat{\nu}_{m}).

The covariance matrices 𝐀1/2𝚪m𝐀1/2\mathbf{A}^{1/2}\mathbf{\Gamma}_{m}\mathbf{A}^{1/2} and 𝐀1/2𝚪^m𝐀1/2\mathbf{A}^{1/2}\widehat{\mathbf{\Gamma}}_{m}\mathbf{A}^{1/2} associated with the 2-Wasserstein distance commute because they are both diagonalized by the same orthogonal matrix 𝐀1/2𝐕~\mathbf{A}^{1/2}\widetilde{\mathbf{V}}.

3.2 Probabilistic Interpretation of the Krylov Prior

We interpret the Krylov prior as an ‘empirical Bayesian procedure’ (Theorem 3.7), and elucidate the connection between the random variables and the deterministic solution (Remark 3.8).

An empirical Bayesian procedure estimates the prior from data (Berger1985, Section 4.5). Our ‘data’ are the pairs of normalized search directions 𝐯i\mathbf{v}_{i} and step sizes ϕi\phi_{i}, 1im+d1\leq i\leq m+d, from m+dm+d iterations of CG. In contrast, the usual data for BayesCG are the inner products 𝐯iT𝐛\mathbf{v}_{i}^{T}\mathbf{b}, 1im1\leq i\leq m. However, if we augment the usual data with the search directions, which is natural due to their dependence on 𝐱\mathbf{x}_{*}, then ϕi\phi_{i} is just a function of the data.

From these data we construct a prior in an empirical Bayesian fashion, starting with a random variable

X=𝐱0+i=1m+dϕi𝐯iQin,X=\mathbf{x}_{0}+\sum_{i=1}^{m+d}\sqrt{\phi_{i}}\mathbf{v}_{i}Q_{i}\in\mathbb{R}^{n},

where Qi𝒩(0,1)Q_{i}\sim\operatorname{\mathcal{N}}(0,1) are independent and identically distributed scalar Gaussian random variables, 1im+d1\leq i\leq m+d. Due to the independence of the QiQ_{i}, the above sum is the matrix vector product

X=𝐱0+𝐕1:m+d𝚽1:m+d1/2QX=\mathbf{x}_{0}+\mathbf{V}_{1:m+d}\,\mathbf{\Phi}_{1:m+d}^{1/2}\,Q (18)

where Q𝒩(𝟎,𝐈m+d)Q\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{m+d}) is a vector-valued Gaussian random variable.

The distribution of XX is the empirical prior, while the distribution of XX conditioned on the random variable Y𝐕1:mT𝐀XY\equiv\mathbf{V}_{1:m}^{T}\mathbf{A}X taking the value 𝐕1:mT𝐛\mathbf{V}_{1:m}^{T}\mathbf{b} is the empirical posterior. We relate these distributions to the Krylov prior.

Theorem 3.7.

Under the assumptions of Theorem 2.7, the random variable XX in (18) is distributed according to the empirical prior

𝒩(𝐱0,𝐕1:m+d𝚽1:m+d𝐕1:m+dT),\operatorname{\mathcal{N}}\left(\mathbf{x}_{0},\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}\right),

which is the rank-(m+d)(m+d) approximation of the Krylov prior 𝚪0\mathbf{\Gamma}_{0}. The variable XX conditioned on Y𝐕1:mT𝐀XY\equiv\mathbf{V}_{1:m}^{T}\mathbf{A}X taking the value 𝐕1:mT𝐛\mathbf{V}_{1:m}^{T}\mathbf{b} is distributed according to the empirical posterior

𝒩(𝐱m,𝐕m+1:m+d𝚽m+1:m+d𝐕m+1:m+dT)=𝒩(𝐱m,𝚪^m),\operatorname{\mathcal{N}}\left(\mathbf{x}_{m},\mathbf{V}_{m+1:m+d}\mathbf{\Phi}_{m+1:m+d}\mathbf{V}_{m+1:m+d}^{T}\right)=\operatorname{\mathcal{N}}\left(\mathbf{x}_{m},\widehat{\mathbf{\Gamma}}_{m}\right),

which, in turn, is the rank-dd approximation of the Krylov posterior.

Proof: As in the proof of Theorem 2.1 in (BCG:Supp, Proof of Proposition 1), we exploit the stability and conjugacy of Gaussian distributions in Lemmas A.1 and A.2 in Appendix A.

Prior.

Lemma A.1 implies that XX in (18) is a Gaussian random variable with mean and covariance

X𝒩(𝐱0,𝐕1:m+d𝚽1:m+d𝐕1:m+dT).X\sim\mathcal{N}\left(\mathbf{x}_{0},\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}\right). (19)

Thus, the approximate Krylov prior is an empirical Bayesian prior.

Posterior.

From (19) follows that XX and Y𝐕1:mT𝐀XY\equiv\mathbf{V}_{1:m}^{T}\mathbf{A}X have the joint distribution

[XY]𝒩([𝐱0𝔼[Y]],[𝐕1:m+d𝚽1:m+d𝐕1:m+dTCov(X,Y)Cov(X,Y)TCov(Y,Y)])\begin{bmatrix}X\\ Y\end{bmatrix}\sim\operatorname{\mathcal{N}}\left(\begin{bmatrix}\mathbf{x}_{0}\\ \operatorname{\mathbb{E}}[Y]\end{bmatrix},\begin{bmatrix}\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}&\operatorname{Cov}(X,Y)\\ \operatorname{Cov}(X,Y)^{T}&\operatorname{Cov}(Y,Y)\end{bmatrix}\right) (20)

and that 𝔼[Y]=𝐕1:mT𝐀𝐱0\operatorname{\mathbb{E}}[Y]=\mathbf{V}_{1:m}^{T}\mathbf{A}\mathbf{x}_{0}. This, together with the linearity of the expectation and the 𝐀\mathbf{A}-orthonormality of 𝐕\mathbf{V} implies

Cov(Y,Y)\displaystyle\operatorname{Cov}(Y,Y) =𝔼[(Y𝔼[Y])(Y𝔼[Y])T]\displaystyle=\operatorname{\mathbb{E}}\left[(Y-\operatorname{\mathbb{E}}[Y])(Y-\operatorname{\mathbb{E}}[Y])^{T}\right]
=𝐕1:mT𝐀𝔼[(X𝐱0)(X𝐱0)T]𝐀𝐕1:m\displaystyle=\mathbf{V}_{1:m}^{T}\mathbf{A}\,\operatorname{\mathbb{E}}\left[(X-\mathbf{x}_{0})(X-\mathbf{x}_{0})^{T}\right]\mathbf{A}\mathbf{V}_{1:m}
=𝐕1:mT𝐀(𝐕1:m+d𝚽1:m+d𝐕1:m+dT)𝐀𝐕1:m\displaystyle=\mathbf{V}_{1:m}^{T}\mathbf{A}\>\left(\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}\right)\>\mathbf{A}\mathbf{V}_{1:m}
=[𝐈m𝟎]𝚽1:m+d[𝐈m𝟎]=𝚽1:m.\displaystyle=\begin{bmatrix}\mathbf{I}_{m}&\mathbf{0}\end{bmatrix}\mathbf{\Phi}_{1:m+d}\begin{bmatrix}\mathbf{I}_{m}\\ \mathbf{0}\end{bmatrix}=\mathbf{\Phi}_{1:m}.

Analogously,

Cov(X,Y)\displaystyle\operatorname{Cov}(X,Y) =𝔼[(X𝐱0)(Y𝔼[Y])T]=𝔼[(X𝐱0)(Y𝐕1:mT𝐀𝐱0)T]\displaystyle=\operatorname{\mathbb{E}}[(X-\mathbf{x}_{0})(Y-\operatorname{\mathbb{E}}[Y])^{T}]=\operatorname{\mathbb{E}}[(X-\mathbf{x}_{0})(Y-\mathbf{V}_{1:m}^{T}\mathbf{A}\mathbf{x}_{0})^{T}]
=𝔼[(X𝐱0)(X𝐱0)T]𝐀𝐕1:m=𝐕1:m+d𝚽1:m+d𝐕1:m+dT𝐀𝐕1:m\displaystyle=\operatorname{\mathbb{E}}[(X-\mathbf{x}_{0})(X-\mathbf{x}_{0})^{T}]\mathbf{A}\mathbf{V}_{1:m}=\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}\mathbf{A}\mathbf{V}_{1:m}
=𝐕1:m+d𝚽1:m+d[𝐈m𝟎]=𝐕1:m𝚽1:m.\displaystyle=\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\begin{bmatrix}\mathbf{I}_{m}&\mathbf{0}\end{bmatrix}=\mathbf{V}_{1:m}\mathbf{\Phi}_{1:m}.

From (Stuart:BayesInverse, Theorem 6.20) follows the expression for the posterior mean,

𝐱m\displaystyle\mathbf{x}_{m} =𝐱0+Cov(X,Y)Cov(Y,Y)1(𝐕1:mT𝐛𝐕1:mT𝐀𝐱0)\displaystyle=\mathbf{x}_{0}+\operatorname{Cov}(X,Y)\operatorname{Cov}(Y,Y)^{-1}\left(\mathbf{V}_{1:m}^{T}\mathbf{b}-\mathbf{V}_{1:m}^{T}\mathbf{A}\mathbf{x}_{0}\right)
=𝐱0+𝐕1:m𝚽1:m𝚽1:m1𝐕1:mT𝐫0=𝐱0+𝐕1:m𝐕1:mT𝐫0,\displaystyle=\mathbf{x}_{0}+\mathbf{V}_{1:m}\mathbf{\Phi}_{1:m}\mathbf{\Phi}_{1:m}^{-1}\mathbf{V}_{1:m}^{T}\mathbf{r}_{0}=\mathbf{x}_{0}+\mathbf{V}_{1:m}\mathbf{V}_{1:m}^{T}\mathbf{r}_{0},

and for the posterior covariance

𝚪^m\displaystyle\widehat{\mathbf{\Gamma}}_{m} =𝐕1:m+d𝚽1:m+d𝐕1:m+dTCov(X,Y)Cov(Y,Y)1Cov(X,Y)T,\displaystyle=\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}-\operatorname{Cov}(X,Y)\operatorname{Cov}(Y,Y)^{-1}\operatorname{Cov}(X,Y)^{T},

where

Cov(X,Y)Cov(Y,Y)1Cov(X,Y)T\displaystyle\operatorname{Cov}(X,Y)\operatorname{Cov}(Y,Y)^{-1}\operatorname{Cov}(X,Y)^{T} =𝐕1:m𝚽1:m𝚽1:m1𝚽1:m𝐕1:mT\displaystyle=\mathbf{V}_{1:m}\mathbf{\Phi}_{1:m}\mathbf{\Phi}_{1:m}^{-1}\mathbf{\Phi}_{1:m}\mathbf{V}_{1:m}^{T}
=𝐕1:m𝚽1:m𝐕1:mT.\displaystyle=\mathbf{V}_{1:m}\mathbf{\Phi}_{1:m}\mathbf{V}_{1:m}^{T}.

Substituting this into 𝚪^m\widehat{\mathbf{\Gamma}}_{m} gives the expression for the posterior covariance

𝚪^m\displaystyle\widehat{\mathbf{\Gamma}}_{m} =𝐕1:m+d𝚽1:m+d𝐕1:m+dT𝐕1:m𝚽1:m𝐕1:mT\displaystyle=\mathbf{V}_{1:m+d}\mathbf{\Phi}_{1:m+d}\mathbf{V}_{1:m+d}^{T}-\mathbf{V}_{1:m}\mathbf{\Phi}_{1:m}\mathbf{V}_{1:m}^{T}
=𝐕m+1:m+d𝚽m+1:m+d𝐕m+1:m+dT.\displaystyle=\mathbf{V}_{m+1:m+d}\mathbf{\Phi}_{m+1:m+d}\mathbf{V}_{m+1:m+d}^{T}.

Thus, the posterior mean 𝐱m\mathbf{x}_{m} is equal to the one in Theorem 2.7, and the posterior covariance 𝚪^m\widehat{\mathbf{\Gamma}}_{m} is equal to the rank-dd approximate Krylov posterior in Definition 2.8.        \blacksquare

Remark 3.8.

The random variable XX in Theorem 3.7 is a surrogate for the unknown solution 𝐱\mathbf{x}_{*}. The solution 𝐱\mathbf{x}_{*} is a deterministic quantity, but prior to solving the linear system (1), we are uncertain of 𝐱\mathbf{x}_{*}, and the prior models this uncertainty.

During the course of the BayesCG iterations, we acquire information about 𝐱\mathbf{x}_{*}, and the posterior distributions μm\mu_{m}, 1mn1\leq m\leq n incorporate our increasing knowledge and, consequently, our diminishing uncertainty.

4 Calibration of BayesCG Under the Krylov Prior

We review the notion of calibration for probabilistic solvers, and show that this notion does not apply to BayesCG under the Krylov prior (Section 4.1). Then we relax this notion and analyze BayesCG with two test statistics that are necessary but not sufficient for calibration: the ZZ-statistic (Section 4.2) and the SS-statistic (Section 4.3).

4.1 Calibration

We review the definition of calibration for probabilistic linear solvers (Definition 4.1, Lemma 4.2), discuss the difference between certain random variables (Remark 4.3), present two illustrations (Examples 4.4 and 4.5), and explain why this notion of calibration does not apply to BayesCG under the Krylov prior (Remark 4.6).

Informally, a probabilistic numerical solver is calibrated if its posterior distributions accurately model the uncertainty in the solution CGOS21; CIOR20.

Definition 4.1 ((CIOR20, Definition 6)).

Let 𝐀X=B\mathbf{A}X_{*}=B be a class of linear systems where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite, and the random right hand sides BnB\in\mathbb{R}^{n} are defined by random solutions Xμ0𝒩(𝐱0,𝚺0)X_{*}\sim\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}).

Assume that a probabilistic linear solver under the prior μ0\mu_{0} and applied to a system 𝐀X=B\mathbf{A}X_{*}=B computes posteriors μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}), 1mn1\leq m\leq n. Let rank(𝚺m)=pm\operatorname{rank}(\mathbf{\Sigma}_{m})=p_{m}, and let 𝚺m\mathbf{\Sigma}_{m} have an orthogonal eigenvector matrix 𝐔=[𝐔m𝐔m]n×n\mathbf{U}=\begin{bmatrix}\mathbf{U}_{m}&\mathbf{U}_{m}^{\perp}\end{bmatrix}\in\mathbb{R}^{n\times n} where 𝐔mn×pm\mathbf{U}_{m}\in\mathbb{R}^{n\times p_{m}} and 𝐔mn×(npm)\mathbf{U}_{m}^{\perp}\in\mathbb{R}^{n\times(n-p_{m})} satisfy

range(𝐔m)=range(𝚺m),range(𝐔m)=ker(𝚺m).\operatorname{range}(\mathbf{U}_{m})=\operatorname{range}(\mathbf{\Sigma}_{m}),\qquad\quad\operatorname{range}(\mathbf{U}_{m}^{\perp})=\ker(\mathbf{\Sigma}_{m}).

The probabilistic solver is calibrated if all posterior covariances 𝚺m\mathbf{\Sigma}_{m} are independent of BB and satisfy

(𝐔mT𝚺m𝐔m)1/2𝐔mT(X𝐱m)𝒩(𝟎,𝐈pm),(𝐔m)T(X𝐱m)=𝟎,1mn.\displaystyle\begin{split}(\mathbf{U}_{m}^{T}\mathbf{\Sigma}_{m}\mathbf{U}_{m})^{-1/2}\mathbf{U}_{m}^{T}(X_{*}-\mathbf{x}_{m})&\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}_{p_{m}}),\\ (\mathbf{U}_{m}^{\perp})^{T}(X_{*}-\mathbf{x}_{m})&=\mathbf{0},\qquad 1\leq m\leq n.\end{split} (21)

Alternatively, one can think of a probabilistic linear solver as calibrated if and only if the solutions XX_{*} are distributed according to the posteriors.

Lemma 4.2.

Under the conditions of Definition 4.1, a probabilistic linear solver is calibrated, if and only if

X𝒩(𝐱m,𝚺m),1mn.X_{*}\sim\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}),\qquad 1\leq m\leq n.

Proof: Let 𝚺m=𝐔𝐃𝐔T\mathbf{\Sigma}_{m}=\mathbf{U}\mathbf{D}\mathbf{U}^{T} be an eigendecomposition where the eigenvalue matrix 𝐃=diag(𝐃m𝟎)\mathbf{D}=\operatorname{diag}\begin{pmatrix}\mathbf{D}_{m}&\mathbf{0}\end{pmatrix} is commensurately partitioned with 𝐔\mathbf{U} in Definition 4.1. Multiply the first equation of (21) on the left by 𝐃m=𝐔mT𝚺m𝐔m\mathbf{D}_{m}=\mathbf{U}_{m}^{T}\mathbf{\Sigma}_{m}\mathbf{U}_{m},

𝐔mT(X𝐱m)𝒩(𝟎,𝐃m),\mathbf{U}_{m}^{T}(X_{*}-\mathbf{x}_{m})\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{D}_{m}),

combine the result with the second equation in (21),

𝐔T(X𝐱m)𝒩(𝟎,𝐃).\mathbf{U}^{T}(X_{*}-\mathbf{x}_{m})\sim\operatorname{\mathcal{N}}\left(\mathbf{0},\mathbf{D}\right).

and multiply by 𝐔\mathbf{U} on the left,

(X𝐱m)𝒩(𝟎,𝐔𝐃𝐔T),1mn.(X_{*}-\mathbf{x}_{m})\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{U}\mathbf{D}\mathbf{U}^{T}),\qquad 1\leq m\leq n.

At last, substitute 𝚺m=𝐔𝐃𝐔T\mathbf{\Sigma}_{m}=\mathbf{U}\mathbf{D}\mathbf{U}^{T} and subtract 𝐱m\mathbf{x}_{m}.        \blacksquare

Since the covariance matrix 𝚺m\mathbf{\Sigma}_{m} is singular, its probability density function is zero on the subspace of n\mathbb{R}^{n} where the solver has eliminated the uncertainty about XX_{*}. From (21) follows that X=𝐱mX_{*}=\mathbf{x}_{m} in ker(𝚺m)\ker(\mathbf{\Sigma}_{m}). Hence, this subspace must be ker(𝚺m)\ker(\mathbf{\Sigma}_{m}), and any remaining uncertainty about XX_{*} lies in range(𝚺m)\operatorname{range}(\mathbf{\Sigma}_{m}).

Remark 4.3.

We discuss the difference between the random variable XX_{*} in Definition 4.1 and the random variable XX in Theorem 3.7.

In the context of calibration, the random variable Xμ0X_{*}\sim\mu_{0} represents the set of all possible solutions that are accurately modeled by the prior μ0\mu_{0}. If the solver is calibrated, then Lemma 4.2 shows that XμmX_{*}\sim\mu_{m}. Thus, solutions accurately modeled by the prior μ0\mu_{0} are also accurately modeled by all posteriors μm\mu_{m}.

By contrast, in the context of a deterministic linear system 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}_{*}=\mathbf{b}, the random variable XX represents a surrogate for the particular solution 𝐱\mathbf{x}_{*} and can be viewed as an abbreviation for XX=𝐱X\mid X_{*}=\mathbf{x}_{*}. The prior μ0\mu_{0} models the uncertainty in the user’s initial knowledge of 𝐱\mathbf{x}_{*}, and the posteriors μm\mu_{m} model the uncertainty remaining after mm iterations of the solver.

The following two examples illustrate Definition 4.1.

Example 4.4.

Suppose there are three people: Alice, Bob, and Carol.

  1. 1.

    Alice samples 𝐱\mathbf{x}_{*} from the prior μ0\mu_{0} and computes the matrix vector product 𝐛=𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}_{*}.

  2. 2.

    Bob receives μ0\mu_{0}, 𝐛\mathbf{b}, and 𝐀\mathbf{A} from Alice. He estimates 𝐱\mathbf{x}_{*} by solving the linear system with a probabilistic solver under the prior μ0\mu_{0}, and then samples 𝐲\mathbf{y} from a posterior μm\mu_{m}.

  3. 3.

    Carol receives μm\mu_{m}, 𝐱\mathbf{x}_{*} and 𝐲\mathbf{y}, but she is not told which vector is 𝐱\mathbf{x}_{*} and which is 𝐲\mathbf{y}. Carol then attempts to determine which one of 𝐱\mathbf{x}_{*} or 𝐲\mathbf{y} is the sample from μm\mu_{m}. If Carol cannot distinguish between 𝐱\mathbf{x}_{*} and 𝐲\mathbf{y}, then the solver is calibrated.

Refer to caption
Refer to caption
Refer to caption
Figure 4.1: Posterior distributions and solutions from three different probabilistic solvers: calibrated (top), optimistic (bottom left), and pessimistic (bottom right). The gray contours represent the posterior distributions, the red symbols “×\times” the solutions, and the blue symbols “+” samples from the posterior distributions.
Example 4.5.

This is the visual equivalent of Example 4.4, where Carol receives the images in Figure 4.1 of three different probabilistic solvers, but without any identification of the solutions and posterior samples.

  • Top plot. This solver is calibrated because the solutions look indistinguishable from the samples of the posterior distribution.

  • Bottom left plot. This solver is not calibrated because the solutions are unlikely to be samples from the posterior distribution.
    The solver is optimistic because the posterior distribution is concentrated in an area of n\mathbb{R}^{n} that is too small to cover the solutions.

  • Bottom right plot. The solver is not calibrated. Although the solutions could plausibly be sampled from the posterior, they are concentrated too close to the center of the distribution.
    The solver is pessimistic because the area covered by the posterior distribution is much larger than the area containing the solutions.

Remark 4.6.

The posterior means and covariances from a probabilistic solver can depend on the solution 𝐱\mathbf{x}_{*}, as is the case for BayesCG. If a solver is applied to a random linear system in Definition 4.1 and if the posterior means and covariances depend on the solution XX_{*}, then the posterior means and covariances are also random variables.

Definition 4.1 prevents the posterior covariances from being random variables by forcing them to be independent of the random right hand side BB. Although this is a realistic constraint for the stationary iterative solvers in RICO21, it does not apply to BayesCG under the Krylov prior, because Krylov posterior covariances depend non-linearly on the right-hand side. In Sections 4.2 and 4.3, we present a remedy for BayesCG in the form of test statistics that are motivated by Definition 4.1 and Lemma 4.2.

4.2 The ZZ-statistic

We assess BayesCG under the Krylov prior with an existing test statistic, the ZZ-statistic, which is a necessary condition for calibration and can be viewed as a weaker normwise version of criterion (21). We review the ZZ-statistic (Section 4.2.1), and apply it to BayesCG under the Krylov prior (Section 4.2.2).

4.2.1 Review of the ZZ-statistic

We review the ZZ-statistic (Definition 4.7), and the chi-square distribution (Definition 4.8), which links the ZZ-statistic to calibration (Theorem 4.9). Then we discuss how to generate samples of the ZZ-statistic (Algorithm 4.1), how to use them for the assessment of calibration (Remark 4.10), and then present the Kolmogorov-Smirnov statistic as a computationally inexpensive estimate for the difference between two general distributions (Definition 4.11).

The ZZ-statistic was introduced in (Cockayne:BCG, Section 6.1) as a means to assess the calibration of BayesCG, and has subsequently been applied to other probabilistic linear solvers (Bartels, Section 6.4), (Fanaskov21, Section 9).

Definition 4.7 ((Cockayne:BCG, Section 6.1)).

Let 𝐀X=B\mathbf{A}X_{*}=B be a class of linear systems where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite, and Xμ0𝒩(𝐱0,𝚺0)X_{*}\sim\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}). Let μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}), 1mn1\leq m\leq n, be the posterior distributions from a probabilistic solver under the prior μ0\mu_{0} applied to 𝐀X=B\mathbf{A}X_{*}=B. The ZZ-statistic is

Zm(X)(X𝐱m)T𝚺m(X𝐱m),1mn.\mathrm{Z}_{m}(X_{*})\equiv(X_{*}-\mathbf{x}_{m})^{T}\mathbf{\Sigma}_{m}^{\dagger}(X_{*}-\mathbf{x}_{m}),\qquad 1\leq m\leq n. (22)

The chi-squared distribution below furnishes the link from ZZ-statistic to calibration.

Definition 4.8 ((Ross07, Definition 2.2)).

If X1,,Xf𝒩(0,1)X_{1},\ldots,X_{f}\in\mathcal{N}(0,1) are independent random normal variables, then j=1fXj2\sum_{j=1}^{f}{X_{j}^{2}} is distributed according to the chi-squared distribution χf2\chi_{f}^{2} with ff degrees of freedom and mean ff.
In other words, if X𝒩(𝟎,𝐈f)X\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}_{f}), then XTXχf2X^{T}X\sim\chi_{f}^{2} and 𝔼[XTX]=f\operatorname{\mathbb{E}}[X^{T}X]=f.

We show that the ZZ-statistic is a necessary condition for calibration. That is: If a probabilistic solver is calibrated, then the ZZ-statistic is distributed according to a chi-squared distribution.

Theorem 4.9 ((BCG:Supp, Proposition 1)).

Let 𝐀X=B\mathbf{A}X_{*}=B be a class of linear systems where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite, and Xμ0𝒩(𝐱0,𝚺0)X_{*}\sim\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}). Assume that a probabilistic solver under the prior μ0\mu_{0} applied to 𝐀X=B\mathbf{A}X_{*}=B computes the posteriors μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}) with rank(𝚺m)=pm\operatorname{rank}(\mathbf{\Sigma}_{m})=p_{m}, 1mn1\leq m\leq n.

If the solver is calibrated, then

Zm(X)χpm2,1mn.\mathrm{Z}_{m}(X_{*})\sim\chi_{p_{m}}^{2},\qquad 1\leq m\leq n.

Proof: Write Zm(X)=MmTMm\mathrm{Z}_{m}(X_{*})=M_{m}^{T}M_{m}, where Mm(𝚺m)1/2(X𝐱m)M_{m}\equiv(\mathbf{\Sigma}_{m}^{\dagger})^{1/2}(X_{*}-\mathbf{x}_{m}). Lemma 4.2 implies that a calibrated solver produces posteriors with

(X𝐱m)𝒩(𝟎,𝚺m),1mn.(X_{*}-\mathbf{x}_{m})\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{\Sigma}_{m}),\qquad 1\leq m\leq n.

With the eigenvector matrix 𝐔mn×pm\mathbf{U}_{m}\in\mathbb{R}^{n\times p_{m}} as in Definition 4.1, Lemma A.1 in Appendix A implies

Mm𝒩(𝟎,𝐔m𝐔mT),1mn.M_{m}\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{U}_{m}\mathbf{U}_{m}^{T}),\qquad 1\leq m\leq n.

Since the covariance of MmM_{m} is an orthogonal projector, Lemma A.7 implies Zm(X)=(MmTMm)χpm2\mathrm{Z}_{m}(X_{*})=(M_{m}^{T}M_{m})\sim\chi_{p_{m}}^{2}.        \blacksquare

Theorem 4.9 implies that BayesCG is calibrated if the ZZ-statistic is distributed according to a chi-squared distribution with pm=rank(𝚺0)mp_{m}=\operatorname{rank}(\mathbf{\Sigma}_{0})-m degrees of freedom. For the Krylov prior specifically, pm=gmp_{m}=g-m.

Generating samples from the ZZ-statistic and assessing calibration.

For a user-specified probabilistic linear solver and a symmetric positive definite matrix 𝐀\mathbf{A}, Algorithm 4.1 samples NtestN_{\text{test}} solutions 𝐱\mathbf{x}_{*} from the prior distribution μ0\mu_{0}, defines the systems 𝐛𝐀𝐱\mathbf{b}\equiv\mathbf{A}\mathbf{x}_{*}, runs mm iterations of the solver on 𝐛𝐀𝐱\mathbf{b}\equiv\mathbf{A}\mathbf{x}_{*}, and computes Zm(𝐱)\mathrm{Z}_{m}(\mathbf{x}_{*}) in (22).

The application of the Moore-Penrose inverse in Line 6 can be implemented by computing the minimal norm solution 𝐪=𝚺m(𝐱𝐱m)\mathbf{q}=\mathbf{\Sigma}_{m}^{\dagger}(\mathbf{x}_{*}-\mathbf{x}_{m}) of the least squares problem

min𝐮n(𝐱𝐱m)𝚺m𝐮2,\min_{\mathbf{u}\in\mathbb{R}^{n}}\|(\mathbf{x}_{*}-\mathbf{x}_{m})-\mathbf{\Sigma}_{m}\mathbf{u}\|_{2}, (23)

followed by the inner product zi=(𝐱𝐱m)T𝐪z_{i}=(\mathbf{x}_{*}-\mathbf{x}_{m})^{T}\mathbf{q}.

Algorithm 4.1 Sampling from the ZZ-statistic
1:Input: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, μ0=𝒩(𝐱0,𝚺0)\mu_{0}=\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}), solver, mm, NtestN_{\text{test}}
2:for i=1:Ntesti=1:N_{\text{test}} do
3:     Sample 𝐱\mathbf{x}_{*} from prior distribution μ0\mu_{0} \ignorespaces\triangleright Sample solution vector
4:     𝐛=𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}_{*} \ignorespaces\triangleright Define test problem
5:     [𝐱m,𝚺m]=solver(𝐀,𝐛,μ0,m)[\mathbf{x}_{m},\mathbf{\Sigma}_{m}]=\texttt{solver}(\mathbf{A},\mathbf{b},\mu_{0},m) \ignorespaces\triangleright Compute posterior μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m})
6:     zi=(𝐱𝐱m)T𝚺m(𝐱𝐱m)z_{i}=(\mathbf{x}_{*}-\mathbf{x}_{m})^{T}\mathbf{\Sigma}_{m}^{\dagger}(\mathbf{x}_{*}-\mathbf{x}_{m}) \ignorespaces\triangleright Compute ZZ-statistic sample
7:end for
8:Output: ZZ-statistic samples ziz_{i}, 1iNtest1\leq i\leq N_{test}.
Remark 4.10.

We assess calibration of the solver by comparing the ZZ-statistic samples ziz_{i} from Algorithm 4.1 to the chi-squared distribution χpm2\chi_{p_{m}}^{2} with pmrank(𝚺0)mp_{m}\equiv\operatorname{rank}(\mathbf{\Sigma}_{0})-m degrees of freedom, based on the following criteria from (Cockayne:BCG, Section 6.1).

Calibrated:

If ziχpm2z_{i}\sim\chi_{p_{m}}^{2}, then 𝐱μm\mathbf{x}_{*}\sim\mu_{m} and the solutions 𝐱\mathbf{x}_{*} are distributed according to the posteriors μm\mu_{m}.

Pessimistic:

If the ziz_{i} are concentrated around smaller values than χpm2\chi_{p_{m}}^{2}, then the solutions 𝐱\mathbf{x}_{*} occupy a smaller area of n\mathbb{R}^{n} than predicted by μm\mu_{m}.

Optimistic:

If the ziz_{i} are concentrated around larger values than χpm2\chi_{p_{m}}^{2}, then the solutions cover a larger area of n\mathbb{R}^{n} than predicted by μm\mu_{m}.

In (Cockayne:BCG, Section 6.1) and (Bartels, Section 6.4), the ZZ-statistic samples and the predicted chi-squared distribution are compared visually. In Section 5, we make an additional quantitative comparison with the Kolmogorov-Smirnov test to estimate the difference between two probability distributions.

Definition 4.11 ((Kaltenbach12, Section 3.4.1)).

Given two distributions μ\mu and ν\nu on n\mathbb{R}^{n} with cumulative distribution functions FμF_{\mu} and FνF_{\nu}, the Kolmogorov-Smirnov statistic is

KS(μ,ν)=supxF_μ(x) - F_ν(x)where0KS(μ,ν)1.IfKS(μ,ν)=0,thenμandνhavethesamecumulativedistributionfunctions,Fμ=Fν.IfKS(μ,ν)=1,thenμandνdonotoverlap.Ingeneral,thelowerKS(μ,ν),thecloserμandνaretoeachother.KS(\mu,\nu)=\sup_{x\in\mathbb{R}}\mbox{{}\rm\sf\leavevmode\hbox{}\hbox{}F\_\mu(x) - F\_\nu(x)\/}, \end{equation*}where0\leq KS(\mu,\nu)\leq 1.\par IfKS(\mu,\nu)=0,then\mu and\nu havethesamecumulativedistributionfunctions,F_{\mu}=F_{\nu}.IfKS(\mu,\nu)=1,then\mu and\nu donotoverlap.Ingeneral,thelowerKS(\mu,\nu),thecloser\mu and\nu aretoeachother.

In contrast to the Wasserstein distance in Definition 3.1, the Kolmogorov-Smirnov statistic can be easier to estimate—especially if the distributions are not Gaussian—but it is not a metric. Consequently, if μ\mu and ν\nu do not overlap, then KS(μ,ν)=1KS(\mu,\nu)=1 regardless of how far μ\mu and ν\nu are apart, while the Wasserstein metric still gives information about the distance between μ\mu and ν\nu.

4.2.2 ZZ-Statistic for BayesCG under the Krylov prior

We apply the ZZ-statistic to BayesCG under the Krylov prior. We start with an expression for the Moore-Penrose inverse of the Krylov posterior covariances (Lemma 4.12). Then we show that the ZZ-statistic for the full Krylov posteriors has the same mean as the corresponding chi-squared distribution (Theorem 4.13), but its distribution is different. Therefore the ZZ-statistic is inconclusive about the calibration of BayesCG under the Krylov prior (Remark 4.14).

Lemma 4.12.

In Definition 2.8, abbreviate 𝐕^𝐕m+1:m+d\widehat{\mathbf{V}}\equiv\mathbf{V}_{m+1:m+d} and 𝚽^𝚽m+1:m+d\widehat{\mathbf{\Phi}}\equiv\mathbf{\Phi}_{m+1:m+d}. The rank-dd approximate Krylov posterior covariances have the Moore-Penrose inverse

𝚪^m=(𝐕^𝚽^𝐕^T)=𝐕^(𝐕^T𝐕^)1𝚽^1(𝐕^T𝐕^)1𝐕^T,1mgd.\widehat{\mathbf{\Gamma}}_{m}^{\dagger}=\left(\widehat{\mathbf{V}}\widehat{\mathbf{\Phi}}\widehat{\mathbf{V}}^{T}\right)^{\dagger}=\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}\widehat{\mathbf{\Phi}}^{-1}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}\widehat{\mathbf{V}}^{T},\qquad 1\leq m\leq g-d.

Proof: We exploit the fact that all factors of 𝚪^m\widehat{\mathbf{\Gamma}}_{m} have full column rank.

The factors 𝐕^\widehat{\mathbf{V}} and 𝐕^T\widehat{\mathbf{V}}^{T} have full column and row rank, respectively, because 𝐕\mathbf{V} has 𝐀\mathbf{A}-orthonormal columns. Additionally, the diagonal matrix 𝚽^\widehat{\mathbf{\Phi}} is nonsingular. Then Lemma A.5 in Appendix A implies that the Moore-Penrose inverses can be expressed in terms of the matrices proper,

𝐕^=(𝐕^T𝐕^)1𝐕^T,(𝐕^T)=𝐕^(𝐕^T𝐕^)1,\widehat{\mathbf{V}}^{\dagger}=(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}\widehat{\mathbf{V}}^{T},\qquad(\widehat{\mathbf{V}}^{T})^{\dagger}=\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}, (24)

and

(𝚽^𝐕^T)=(𝐕^T)𝚽^1=𝐕^(𝐕^T𝐕^)1𝚽^1.(\widehat{\mathbf{\Phi}}\widehat{\mathbf{V}}^{T})^{\dagger}=(\widehat{\mathbf{V}}^{T})^{\dagger}\widehat{\mathbf{\Phi}}^{-1}=\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}\widehat{\mathbf{\Phi}}^{-1}. (25)

Since 𝚽^𝐕^T\widehat{\mathbf{\Phi}}\widehat{\mathbf{V}}^{T} also has full row rank, apply Lemma A.5 to 𝚪^m\widehat{\mathbf{\Gamma}}_{m},

𝚪^m=(𝚽^𝐕^T)𝐕^,\widehat{\mathbf{\Gamma}}_{m}^{\dagger}=(\widehat{\mathbf{\Phi}}\widehat{\mathbf{V}}^{T})^{\dagger}\widehat{\mathbf{V}}^{\dagger},

and substitute (24) and (25) into the above expression.        \blacksquare

We apply the ZZ-statistic to the full Krylov posteriors, and show that ZZ-statistic samples reproduce the dimension of the unexplored Krylov space.

Theorem 4.13.

Under the assumptions of Theorem 2.7, let BayesCG under the Krylov prior η0𝒩(𝐱0,𝚪0)\eta_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Gamma}_{0}) produce full Krylov posteriors ηm𝒩(𝐱m,𝚪m)\eta_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Gamma}_{m}). Then the ZZ-statistic is equal to

Zm(𝐱)=(𝐱𝐱m)T𝚪m(𝐱𝐱m)=gm,1mg.\mathrm{Z}_{m}(\mathbf{x}_{*})=(\mathbf{x}_{*}-\mathbf{x}_{m})^{T}\mathbf{\Gamma}_{m}^{\dagger}(\mathbf{x}_{*}-\mathbf{x}_{m})=g-m,\qquad 1\leq m\leq g.

Proof: Express the error 𝐱0𝐱m\mathbf{x}_{0}-\mathbf{x}_{m} in terms of 𝐕^𝐕m+1:m+d\widehat{\mathbf{V}}\equiv\mathbf{V}_{m+1:m+d} by inserting

𝐱=𝐱0+𝐕1:g𝐕1:gT𝐫0,𝐱m=𝐱0+𝐕1:m𝐕1:mT𝐫0,1mg,\displaystyle\mathbf{x}_{*}=\mathbf{x}_{0}+\mathbf{V}_{1:g}\mathbf{V}_{1:g}^{T}\mathbf{r}_{0},\qquad\mathbf{x}_{m}=\mathbf{x}_{0}+\mathbf{V}_{1:m}\mathbf{V}_{1:m}^{T}\mathbf{r}_{0},\quad 1\leq m\leq g, (26)

from Theorem 2.7 into

𝐱𝐱m=𝐕m+1:g𝐕m+1:gT𝐫0=𝐕^𝐕^T𝐫0.\displaystyle\mathbf{x}_{*}-\mathbf{x}_{m}=\mathbf{V}_{m+1:g}\mathbf{V}_{m+1:g}^{T}\mathbf{r}_{0}=\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{T}\,\mathbf{r}_{0}.

This expression is identical to (Liesen, Equation (5.6.5)), which relates the CG error to the search directions and step sizes of the remaining iterations.

With Lemma 4.12, this implies for the ZZ-statistic in Theorem 4.9

Zm(𝐱)\displaystyle\mathrm{Z}_{m}(\mathbf{x}_{*}) =(𝐱𝐱m)T𝚪m𝐱𝐱m)\displaystyle=(\mathbf{x}_{*}-\mathbf{x}_{m})^{T}\mathbf{\Gamma}_{m}^{\dagger}\mathbf{x}_{*}-\mathbf{x}_{m})
=𝐫0T𝐕^𝐕^T(𝐱𝐱m)T𝐕^(𝐕^T𝐕^)1𝚽^1(𝐕^T𝐕^)1𝐕^T𝚪m𝐕^𝐕^T𝐫0(𝐱𝐱m)\displaystyle=\underbrace{\mathbf{r}_{0}^{T}\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{T}}_{(\mathbf{x}_{*}-\mathbf{x}_{m})^{T}}\underbrace{\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}\widehat{\mathbf{\Phi}}^{-1}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}\widehat{\mathbf{V}}^{T}}_{\mathbf{\Gamma}_{m}^{\dagger}}\underbrace{\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{T}\mathbf{r}_{0}}_{(\mathbf{x}_{*}-\mathbf{x}_{m})}
=𝐫0T𝐕^(𝐕^T𝐕^)(𝐕^T𝐕^)1𝐈𝚽^1(𝐕^T𝐕^)1(𝐕^T𝐕^)𝐈𝐕^T𝐫0\displaystyle=\mathbf{r}_{0}^{T}\widehat{\mathbf{V}}\>\underbrace{(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}}_{\mathbf{I}}\>\widehat{\mathbf{\Phi}}^{-1}\>\underbrace{(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})^{-1}(\widehat{\mathbf{V}}^{T}\widehat{\mathbf{V}})}_{\mathbf{I}}\>\widehat{\mathbf{V}}^{T}\mathbf{r}_{0}
=𝐫0T𝐕^𝚽^1𝐕^T𝐫0.\displaystyle=\mathbf{r}_{0}^{T}\widehat{\mathbf{V}}\widehat{\mathbf{\Phi}}^{-1}\widehat{\mathbf{V}}^{T}\mathbf{r}_{0}.

In other words,

𝐱𝐱m𝚪^m2\displaystyle\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\widehat{\mathbf{\Gamma}}_{m}^{\dagger}}^{2} =(𝐕m+1:gT𝐫0)T𝚽m+1:m+d1(𝐕m+1:gT𝐫0)\displaystyle=\left(\mathbf{V}_{m+1:g}^{T}\mathbf{r}_{0}\right)^{T}\mathbf{\Phi}_{m+1:m+d}^{-1}\left(\mathbf{V}_{m+1:g}^{T}\mathbf{r}_{0}\right)
=j=m+1gϕj1(𝐯jT𝐫0)2=gm,0m<g,\displaystyle=\sum_{j=m+1}^{g}\phi_{j}^{-1}(\mathbf{v}_{j}^{T}\mathbf{r}_{0})^{2}=g-m,\qquad 0\leq m<g,

where the last inequality follows from ϕj=(𝐯jT𝐫0)2\phi_{j}=(\mathbf{v}_{j}^{T}\mathbf{r}_{0})^{2} in Definition 2.4.        \blacksquare

Remark 4.14.

The ZZ-statistic is inconclusive about the calibration of BayesCG under the Krylov prior.

Theorem 4.13 shows that the ZZ-statistic is distributed according to a Dirac distribution at gmg-m. Thus, the ZZ-statistic has the same mean as the chi-squared distribution χgm2\chi_{g-m}^{2}, which suggests that BayesCG under the Krylov prior is neither optimistic or pessimistic. However, although the means are the same, the distributions are not. Hence, Theorem 4.13 does not imply that BayesCG under the Krylov prior is calibrated.

4.3 The SS-statistic

We introduce a new test statistic for assessing the calibration of probabilistic solvers, the SS-statistic. After discussing the relation between calibration and error estimation (Section 4.3.1), we define the SS-statistic (Section 4.3.2), compare the SS-statistic to the ZZ-statistic (Section 4.3.3), and then apply the SS-statistic to BayesCG under the Krylov prior (Section 4.3.4).

4.3.1 Calibration and Error Estimation

We establish a relation between the error of the posterior means (approximations to the solution) and the trace of posterior covariances (Theorem 4.15).

Theorem 4.15.

Let 𝐀X=B\mathbf{A}X_{*}=B be a class of linear systems where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite and Xμ0𝒩(𝐱0,𝚺0)X_{*}\sim\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}). Let μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}), 1mn1\leq m\leq n be the posterior distributions from a probabilistic solver under the prior μ0\mu_{0} applied to 𝐀X=B\mathbf{A}X_{*}=B.

If the solver is calibrated, then

𝔼[X𝐱m𝐀2]=trace(𝐀𝚺m),1mn.\operatorname{\mathbb{E}}[\|X_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}]=\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{m}),\qquad 1\leq m\leq n. (27)

Proof: For a calibrated solver Lemma 4.2 implies that XμmX_{*}\sim\mu_{m}. Then apply Lemma A.8 in Appendix A to the error X𝐱m𝐀2\|X_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}.        \blacksquare

For a calibrated solver, Theorem 4.15 implies that the equality 𝐱𝐱m𝐀2=trace(𝐀𝚺m)\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}=\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{m}) holds on average. This means the trace can overestimate the error for some solutions, while for others, it can underestimate the error.

We explain how Theorem 4.15 relates the errors of a calibrated solver to the area in which its posteriors are concentrated.

Remark 4.16.

The trace of a posterior covariance matrix quantifies the spread of its probability distribution—because the trace is the sum of the eigenvalues, which in the case of a covariance are the variances of the principal components (JWHT21, Section 12.2.1).

In analogy to viewing the 𝐀\mathbf{A}-norm as the 2-norm weighted by 𝐀\mathbf{A}, we can view trace(𝐀𝚺m)\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{m}) as the trace of 𝚺m\mathbf{\Sigma}_{m} weighted by 𝐀\mathbf{A}. Theorem 4.15 shows that the 𝐀\mathbf{A}-norm errors of a calibrated solver are equal to the weighted sum of the principal component variances from the posterior. Thus, the posterior means 𝐱m\mathbf{x}_{m} and the areas in which the posteriors are concentrated both converge to the solution at the same speed, provided the solver is calibrated.

4.3.2 Definition of the SS-statistic

We introduce the SS-statistic (Definition 4.17), present an algorithm for generating samples from the SS-statistic (Algorithm 4.2), and discuss their use for assessing calibration of solvers (Remark 4.18).

The SS-statistic represents a necessary condition for calibration, as established in Theorem 4.15.

Definition 4.17.

Let 𝐀X=B\mathbf{A}X_{*}=B be a class of linear systems where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite, and Xμ0𝒩(𝐱0,𝚺0)X_{*}\sim\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}). Let μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m}), 1mn1\leq m\leq n, be the posterior distributions from a probabilistic solver under the prior μ0\mu_{0} applied to 𝐀X=B\mathbf{A}X_{*}=B. The SS-statistic is

Sm(X)X𝐱m𝐀2.\mathrm{S}_{m}(X_{*})\equiv\|X_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}. (28)

If the solver is calibrated then Theorem 4.15 implies

𝔼[S(X)]=trace(𝐀𝚺m).\operatorname{\mathbb{E}}[\mathrm{S}(X_{*})]=\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{m}). (29)
Generating samples from the SS-statistic and assessing calibration.

For a user specified probabilistic linear solver and a symmetric positive definite matrix 𝐀\mathbf{A}, Algorithm 4.2 samples NtestN_{test} solutions 𝐱\mathbf{x}_{*} from the prior distribution μ0\mu_{0}, defines the linear systems 𝐛=𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}_{*}, runs mm iterations of the solver on the system, and computes Sm(𝐱)\mathrm{S}_{m}(\mathbf{x}_{*}) and trace(𝐀𝚺m)\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{m}) from (28).

As with the ZZ-statistic, Algorithm 4.2 requires a separate reference μref\mu_{ref} when sampling solutions 𝐱\mathbf{x}_{*} for BayesCG under the Krylov prior.

Algorithm 4.2 Sampling from the SS-statistic
1:Input: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, μ0=𝒩(𝐱0,𝚺0)\mu_{0}=\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}), solver, mm, NtestN_{\text{test}}
2:for i=1:Ntesti=1:N_{\text{test}} do
3:     Sample 𝐱\mathbf{x}_{*} from prior distribution μ0\mu_{0} \ignorespaces\triangleright Sample solution vector
4:     𝐛=𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}_{*} \ignorespaces\triangleright Define test problem
5:     [𝐱m,𝚺m]=solver(𝐀,𝐛,μ0,m)[\mathbf{x}_{m},\mathbf{\Sigma}_{m}]=\texttt{solver}(\mathbf{A},\mathbf{b},\mu_{0},m) \ignorespaces\triangleright Compute posterior μm𝒩(𝐱m,𝚺m)\mu_{m}\equiv\operatorname{\mathcal{N}}(\mathbf{x}_{m},\mathbf{\Sigma}_{m})
6:     si=𝐱𝐱m𝐀2s_{i}=\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2} \ignorespaces\triangleright Compute SS-statistic for test problem
7:     ti=trace(𝐀𝚺m)t_{i}=\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{m}) \ignorespaces\triangleright Compute trace for test problem
8:end for
9:h=(1/Ntest)i=1Ntestsih=(1/N_{test})\sum_{i=1}^{N_{\text{test}}}s_{i} \ignorespaces\triangleright Compute empirical mean of SS-statistic samples
10:Output: SS-statistic samples sis_{i} and traces tit_{i}, 1iNtest1\leq i\leq N_{\text{test}}; SS-statistic mean hh
Remark 4.18.

We assess calibration of the solver by comparing the SS-statistic samples sis_{i} from Algorithm 4.2 to the traces tit_{i}, 1iNtest1\leq i\leq N_{test}. The following criteria are based on Theorem 4.15 and Remark 4.16.

Calibrated:

If the solver is calibrated, the traces tit_{i} should all be equal to the empirical mean hh of the SS-statistic samples sis_{i}.

Pessimistic:

If the sis_{i} are concentrated around smaller values than the tit_{i}, then the solutions 𝐱\mathbf{x}_{*} occupy a smaller area of n\mathbb{R}^{n} than predicted by the posteriors μm\mu_{m}.

Optimistic:

If the sis_{i} are concentrated around larger values than the tit_{i}, then the solutions 𝐱\mathbf{x}_{*} occupy a larger area of n\mathbb{R}^{n} than predicted by μm\mu_{m}.

We can also compare the empirical means of the sis_{i} and tit_{i}, because a calibrated solver should produce sis_{i} and tit_{i} with the same mean. Note that a comparison via the Kolmogorov-Smirnov statistic is not appropriate because the empirical distributions of sis_{i} and tit_{i} are generally different.

4.3.3 Comparison of the ZZ- and SS-statistics

Both, ZZ- and SS-statistic represent necessary conditions for calibration ((27) and (29)); and both measure the norm of the error X𝐱mX_{*}-\mathbf{x}_{m}: The ZZ-statistic in the 𝚺m\mathbf{\Sigma}_{m}^{\dagger}-pseudo norm (Definition 4.7), and the SS-statistic in the 𝐀\mathbf{A}-norm (Definition 4.17). Deeper down, though, the ZZ-statistic projects errors onto a single dimension (Theorem 4.9), while the SS-statistic relates errors to the areas in which the posterior distributions are concentrated.

Due to its focus on the area of the posteriors, the SS-statistic can give a false positive for calibration. This occurs when the solution is not in the area of posterior concentration but the size of the posteriors is consistent with the errors. The ZZ-statistic is less likely to encounter this problem, as illustrated in Figure 4.2.

The ZZ-statistic is better at assessing calibration, while the SS statistic produces accurate error estimates, which default to the traditional 𝐀\mathbf{A}-norm estimates. The SS-statistic is also faster to compute because it does not require the solution of a least squares problem.

Refer to caption
Refer to caption
Refer to caption
Figure 4.2: Assessment of calibration from ZZ-statistic and SS-statistic. The contour plots represent the posterior distributions, and the symbol ‘×\times’ represents the solution.
Top left: Both statistics decide that the solver is not calibrated. Top right: The SS-statistic decides that the solver is calibrated, while the ZZ-statistic does not. Bottom: Both statistics decide that the solver is calibrated.

4.3.4 SS-statistic for BayesCG under the Krylov prior

We show that BayesCG under the Krylov prior is not calibrated, but that it is has similar performance to a calibrated solver under full posteriors and is optimistic under approximate posteriors.

Calibration of BayesCG under full Krylov posteriors.

Theorem 2.7 implies that the SS-statistic for any solution 𝐱\mathbf{x}_{*} is equal to

Sm(𝐱)=𝐱𝐱m𝐀2=trace(𝐀𝚪m),1mg.\mathrm{S}_{m}(\mathbf{x}_{*})=\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}=\operatorname{trace}(\mathbf{A}\mathbf{\Gamma}_{m}),\qquad 1\leq m\leq g.

Thus, the SS-statistic indicates that the size of Krylov posteriors is consistent with the errors, which is a desirable property of calibrated solvers. However, BayesCG under the Krylov prior is not a calibrated solver because the traces of posterior covariances from calibrated solvers are distributed around the average error instead of always being equal to the error.

Calibration of BayesCG under approximate posteriors.

From (9) follows that trace(𝐀𝚪^m)\operatorname{trace}(\mathbf{A}\widehat{\mathbf{\Gamma}}_{m}) is concentrated around smaller values than the SS-statistic; and the underestimate of the trace is equal to the Wasserstein distance between full and approximate Krylov posteriors in Theorem 3.5. This underestimate points to the optimism of BayesCG under approximate Krylov posteriors. This optimism is expected because approximate posteriors model the uncertainty about 𝐱\mathbf{x}_{*} in a lower dimensional space than full posteriors.

5 Numerical experiments

We present numerical assessments of BayesCG calibration via the ZZ- and SS-statistics.

After describing the setup of the numerical experiments (Section 5.1), we assess the calibration of three implementations of BayesCG: (i) BayesCG with random search directions (Section 5.2)—a solver known to be calibrated—so as to establish a baseline for comparisons with other versions of BayesCG; (ii) BayesCG under the inverse prior (Section 5.3); and (iii) BayesCG under the Krylov prior (Section 5.4). We choose the inverse prior and the Krylov priors because under each of these priors, the posterior mean from BayesCG coincides with the solution from CG.

Conclusions from all the experiments.

Both, ZZ- and SS statistics indicate that BayesCG with random search directions is indeed a calibrated solver, and that BayesCG under the inverse prior is pessimistic.

The SS-statistic indicates that BayesCG under full Krylov posteriors mimics a calibrated solver, and that BayesCG under rank-50 approximate posteriors does as well, but not as much since it is slightly optimistic.

However, among all versions, BayesCG under approximate Krylov posteriors is the only one that is computationally practical and that is competitive with CG.

5.1 Experimental Setup

We describe the matrix 𝐀\mathbf{A} in the linear system (Section 5.1.1); the setup of the ZZ- and SS-statistic experiments (Section 5.1.2); and the three BayesCG implementations (Section 5.1.3).

5.1.1 The matrix 𝐀\mathbf{A} in the linear system (1)

The symmetric positive definite matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} of dimension n=1806n=1806 is a preconditoned version of the matrix BCSSTK14 from the Harwell-Boeing collection in bcsstk14. Specifically,

𝐀=𝐃1/2𝐁𝐃1/2,where𝐃diag(𝐁11𝐁nn)\mathbf{A}=\mathbf{D}^{-1/2}\mathbf{B}\mathbf{D}^{-1/2},\qquad\text{where}\quad\mathbf{D}\equiv\operatorname{diag}\begin{pmatrix}\mathbf{B}_{11}&\cdots&\mathbf{B}_{nn}\end{pmatrix}

and 𝐁\mathbf{B} is BCSSTK14. Calibration is assessed at iterations m=10,100,300m=10,100,300.

5.1.2 ZZ-statistic and SS-statistic

The ZZ-statistic and SS-statistic experiments are implemented as described in Algorithms 4.1 and 4.2, respectively. The calibration criteria for the ZZ-statistic are given in Remark 4.10, and for the SS-statistic in Remark 4.18.

We sample from Gaussian distributions by exploiting their stability. According to Lemma A.1 in Appendix A, if Z𝒩(𝟎,𝐈)Z\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}), and 𝐅𝐅T=𝚺\mathbf{F}\mathbf{F}^{T}=\mathbf{\Sigma} is a factorization of the covariance, then

𝐅Z+𝐳=X𝒩(𝐱,𝚺).\mathbf{F}Z+\mathbf{z}=X\sim\operatorname{\mathcal{N}}(\mathbf{x},\mathbf{\Sigma}).

Samples Z𝒩(𝟎,𝐈)Z\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}) are generated with randn(n,1n,1) in Matlab, and with numpy.random.randn(n,1n,1) in NumPy.

ZZ-statistic experiments.

We quantify the distance between the ZZ-statistic samples and the chi-squared distribution by applying the Kolmogorov-Smirnov statistic (Definition 4.11) to the empirical cumulative distribution function of the ZZ-statistic samples and the analytical cumulative distribution function of the chi-squared distribution.

The degree of freedom in the chi-squared distribution is chosen as the median numerical rank of the posterior covariances. Note that the numerical rank of 𝚺m\mathbf{\Sigma}_{m} can differ from

rank(𝚺m)=rank(𝚺0)m,\operatorname{rank}(\mathbf{\Sigma}_{m})=\operatorname{rank}(\mathbf{\Sigma}_{0})-m,

and choosing the median rank gives an integer equal to the rank of at least one posterior covariance.

In compliance with the Matlab function rank and the NumPy function numpy.linalg.rank, we compute the numerical rank of 𝚺m\mathbf{\Sigma}_{m} as

rank(𝚺m)=cardinality{σi σ_i ¿ nεΣ_m_2},
\@latexerr
Variable name ended by end of line.I can’t help
whereεismachineepsilonandσi,1in,arethesingularvaluesof𝚺m(GoVa13, Section 5.4.1).
\operatorname{rank}(\mathbf{\Sigma}_{m})=\mathrm{cardinality}\{\sigma_{i}\ \mbox{{}\rm\sf\leavevmode\hbox{}\hbox{} \ \sigma\_i > n\varepsilon\|\mathbf{\Sigma}\_m\|\_2\},}\\ \@latexerr{Variable name ended by end of line.}I can't help\end{equation}where\varepsilon ismachineepsilonand\sigma_{i},1\leq i\leq n,arethesingularvaluesof\mathbf{\Sigma}_{m}\cite[cite]{(\@@bibref{AuthorsPhrase1Year}{GoVa13}{\@@citephrase{, }}{}, Section 5.4.1)}.\par
(30)

5.1.3 Three BayesCG implementations

We use three versions of BayesCG: BayesCG with random search directions, BayesCG under the inverse prior, and BayesCG under the Krylov prior.

BayesCG with random search directions.

The implementation in Algorithm B.2 in Appendix B.2 computes posterior covariances that do not depend on the solution 𝐱\mathbf{x}_{*}. This, in turn, requires search directions that do not depend on 𝐱\mathbf{x}_{*} (CIOR20, Section 1.1) which is achieved by starting with a random search direction 𝐬1=𝐮𝒩(𝟎,𝐈)\mathbf{s}_{1}=\mathbf{u}\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}) instead of the initial residual 𝐫0𝐛0𝐀𝐱0\mathbf{r}_{0}\equiv\mathbf{b}_{0}-\mathbf{A}\mathbf{x}_{0}. The prior is 𝒩(𝟎,𝐀1)\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{A}^{-1}).

By design, this version of BayesCG is calibrated. However, it is also impractical due to its slow convergence, see Figure 5.1, and an accurate solution is available only after nn iterations. The random initial search direction 𝐬1\mathbf{s}_{1} leads to uninformative mm-dimensional subspaces, so that the solver has to explore all of n\mathbb{R}^{n} before finding the solution.

Refer to caption
Refer to caption
Figure 5.1: Relative error 𝐱𝐱m𝐀2/𝐱𝐀2\|\mathbf{x}_{*}-\mathbf{x}_{m}\|_{\mathbf{A}}^{2}/\|\mathbf{x}_{*}\|_{\mathbf{A}}^{2} for BayesCG under the inverse prior (solid line) and Krylov prior (dashed line), and BayesCG with random search directions (dotted line). Vertical axis has a logarithmic scale (left plot) and a linear scale (right plot).
BayesCG under the inverse prior μ0𝒩(𝟎,𝐀1)\mu_{0}\equiv\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{A}^{-1}).

The implementation in Algorithm B.3 in Appendix B.3 is a modified version of Algorithm 2.1 for general priors that maintains the posterior covariances in factored form.

BayesCG under the Krylov prior.

For full posteriors, the modified Lanczos solver Algorithm B.1 in Appendix B.4 computes the full prior, which is then followed by the direct computation of the posteriors from the prior in Algorithm B.2.

For approximate posteriors, Algorithm B.5 in Appendix B.4 computes rank-dd covariances at the same computational cost as m+dm+d iterations of CG.

In ZZ- and SS-statistic experiments, solutions 𝐱\mathbf{x}_{*} are usually sampled from the prior distribution. We cannot sample solutions from the Krylov prior because it differs from solution to solution. Instead we sample solutions from the reference distribution 𝒩(𝟎,𝐀1)\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{A}^{-1}). This is a reasonable choice because the posterior means in BayesCG under the inverse and Krylov priors coincide with the CG iterates (Cockayne:BCG, Section 3).

5.2 BayesCG with random search directions

By design, BayesCG with random search directions is a calibrated solver. The purpose is to establish a baseline for comparisons with BayesCG under the inverse and Krylov priors, and to demonstrate that the ZZ- and SS-statistics perform as expected on a calibrated solver.

Summary of experiments below.

Both, ZZ- and SS-statistics strongly confirm that BayesCG with random search directions is indeed a calibrated solver, thereby corroborating the statements in Theorem 4.9 and Definition 4.17.

Refer to caption
Refer to caption
Refer to caption
Figure 5.2: ZZ-statistic samples for BayesCG with random search directions after m=10,100,300m=10,100,300 iterations. The solid curve represents the chi-squared distribution and the dashed curve the ZZ-statistic samples.
Iteration ZZ-stat mean χ2\chi^{2} mean K-S statistic
10.010.0 1.79×1031.79\times 10^{3} 1.8×1031.8\times 10^{3} 0.1390.139
100.0100.0 1.69×1031.69\times 10^{3} 1.71×1031.71\times 10^{3} 0.1610.161
300.0300.0 1.5×1031.5\times 10^{3} 1.51×1031.51\times 10^{3} 9.65×1029.65\times 10^{-2}
Table 5.1: This table corresponds to Figure 5.2. For BayesCG with random search directions, it shows the ZZ-statistic sample means; the chi-squared distribution means; and the Kolmogorov-Smirnov statistic between the ZZ-statistic samples and the chi-squared distribution.
Refer to caption
Refer to caption
Refer to caption
Figure 5.3: SS-statistic samples and traces for BayesCG with random search directions after m=10,100,300m=10,100,300 iterations. The solid curve represents the traces and the dashed curve the SS-statistic samples.
Iteration SS-stat mean Trace mean Trace standard deviation
10.010.0 1.78×1031.78\times 10^{3} 1.8×1031.8\times 10^{3} 2.93×10122.93\times 10^{-12}
100.0100.0 1.69×1031.69\times 10^{3} 1.71×1031.71\times 10^{3} 2.29×10122.29\times 10^{-12}
300.0300.0 1.51×1031.51\times 10^{3} 1.51×1031.51\times 10^{3} 2.07×10122.07\times 10^{-12}
Table 5.2: This table corresponds to Figure 5.3. For BayesCG with random search directions, it shows the SS-statistic sample means, the trace means, and the trace standard deviations.
Figure 5.2 and Table 5.1.

The ZZ-statistic samples in Figure 5.2 almost match the chi-squared distribution; and the Kolmogorov-Smirnov statistics in Table 5.1 are on the order of 10110^{-1}, meaning close to zero. This confirms that BayesCG with random search directions is indeed calibrated.

Figure 5.3 and Table 5.2.

The traces in Figure 5.3 are tightly concentrated around the empirical mean of the SS-statistic samples. Table 5.2 confirms the strong clustering of the trace and SS-statistic sample means around 10310^{-3}, together with the very small deviation of the traces. Thus, the area in which the posteriors are concentrated is consistent with the error, confirming again that BayesCG with random search directions is calibrated.

5.3 BayesCG under the inverse prior μ0=𝒩(𝟎,𝐀1)\mu_{0}=\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{A}^{-1}).

Summary of experiments below.

Both, ZZ- and SS-statics indicate that BayesCG under the inverse prior is pessimistic, and that the pessimism increases with the iteration count. This is consistent with the experiments in (Cockayne:BCG, Section 6.1).

Refer to caption
Refer to caption
Refer to caption
Figure 5.4: ZZ-statistic samples for BayesCG under the inverse prior after m=10,100,300m=10,100,300 iterations. The solid curve represents the chi-squared distribution and the dashed curve the ZZ-statistic samples.
Iteration ZZ-stat mean χ2\chi^{2} mean K-S statistic
10.010.0 51.951.9 1.8×1031.8\times 10^{3} 1.01.0
100.0100.0 0.5450.545 1.72×1031.72\times 10^{3} 1.01.0
300.0300.0 1.33×1051.33\times 10^{-5} 1.56×1031.56\times 10^{3} 1.01.0
Table 5.3: This table corresponds to Figure 5.4. For BayesCG under the inverse prior, it shows the ZZ-statistic sample means; the chi-squared distribution means; and Kolmogorov-Smirnov statistic between the ZZ-statistic samples and the chi-squared and distribution.
Refer to caption
Refer to caption
Refer to caption
Figure 5.5: SS-statistic samples and traces for BayesCG under the inverse prior after m=10,100,300m=10,100,300 iterations. The solid curve represents the traces and the dashed curve the SS-statistic samples.
Iteration SS-stat mean Trace mean Trace standard deviation
10.010.0 51.851.8 1.8×1031.8\times 10^{3} 2.99×10122.99\times 10^{-12}
100.0100.0 0.5740.574 1.71×1031.71\times 10^{3} 0.1640.164
300.0300.0 3.57×1063.57\times 10^{-6} 1.61×1031.61\times 10^{3} 1.021.02
Table 5.4: This table corresponds to Figure 5.5. For BayesCG under the inverse prior, it shows the SS-statistic sample means, the trace means, and the trace standard deviations.
Figure 5.4 and Table 5.3.

The ZZ-statistic samples in Figure 5.4 are concentrated around smaller values than the predicted chi-squared distribution. The Kol-mogorov-Smirnov statistics in Table 5.3 are all equal to 1, indicating no overlap between ZZ-statistic samples and chi-squared distribution. The first two columns of Table 5.3 show that ZZ-statistic samples move further away from the chi-squared distribution as the iterations progress. Thus, BayesCG under the inverse prior is pessimistic, and the pessimism increases with the iteration count.

Figure 5.5 and Table 5.4.

The SS-statistic samples in Figure 5.5 are concentrated around smaller values than the traces. Table 5.4 indicates trace values at 10310^{3}, while the SS-statistic samples move towards zero as the iteration progress. Thus the errors are much smaller than the area in which the posteriors are concentrated, meaning the posteriors overestimate the error. This again confirms the pessimism of BayesCG under the inverse prior.

5.4 BayesCG under the Krylov prior

We consider full posteriors (Section 5.4.1), and then rank-50 approximate posteriors (Section 5.4.2).

5.4.1 Full Krylov posteriors

Summary of experiments below.

The ZZ-statistic indicates that BayesCG under full Krylov posteriors is somewhat optimistic, while the SS-statistic indicates resemblance to a calibrated solver.

Refer to caption
Refer to caption
Refer to caption
Figure 5.6: ZZ-statistic samples for BayesCG under the Krylov prior and full posteriors at m=10,100,300m=10,100,300 iterations. The solid curve represents the predicted chi-squared distribution and the dashed curve the ZZ-statistic samples.
Iteration ZZ-stat mean χ2\chi^{2} mean K-S statistic
10.010.0 631.0631.0 566.0566.0 0.9020.902
100.0100.0 509.0509.0 450.0450.0 0.7520.752
300.0300.0 201.0201.0 152.0152.0 0.9410.941
Table 5.5: This table corresponds to Figure 5.6. For BayesCG under the Krylov prior and full posteriors, it shows the ZZ-statistic sample means; the chi-squared distribution means; and the Kolmogorov-Smirnov statistic between the ZZ-statistic samples and the chi-squared distribution.
Refer to caption
Refer to caption
Refer to caption
Figure 5.7: SS-statistic samples and traces for BayesCG under the Krylov prior and full posteriors at m=10,100,300m=10,100,300 iterations. The solid curve represents the traces and the dashed curve the SS-statistic samples.
Iteration SS-stat mean Trace mean Trace standard deviation
10.010.0 52.952.9 52.952.9 9.49.4
100.0100.0 0.5150.515 0.5150.515 0.2410.241
300.0300.0 8.59×1078.59\times 10^{-7} 8.59×1078.59\times 10^{-7} 2.84×1072.84\times 10^{-7}
Table 5.6: This table corresponds to Figure 5.7. For BayesCG under the Krylov prior and full posteriors, it shows the SS-statistic sample means, the trace means, and the trace standard deviations.
Figure 5.6 and Table 5.5.

The ZZ-statistic samples in Figure 5.6 are concentrated at somewhat larger values than the predicted chi-squared distribution. The Kolmogorov-Smirnov statistics in Table 5.5 are around .8 and .9, thus close to 1, and indicate very little overlap between ZZ-statistic samples and chi-squared distribution. Thus, BayesCG under full Krylov posteriors is somewhat optimistic.

These numerical results differ from Theorem 4.13, which predicts ZZ-statistic samples equal to gmg-m. A possible reason might be that the rank of the Krylov prior computed by Algorithm B.1 is smaller than the exact rank. In exact arithmetic, rank(𝚪0)=g=n=1806\operatorname{rank}(\mathbf{\Gamma}_{0})=g=n=1806. However, in finite precision, rank(𝚪0)\operatorname{rank}(\mathbf{\Gamma}_{0}) is determined by the convergence tolerance which is set to 101210^{-12}, resulting in rank(𝚪0)<g\operatorname{rank}(\mathbf{\Gamma}_{0})<g.

Figure 5.7 and Table 5.6.

The SS-statistic samples in Figure 5.7 match the traces extremely well, with Table 5.6 showing an agreement to 3 figures, as predicted in Section 4.3.4, Thus, the area in which the posteriors are concentrated is consistent with the error, as would be expected from a calibrated solver.

However, BayesCG under the Krylov prior does not behave exactly like a calibrated solver, such as BayesCG with random search directions in Section 5.2, where all traces are concentrated at the empirical mean of the SS-statistic samples. Thus, BayesCG under the Krylov prior is not calibrated but has a performance similar to that of a calibrated solver.

5.4.2 Rank-50 approximate Krylov posteriors

Summary of the experiments below.

Both, ZZ- and SS-statistic indicate that BayesCG under rank-50 approximate Krylov posteriors is somewhat optimistic, and is not as close to a calibrated solver as BayesCG with full Krylov posteriors. In contrast to the ZZ-statistic, the respective SS-statistic samples and traces for BayesCG under full and rank-50 posteriors are close.

Refer to caption
Refer to caption
Refer to caption
Figure 5.8: ZZ-statistic samples for BayesCG under rank-50 approximate Krylov posteriors at m=10,100,300m=10,100,300 iterations. The solid curve represents the predicted chi-squared distribution and the dashed curve the ZZ-statistic samples.
Iteration ZZ-stat mean χ2\chi^{2} mean K-S statistic
10.010.0 319.0319.0 50.050.0 1.01.0
100.0100.0 375.0375.0 50.050.0 1.01.0
300.0300.0 194.0194.0 50.050.0 1.01.0
Table 5.7: This table corresponds to Figure 5.8. For BayesCG under rank-50 approximate Krylov posteriors, it shows the ZZ-statistic sample means; chi-squared distribution means; and Kolmogorov-Smirnov statistic between the ZZ-statistic samples and the chi-squared distribution.
Refer to caption
Refer to caption
Refer to caption
Figure 5.9: SS-statistic samples and traces for BayesCG under rank-50 approximate Krylov posteriors at m=10,100,300m=10,100,300 iterations. The solid curve represents the traces and the dashed curve the SS-statistic samples.
Iteration SS-stat mean Trace mean Trace standard deviation
10.010.0 57.257.2 53.953.9 8.328.32
100.0100.0 0.5170.517 0.4670.467 0.2140.214
300.0300.0 3.37×1063.37\times 10^{-6} 3.29×1063.29\times 10^{-6} 1.6×1061.6\times 10^{-6}
Table 5.8: This table corresponds to Figure 5.9. For BayesCG under rank-50 approximate Krylov posteriors, it shows the SS-statistic sample means, trace means, and trace standard deviations.
Figure 5.8 and Table 5.7.

The ZZ-statistic samples in Figure 5.8 are concentrated around larger values than the predicted chi-squared distribution, which is steady at 50. All Kolmogorov-Smirnov statistics in Table 5.7 are equal to 1, indicating no overlap between ZZ-statistic samples and chi-squared distribution. Thus, BayesCG under approximate Krylov posteriors is more optimistic than BayesCG under full posteriors.

Figure 5.9 and Table 5.8.

The traces in Figure 5.9 are concentrated around slightly smaller values than the SS-statistic samples, but they all have the same order of magnitude, as shown in Table 5.8. This means, the errors are slightly larger than the area in which the posteriors are concentrated; and the posteriors slightly underestimate the errors.

A comparison of Tables 5.6 and 5.8 shows that the SS-statistic samples and traces, respectively, for full and rank-50 posteriors are close. From the point of view of the SS-statistic, BayesCG under approximate Krylov posteriors is somewhat optimistic, and close to being a calibrated solver but not as close as BayesCG under full Krylov posteriors.

Appendix A Auxiliary Results

We present auxiliary results required for proofs in other sections.

The stability of Gaussian distributions implies that a linear transformation of a Gaussian random variable remains Gaussian.

Lemma A.1 (Stability of Gaussian Distributions (Muirhead, Section 1.2)).

Let X𝒩(𝐱,𝚺)X\sim\operatorname{\mathcal{N}}(\mathbf{x},\mathbf{\Sigma}) be a Gaussian random variable with mean 𝐱n\mathbf{x}\in\mathbb{R}^{n} and covariance 𝚺n×n\mathbf{\Sigma}\in\mathbb{R}^{n\times n}. If 𝐲n\mathbf{y}\in\mathbb{R}^{n} and 𝐅n×n\mathbf{F}\in\mathbb{R}^{n\times n}, then

Z=𝐲+𝐅X𝒩(𝐲+𝐅𝐱,𝐅𝚺𝐅T).Z=\mathbf{y}+\mathbf{F}X\sim\operatorname{\mathcal{N}}(\mathbf{y}+\mathbf{F}\mathbf{x},\mathbf{F}\mathbf{\Sigma}\mathbf{F}^{T}).

The conjugacy of Gaussian distributions implies that the distribution of a Gaussian random variable conditioned on information that linearly depends on the random variable is a Gaussian distribution.

Lemma A.2 (Conjugacy of Gaussian Distributions (Ouellette, Section 6.1), (Stuart:BayesInverse, Corollary 6.21)).

Let X𝒩(𝐱,𝚺x)X\sim\operatorname{\mathcal{N}}(\mathbf{x},\mathbf{\Sigma}_{x}) and Y𝒩(𝐲,𝚺y)Y\sim\operatorname{\mathcal{N}}(\mathbf{y},\mathbf{\Sigma}_{y}). The jointly Gaussian random variable [XTYT]T\begin{bmatrix}X^{T}&Y^{T}\end{bmatrix}^{T} has the distribution

[XY]𝒩([𝐱𝐲],[𝚺x𝚺xy𝚺xyT𝚺y]),\begin{bmatrix}X\\ Y\end{bmatrix}\sim\operatorname{\mathcal{N}}\left(\begin{bmatrix}\mathbf{x}\\ \mathbf{y}\end{bmatrix},\begin{bmatrix}\mathbf{\Sigma}_{x}&\mathbf{\Sigma}_{xy}\\ \mathbf{\Sigma}_{xy}^{T}&\mathbf{\Sigma}_{y}\end{bmatrix}\right),

where 𝚺xyCov(X,Y)=𝔼[(X𝐱)(Y𝐲)T]\mathbf{\Sigma}_{xy}\equiv\operatorname{Cov}(X,Y)=\operatorname{\mathbb{E}}[(X-\mathbf{x})(Y-\mathbf{y})^{T}] and the conditional distribution of XX given YY is

(X Y) N(xΣ_–xy˝Σ_y^(Y- y)˝^–mean˝˝, Σ_x - Σ_–xy˝Σ_y^Σ_–xy˝^T˝^–covariance˝˝).
\@latexerr
Variable name ended by end of line.I can’t help
(X\ \mbox{{}\rm\sf\leavevmode\hbox{}\hbox{} \ Y) \sim\operatorname{\mathcal{N}}(\overbrace{{}\mathbf{x}+ \mathbf{\Sigma}\_{xy}\mathbf{\Sigma}\_y^\dagger(Y- \mathbf{y})}^{\text{{}mean}}, \ \overbrace{{}\mathbf{\Sigma}\_x - \mathbf{\Sigma}\_{xy}\mathbf{\Sigma}\_y^\dagger\mathbf{\Sigma}\_{xy}^T}^{\text{{}covariance}}).}\\ \@latexerr{Variable name ended by end of line.}I can't help\end{equation*}

We show how to transform a 𝐁\mathbf{B}-orthogonal matrix into an orthogonal matrix.

Lemma A.3.

Let 𝐁n×n\mathbf{B}\in\mathbb{R}^{n\times n} be symmetric positive definite, and let 𝐇n×n\mathbf{H}\in\mathbb{R}^{n\times n} be a 𝐁\mathbf{B}-orthogonal matrix with 𝐇T𝐁𝐇=𝐇𝐁𝐇T=𝐈\mathbf{H}^{T}\mathbf{B}\mathbf{H}=\mathbf{H}\mathbf{B}\mathbf{H}^{T}=\mathbf{I}. Then

𝐔𝐁1/2𝐇\mathbf{U}\equiv\mathbf{B}^{1/2}\mathbf{H}

is an orthogonal matrix with 𝐔T𝐔=𝐔𝐔T=𝐈\mathbf{U}^{T}\mathbf{U}=\mathbf{U}\mathbf{U}^{T}=\mathbf{I}.

Proof: The symmetry of 𝐁\mathbf{B} and the 𝐁\mathbf{B}-orthogonality of 𝐇\mathbf{H} imply

𝐔T𝐔=𝐇T𝐁𝐇=𝐈.\mathbf{U}^{T}\mathbf{U}=\mathbf{H}^{T}\mathbf{B}\mathbf{H}=\mathbf{I}.

From the orthonormality of the columns of 𝐔\mathbf{U}, and the fact that 𝐔\mathbf{U} is square follows that 𝐔\mathbf{U} is an orthogonal matrix (HornJohnson85, Definition 2.1.3).        \blacksquare

Definition A.4.

(HornJohnson85, Section 7.3) The thin singular value decomposition of the rank-pp matrix 𝐆m×n\mathbf{G}\in\mathbb{R}^{m\times n} is

𝐆=𝐔𝐃𝐖T,\mathbf{G}=\mathbf{U}\mathbf{D}\mathbf{W}^{T},

where 𝐔m×p\mathbf{U}\in\mathbb{R}^{m\times p} and 𝐖n×p\mathbf{W}\in\mathbb{R}^{n\times p} are matrices with orthonormal columns and 𝐃p×p\mathbf{D}\in\mathbb{R}^{p\times p} is a diagonal matrix with positive diagonal elements. The Moore-Penrose inverse of 𝐆\mathbf{G} is

𝐆=𝐖𝐃1𝐔T.\mathbf{G}^{\dagger}=\mathbf{W}\mathbf{D}^{-1}\mathbf{U}^{T}.

If a matrix has full column-rank or full row-rank, then its Moore-Penrose can be expressed in terms of the matrix itself. Furthermore, the Moore-Penrose inverse of a product is equal to the product of the Moore-Penrose inverses, provided the first matrix has full column-rank and the second matrix has full row-rank.

Lemma A.5 ((CM09, Corollary 1.4.2)).

Let 𝐆m×n\mathbf{G}\in\mathbb{R}^{m\times n} and 𝐉n×p\mathbf{J}\in\mathbb{R}^{n\times p} have full column and row rank respectively, so rank(𝐆)=rank(𝐉)=n\operatorname{rank}(\mathbf{G})=\operatorname{rank}(\mathbf{J})=n. The Moore-Penrose inverses of 𝐆\mathbf{G} and 𝐉\mathbf{J} are

𝐆=(𝐆T𝐆)1𝐆Tand𝐉=𝐉T(𝐉𝐉T)1\mathbf{G}^{\dagger}=(\mathbf{G}^{T}\mathbf{G})^{-1}\mathbf{G}^{T}\quad\text{and}\quad\mathbf{J}^{\dagger}=\mathbf{J}^{T}(\mathbf{J}\mathbf{J}^{T})^{-1}

respectively, and the Moore-Penrose inverse of the product equals

(𝐆𝐉)=𝐉𝐆.(\mathbf{G}\mathbf{J})^{\dagger}=\mathbf{J}^{\dagger}\mathbf{G}^{\dagger}.

Below is an explicit expression for the mean of a quadratic form of Gaussians.

Lemma A.6 ((Mathai, Sections 3.2b.1–3.2b.3)).

Let Z𝒩(𝐱z,𝚺z)Z\sim\operatorname{\mathcal{N}}(\mathbf{x}_{z},\mathbf{\Sigma}_{z}) be a Gaussian random variable in n\mathbb{R}^{n}, and 𝐁n×n\mathbf{B}\in\mathbb{R}^{n\times n} be symmetric positive definite. The mean of ZT𝐁ZZ^{T}\mathbf{B}Z is

𝔼[ZT𝐁Z]\displaystyle\operatorname{\mathbb{E}}[Z^{T}\mathbf{B}Z] =trace(𝐁𝚺z)+𝐱zT𝐁𝐱z.\displaystyle=\operatorname{trace}(\mathbf{B}\mathbf{\Sigma}_{z})+\mathbf{x}_{z}^{T}\mathbf{B}\mathbf{x}_{z}.

We show that the squared Euclidean norm of a Gaussian random variable with an orthogonal projector as its covariance matrix is distributed according to a chi-squared distribution.

Lemma A.7.

Let Z𝒩(𝟎,𝐏)Z\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{P}) be a Gaussian random variable in n\mathbb{R}^{n}. If the covariance matrix 𝐏\mathbf{P} is an orthogonal projector, that is, if 𝐏2=𝐏\mathbf{P}^{2}=\mathbf{P} and 𝐏=𝐏T\mathbf{P}=\mathbf{P}^{T}, then

X22=(XTX)χp2,\|X\|_{2}^{2}=(X^{T}X)\sim\chi_{p}^{2},

where p=rank(𝐏)p=\operatorname{rank}(\mathbf{P}).

Proof: We express the projector in terms of orthonormal matrices and then use the invariance of the 2-norm under orthogonal matrices and the stability of Gaussians.

Since 𝐏\mathbf{P} is an orthogonal projector, there exists 𝐔1n×p\mathbf{U}_{1}\in\mathbb{R}^{n\times p} such that 𝐔1𝐔1T=𝐏\mathbf{U}_{1}\mathbf{U}_{1}^{T}=\mathbf{P} and 𝐔1T𝐔=𝐈p\mathbf{U}_{1}^{T}\mathbf{U}=\mathbf{I}_{p}. Choose 𝐔2n×(np)\mathbf{U}_{2}\in\mathbb{R}^{n\times(n-p)} so that 𝐔=[𝐔1𝐔2]\mathbf{U}=\begin{bmatrix}\mathbf{U}_{1}&\mathbf{U}_{2}\end{bmatrix} is an orthogonal matrix. Thus,

XTX=XT𝐔𝐔TX=XT𝐔1𝐔1TX+XT𝐔2𝐔2TX.X^{T}X=X^{T}\mathbf{U}\mathbf{U}^{T}X=X^{T}\mathbf{U}_{1}\mathbf{U}_{1}^{T}X+X^{T}\mathbf{U}_{2}\mathbf{U}_{2}^{T}X. (31)

Lemma A.1 implies that Y=𝐔1TXY=\mathbf{U}_{1}^{T}X is distributed according to a Gaussian distribution with mean 𝟎\mathbf{0} and covariance 𝐔1T𝐔1𝐔1T𝐔=𝐈p\mathbf{U}_{1}^{T}\mathbf{U}_{1}\mathbf{U}_{1}^{T}\mathbf{U}=\mathbf{I}_{p}. Similarly, Z=𝐔2TXZ=\mathbf{U}_{2}^{T}X is distributed according to a Gaussian distribution with mean 𝟎\mathbf{0} and covariance 𝐔2T𝐔1𝐔1T𝐔2=𝟎\mathbf{U}_{2}^{T}\mathbf{U}_{1}\mathbf{U}_{1}^{T}\mathbf{U}_{2}=\mathbf{0}, thus Z=𝟎Z=\mathbf{0}.

Substituting YY and ZZ into (31) gives XTX=YTY+𝟎T𝟎X^{T}X=Y^{T}Y+\mathbf{0}^{T}\mathbf{0}. From Y𝒩(𝟎,𝐈p)Y\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}_{p}) follows (XTX)χp2(X^{T}X)\sim\chi_{p}^{2}.        \blacksquare

Lemma A.8.

If 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric positive definite, and M𝒩(𝐱μ𝚺μ)M\sim\operatorname{\mathcal{N}}(\mathbf{x}_{\mu}\mathbf{\Sigma}_{\mu}) and N𝒩(𝐱ν,𝚺ν)N\sim\operatorname{\mathcal{N}}(\mathbf{x}_{\nu},\mathbf{\Sigma}_{\nu}) are independent random variables in n\mathbb{R}^{n}, then

𝔼[MN𝐀2]=𝐱μ𝐱ν𝐀2+trace(𝐀𝚺μ)+trace(𝐀𝚺ν).\operatorname{\mathbb{E}}[\|M-N\|_{\mathbf{A}}^{2}]=\|\mathbf{x}_{\mu}-\mathbf{x}_{\nu}\|_{\mathbf{A}}^{2}+\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{\mu})+\operatorname{trace}(\mathbf{A}\mathbf{\Sigma}_{\nu}).

Proof: The random variable MNM-N has mean 𝔼[MN]=𝐱μ𝐱ν\operatorname{\mathbb{E}}[M-N]=\mathbf{x}_{\mu}-\mathbf{x}_{\nu}, and covariance

𝚺MN\displaystyle\mathbf{\Sigma}_{M-N} Cov(MN,MN)\displaystyle\equiv\operatorname{Cov}(M-N,M-N)
=Cov(M,M)+Cov(N,N)Cov(M,N)Cov(N,M)\displaystyle=\operatorname{Cov}(M,M)+\operatorname{Cov}(N,N)-\operatorname{Cov}(M,N)-\operatorname{Cov}(N,M)
=Cov(M,M)+Cov(N,N)=𝚺μ+𝚺ν,\displaystyle=\operatorname{Cov}(M,M)+\operatorname{Cov}(N,N)=\mathbf{\Sigma}_{\mu}+\mathbf{\Sigma}_{\nu},

where the covariances Cov(M,N)=Cov(N,M)=0\operatorname{Cov}(M,N)=\operatorname{Cov}(N,M)=0 because MM and NN are independent. Now apply Lemma A.6 to MNM-N.        \blacksquare

Appendix B Algorithms

We present algorithms for the modified Lanczos method (Section B.1), BayesCG with random search directions (Section B.2), BayesCG with covariances in factored form (Section B.3), and BayesCG under the Krylov prior (Section B.4).

B.1 Modified Lanczos method

The Lanczos method (Saad, Algorithm 6.15) produces an orthonormal basis for the Krylov space 𝒦g(𝐀,𝐯1)\mathcal{K}_{g}(\mathbf{A},\mathbf{v}_{1}), while the modified version in Algorithm B.1 produces an 𝐀\mathbf{A}-orthonormal basis.

Algorithm B.1 Modified Lanczos Method
1:Input: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐯1n\mathbf{v}_{1}\in\mathbb{R}^{n}, basis dimension mm, convergence tolerance ε\varepsilon
2:𝐯0=𝟎n\mathbf{v}_{0}=\mathbf{0}\in\mathbb{R}^{n}
3:i=1i=1
4:β=(𝐯iT𝐀𝐯i)1/2\beta=(\mathbf{v}_{i}^{T}\mathbf{A}\mathbf{v}_{i})^{1/2}
5:𝐯i=𝐯i/β\mathbf{v}_{i}=\mathbf{v}_{i}/\beta
6:while  imi\leq m do
7:     𝐰=𝐀𝐯iβ𝐯i1\mathbf{w}=\mathbf{A}\mathbf{v}_{i}-\beta\mathbf{v}_{i-1}
8:     α=𝐰T𝐀𝐯i\alpha=\mathbf{w}^{T}\mathbf{A}\mathbf{v}_{i}
9:     𝐰=wα𝐯i\mathbf{w}=w-\alpha\mathbf{v}_{i}
10:     𝐰=𝐰j=1i𝐯j𝐯jT𝐀𝐰\mathbf{w}=\mathbf{w}-\sum_{j=1}^{i}\mathbf{v}_{j}\mathbf{v}_{j}^{T}\mathbf{A}\mathbf{w} \ignorespaces\triangleright Reorthogonalize 𝐰\mathbf{w}
11:     𝐰=𝐰j=1i𝐯j𝐯jT𝐀𝐰\mathbf{w}=\mathbf{w}-\sum_{j=1}^{i}\mathbf{v}_{j}\mathbf{v}_{j}^{T}\mathbf{A}\mathbf{w}
12:     β=(𝐰T𝐀𝐰)1/2\beta=(\mathbf{w}^{T}\mathbf{A}\mathbf{w})^{1/2}
13:     if  β<ε\beta<\varepsilon then
14:         Exit while loop
15:     end if
16:     i=i+1i=i+1
17:     𝐯i=𝐰/β\mathbf{v}_{i}=\mathbf{w}/\beta
18:end while
19:m=i1m=i-1 \ignorespaces\triangleright Number of basis vectors
20:Output: {𝐯1,𝐯2,,𝐯m}\{\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{m}\} \ignorespaces\triangleright 𝐀\mathbf{A}-orthonormal basis of 𝒦m(𝐀,𝐯1)\mathcal{K}_{m}(\mathbf{A},\mathbf{v}_{1})

Algorithm B.1 reorthogonalizes the basis vectors 𝐯i\mathbf{v}_{i} with Classical Gram-Schmidt performed twice, see Lines 10 and 11. This reorthogonalization technique can be implemented efficiently and produces vectors that are orthogonal to machine precision Giraud:CGS2Exp; Giraud:CGS2Theory.

B.2 BayesCG with random search directions

The version of BayesCG in Algorithm B.2 is designed to be calibrated because the search directions do not depend on 𝐱\mathbf{x}_{*}, hence the posteriors do not depend on 𝐱\mathbf{x}_{*} either (CIOR20, Section 1.1).

After sampling an initial random search direction 𝐬1𝒩(𝟎,𝐈)\mathbf{s}_{1}\sim\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}), Algorithm B.2 computes an 𝐀𝚺0𝐀\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}-orthonormal basis for the Krylov space 𝒦m(𝐀𝚺0𝐀,𝐬1)\mathcal{K}_{m}(\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A},\mathbf{s}_{1}) with Algorithm B.1. Then Algorithm B.2 computes the BayesCG posteriors directly with (2) and (3) from Theorem 2.1. The numerical experiments in Section 5 run Algorithm B.2 with the inverse prior μ0=𝒩(𝟎,𝐀1)\mu_{0}=\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{A}^{-1}).

Algorithm B.2 BayesCG with random search directions
1:Inputs: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, prior μ0=𝒩(𝐱0,𝚺0)\mu_{0}=\operatorname{\mathcal{N}}(\mathbf{x}_{0},\mathbf{\Sigma}_{0}), iteration count mm
2:𝐫0=𝐛𝐀𝐱0{\mathbf{r}}_{0}={\mathbf{b}}-{\mathbf{A}\mathbf{x}}_{0} \ignorespaces\triangleright Initial residual
3:Sample 𝐬1\mathbf{s}_{1} from 𝒩(𝟎,𝐈)\operatorname{\mathcal{N}}(\mathbf{0},\mathbf{I}) \ignorespaces\triangleright Initial search direction
4:Compute columns of 𝐒\mathbf{S} with Algorithm B.1
5:𝚲m=𝐒mT𝐀𝚺0𝐀𝐒m\mathbf{\Lambda}_{m}=\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S}_{m} \ignorespaces\triangleright 𝚲m\mathbf{\Lambda}_{m} is diagonal
6:𝐱m=𝐱0+𝚺0𝐀𝐒m𝚲m1𝐒mT𝐫0\mathbf{x}_{m}=\mathbf{x}_{0}+\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S}_{m}\mathbf{\Lambda}_{m}^{-1}\mathbf{S}_{m}^{T}\mathbf{r}_{0} \ignorespaces\triangleright Compute posterior mean with (2)
7:𝚺m=𝚺0𝚺0𝐀𝐒m𝚲m1𝐒mT𝐀𝚺0\mathbf{\Sigma}_{m}=\mathbf{\Sigma}_{0}-\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{S}_{m}\mathbf{\Lambda}_{m}^{-1}\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{\Sigma}_{0} \ignorespaces\triangleright Compute posterior covariance with (3)
8:Output: μm=𝒩(𝐱m,𝚺m)\mu_{m}=\mathcal{N}(\mathbf{x}_{m},\mathbf{\Sigma}_{m})

B.3 BayesCG with covariances in factored form

Algorithm B.3 takes as input a general prior covariance 𝚺0\mathbf{\Sigma}_{0} in factored form, and subsequently maintains the posterior covariances 𝚺m\mathbf{\Sigma}_{m} in factored form as well. Theorem B.1 presents the correctness proof for Algorithm B.3.

Theorem B.1.

Under the conditions of Theorem 2.1, if 𝚺0=𝐅0𝐅0T\mathbf{\Sigma}_{0}=\mathbf{F}_{0}\mathbf{F}_{0}^{T} for 𝐅0n×\mathbf{F}_{0}\in\mathbb{R}^{n\times\ell} and some mnm\leq\ell\leq n, then 𝚺m=𝐅m𝐅mT\mathbf{\Sigma}_{m}=\mathbf{F}_{m}\mathbf{F}_{m}^{T} with

𝐅m=𝐅0(𝐈𝐅0T𝐀𝐒m(𝐒mT𝐀𝐅0𝐅0T𝐀𝐒m)1𝐒m𝐀𝐅0)n×,1mn.\mathbf{F}_{m}=\mathbf{F}_{0}\left(\mathbf{I}-\mathbf{F}_{0}^{T}\mathbf{A}\mathbf{S}_{m}(\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{F}_{0}\mathbf{F}_{0}^{T}\mathbf{A}\mathbf{S}_{m})^{-1}\mathbf{S}_{m}\mathbf{A}\mathbf{F}_{0}\right)\in\mathbb{R}^{n\times\ell},\qquad 1\leq m\leq n.

Proof: Fix mm. Substituting 𝚺0=𝐅0𝐅0T\mathbf{\Sigma}_{0}=\mathbf{F}_{0}\mathbf{F}_{0}^{T} into (3) and factoring out 𝐅0\mathbf{F}_{0} on the left and 𝐅0T\mathbf{F}_{0}^{T} on the right gives 𝚺m=𝐅0𝐏𝐅0T\mathbf{\Sigma}_{m}=\mathbf{F}_{0}\mathbf{P}\mathbf{F}_{0}^{T} where

𝐏\displaystyle\mathbf{P} 𝐈𝐅0T𝐀𝐒m(𝐒mT𝐀𝐅0𝐅0T𝐀𝐒m)1𝐒m𝐀𝐅0\displaystyle\equiv\mathbf{I}-\mathbf{F}_{0}^{T}\mathbf{A}\mathbf{S}_{m}(\mathbf{S}_{m}^{T}\mathbf{A}\mathbf{F}_{0}\mathbf{F}_{0}^{T}\mathbf{A}\mathbf{S}_{m})^{-1}\mathbf{S}_{m}\mathbf{A}\mathbf{F}_{0}
=(𝐈𝐐(𝐐T𝐐)1𝐐T)where𝐐𝐅0T𝐀𝐒m.\displaystyle=(\mathbf{I}-\mathbf{Q}(\mathbf{Q}^{T}\mathbf{Q})^{-1}\mathbf{Q}^{T})\qquad\text{where}\quad\mathbf{Q}\equiv\mathbf{F}_{0}^{T}\mathbf{A}\mathbf{S}_{m}.

Show that 𝐏\mathbf{P} is a projector,

𝐏2\displaystyle\mathbf{P}^{2} =𝐈2𝐐(𝐐T𝐐)1𝐐T+𝐐(𝐐T𝐐)1𝐐T𝐐(𝐐T𝐐)1𝐐T\displaystyle=\mathbf{I}-2\mathbf{Q}(\mathbf{Q}^{T}\mathbf{Q})^{-1}\mathbf{Q}^{T}+\mathbf{Q}(\mathbf{Q}^{T}\mathbf{Q})^{-1}\mathbf{Q}^{T}\mathbf{Q}(\mathbf{Q}^{T}\mathbf{Q})^{-1}\mathbf{Q}^{T}
=𝐈𝐐(𝐐T𝐐)1𝐐T=𝐏.\displaystyle=\mathbf{I}-\mathbf{Q}(\mathbf{Q}^{T}\mathbf{Q})^{-1}\mathbf{Q}^{T}=\mathbf{P}.

Hence 𝚺m=𝐅0𝐏𝐅0T=𝐅0𝐏𝐏𝐅0T=𝐅m𝐅mT\mathbf{\Sigma}_{m}=\mathbf{F}_{0}\mathbf{P}\mathbf{F}_{0}^{T}=\mathbf{F}_{0}\mathbf{P}\mathbf{P}\mathbf{F}_{0}^{T}=\mathbf{F}_{m}\mathbf{F}_{m}^{T}.

Algorithm B.3 BayesCG with covariances in factored form
1:Input: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}, 𝐅0n×\mathbf{F}_{0}\in\mathbb{R}^{n\times\ell} \ignorespaces\triangleright need 𝐱𝐱0range(𝚺0)\mathbf{x}_{*}-\mathbf{x}_{0}\in\operatorname{range}(\mathbf{\Sigma}_{0})
2:𝐫0=𝐛𝐀𝐱0\mathbf{r}_{0}=\mathbf{b}-\mathbf{A}\mathbf{x}_{0}
3:𝐬1=𝐫0\mathbf{s}_{1}=\mathbf{r}_{0}
4:𝐏=𝟎n×n\mathbf{P}=\mathbf{0}\in\mathbb{R}^{n\times n}
5:m=0m=0
6:while not converged do
7:     m=m+1m=m+1
8:     𝐏(:,m)=𝐅0T𝐀𝐬m\mathbf{P}(:,m)=\mathbf{F}_{0}^{T}\mathbf{A}\mathbf{s}_{m} \ignorespaces\triangleright Save column mm of 𝐏\mathbf{P}
9:     𝐪=𝐅0𝐏(:,m)\mathbf{q}=\mathbf{F}_{0}\mathbf{P}(:,m) \ignorespaces\triangleright Compute 𝐪=𝚺0𝐀𝐬m\mathbf{q}=\mathbf{\Sigma}_{0}\mathbf{A}\mathbf{s}_{m}
10:     ηm=𝐬mT𝐀𝐪\eta_{m}=\mathbf{s}_{m}^{T}\mathbf{A}\mathbf{q}
11:     𝐏(:,m)=𝐏(:,m)/ηm\mathbf{P}(:,m)=\mathbf{P}(:,m)\big{/}\eta_{m} \ignorespaces\triangleright Normalize column mm of 𝐏\mathbf{P}
12:     αm=(𝐫m1T𝐫m1)/ηm\alpha_{m}=\left(\mathbf{r}_{m-1}^{T}\mathbf{r}_{m-1}\right)\big{/}\eta_{m}
13:     𝐱m=𝐱m1+αm𝐪\mathbf{x}_{m}=\mathbf{x}_{m-1}+\alpha_{m}\mathbf{q}
14:     𝐫m=𝐫m1αm𝐀𝐪\mathbf{r}_{m}=\mathbf{r}_{m-1}-\alpha_{m}\mathbf{A}\mathbf{q}
15:     βm=(𝐫mT𝐫m)/(𝐫m1T𝐫m1)\beta_{m}=\left(\mathbf{r}_{m}^{T}\mathbf{r}_{m}\right)\big{/}\left(\mathbf{r}_{m-1}^{T}\mathbf{r}_{m-1}\right)
16:     𝐬m+1=𝐫m+βm𝐬m\mathbf{s}_{m+1}=\mathbf{r}_{m}+\beta_{m}\mathbf{s}_{m}
17:end while
18:𝐏=𝐏(:,1:m)\mathbf{P}=\mathbf{P}(:,1:m) \ignorespaces\triangleright Discard unused columns of 𝐏\mathbf{P}
19:𝐅m=𝐅0(𝐈𝐏𝐏T)\mathbf{F}_{m}=\mathbf{F}_{0}(\mathbf{I}-\mathbf{P}\mathbf{P}^{T})
20:Output: 𝐱m\mathbf{x}_{m}, 𝐅m\mathbf{F}_{m} \ignorespaces\triangleright Final posterior

B.4 BayesCG under the Krylov prior

We present algorithms for BayesCG under full Krylov posteriors (Section B.4.1) and under approximate Krylov posteriors (Section B.4.2).

B.4.1 Full Krylov posteriors

Algorithm B.4 computes the following: a matrix 𝐕\mathbf{V} whose columns are an 𝐀\mathbf{A}-orthonormal basis for 𝒦g(𝐀,𝐫0)\mathcal{K}_{g}(\mathbf{A},\mathbf{r}_{0}); the diagonal matrix 𝚽\mathbf{\Phi} in (5); and the posterior mean 𝐱m\mathbf{x}_{m} in (26). The output consists of the posterior mean 𝐱m\mathbf{x}_{m}, and the factors 𝐕m+1:g\mathbf{V}_{m+1:g} and 𝚽m+1:g\mathbf{\Phi}_{m+1:g} for the posterior covariance.

Algorithm B.4 BayesCG under the Krylov prior with full posteriors
1:Inputs: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}, iteration count mm
2:𝐫0=𝐛𝐀𝐱0{\mathbf{r}}_{0}={\mathbf{b}}-{\mathbf{A}\mathbf{x}}_{0} \ignorespaces\triangleright Initial residual
3:𝐯1=𝐫0{\mathbf{v}}_{1}={\mathbf{r}}_{0} \ignorespaces\triangleright Initial search direction
4:Compute columns of 𝐕\mathbf{V} with Algorithm B.1
5:𝚽=diag((𝐕T𝐫0)2)\mathbf{\Phi}=\mathrm{diag}((\mathbf{V}^{T}\mathbf{r}_{0})^{2}) \ignorespaces\triangleright Compute 𝚽\mathbf{\Phi} with (5)
6:𝐱m=𝐱0+𝐕1:m𝐕1:mT𝐫0\mathbf{x}_{m}=\mathbf{x}_{0}+\mathbf{V}_{1:m}\mathbf{V}_{1:m}^{T}\mathbf{r}_{0} \ignorespaces\triangleright Compute posterior mean with (26)
7:Output: 𝐱m\mathbf{x}_{m}, 𝐕m+1:g\mathbf{V}_{m+1:g}, 𝚽m+1:g\mathbf{\Phi}_{m+1:g}

B.4.2 Approximate Krylov posteriors

Algorithm B.5 computes rank-dd approximate Krylov posteriors in two main steps: (i) posterior mean and iterates 𝐱m\mathbf{x}_{m} in Lines 5-14; and (ii) factorization of the posterior covariance 𝚪^m\widehat{\mathbf{\Gamma}}_{m} in Lines 16-26.

Algorithm B.5 BayesCG under the Krylov prior (RICO21, Algorithm 3.1)
1:Inputs: spd 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}, iteration count mm, posterior rank dd
2:𝐫0=𝐛𝐀𝐱0{\mathbf{r}}_{0}={\mathbf{b}}-{\mathbf{A}\mathbf{x}}_{0} \ignorespaces\triangleright Initial residual
3:𝐯1=𝐫0{\mathbf{v}}_{1}={\mathbf{r}}_{0} \ignorespaces\triangleright Initial search direction
4:i=0i=0 \ignorespaces\triangleright Initial iteration counter
5:while i<mi<m do \ignorespaces\triangleright CG recursions for posterior means
6:     i=i+1i=i+1 \ignorespaces\triangleright Increment iteration count
7:     ηi=𝐯iT𝐀𝐯i\eta_{i}=\mathbf{v}_{i}^{T}\mathbf{A}\mathbf{v}_{i}
8:     γi=(𝐫i1T𝐫i1)/ηi\gamma_{i}=(\mathbf{r}_{i-1}^{T}\mathbf{r}_{i-1})\big{/}\eta_{i} \ignorespaces\triangleright Next step size
9:     𝐱i=𝐱i1+γi𝐯i{\mathbf{x}}_{i}={\mathbf{x}}_{i-1}+\gamma_{i}\mathbf{v}_{i} \ignorespaces\triangleright Next iterate
10:     𝐫i=𝐫i1γi𝐀𝐯i{\mathbf{r}}_{i}={\mathbf{r}}_{i-1}-\gamma_{i}\mathbf{A}\mathbf{v}_{i} \ignorespaces\triangleright Next residual
11:     δi=(𝐫iT𝐫i)/(𝐫i1T𝐫i1)\delta_{i}=(\mathbf{r}_{i}^{T}\mathbf{r}_{i})\big{/}(\mathbf{r}_{i-1}^{T}\mathbf{r}_{i-1})
12:     𝐯i+1=𝐫i+δi𝐯i\mathbf{v}_{i+1}=\mathbf{r}_{i}+\delta_{i}\mathbf{v}_{i} \ignorespaces\triangleright Next search direction
13:end while
14:d=min{d,gm}d=\min\{d,g-m\} \ignorespaces\triangleright Compute full rank posterior if d>gmd>g-m
15:𝐕m+1:m+d=𝟎n×d\mathbf{V}_{m+1:m+d}=\mathbf{0}_{n\times d} \ignorespaces\triangleright Initialize approximate posterior matrices
16:𝚽m+1:m+d=𝟎d×d\mathbf{\Phi}_{m+1:m+d}=\mathbf{0}_{d\times d}
17:for j=m+1:m+dj=m+1:m+d do \ignorespaces\triangleright dd additional iterations for posterior covariance
18:     ηj=𝐯jT𝐀𝐯j\eta_{j}=\mathbf{v}_{j}^{T}\mathbf{A}\mathbf{v}_{j}
19:     γj=(𝐫j1T𝐫j1)/ηj\gamma_{j}=(\mathbf{r}_{j-1}^{T}\mathbf{r}_{j-1})\big{/}\eta_{j}
20:     𝐕(:,j)=𝐯j/ηj\mathbf{V}(:,j)=\mathbf{v}_{j}\big{/}\sqrt{\eta_{j}} \ignorespaces\triangleright Next column of 𝐕m+1,m+d\mathbf{V}_{m+1,m+d}
21:     𝚽(j,j)=γj𝐫j122\mathbf{\Phi}(j,j)=\gamma_{j}\|\mathbf{r}_{j-1}\|_{2}^{2} \ignorespaces\triangleright Next diagonal element of 𝚽m+1,m+d\mathbf{\Phi}_{m+1,m+d}
22:     𝐫j=𝐫j1γj𝐀𝐯j\mathbf{r}_{j}=\mathbf{r}_{j-1}-\gamma_{j}\mathbf{A}\mathbf{v}_{j}
23:     δj=(𝐫jT𝐫j)/(𝐫j1T𝐫j1)\delta_{j}=(\mathbf{r}_{j}^{T}\mathbf{r}_{j})\big{/}(\mathbf{r}_{j-1}^{T}\mathbf{r}_{j-1})
24:     𝐯j+1=𝐫j+δj𝐯j\mathbf{v}_{j+1}=\mathbf{r}_{j}+\delta_{j}\mathbf{v}_{j} \ignorespaces\triangleright Next un-normalized column of 𝐕m+1,m+d\mathbf{V}_{m+1,m+d}
25:end for
26:Output: 𝐱m\mathbf{x}_{m}, 𝐕m+1:m+d\mathbf{V}_{m+1:m+d}, 𝚽m+1:m+d\mathbf{\Phi}_{m+1:m+d}