This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint State and Noise Covariance Estimation

Kasra Khosoussi Also affiliated with CSIRO Robotics, Data61. School of Electrical Engineering and Computer Science
The University of Queensland
St Lucia, QLD, Australia
[email protected]
   Iman Shames School of Engineering
The Australian National University
Canberra, ACT, Australia
[email protected]
Abstract

This paper tackles the problem of jointly estimating the noise covariance matrix alongside primary parameters (such as poses and points) from measurements corrupted by Gaussian noise. In such settings, the noise covariance matrix determines the weights assigned to individual measurements in the least squares problem. We show that the joint problem exhibits a convex structure and provide a full characterization of the optimal noise covariance estimate (with analytical solutions) within joint maximum a posteriori and likelihood frameworks and several variants. Leveraging this theoretical result, we propose two novel algorithms that jointly estimate the primary parameters and the noise covariance matrix. To validate our approach, we conduct extensive experiments across diverse scenarios and offer practical insights into their application in robotics and computer vision estimation problems with a particular focus on SLAM.

I Introduction

Maximum likelihood (ML) and maximum a posteriori (MAP) are the two most common point-estimation criteria in robotics and computer vision applications such as all variants of simultaneous localization and mapping (SLAM) and bundle adjustment. Under the standard assumption of zero-mean Gaussian measurement noise—and, for MAP, Gaussian priors—these estimation problems reduce to least squares; i.e., finding estimates that minimize the weighted sum of squared errors between observed and expected measurements (and, when priors are available, the weighted squared errors between parameter values and their expected priors).

Each measurement error in the least squares objective is weighted by its corresponding noise information matrix (i.e., the inverse of the covariance matrix). Intuitively, more precise sensors receive “larger” weights, thus exerting greater influence on the final estimate. Moreover, the weight matrices enable the estimator to account for correlations between different components of measurements, preventing “double counting” of information. Therefore, obtaining an accurate estimate of the noise covariance matrix is critical for achieving high estimation accuracy. In addition, the (estimated) noise covariance matrix also determines the (estimated) covariance of the estimated parameter values often used for control and decision making in robotics. Consequently, an inaccurate noise covariance estimate can cause overconfidence or underconfidence in state estimates, potentially leading to poor or even catastrophic decisions.

In principle, the noise covariance matrix can be estimated a priori (offline) using a calibration dataset where the true values of the primary parameters (e.g., robot poses) are known (see Remark 6). This can be done either by the sensor manufacturer or the end user. However, in practice, several challenges arise:

  1. 1.

    Calibration is a labor-intensive process and may not always be feasible, particularly when obtaining ground truth for primary parameters requires additional instrumentation.

  2. 2.

    In many cases, raw measurements are preprocessed by intermediate algorithms before being used in the estimation problem (e.g., in a SLAM front-end), making it difficult to model their noise characteristics.

  3. 3.

    The noise characteristics may evolve over time (e.g., due to dynamic environmental factors such as temperature), making the pre-calibrated noise model obsolete.

Due to these challenges, many applications rely on ad hoc noise covariances, such as arbitrary isotropic (or diagonal) covariances, which are either manually set by experts or determined through trial-and-error tuning. Despite being recognized as one of the most critical and widely acknowledged challenges in SLAM [12, Sections III.B, III.G, and V], the problem of noise covariance estimation remains unsolved and understudied in robotics and computer vision literature.

This paper addresses the problem of jointly estimating the (primary) parameters and noise covariance matrices from noisy measurements in a general setting that covers a wide range of estimation problems in robotics and computer vision. For the first time, our algorithms enable automatic estimation of the noise covariance matrix online (i.e., during deployment) without the need for a separate calibration stage with ground truth information. Furthermore, our MAP framework allows the users to incorporate prior information about the noise covariance, such as manufacturer-provided noise calibration.

Our main contributions are as follows:

  1. 1.

    Formulate the joint MAP and ML estimation problems along with several variants that incorporate different structural assumptions on the true noise covariance matrix. To the best of our knowledge, this is the first time such an approach has been proposed in the SLAM literature.

  2. 2.

    Provide analytical expressions for (conditionally) optimal noise covariances (under MAP and ML criteria, and their constrained variants) in Theorem 1.

  3. 3.

    Building on our theoretical results, we propose two (local) types of optimization methods to solve the joint estimation problem and present a rigorous convergence analysis (Theorems 2 and 3). Our algorithms can be easily incorporated into popular (sparse) nonlinear least squares solvers such as [2, 10, 17] as we demonstrate in our experimental results.

  4. 4.

    Offer practical insights for noise covariance estimation in sparse graph-structured problems such as SLAM, and present extensive experimental and ablation studies for linear(ized)-Gaussian problems and pose-graph optimization (PGO).

Notation

We use [n][n] to denote the set of integers from 11 to nn. The abbreviated notation x1:nx_{1:n} is used to denote x1,,xnx_{1},\ldots,x_{n}. The zero matrix (and vector) is denote by 0 where the size should be clear from the context. 𝕊0d\mathbb{S}^{d}_{\succeq 0} and 𝕊0d\mathbb{S}^{d}_{\succ 0} denote the sets of d×dd\times d symmetric positive semidefinite and positive definite real matrices, respectively. For two symmetric real matrices AA and BB, ABA\succeq B (resp., ABA\succ B) means ABA-B is positive semidefinite (resp., positive definite). AijA_{ij} denotes the (i,j)(i,j) element of matrix AA, and Diag(A)\mathrm{Diag}(A) denotes the diagonal matrix obtained by zeroing out the off-diagonal elements of AA. The standard (Frobenius) inner product between n×nn\times n real matrices AA and BB is denoted by A,Btrace(AB)\langle A,B\rangle\triangleq\mathrm{trace}(A^{\top}B). The Frobenius norm of AA is denoted by A=A,A\|A\|=\sqrt{\langle A,A\rangle}. The weighted Euclidean norm of xx given a weight matrix W0W\succ 0 is denoted by xWxWx\|x\|_{W}\triangleq\sqrt{x^{\top}Wx}. The probability density function of the multivariate normal distribution of random variable xx with mean vector μ\mu and covariance matrix Σ\Sigma is denoted by 𝒩(x;μ,Σ)\mathcal{N}(x;\mu,\Sigma).

II Related Works

We refer the reader to [6, 11, 12, 24] for comprehensive reviews of state-of-the-art estimation frameworks in robotics and computer vision. Sparse nonlinear least squares solvers for solving these estimation problems can be found in [2, 10, 17]. However, all of these works assume that the noise covariance is known beforehand. In contrast, our approach simultaneously estimates both the primary parameters (e.g., robot’s trajectory) and the noise covariance matrix directly from noisy measurements. The importance of automatic hyperparameter tuning has recently gained recognition in the SLAM literature; see, e.g., [12, Section V] and [13, 14]. We share this perspective and present, to the best of our knowledge, the first principled approach for optimal noise covariance estimation in SLAM and related problems.

II-A Optimal Covariance Estimation via Convex Optimization

ML estimation of the mean and covariance from independent and identically distributed (i.i.d.) Gaussian samples using the sample mean and sample covariance is a classic example found in textbooks. Boyd and Vandenberghe [5, Chapter 7.1.1] show that covariance estimation in this standard setting and several of its variants can be formulated as convex optimization problems. However, many estimation problems that arise in robotics and other engineering disciplines extend beyond the standard setting. While noise samples are assumed to be i.i.d. for each measurement type (Section IV-C1), the measurements themselves are not identically distributed. Each measurement follows a Gaussian distribution with the corresponding noise covariance matrix and a unique mean that depends on an unknown parameter belonging to a manifold. Furthermore, the measurement function varies across different measurements and is often nonlinear. We demonstrate that the noise covariance estimation problem in this more general setting can also be formulated as a convex optimization problem. Similar to [5], we explore several problem variants that incorporate prior information and additional structural constraints on the noise covariance matrix. These variants differ from those studied in [5, Chapter 7.1.1] and admit analytical (closed-form) solutions.

II-B Iterated Generalized Least Squares

Zhan et al. [26] propose a joint pose and noise covariance estimation method for the perspective-nn-point (PnnP) problem in computer vision. Their approach is based on the iterated (or iterative) generalized least squares (IGLS) method (see [22, Chapter 12.5] and references therein), alternating between pose and noise covariance estimation. They report improvements in estimation accuracy ranging from 2%2\% to 34%34\% compared to baseline methods that assume a fixed isotropic noise covariance. To the best of our knowledge, before [26], IGLS had not been applied in robotics or computer vision.111We were unable to find the original reference for IGLS. However, the concept was already known and analyzed in the econometrics literature in 1970s [18]. IGLS is closely related to the feasible generalized least squares (FGLS) method which also has a long history in econometrics and regression. Our work (developed concurrently with [26]) generalizes and extends both IGLS and [26] in several keys ways. First, we prove that the joint ML estimation problem is ill-posed when the sample covariance matrix is singular. This critical problem arises frequently in real-world applications, where the sample covariance matrix can be singular or poorly conditioned (Remark 5). We address this critical issue by (i) constraining the minimum eigenvalue of the noise covariance matrix, and, in our MAP formulation, (ii) imposing a Wishart prior on the noise information matrix. We derive analytical optimal solutions for the noise covariance matrix, conditioned on fixed values of the primary parameters, for MAP and ML joint estimation problems and several of their constrained variants (Theorem 1). These enable the end user to leverage prior information about the noise covariance matrix (from, e.g., manufacturer’s calibration) in addition to noisy measurements to solve the joint estimation problem. We propose several algorithms for solving these joint estimation problems and present a rigorous theoretical analysis of their convergence properties. Our formulation is more general than IGLS and we show how our framework can be extended to heteroscedastic measurements, nonlinear (with respect to) noise models, and manifold-valued parameters. Finally, we provide insights into the application of our approach to “graph-structured” estimation problems [16], such as PGO and other SLAM variants. These problems present additional challenges compared to the PnnP problem due to the increasing number of primary parameters (thousands of poses in PGO vs. a single pose in PnnP) and the sparsity of measurements, which reduce the effective signal-to-noise ratio.

II-C Noise Covariance Estimation in Kalman Filtering

The problem of identifying process and measurement noise models in Kalman filtering has been extensively studied since 1970s; see, e.g., [27, 9, 15] and references therein. The Kalman filter (KF) is equivalent to the recursive MAP estimator in linear-Gaussian models. Similarly, nonlinear variants of the KF, such as the extended Kalman filter (EKF), can be viewed as approximate MAP estimators. Despite conceptual similarities between problems and proposed solutions, Kalman filtering methods are specifically tailored for linear(ized) models in filtering estimation problems within a recursive framework. Consequently, they are not directly applicable to the smoothing problems (or batch settings) addressed in this paper, which represent the state-of-the-art approach in SLAM and related computer vision problems; see, e.g., [11].

Refer to caption
Figure 1: Confidence ellipses for 10 samples drawn from the Wishart distribution 𝒲(P;V,ν)\mathcal{W}(P;V,\nu) with different parameters. The scale matrix VV is set to identity in the left and middle plots, and to a correlated matrix in the right plot. The degrees of freedom ν\nu are set to 22 (left) and 1010 (middle and right). As ν\nu increases, the samples become more concentrated. Increasing ν\nu changes the scale of samples as well (cf. left and middle).

III Problem Statement

Consider the standard problem of estimating an unknown vector xtruex_{\text{true}}\in\mathcal{M} given kk noisy mm-dimensional measurements 𝒵{zi}i=1k\mathcal{Z}\triangleq\{z_{i}\}_{i=1}^{k} corrupted by i.i.d. zero-mean Gaussian noise:

zi=hi(xtrue)ϵi,ϵi𝒩(0m,Σtrue),\displaystyle z_{i}=h_{i}(x_{\text{true}})\boxplus\epsilon_{i},\quad\epsilon_{i}\sim\mathcal{N}(0_{m},\Sigma_{\text{true}}), (1)

where Σtrue\Sigma_{\text{true}} is the unknown noise covariance. For Euclidean-valued measurements (such as relative position of a landmark with respect to a robot pose), \boxplus reduces to addition in m\mathbb{R}^{m}. For matrix Lie group-valued measurements (such as relative pose or orientation between two poses), \boxplus is equivalent to multiplication by Exp(ϵi)\mathrm{Exp}(\epsilon_{i}) where Exp\mathrm{Exp} denotes the matrix exponential composed with the so-called hat operator. We denote the residual of measurement ziz_{i} evaluated at xx\in\mathcal{M} with ri(x)zihi(x)r_{i}(x)\triangleq z_{i}\boxminus h_{i}(x). As above, \boxminus is subtraction in m\mathbb{R}^{m} for Euclidean measurements, and Log(hi(x)1zi)\mathrm{Log}(h_{i}(x)^{-1}z_{i}) in the case of matrix Lie-group-valued measurements Log\mathrm{Log} is the matrix logarithm composed with the so-called vee operator (in this case, mm refers to the dimension of Lie algebra).

In this paper, we refer to xtruex_{\text{true}} as the primary “parameters” to distinguish them from Σtrue\Sigma_{\text{true}}. To simplify the discussion, we first consider the case where all measurements share the same noise distribution, meaning there is a single noise covariance matrix Σtrue\Sigma_{\text{true}} in (1). See Section IV-C1 for extensions to more general cases. In robotics and computer vision applications, \mathcal{M} is typically a (smooth) product manifold comprised of d\mathbb{R}^{d} and SO(d)\operatorname{SO}(d) components (d{2,3}d\in\{2,3\}) and other real components (e.g., time offsets, IMU biases). We assume the measurement functions hi:mh_{i}:\mathcal{M}\to\mathbb{R}^{m} are smooth. This standard model (along with extensions in Section IV) is quite general, capturing many estimation problems such such as SLAM (with various sensing modalities and variants), PGO, point cloud registration (with known correspondences), perspective-nn-point, and bundle adjustment.

In this paper, we are interested in the setting where the noise covariance matrix Σtrue0\Sigma_{\text{true}}\succ 0 is unknown and must be estimated jointly with xtruex_{\text{true}} based on the collected measurements {zi}i=1k\{z_{i}\}_{i=1}^{k}. For convenience, we formulate the problem in the information form and estimate the noise information (or precision) matrix PtrueΣtrue1P_{\text{true}}\triangleq\Sigma_{\text{true}}^{-1}. Without loss of generality, we assume a non-informative prior on xtruex_{\text{true}} which is almost always the case in real-world applications (effectively treating it as an unknown “parameter”). We assume a Wishart prior on the noise information matrix PtrueP_{\text{true}} and denote its probability density function with 𝒲(P;V,ν)\mathcal{W}(P;V,\nu) where V𝕊0mV\in\mathbb{S}^{m}_{\succ 0} is the scale matrix and the integer νm+1\nu\geq m+1 is the number of degrees of freedom; Figure 1 illustrates random samples drawn from 𝒲(P;V,ν)\mathcal{W}(P;V,\nu). The Wishart distribution is the standard choice for prior in Bayesian statistics for estimating the information matrix from multivariate Gaussian data (in part due to conjugacy); see, e.g., [3, Eq. (2.155)]. In Algorithm 1, we propose a procedure for setting the parameters of the prior, VV and ν\nu, when the user has access to a prior estimate Σ0\Sigma_{0} for the noise covariance matrix Σtrue\Sigma_{\text{true}} (e.g., from prior calibration by the manufacturer of the sensor); see Appendix A for a justification.

Algorithm 1 Setting Wishart Parameters via Mode Matching
1:procedure PriorModeMatching(Σ0\Sigma_{0}, wpriorw_{\text{prior}}, kk)
2:   // Σ00\Sigma_{0}\succ 0 is a prior estimate for Σtrue\Sigma_{\text{true}}
3:   // wprior>0w_{\text{prior}}>0 is the weight assigned to prior relative to measurement likelihood
4:   // mm is the dimension of Σ0\Sigma_{0}
5:   // kk is the number of measurements
6:  V(wpriorkΣ0)1V\leftarrow(w_{\text{prior}}\,k\,\Sigma_{0})^{-1}
7:  νwpriork+m+1\nu\leftarrow w_{\text{prior}}\,k+m+1
8:  return (V,ν)(V,\nu)
9:end procedure

The joint MAP estimator for xtruex_{\text{true}} and PtrueP_{\text{true}} are the maximizers of the posterior. Invoking the Bayes’ Theorem (and omitting the normalizing constant) results in the following problem:

maximizex,P0𝒲(P;V,ν)prior on Ptruei=1k𝒩(zi;hi(x),P1)measurement likelihood.\displaystyle\underset{x\in\mathcal{M},P\succeq 0}{\text{maximize}}\quad\underbracket{\mathcal{W}(P;V,\nu)}_{\text{\footnotesize prior on $P_{\text{true}}$}}\overbracket{\prod_{i=1}^{k}\mathcal{N}(z_{i};h_{i}(x),P^{-1})}^{\text{\footnotesize measurement likelihood}}. (2)

For any xx\in\mathcal{M}, define the sample covariance at xx as follows:

S(x)1ki=1kri(x)ri(x)0.\displaystyle S(x)\triangleq\frac{1}{k}\sum_{i=1}^{k}r_{i}(x)r_{i}(x)^{\top}\succeq 0. (3)

Then, computing the negative log-posterior, omitting normalizing constants, dividing the objective by k+νm1k+\nu-m-1, and using the cyclic property of trace\mathrm{trace} yields the following equivalent problem.

Problem 1 (Joint MAP).
minimizex,P0F(x,P)logdetP+M(x),P,\displaystyle\underset{x\in\mathcal{M},P\succeq 0}{\text{minimize}}\,\,F(x,P)\triangleq-\log\det P+\left\langle M(x),P\right\rangle, (4)

where

M(x)kS(x)+V1k+νm10.\displaystyle M(x)\triangleq\frac{kS(x)+V^{-1}}{k+\nu-m-1}\succ 0. (5)

We also introduce and study three new variants of Problem 1 by imposing additional (hard) constraints on the noise covariance matrix Σ\Sigma. These constraints enable the user to enforce prior structural information about the covariance.

In the first variant, Σ\Sigma (and thus PP) is forced to be diagonal. This allows the user to enforce independence between noise components.

Problem 2 (Diagonal Joint MAP).
minimizex,P0F(x,P)subject toP is diagonal.\displaystyle\begin{aligned} &\underset{x\in\mathcal{M},P\succeq 0}{\text{minimize}}&&\,\,F(x,P)\\ &\text{subject to}&&\text{$P$ is diagonal}.\end{aligned} (6)

In the second variant, we constrain the eigenvalues of Σ=P1\Sigma=P^{-1} to [λmin,λmax][\lambda_{\text{min}},\lambda_{\text{max}}] where λmaxλmin>0\lambda_{\text{max}}\geq\lambda_{\text{min}}>0. These eigenvalues specify the minimum and maximum variance along all directions in m\mathbb{R}^{m} (i.e., variance of all normalized linear combinations of noise components), Therefore, this constraint allows the user to incorporate prior knowledge about the sensor’s noise limits. We will also show that in many real-world instances (especially with a weak or no prior wprior0w_{\text{prior}}\to 0), constraining the smallest eigenvalue of Σ\Sigma (or largest eigenvalue of PP) is essential for preventing Σ\Sigma from collapsing to zero.

Problem 3 (Eigenvalue-constrained Joint MAP).
minimizex,P0F(x,P)subject toλmax1IPλmin1I.\displaystyle\begin{aligned} &\underset{x\in\mathcal{M},P\succeq 0}{\text{minimize}}&&\,\,F(x,P)\\ &\text{subject to}&&\lambda_{\text{max}}^{-1}I\preceq P\preceq\lambda_{\text{min}}^{-1}I.\end{aligned} (7)

The last variant imposes both constraints simultaneously.

Problem 4 (Diagonal Eigenvalue-constrained Joint MAP).
minimizex,P0F(x,P)subject toλmax1IPλmin1I,P is diagonal.\displaystyle\begin{aligned} &\underset{x\in\mathcal{M},P\succeq 0}{\text{minimize}}&&\,\,F(x,P)\\ &\text{subject to}&&\lambda_{\text{max}}^{-1}I\preceq P\preceq\lambda_{\text{min}}^{-1}I,\\ &&&\text{$P$ is diagonal}.\end{aligned} (8)
Remark 1.

It is easy to verify that by fixing PP to a predetermined value, Problem 1 reduces to the standard (weighted) nonlinear least squares over xx and can be (locally) solved using existing solvers [2, 10, 17].222Note that S(x),P=i=1kri(x)P2\langle S(x),P\rangle=\sum_{i=1}^{k}\|r_{i}(x)\|^{2}_{P}. Similarly, algebraically the problem reduces to joint ML estimation when wprior0w_{\text{prior}}\to 0 where wpriorw_{\text{prior}} is the prior weight in Algorithm 1.333The prior wpriorw_{\text{prior}} cannot be exactly zero due to matrix inversion in line 6 of Algorithm 1. To be more precise, wprior=0w_{\text{prior}}=0 refers to the case where V1V^{-1} is set to zero in M(x)M(x) (5) and ν=m+1\nu=m+1. This abuse of language allows us to save space by treating ML estimation as a special of the MAP problems introduced in Problems 1-4. In that case, the linear term in the objective becomes S(x),P\langle S(x),P\rangle. See also [26] for the unconstrained ML formulation.

Remark 2.

Singularity of S(x)S(x) may lead to an ill-conditioned problems when the prior is weak (i.e., wpriorw_{\text{prior}} is small) as the corresponding objective function becomes very sensitive in certain directions that correspond to the smallest eigenvalues of M(x)M(x). In the unconstrained joint ML estimation case (i.e., wprior=0w_{\text{prior}}=0 and without ΣλminI\Sigma\succeq\lambda_{\text{min}}I), the minimization problem is ill-posed (unbounded from below) when S(x)S(x) is singular.444This can be shown by inspecting the objective value along P+cu0u0P+c\cdot u_{0}u_{0}^{\top} where u0u_{0} is in the nullspace of S(x)S(x) and cc is an arbitrarily large positive scalar. See Remark 5 for more information regarding scenarios where S(x)S(x) may become singular.

IV Optimal Information Matrix Estimation

Problems 1-4 are in general non-convex in xx because of the residuals rir_{i}’s and, when estimating rotations (e.g., in SLAM), the SO(d)\operatorname{SO}(d) constraints imposed on rotational components of xx. In this section, we reveal a convexity structure in these problems and provide analytical (globally) optimal solutions for estimating the covariance matrix for a given xx\in\mathcal{M}.

IV-A Inner Subproblem: Covariance Estimation

Problems 1-4 can be separated into two nested subproblems: an inner subproblem and an outer subproblem, i.e.,

minimizexminP0F(x,P)subject toappropriate constraints on P,\displaystyle\begin{aligned} \underset{x\in\mathcal{M}}{\text{minimize}}&\qquad\underset{P\succeq 0}{\min}&&\,\,F(x,P)\\ &\quad\quad\text{subject to}&&\text{appropriate constraints on $P$},\end{aligned} (9)

where the constraints for each problem are given in Problems 1-4. The inner subproblem focuses on minimizing the objective function over the information matrix PP and as a function of xx\in\mathcal{M}. The outer subproblem minimizes the overall objective function by optimizing over xx\in\mathcal{M} when the objective is evaluated at the optimal information matrix obtained from the inner subproblem.

Remark 3.

For (9) to be well defined, we require the constraint set to be closed and the cost function to be bounded from below. As M(x)M(x) is positive definite by construction, the cost is bounded from below. The positive semidefinite constraint guarantees the closedness of the constraint set. However, due to the presence of the logdetP\log\det P term in the cost function, if a solution exists, the cone constraints are not active at this solution. Thus, a singular PP^{\star} can never be a solution to this problem.

IV-B Analytical Solution to the Inner Subproblem

The inner subproblem for a fixed xx can be written as

minimizeP0logdetP+M(x),Psubject toappropriate constraints on P.\displaystyle\begin{aligned} &\underset{P\succeq 0}{\text{minimize}}&&-\log\det P+\langle M(x),P\rangle\\ &\text{subject to}&&\text{appropriate constraints on $P$}.\end{aligned} (10)
Proposition 1.

For any xx\in\mathcal{M}, the inner subproblem (10) is a convex optimization problem and has at most one optimal solution.

Proof.

See Appendix B. ∎

Theorem 1 (Analytical Solution to the Inner Problem).

Consider the inner problem (10) for a given xx\in\mathcal{M}. The following statements hold:

  1. 1.

    Inner Subproblem in Problem 1: The optimal solution is given by

    P(x)=M(x)1.\displaystyle P^{\star}(x)=M(x)^{-1}. (11)
  2. 2.

    Inner Subproblem in Problem 2: The optimal solution is given by

    P(x)=Diag(M(x))1\displaystyle P^{\star}(x)=\mathrm{Diag}(M(x))^{-1} (12)
  3. 3.

    Inner Subproblem in Problem 3: Let

    M(x)=U(x)D(x)U(x)M(x)=U(x)D(x)U(x)^{\top} (13)

    be an eigendecomposition of M(x)M(x) where U(x)U(x) and D(x)D(x) are orthogonal and diagonal, respectively. The optimal solution is given by

    P(x)=U(x)Λ(x)U(x)\displaystyle P^{\star}(x)=U(x)\Lambda(x)U(x)^{\top} (14)

    where Λ(x)\Lambda(x) is a diagonal matrix with the following elements:

    Λii(x)={λmin1Dii(x)[0,λmin],Dii(x)1Dii(x)(λmin,λmax),λmax1Dii(x)[λmax,).\displaystyle\Lambda_{ii}(x)=\begin{cases}\lambda_{\text{min}}^{-1}&D_{ii}(x)\in[0,\lambda_{\text{min}}],\\ D_{ii}(x)^{-1}&D_{ii}(x)\in(\lambda_{\text{min}},\lambda_{\text{max}}),\\ \lambda_{\text{max}}^{-1}&D_{ii}(x)\in[\lambda_{\text{max}},\infty).\end{cases} (15)
  4. 4.

    Inner Subproblem in Problem 4: The optimal solution is a diagonal matrix with the following elements:

    Pii(x)={λmin1Mii(x)[0,λmin],Mii(x)1Mii(x)(λmin,λmax),λmax1Mii(x)[λmax,).\displaystyle P^{\star}_{ii}(x)=\begin{cases}\lambda_{\text{min}}^{-1}&M_{ii}(x)\in[0,\lambda_{\text{min}}],\\ M_{ii}(x)^{-1}&M_{ii}(x)\in(\lambda_{\text{min}},\lambda_{\text{max}}),\\ \lambda_{\text{max}}^{-1}&M_{ii}(x)\in[\lambda_{\text{max}},\infty).\end{cases} (16)
Proof.

See Appendix C. ∎

Remark 4 (ML Estimation).

As we mentioned earlier, the objective function in the ML formulation for estimating the noise covariance can be written as logdetP+S(x),P-\log\det P+\langle S(x),P\rangle which is similar to (10). However, unlike M(x)M(x), the sample covariance S(x)S(x) can become singular. Therefore, Theorem 1 readilly applies to the ML case (and its constrained variants) with an important exception: without the constraint ΣλminI\Sigma\succeq\lambda_{\text{min}}I, if S(x)S(x) becomes singular, the problem becomes unbounded from below and thus ML estimation becomes ill-posed.

Remark 5 (Singularities of S(x)S(x)).

According to Theorem 1, when S(x)S(x) is singular, the spectrum of M(x)M(x) and consequently solution to the inner subproblems associated with Problems 1 and 2 become more influenced by the Wishart prior than the observed data. The sample covariance matrix could be singular in many different situations, but two particular cases that can lead to singularity are as follows:

  1. 1.

    When k<mk<m (i.e., there are not enough measurements to estimate Σtrue\Sigma_{\text{true}}), S(x)S(x) will be singular at any xx\in\mathcal{M}.

  2. 2.

    Specific values of xx\in\mathcal{M} can lead to singularity in some problems. For example, consider the problem of estimating odometry noise covariance matrix in PGO. Let xodox_{\text{odo}} be the odometry estimate obtained from composing odometry measurements. At x=xodox=x_{\text{odo}}, the residuals ri(xodo)r_{i}(x_{\text{odo}}) (and thus S(xodo)S(x_{\text{odo}})) will be zero.

Theorem 1 has a clear geometric interpretation. For instance, the optimal covariance matrix Σ(x)=P(x)1\Sigma^{\star}(x)={P^{\star}(x)}^{-1} for Problem 3 (14) has the following properties: (i) it preserves the eigenvectors of M(x)M(x), which define the principal axes of the associated confidence ellipsoid; (ii) it matches M(x)M(x) along directions corresponding to eigenvalues within the range [λmin,λmax][\lambda_{\text{min}},\lambda_{\text{max}}]; and (iii) it only adjusts the radii of the ellipsoid along the remaining principal axes to satisfy the constraint. This is visualized in Figure 2. Similarly, the confidence ellipsoid associated to (12) is obtained by projecting the confidence ellipsoid of M(x)M(x) onto the standard basis. In the case of (16), the radii of the projected ellipsoid are adjusted to satisfy the eigenvalue constraint.

Refer to caption
Figure 2: The confidence ellipses corresponding to M(x)M(x) (in blue) and the optimal covariance matrix Σ(x)=P(x)1\Sigma^{\star}(x)={P^{\star}(x)}^{-1} as given in (14) for Problem 3 (in green). The circles show the confidence ellipses associated to λminI\lambda_{\text{min}}I and λmaxI\lambda_{\text{max}}I. Note that the principal axes u1(x)u_{1}(x) and u2(x)u_{2}(x) of M(x)M(x) remain unchanged, while its radii (along the principal axes) are adjusted to fit within the bounds of the circles.
Remark 6 (Calibration).

The inner subproblem (10) also arises in offline calibration. In this context, a calibration dataset 𝒵cal\mathcal{Z}^{\text{cal}} is provided with ground truth xtruecalx_{\text{true}}^{\text{cal}} (or a close approximation). The objective is to estimate the noise covariance matrix Σtrue\Sigma_{\text{true}} based on the calibration dataset, which can then be used for future datasets collected with the same sensors. Note that this approach differs from the joint problems (Problems 1-4), where xtruex_{\text{true}} and Σtrue\Sigma_{\text{true}} must be estimated simultaneously without the aid of a calibration dataset containing ground truth information. Applying Theorem 1 at x=xtruecalx=x_{\text{true}}^{\text{cal}} directly provides the optimal noise covariance matrix for this scenario. If calibration is performed using an approximate ground truth xcalxtruecalx^{\text{cal}}\approx x_{\text{true}}^{\text{cal}}, the estimated noise covariance matrix will be biased.

IV-C Two Important Extensions

IV-C1 Heteroscedastic Measurements

For simplicity, we have so far assumed that measurement noises are identically distributed (i.e., homoscedastic). However, in general, there may be TT distinct types of measurements (e.g., obtained using different sensors in sensor fusion problems, odometry vs. loop closure in PGO, etc), where the noises corrupting each type are identically distributed. In such cases, the inner problem (10) decomposes into TT independent problems (with different M(x)M(x)), solving each yields one of the TT noise information matrices. Theorem 1 can then be applied independently to each of these problems to find the analytical optimal solution for each noise information matrix as a function of xx.

IV-C2 Preprocessed Measurements and Non-Additive Noise

In SLAM, raw measurements are often preprocessed and transformed nonlinearly into standard models supported by popular solvers. For instance, raw range-bearing measurements (corrupted by additive noise with covariance Σtrue\Sigma_{\text{true}}) are often expressed in Cartesian coordinates. As a result, although the raw measurements generated by the sensor have identically distributed noise, the transformed measurements that appear in the least squares problem may have different covariances because of the nonlinear transformation. Let Σtrue\Sigma_{\text{true}} be the covariance of raw measurements, and Σi\Sigma_{i} be the covariance of the iith transformed measurement. In practice, Σi\Sigma_{i} is approximated by linearization, i.e., ΣiJiΣtrueJi\Sigma_{i}\approx J_{i}\Sigma_{\text{true}}J_{i}^{\top} in which JiJ_{i} is the (known) Jacobian of the transformation. It is easy to verify that Theorem 1 readilly extends to this case when JiJ_{i}’s are full-rank square matrices by replacing the sample covariance S(x)S(x) as defined in (3) with

S~(x)1ki=1kJi1ri(x)ri(x)Ji.\displaystyle\widetilde{S}(x)\triangleq\frac{1}{k}\sum_{i=1}^{k}J_{i}^{-1}r_{i}(x)r_{i}(x)^{\top}J_{i}^{-\top}. (17)

A similar technique can be used when measurements are affected by zero-mean Gaussian noise in a nonlinear manner, i.e., zi=hi(xtrue,ϵi)z_{i}=h_{i}(x_{\text{true}},\epsilon_{i}) (in that case, the Jacobians of hih_{i}’s with respect to noise will in general depend on xx).

V Algorithms for Joint Estimation

In principle, one can employ existing constrained optimization methods to directly solve (locally) Problems 1-4. However, this requires substantial modification of existing highly optimized solvers such as [2, 10, 17] and thus may not be ideal. In this section, present two types of algorithms that leverage Theorem 1 for solving the joint estimation problems.

V-A Variable Elimination

Algorithm 2 Variable Elimination for Joint MAP
1:procedure VariableElimination
2:  Use a local optimization method to find
xargminxlogdetP(x)+M(x),P(x)\displaystyle x^{\star}\in\operatorname*{argmin}_{x\in\mathcal{M}}-\log\det P^{\star}(x)+\langle M(x),P^{\star}(x)\rangle
3:  return (x,P(x))\big{(}x^{\star},P^{\star}(x^{\star})\big{)}
4:end procedure

We can eliminate PP from the joint problems (9) by plugging in the optimal information matrix (as a function of xx) P(x)P^{\star}(x) for the inner subproblem (10) provided in Theorem 1. This leads to the following reduced optimization problem in xx:

minimizexlogdetP(x)+M(x),P(x).\displaystyle\underset{x\in\mathcal{M}}{\text{minimize}}\quad-\log\det P^{\star}(x)+\langle M(x),P^{\star}(x)\rangle. (18)

Therefore, if xx^{\star}\in\mathcal{M} is an optimal solution for the above problem, then xx^{\star} and P(x)P^{\star}(x^{\star}) are MAP estimates for xtruex_{\text{true}} and PtrueP_{\text{true}}, respectively. This suggests the simple procedure outlined in Algorithm 2. Note that the reduced problem (18) (like Problems 1-4) is a non-convex optimization problem, and thus the first step in Algorithm 2 is subject to local minima.

Remark 7 (Reduced Problem for Problems 1 and 2).

The objective function of the reduced problem (18) further simplifies in the case of Problems 1 and 2. Specifically, for any xx\in\mathcal{M}, the linear term in the objective (i.e., M(x),P(x)\langle M(x),P^{\star}(x)\rangle) is constant for the values of P(x)P^{\star}(x) given in (11) and (12);

P(x)=M(x)1\displaystyle P^{\star}(x)=M(x)^{-1} M(x),P(x)=m,\displaystyle\Rightarrow\langle M(x),P^{\star}(x)\rangle=m, (19)
P(x)=Diag(M(x))1\displaystyle P^{\star}(x)=\mathrm{Diag}\big{(}M(x)\big{)}^{-1} M(x),P(x)=m.\displaystyle\Rightarrow\langle M(x),P^{\star}(x)\rangle=m. (20)

Therefore, in these cases the reduced problem further simplifies to the following:

minimizex\displaystyle\underset{x\in\mathcal{M}}{\text{minimize}} logdetM(x),\displaystyle\quad\log\det M(x), (21)
minimizex\displaystyle\underset{x\in\mathcal{M}}{\text{minimize}} logdetDiag(M(x)).\displaystyle\quad\log\det\mathrm{Diag}(M(x)). (22)

These simplified problems have an intuitive geometric interpretation (as noted in [26] for the unconstrained ML estimation case): the MAP estimate of xtruex_{\text{true}} is the value of xx\in\mathcal{M} that minimizes the volume of confidence ellipsoid (in Problem 1) and the volume of the projected (onto the standard basis) ellipsoid (in Problem 2) characterised by M(x)M(x).

The variable elimination algorithm has the following practical drawbacks:

  1. 1.

    In many real-world estimation problems, the residuals rir_{i} are typically sparse, meaning each measurement depends on only a small subset of elements in xx. Exploiting this sparsity is essential for solving large-scale problems efficiently. However, this sparse structure is generally lost after eliminating PP in the reduced problem (18).

  2. 2.

    Popular solvers in robotics and computer vision such as [2, 10, 17] are primarily nonlinear least squares solvers that assume Σtrue\Sigma_{\text{true}} is known and focus solely on optimizing xx. Consequently, these highly optimized tools cannot be directly applied to solve the reduced problem in (18).

V-B Block-Coordinate Descent

In this section we show how block-coordinate descent (BCD) methods can be used to solve the problem of interest. A BCD-type algorithm alternates between the following two steps until a stopping condition (e.g., convergence) is satisfied:

  1. 1.

    Fix PP to its most recent value and minimize the joint MAP objective function in Problems 1-4 with respect to xx\in\mathcal{M}. This results in a standard nonlinear least squares problem (where residuals are weighted by PP) over \mathcal{M}, which can be (locally) solved using existing solvers such as [2, 10, 17].

  2. 2.

    Fix xx to its most recent value and minimize the joint MAP objective function with respect to P0P\succeq 0, subject to the constraints in (10). This step reduces to solving the inner subproblem for which we have analytical optimal solutions P(x)P^{\star}(x) provided by Theorem 1.

We explore two variants of this procedure shown in Algorithms 3 and 4. The BCD algorithms of the type considered here address the limitations of Algorithm 2. Specifically, the problem in the first step can be readily solved using standard solvers widely used in robotics and computer vision, which are also capable of exploiting sparsity in residuals. The second step is highly efficient, as it only requires computing the m×mm\times m matrix M(x)M(x) as defined in (5), which can be done in O(km2)O(k\cdot m^{2}) time. In the case of Problem 3, one must also compute the eigendecomposition of M(x)M(x). In practice, mm (the dimension of the residuals) is typically a small constant (i.e., m=O(1)m=O(1); e.g., in 3D PGO, m=6m=6), and therefore the overall time complexity of the second step of BCD is O(k)O(k), i.e., linear in the number of measurements.

Algorithm 3 Hybrid Block-Coordinate Descent for Joint MAP
1:procedure BCD(xinitx_{\text{init}})
2:  t1t\leftarrow 1
3:   // Initialize the information matrix
4:  PtP(xinit)P^{t}\leftarrow P^{\star}(x_{\text{init}})
5:  while tτt\leq\tau do
6:    // Step 1: Update xtx^{t} using a descent step using retraction Retrxt()\text{Retr}_{x^{t}}(\cdot), Riemannian gradient gradxR()\mathrm{grad}_{x}\,R(\cdot) with respect to xx, and step-size η\eta; see Theorem 2 and Appendix E.
7:   xtRetrxt(ηgradxR(xt,Pt))x^{t}\leftarrow\text{Retr}_{x^{t}}(-\eta\,\mathrm{grad}_{x}\,R(x^{t},P^{t}))
8:    // Step 2: optimize PP (Theorem 1)
9:   PtP(xt)P^{t}\leftarrow P^{\star}(x^{t})
10:   tt+1t\leftarrow t+1
11:  end while
12:  return (xt,Pt)(x^{t},P^{t})
13:end procedure
Algorithm 4 Block-Exact BCD for Joint MAP
1:procedure BCD(xinitx_{\text{init}})
2:  t1t\leftarrow 1
3:   // Initialize the information matrix
4:  PtP(xinit)P^{t}\leftarrow P^{\star}(x_{\text{init}})
5:  while not converged do
6:    // Step 1: optimize xx
7:   xtargminx12i=1kri(x)Pt2x^{t}\in\operatorname*{argmin}_{x\in\mathcal{M}}\,\,\frac{1}{2}\sum_{i=1}^{k}\|r_{i}(x)\|^{2}_{P^{t}}
8:    // Step 2: optimize PP (Theorem 1)
9:   PtP(xt)P^{t}\leftarrow P^{\star}(x^{t})
10:   tt+1t\leftarrow t+1
11:  end while
12:  return (xt,Pt)(x^{t},P^{t})
13:end procedure

Before studying the convergence properties of these algorithms we introduce the necessary assumption below. See Appendix E for background information. For simplifying the presentation we define R(x,P)M(x),PR(x,P)\triangleq\langle M(x),P\rangle.

Assumption 1.

Let 𝒫\mathcal{P} denote the constraint set for the noise information matrix PP. The sets \mathcal{M} and 𝒫\mathcal{P} are closed and nonempty and the function FF is differentiable and its level set {(x,P):F(x,P)γ}\{(x,P):F(x,P)\leq\gamma\} is bounded for every scalar γ\gamma.

Assumption 2 (Lipschitz Smoothness in xx).

Function R(x,P)R(x,P) is continuously differentiable and Lipschitz smooth in xx, i.e., there exists a positive constant LL such that for all (ξ,Π)×𝒫(\xi,\Pi)\in\mathcal{M}\times\mathcal{P} and all ζ\zeta\in\mathcal{M}:

xR(ξ,Π)xR(ζ,Π)Lξζ.\|\nabla_{x}R(\xi,\Pi)-\nabla_{x}R(\zeta,\Pi)\|\leq L\|\xi-\zeta\|. (23)

Next, we state the convergence result for Algorithm 3 when applied to Problems 3 and 4.

Theorem 2.

Let \mathcal{M} and 𝒫\mathcal{P} be compact submanifolds of the Euclidean space. Under Assumptions 1 and 2 and setting η=1/L~\eta=1/\widetilde{L} with L~\widetilde{L} defined in Lemma 1 in Appendix E, for the sequence {(xt,Pt)}\{(x^{t},P^{t})\} generated by Algorithm 3 we have

mint[τ]gradF(xt,Pt)CF(xinit,Pinit)F(x,P)τ,\min_{t\in[\tau]}\|\mathrm{grad}\,F(x^{t},P^{t})\|\leq C\sqrt{\dfrac{F(x_{\mathrm{init}},P_{\mathrm{init}})-F(x^{\star},P^{\star})}{\tau}}, (24)

where C=2L~(1+2α)C=\sqrt{2\widetilde{L}}(1+\sqrt{2}\alpha) where α\alpha is given in Lemma 1.

Proof.

See Appendix D. ∎

As can be seen in Algorithm 4, one might be able to solve the optimization problems associated with each of the coordinates uniquely and exactly. For example, for the case where the residuals ri(x)r_{i}(x) are affine functions of xx and \mathcal{M} is convex, the optimization problem associated with xx has a unique solution (assuming a non-singular Hessian) can be solved exactly. This case satisfies the following assumption.

Assumption 3.

For all Π𝒫\Pi\in\mathcal{P} and all ξ\xi\in\mathcal{M}, the following problems have unique solutions:

minxF(x,Π),minP𝒫F(ξ,P).\min_{x\in\mathcal{M}}\;F(x,\Pi),\quad\min_{P\in\mathcal{P}}\;F(\xi,P). (25)
Theorem 3.

Under Assumptions 1 and 3, the sequence {(xt,Pt)}\{(x^{t},P^{t})\} generated by Algorithm 4 is bounded and has limit points. Moreover, every limit point (x,P)(x^{\star},P^{\star}) is a local minimum of the optimization problem.

Proof.

The proof of the theorem follows directly from [20, Theorem 1]. ∎

VI Experiments

VI-A Linear Measurement Model

Refer to caption
(a) Objective value over iterations
Refer to caption
(b) Average RMSE in estimating xtruex_{\text{true}}
Refer to caption
(c) Average 2-Wasserstein error for noise covariance
Figure 3: Results of experiments with linear measurement models (Section VI-A). The results shown in Figures 3(b) and 3(c) are averaged over 5050 Monte Carlo (MC) runs. The error bars in these figures represent the 95%95\% confidence intervals.

We first evaluate the algorithms on a simple linear measurement model. Although the joint problem is still non-convex in this scenario, the problem is (strictly) convex in xx and PP separately, enabling exact minimization in both steps of BCD (Algorithm 4) with analytical solutions (i.e., linear least squares and (10)).

Setup: In these experiments, =20\mathcal{M}=\mathbb{R}^{20} and xtruex_{\text{true}} is set to the all-ones vector. We generated k=50k=50 measurements, where the dimension of each measurement is m=5m=5. Measurement ziz_{i} is generated according to

zi=Hixtrue+ϵi,i[k].z_{i}=H_{i}x_{\text{true}}+\epsilon_{i},\quad i\in[k]. (26)

Each measurement function Hi5×20H_{i}\in\mathbb{R}^{5\times 20} is a random matrix drawn from the standard normal distribution. The Measurement noise ϵi𝒩(0,Σtrue)\epsilon_{i}\sim\mathcal{N}(0,\Sigma_{\text{true}}) where

Σtrue=Σbase+σ2I,\Sigma_{\text{true}}=\Sigma^{\text{base}}+\sigma^{2}I, (27)

in which Σbase0\Sigma^{\text{base}}\succeq 0 is a fixed random covariance matrix, and σ2{102,101,1,10,102}\sigma^{2}\in\{10^{-2},10^{-1},1,10,10^{2}\} is a variable that controls the noise level in our experiments. We did not impose a prior on PP (i.e., prior weight wprior=0w_{\text{prior}}=0; see (31) in Appendix A), leading to unconstrained joint ML estimation of xtruex_{\text{true}} and Σtrue\Sigma_{\text{true}}.

We conducted 50 Monte Carlo simulations per noise level, each with a different noise realization. In each trial, we generated measurement noise according to the model described above and applied Elimination and BCD (Algorithms 2 and 4) to estimate xtruex_{\text{true}} and Σtrue\Sigma_{\text{true}} under the Problem 1 formulation. Both algorithms were initialized with xinit=0x_{\text{init}}=0 and executed for up to 2525 iterations. The Elimination algorithm uses the Limited-memory BFGS solver from SciPy [25].

Metrics: We use the root mean square error (RMSE) to evaluate the accuracy of estimating xtruex_{\text{true}}. For covariance estimation accuracy, we use the 2-Wasserstein distance between the true noise distribution 𝒩(0,Σtrue)\mathcal{N}(0,\Sigma_{\text{true}}) and the estimated noise distribution 𝒩(0,Σ)\mathcal{N}(0,\Sigma^{\star}):

𝔚2(𝒩true,𝒩)=trace(Σtrue+Σ2(Σtrue12ΣΣtrue12)12).\mathfrak{W}_{2}(\mathcal{N}_{\text{true}},\mathcal{N}^{\star})=\sqrt{\mathrm{trace}\Big{(}\Sigma_{\text{true}}+\Sigma^{\star}-2\big{(}\Sigma_{\text{true}}^{\frac{1}{2}}\Sigma^{\star}\Sigma_{\text{true}}^{\frac{1}{2}}\big{)}^{\frac{1}{2}}\Big{)}}. (28)

Results: The results are shown in Figure 3.

Figure 3(a) illustrates the value of the objective function in Problem 1 during one of the Monte Carlo simulations (σ=0.1\sigma=0.1). The objective value for BCD is recorded after updating xx (Step 1 of Algorithm 4). This figure demonstrates that both methods eventually converge to the same objective value, although BCD exhibits faster convergence.

Figure 3(b) presents the average RMSE for the solution xx^{\star}, obtained by Elimination and BCD (at their final iteration), across the Monte Carlo simulations for various noise levels. The figure also includes average RMSEs for estimates obtained using fixed covariance matrices: Σ=Σtrue\Sigma=\Sigma_{\text{true}} (i.e., the true noise covariance) and Σ=I\Sigma=I (i.e., an arbitrary identity matrix often used in practice when Σtrue\Sigma_{\text{true}} is unknown). The results show that Elimination and BCD achieve identical RMSEs across all noise levels, with RMSE increasing as the noise level rises. Notably, under low noise, the accuracy of the solutions produced by these algorithms matches that achieved when the true covariance matrix Σtrue\Sigma_{\text{true}} is known. This indicates that the algorithms can accurately estimate xtruex_{\text{true}} without prior knowledge of Σtrue\Sigma_{\text{true}}. The gap between the RMSEs widens as the noise level increases, which is expected because the algorithms must jointly estimate xtruex_{\text{true}} and Σtrue\Sigma_{\text{true}} under a low signal-to-noise ratio. Nonetheless, the RMSE trends consistently across noise levels.

Furthermore, the results highlight that using a naïve approximation of the noise covariance matrix (i.e., fixing Σ=I\Sigma=I) leads to a poor estimate xx^{\star}. However, as the noise level increases, the RMSE for this approximation eventually approaches that of the case where Σtrue\Sigma_{\text{true}} is perfectly known. This behavior is partly due to the setup in (27): as σ2\sigma^{2} grows, the diagonal components of the covariance matrix dominate, making the true covariance approximately isotropic. Since the estimation of xtruex_{\text{true}} (given a fixed noise covariance matrix) is invariant to the scaling of the covariance matrix, the performance of Σ=I\Sigma=I aligns with that of Σ=Σtrue\Sigma=\Sigma_{\text{true}} despite the scaling discrepancy.

Finally, Figure 3(c) shows the 2-Wasserstein distance averaged over the Monte Carlo simulations. Similar to the previous figure, the covariance estimates obtained by Elimination and BCD are consistently close to those derived using x=xtruex=x_{\text{true}}. As the noise level increases, covariance estimation error also rises, widening the gap, as expected.

VI-B Pose-Graph Optimization Ablations

Dataset: We used a popular synthetic PGO benchmark, the Manhattan dataset [19], and generated new measurement realizations with varying values of the actual noise covariance matrix. The dataset consists of 3,5003,500 poses and k=5,598k=5,598 relative-pose measurements. This dataset is notoriously poorly connected [16]. Therefore, to analyze the effect of connectivity on covariance estimation, we also performed experiments on modified versions of this dataset, where additional loop closures were introduced by connecting pose ii to poses i+2i+2 and i+3i+3. This modification increased the total number of measurements to k=12,593k=12,593.

We generated zero-mean Gaussian noise in the Lie algebra se(2)3\mathrm{se}(2)\cong\mathbb{R}^{3} for the following models:

  1. 1.

    Homoscedastic Measurements: In these experiments, all measurements share the same information matrix, given by α×diag(20,40,30)\alpha\times\mathrm{diag}(20,40,30), where α\alpha is a scaling factor that controls the noise level (hereafter referred to as the “information level”).

  2. 2.

    Heteroscedastic Measurements: We introduce two distinct noise models for odometry and loop-closure edges. The true information matrix for odometry noise is fixed at diag(1000,1000,800)\mathrm{diag}(1000,1000,800), while the loop-closure noise information matrix is varied as α×diag(20,40,30)\alpha\times\mathrm{diag}(20,40,30).

For each value of the information level α{5,10,20,30,40}\alpha\in\{5,10,20,30,40\}, we conducted 5050 Monte Carlo simulations with different noise realizations.

Algorithms: We implemented a variant of Algorithm 3 in C++, based on g2o [17], for PGO problems. Instead of using Riemannian gradient descent, we used g2o’s implementation of Powell’s Dog-Leg method [21] on SE(2)\mathrm{SE}(2) to optimize xx. In each outer iteration, our implementation retrieves the residuals for each measurement, computes M(x)M(x) as defined in (5), and updates the noise covariance estimate using Theorem 1. To handle cases where wprior=0w_{\text{prior}}=0 (i.e., ML estimation) and S(x)S(x) is poorly conditioned, we set λmin=104\lambda_{\text{min}}=10^{-4} and λmax=104\lambda_{\text{max}}=10^{4} in all experiments. We then perform a single iteration of Powell’s Dog-Leg method [21] to update the primary parameters xx, initializing the solver at the latest estimate of xx. To ensure convergence across all Monte Carlo trials, we ran 13 outer iterations.

We report the results for the following methods:

  1. 1.

    BCD: In line 9 of Algorithm 3, the noise information matrix is estimated using the ML estimate (i.e., wprior0w_{\text{prior}}\to 0) subject to eigenvalue constraints. This is equivalent to (14) after replacing M(x)M(x) with the sample covariance S(x)S(x) at the current value for xx.

  2. 2.

    BCD (diag): In line 9 of Algorithm 3, the noise information matrix is estimated using the ML estimate (as above), subject to both diagonal and eigenvalue constraints. This is equivalent to (16) after replacing M(x)M(x) with the sample covariance S(x)S(x) at the current value for xx.

  3. 3.

    BCD (Wishart): In line 9 of Algorithm 3, the information matrix is updated according to (14); i.e., using the MAP estimate with a Wishart prior and under the eigenvalue constraints.

  4. 4.

    BCD (diag+Wishart): In line 9 of Algorithm 3, the noise information matrix is updated according to (16); i.e., using the MAP estimate with a Wishart prior and under the diagonal and eigenvalue constraints.

For BCD (Wishart) and BCD (diag+Wishart) where a Wishart prior was used, we applied our Algorithm 1 to set the prior parameters (i.e., VV and ν\nu), for a prior weight of wprior=0.1w_{\text{prior}}=0.1 and a prior estimate of Σ0=0.002I\Sigma_{0}=0.002I for both odometry and loop-closure edges. Note that this prior estimate is far from the true noise covariance value.

Additionally, we report results for estimating xx using Powell’s Dog-Leg solver in g2o under two fixed noise covariance settings: (i) the true covariance matrix (as a reference) and (ii) the identity matrix (a common ad hoc approximation used in practice). To ensure convergence across all trials, we performed eight iterations for these methods. All algorithms were initialized using a spanning tree to compute xinitx_{\text{init}} [17].

Metrics: We use RMSE to measure error in the estimated robot trajectory (positions). In all cases, the first pose is fixed to the origin and therefore aligning the estimates with the ground truth is not needed. We also use the 2-Wasserstein distance (28) to measure the covariance estimation error.

Refer to caption
(a) Homoscedastic Scenario
Refer to caption
(b) Heteroscedastic Scenario
Refer to caption
(c) Heteroscedastic Scenario with extra loop closures
Figure 4: Average RMSE obtained by variants of BCD and g2o (with fixed true and identity covariances) as a function of information level α\alpha.
Refer to caption
(a) Homoscedastic Scenario
Refer to caption
(b) Heteroscedastic Scenario – odometry (middle) and loop closure (right)
Figure 5: Average 2-Wasserstein error achieved by variants of BCD and as a function of information level α\alpha.

Results: The results are shown in Figures 4 and 5.

Figure 4 shows the average RMSE over Monte Carlo trials for different values of the information level α\alpha. Figure 4(a) presents the results for the homoscedastic case, while Figures 4(b) and 4(c) show the results for the heteroscedastic case before and after additional loop closures, respectively. The results show that our framework, across all variants, succesfully produced solutions with an RMSE close to the reference setting where the true noise covariance matrix was available (g2o with Σtrue\Sigma_{\text{true}}). For almost all values of α\alpha and all experiments, BCD (Wishart) and BCD (diag+Wishart) achieved a lower average RMSE compared to the other variants, highlighting the importance of incorporating a prior. Notably, this is despite the prior being assigned a small weight wpriorw_{\text{prior}} and the fact that the prior estimate Σ0\Sigma_{0} is not close to Σtrue\Sigma_{\text{true}}. Incorporating a prior is particularly crucial in PGO, as we observed that the eigenvalues of S(x)S(x) can become severely small in practice. In such cases, without a prior and without enforcing a minimum eigenvalue constraint ΣλminI\Sigma\succeq\lambda_{\text{min}}I, the ML estimate for the noise covariance can collapse to zero, leading to invalid results and excessive overconfidence. The average RMSE achieved by those variants with diagonal and without constraints are generally similar. Interestingly, the RMSE achieved by the diagonal variants appears to be close to that of their counterparts without this constraint, despite the true noise covariance being diagonal.

For some values of α\alpha in Figure 4(b), the BCD variants with the Wishart prior achieved a lower MSE than the reference solution. However, we expect that, on average, the reference solution will perform better with a larger number of Monte Carlo simulations. The RMSE trends look similar in all experiments, with the exception of a slight increase for α=30\alpha=30 for BCD and BCD (diag) in Figure 4(b). In terms of RMSE, BCD variants with a Wishart prior outperform or perform comparably to the baseline with a fixed identity covariance in most cases. That said, especially under larger information levels (i.e., lower noise), this naïve baseline estimates xx quite accurately.

Figure 5 presents the average 2-Wasserstein distance (28) between the noise covariance matrices estimated by various BCD variants and the true covariance in both homoscedastic and heteroscedastic scenarios. In all cases, BCD variants achieved a small 2-Wasserstein error. For reference, the 2-Wasserstein error between 𝒩(0,Σtrue)\mathcal{N}(0,\Sigma_{\text{true}}) and the baseline 𝒩(0,I)\mathcal{N}(0,I) exceeds 1, which is more than 20 times the errors attained by our algorithms. As noted in Section I, an incorrect noise covariance leads to severe overconfidence or underconfidence in the estimated state xx (e.g., the estimated trajectory in SLAM), potentially resulting in poor or catastrophic decisions.

The results indicate that, in most cases, diagonal BCD variants yield lower errors. This is expected, as the true noise covariance matrices are diagonal, and enforcing this prior information enhances estimation accuracy. Additionally, the MAP estimates obtained by BCD (Wishart) and BCD (diag+Wishart) outperform their ML-based counterparts. This is due to the fact that the eigenvalues of the sample covariance matrix S(x)S(x) can be very small, indicating insufficient information for accurate noise covariance estimation. In such cases, the eigenvalue constraint effectively prevents a complete collapse of the estimated covariance. Interestingly, this issue was not encountered in [26] for ML estimation of the noise covariance matrix in PnnP. This discrepancy may suggest additional challenges in estimating noise covariance in SLAM (and related) problems, which typically involve a larger number of variables and sparse, graph-structured relative measurements. In such cases, incorporating a prior estimate may be necessary for accurate identification of noise covariance matrices. Finally, Figure 5(b) illustrates that the MAP estimation error is lower for the noise covariance of loop closures compared to odometry edges. This difference is partly because the prior value based on Σ0\Sigma_{0} is closer to the true noise covariance matrix for loop closures.

VI-C RIM Dataset

Refer to caption
(a) g2o w/ original covariance
Refer to caption
(b) BCD – joint ML estimate (w/o prior)
Refer to caption
(c) BCD – joint MAP estimate w/ prior
Figure 6: RIM Dataset

In this section, we present qualitative results on RIM, a large real-world 3D PGO dataset collected at Georgia Tech [8]. This dataset consists of 10,19510,195 poses and 29,74329,743 measurements. The g2o dataset includes a default noise covariance matrix, but with these default values, g2o struggles to solve the problem. Specifically, the Gauss-Newton and Dog-Leg solvers terminate after a single iteration, while the Levenberg-Marquardt solver completes 10 iterations but produces the trajectory estimate shown in Figure 6(a).

We ran BCD without and with the Wishart prior (i.e., ML and MAP estimation) for 10 outer iterations, using a single Dog-Leg iteration to optimize xx in each round. We set the prior parameter based on wprior=0.1w_{\text{prior}}=0.1 and Σ0=0.01I\Sigma_{0}=0.01I. The results are displayed in Figures 6(b) and 6(c), respectively. It is evident that the trajectory estimates obtained by BCD are significantly more accurate than those obtained using the original noise covariance values.

A closer visual inspection reveals that the trajectory estimated using the Wishart prior (right) appears more precise than the ML estimate (middle). This is expected, as without imposing a prior, some eigenvalues of S(x)S(x) collapsed until they reached the eigenvalue constraint ΣλminI\Sigma\succeq\lambda_{\text{min}}I. In contrast, in the MAP case, the estimated covariance did not reach this lower bound.

Interestingly, despite Σ0\Sigma_{0} being isotropic, the estimate obtained by BCD with the Wishart prior was not. In particular, the information matrix component associated with the zz coordinate was almost twice as large as those of xx and yy. This likely reflects the fact that the trajectory in this dataset is relatively flat. This finding highlights that in this case, in addition to the prior, the measurements also contributed information to the estimation of the noise covariance matrix.

VII Limitations

This paper did not explore the issue of outliers. Our BCD-type methods can be viewed as (unconventional) instances of iteratively reweighted least squares (IRLS) techniques, where we estimate the entire weight matrix instead of a scalar weight. Consequently, we believe that our joint MAP estimation framework can be easily adapted to integrate with M-estimation IRLS techniques commonly used for outlier-robust estimation, to accurately estimate both the primary parameters and the noise covariance matrix (for inlier measurements) when the dataset contains outliers. We are currently exploring this direction and also aim to implement our method in real-time visual and LiDAR-based SLAM and odometry systems such as [7, 23].

Moreover, our Theorem 2 requires \mathcal{M} to be compact. This result therefore does not immediately apply to problems where xx contains translational components. However, we believe we can address this issue in SLAM by confining translational components to potentially large but bounded subsets of d\mathbb{R}^{d} (where d{2,3}d\in\{2,3\} is the problem dimension).

Finally, we acknowledge the extensive research on IGLS and Feasible GLS (FGLS) conducted as early as 1970s in econometrics and statistics, which remain largely unknown in engineering; see relevant references in [22, Chapter 12.5] and [26]. Exploring this rich literature may result in new insights and methods that can further improve noise covariance estimation in SLAM and computer vision applications.

VIII Conclusion

This work presented a novel and rigorous framework for joint estimation of primary parameters and noise covariance matrices in SLAM and related computer vision estimation problems. We derived analytical expressions for (conditionally) optimal noise covariance matrix (under ML and MAP criteria) under various structural constraints on the true covariance matrix. Building on these solutions, we proposed two types of algorithms for finding the optimal estimates and theoretically analyzed their convergence properties. Our results and algorithms are quite general and can be readily applied to a broad range of estimation problems across various engineering disciplines. Our algorithms were validated through extensive experiments using linear measurement models and PGO problems.

References

  • Absil et al. [2009] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • Agarwal et al. [2023] Sameer Agarwal, Keir Mierle, and The Ceres Solver Team. Ceres Solver, 10 2023. URL https://github.com/ceres-solver/ceres-solver.
  • Bishop and Nasrabadi [2006] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
  • Boumal [2020] Nicolas Boumal. An introduction to optimization on smooth manifolds. Available online, Aug 2020. URL http://www.nicolasboumal.net/book.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Cadena et al. [2016] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jose Neira, Ian D Reid, and John J Leonard. Simultaneous localization and mapping: Present, future, and the robust-perception age. arXiv preprint arXiv:1606.05830, 2016.
  • Campos et al. [2021] Carlos Campos, Richard Elvira, Juan J. Gómez, José M. M. Montiel, and Juan D. Tardós. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
  • Carlone et al. [2015] Luca Carlone, Roberto Tron, Kostas Daniilidis, and Frank Dellaert. Initialization techniques for 3d slam: A survey on rotation estimation and its use in pose graph optimization. In 2015 IEEE international conference on robotics and automation (ICRA), pages 4597–4604. IEEE, 2015.
  • Chen et al. [2023] Zhaozhong Chen, Harel Biggie, Nisar Ahmed, Simon Julier, and Christoffer Heckman. Kalman filter auto-tuning through enforcing chi-squared normalized error distributions with bayesian optimization. arXiv preprint arXiv:2306.07225, 2023.
  • Dellaert and Contributors [2022] Frank Dellaert and GTSAM Contributors. borglab/gtsam, May 2022. URL https://github.com/borglab/gtsam).
  • Dellaert et al. [2017] Frank Dellaert, Michael Kaess, et al. Factor graphs for robot perception. Foundations and Trends® in Robotics, 6(1-2):1–139, 2017.
  • Ebadi et al. [2023] Kamak Ebadi, Lukas Bernreiter, Harel Biggie, Gavin Catt, Yun Chang, Arghya Chatterjee, Christopher E Denniston, Simon-Pierre Deschênes, Kyle Harlow, Shehryar Khattak, et al. Present and future of SLAM in extreme environments: The DARPA SubT challenge. IEEE Transactions on Robotics, 2023.
  • Fontan et al. [2024a] Alejandro Fontan, Javier Civera, Tobias Fischer, and Michael Milford. Look ma, no ground truth! ground-truth-free tuning of structure from motion and visual SLAM. arXiv preprint arXiv:2412.01116, 2024a.
  • Fontan et al. [2024b] Alejandro Fontan, Javier Civera, and Michael Milford. Anyfeature-vslam: Automating the usage of any chosen feature into visual SLAM. In Robotics: Science and Systems, volume 2, 2024b.
  • Forsling et al. [2024] Robin Forsling, Simon J Julier, and Gustaf Hendeby. Matrix-valued measures and wishart statistics for target tracking applications. arXiv preprint arXiv:2406.00861, 2024.
  • Khosoussi et al. [2019] Kasra Khosoussi, Matthew Giamou, Gaurav S Sukhatme, Shoudong Huang, Gamini Dissanayake, and Jonathan P How. Reliable graphs for SLAM. The International Journal of Robotics Research, 38(2-3):260–298, 2019.
  • Kuemmerle et al. [2011] Rainer Kuemmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. g2o: A general framework for graph optimization. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), 2011.
  • Malinvaud [1980] Edmond Malinvaud. Statistical methods of econometrics. 1980.
  • Olson et al. [2006] E. Olson, J. Leonard, and S. Teller. Fast iterative alignment of pose graphs with poor initial estimates. In Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference on, pages 2262–2269. Ieee, 2006.
  • Peng and Vidal [2023] Liangzu Peng and René Vidal. Block coordinate descent on smooth manifolds: Convergence theory and twenty-one examples. arXiv preprint arXiv:2305.14744, 2023.
  • Powell [1970] MJD Powell. A new algorithm for unconstrained optimization. UKAEA, 1970.
  • Seber and Wild [1989] George A. F. Seber and C. J. Wild. Nonlinear Regression. Wiley-Interscience, 1989.
  • Shan et al. [2020] Tixiao Shan, Brendan Englot, Drew Meyers, Wei Wang, Carlo Ratti, and Rus Daniela. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5135–5142. IEEE, 2020.
  • Triggs et al. [1999] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.
  • Virtanen et al. [2020] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, and et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
  • Zhan et al. [2025] Tian Zhan, Chunfeng Xu, Cheng Zhang, and Ke Zhu. Generalized maximum likelihood estimation for perspective-n-point problem. IEEE Robotics and Automation Letters, 2025.
  • Zhang et al. [2020] Lingyi Zhang, David Sidoti, Adam Bienkowski, Krishna R Pattipati, Yaakov Bar-Shalom, and David L Kleinman. On the identification of noise covariances and adaptive kalman filtering: A new look at a 50 year-old problem. IEEE Access, 8:59362–59388, 2020.

Appendix A Wishart Prior

Consider 𝒲(P;V,ν)\mathcal{W}(P;V,\nu) as a prior for the information matrix. Here we assume νm+1\nu\geq m+1 where mm is the dimension of covariance matrix and V0V\succ 0. The mode of this distribution is given by

argmaxP0𝒲(P;V,ν)=(νm1)V.\operatorname*{argmax}_{P\succeq 0}\quad\mathcal{W}(P;V,\nu)=(\nu-m-1)V. (29)

Let Σ00\Sigma_{0}\succ 0 be our prior guess for the covariance matrix. We set the value of the scale matrix VV by matching the mode with our prior estimate Σ01\Sigma_{0}^{-1}:

(νm1)V=Σ01V1=(νm1)Σ0\displaystyle(\nu-m-1)V=\Sigma_{0}^{-1}\Leftrightarrow V^{-1}=(\nu-m-1)\Sigma_{0} (30)

Let wpriorw_{\text{prior}} be the following:

wpriorνm1k0.w_{\text{prior}}\triangleq\frac{\nu-m-1}{k}\geq 0. (31)

Then we can rewrite V1V^{-1} as

V1=wpriorkΣ0.V^{-1}=w_{\text{prior}}k\Sigma_{0}. (32)

In the unconstrained case (11), the optimal (MAP) estimate (conditioned on a fixed value of xx) is given by:

Σ(x)P(x)1\displaystyle\Sigma^{\star}(x)\triangleq{P^{\star}(x)}^{-1} =M(x)\displaystyle=M(x) (33)
=V1+kS(x)k+νm1\displaystyle=\frac{V^{-1}+kS(x)}{k+\nu-m-1} (34)
=wpriorkΣ0+kS(x)(wprior+1)k\displaystyle=\frac{w_{\text{prior}}k\Sigma_{0}+kS(x)}{(w_{\text{prior}}+1)k} (35)
=wpriorwprior+1Σ0+1wprior+1S(x).\displaystyle=\frac{w_{\text{prior}}}{w_{\text{prior}}+1}\Sigma_{0}+\frac{1}{w_{\text{prior}}+1}S(x). (36)

This shows that the (conditional) MAP estimator simply blends the sample covariance matrix S(x)S(x) and prior Σ0\Sigma_{0} based on the prior weight wpriorw_{\text{prior}}. We directly set wpriorw_{\text{prior}} (e.g., wprior=0.1w_{\text{prior}}=0.1 by default) based on our confidence in the prior relative to the likelihood. Large values of wpriorw_{\text{prior}} will result in stronger priors. For the chosen value of wpriorw_{\text{prior}},

ν\displaystyle\nu =wpriork+m+1,\displaystyle=w_{\text{prior}}k+m+1, (37)
V\displaystyle V =wpriorkΣ0.\displaystyle=w_{\text{prior}}k\Sigma_{0}. (38)

This procedure is summarized in Algorithm 1.

Appendix B Proof of Proposition 1

The objective function is strictly convex because log-determinant is strictly concave over 𝕊0\mathbb{S}_{\succ 0} and the second term is linear in PP. The equality (diagonal) and “inequality” (eigenvalue) constraints are also linear and convex, respectively. Since the objective is strictly convex, there is at most one optimal solution (i.e., if a minimizer exists, it is unique).

Appendix C Proof of Theorem 1

1)

For a given xx, the Lagrangian of the inner subproblem in Problem 1 is given by

(P,Q)=logdetP+M(x),P+Q,P,\mathcal{L}(P,Q)=-\log\det P+\langle M(x),P\rangle+\langle Q,P\rangle, (39)

where QQ is the Lagrange multiplier corresponding to the semidefinite cone constraints. Observe that P(P(x))=Q\nabla_{P}\mathcal{L}(P^{\star}(x))=Q^{\star} for P(x)=M(x)1P^{\star}(x)=M(x)^{-1}. Since, P(x)P^{\star}(x) in this case is primal feasible then optimal dual variable would Q=0Q^{\star}=0. Thus, P(x)P^{\star}(x) is the optimal solution to the problem.

2)

For the case of the inner subproblem in Problem 2, the problem can be rewritten as

minimizeλ1,,λm\displaystyle\underset{\lambda_{1},\dots,\lambda_{m}}{\text{minimize}} i=1mlogλi+i=1mλiMii(x)\displaystyle\qquad-\sum_{i=1}^{m}\log\lambda_{i}+\sum_{i=1}^{m}\lambda_{i}M_{ii}(x)
subject to λi0,i[m].\displaystyle\qquad\lambda_{i}\geq 0,\quad i\in[m]. (40)

The Lagrangian of this problem is given by

(λ1,,λm,q1,,qm)=i=1m[logλi+λiMii(x)+λiqi],\mathcal{L}(\lambda_{1},\dots,\lambda_{m},q_{1},\dots,q_{m})=\sum_{i=1}^{m}\left[-\log\lambda_{i}+\lambda_{i}M_{ii}(x)+\lambda_{i}q_{i}\right], (41)

where qiq_{i} are the Lagrange multipliers corresponding to the constraints. The KKT conditions for this problem are

1/λi+Mii(x)+qi=0,i[m]\displaystyle-1/\lambda_{i}+M_{ii}(x)+q_{i}=0,\quad i\in[m] (42)
qiλi=0,qi0,λi0,i[m].\displaystyle q_{i}\lambda_{i}=0,\quad q_{i}\geq 0,\quad\lambda_{i}\geq 0,\quad i\in[m]. (43)

It can be observed that since Mii(x)M_{ii}(x) is positive by the virtue of M(x)M(x) being a positive definite matrix, λi=Mii(x)1\lambda_{i}^{\star}={M_{ii}(x)}^{-1} and qi=0q_{i}^{\star}=0, i[m]\forall i\in[m], is the solution to the above KKT system.

3)

The Lagrangian of the inner subproblem of Problem 3 is given by

(P,Q¯,Q¯)\displaystyle\mathcal{L}(P,\underline{Q},\overline{Q}) =logdetP+M(x),P\displaystyle=-\log\det P+\langle M(x),P\rangle (44)
+Q¯,I/λmaxP+Q¯,PI/λmin.\displaystyle\quad+\langle\underline{Q},I/\lambda_{\max}-P\rangle+\langle\overline{Q},P-I/\lambda_{\min}\rangle. (45)

The KKT conditions corresponding to this problem read

P(P,Q¯,Q¯)=P1+M(x)Q¯+Q¯=0\displaystyle\nabla_{P}\mathcal{L}(P,\underline{Q},\overline{Q})=-P^{-1}+M(x)-\underline{Q}+\overline{Q}=0 (46)
Q¯,I/λmaxP=0\displaystyle\langle\underline{Q},I/\lambda_{\max}-P\rangle=0 (47)
Q¯,PI/λmin=0\displaystyle\langle\overline{Q},P-I/\lambda_{\min}\rangle=0 (48)
Q¯0,Q¯0,P=P\displaystyle\underline{Q}\succ 0,\;\overline{Q}\succ 0,\;P=P^{\top} (49)
I/λmaxPI/λmin.\displaystyle I/\lambda_{\max}\leq P\leq I/\lambda_{\min}. (50)

Let Q¯=U(x)Λ¯(x)U(x)\underline{Q}^{\star}=U(x)\underline{\Lambda}(x)U(x)^{\top} and Q¯=U(x)Λ¯(x)U(x)\overline{Q}^{\star}=U(x)\overline{\Lambda}(x)U(x)^{\top} where the elements of diagonal matrices Λ¯(x)\underline{\Lambda}(x) and Λ¯(x)\overline{\Lambda}(x) are

Λ¯ii(x)={0Dii(x)[0,λmax],Dii(x)λmaxDii(x)(λmax,),\displaystyle\underline{\Lambda}_{ii}(x)=\begin{cases}0&D_{ii}(x)\in[0,\lambda_{\text{max}}],\\ D_{ii}(x)-\lambda_{\max}&D_{ii}(x)\in(\lambda_{\text{max}},\infty),\\ \end{cases} (51)

and

Λ¯ii(x)={λminDii(x)Dii(x)[0,λmin],0Dii(x)(λmin,).\displaystyle\overline{\Lambda}_{ii}(x)=\begin{cases}\lambda_{\min}-D_{ii}(x)&D_{ii}(x)\in[0,\lambda_{\text{min}}],\\ 0&D_{ii}(x)\in(\lambda_{\text{min}},\infty).\\ \end{cases} (52)

It can be observed that the choice of P(x)P^{\star}(x) as in (14) along with Q¯(x)\overline{Q}^{\star}(x) and Q¯(x)\underline{Q}^{\star}(x) defined above satisfy the KKT conditions and are the primal-dual optimal solutions to the porblem. This completes the proof.

4)

The solution to the inner subproblem of Problem 4 can be obtained by a reformulation similar to case 2) above and checking the conditions similar to case 3).

Appendix D Proof of Theorem 2

The proof follows from specializing the proof of [20, Theorem 4] to the problem considered in this paper. Specifically, the compactness of manifolds along with Assumptions 1 and 2 lead to the existence of of L~\tilde{L} as described in Lemma 1. Then following the steps of the proof of [20, Theorem 4] and setting b=2b=2 completes the proof.

Appendix E Background Information for Section V-B

Let 𝒴\mathcal{Y} be a smooth submanifold in Euclidean space. Each y𝒴y\in\mathcal{Y} is associated with a linear subspace, called the tangent space, denoted by Ty𝒴T_{y}\mathcal{Y}. Informally, the tangent space contains all directions in which one can tangentially pass through yy. For a formal definition see [1, Definition 3.5.1, p. 34]. A smooth function F:𝒴F:\mathcal{Y}\rightarrow\mathbb{R} is associated with the gradient F(y)\nabla F(y) of FF, as well as the orthogonal projection of F(y)\nabla F(y) onto the tangent space Ty𝒴T_{y}\mathcal{Y}, known as the Riemannian gradient of FF at yy, denoted by gradF(y)\mathrm{grad}\,F(y). See [1, p. 46] for more detail. The manifold 𝒴\mathcal{Y} is also associated with a map Retry:Ty𝒴𝒴\mathrm{Retr}_{y}:T_{y}\mathcal{Y}\rightarrow\mathcal{Y}, called a retraction that enables moving from yy along the direction vTy𝒴v\in T_{y}\mathcal{Y} while remaining on the manifold – see [4, Definition 3.47] and [1, Definition 4.1.1] for formal definitions.

A sufficient condition for Assumption 2 to hold is for the manifolds to be compact subsets of the Euclidean space. Under this assumption we have the following lemma from [20].

Lemma 1.

Let \mathcal{M} be a compact submanifold of the Euclidean space. Then we have,

Retrξ(v)ξαv,ξ,vTξ\|\mathrm{Retr}_{\xi}(v)-\xi\|\leq\alpha\|v\|,\quad\forall\xi\in\mathcal{M},\quad\forall v\in T_{\xi}{\mathcal{M}}

for some constant α\alpha. Moreover, if RR satisfies Assumption (2) and 𝒫\mathcal{P} is compact, then there exists a positive L~\widetilde{L} such that for all (ξ,Π)×𝒫(\xi,\Pi)\in\mathcal{M}\times\mathcal{P} and all vTξv\in T_{\xi}{\mathcal{M}}, we have

R(Retrξ(v),Π)\displaystyle R(\mathrm{Retr}_{\xi}(v),\Pi) R(ξ,Π)+gradxR(ξ,Π),v+L~2v.\displaystyle\leq R(\xi,\Pi)+\langle\mathrm{grad}_{x}\,R(\xi,\Pi),v\rangle+\dfrac{\widetilde{L}}{2}\|v\|. (53)